Did you ever wanted to convert a batch of several Microsoft Word documents to another format? Did you ever wanted to read some documents in the subway on your mobile phone but your mobile phone didn't supported .doc or .docx files you had to save all those files as HTML?
You can solve this boring task by writing a small program that takes advantage of Word Automation. The basic steps of saving a Word document as HTML are:
- Run Word
- Load a document
- Format the document to be more readable on your phone
- Save the document
- Exit Word
First we'll get all .doc or .docx files from o user specified directory.
string[] documentFiles = Directory.GetFiles(inputDirectory, "*.doc");
Then we'll have to create a Word application. This is similar to opening Word. We'll set the Visible property to false because we don't want a Word window to pop up.
Microsoft.Office.Interop.Word.Application wordApp = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc;
wordApp.Visible = false;
We'll load each selected document in the Word application. All methods from the Microsoft.Office.Interop.Word assembly require you to pass the parameters (even the value-typed) as an ref Object. Some parameters are optional and you'll have to pass a reference to a Type.Missing object in place of them.
object m = Type.Missing;
object documentFile = documentFiles[i];
doc = wordApp.Documents.Open(ref documentFile, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m);
Some documents I wanted to convert had the font size 10, others 14. I find documents easier to read if they use the same font family and font size. To format the text, you have to create a range and set its Font and Size properties. I created a range containing the whole document.
object oStart = 0;
object oEnd = doc.Characters.Count;
Microsoft.Office.Interop.Word.Range range = doc.Range(ref oStart, ref oEnd);
range.Font.Size = 10;
range.Font.Name = "Arial";
Now we have to save the document. In Word there are three HTML-related options:
- Single File Web Page (*.mht, *.mhtml) - this saves the document as a web archive, including images and text, in a single file. Images are stored in their original size, so the size can be considerable.
- Web Page (*.htm, *.html) - saves the document in a .html file and the resources in a folder. However, the resulted size can also be quite large, because Word saves the markup and the images in their original sizes to be able to edit the document at a later time.
- Web Page, Filtered (*.htm, *.html) - this option is quite similar to the former, with the exception that Word markup isn't stored and images are optimized when saved. If you paste in Word an image 1000x1000 but resize it to 200x200, when you save it with this option the resulting image will be only 200x200, saving you a lot of space and resources.
object wdFormatFilteredHTML = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
doc.SaveAs(ref documentFile, ref wdFormatFilteredHTML, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m);
Finally, we have to close the document and the application.
doc.Close(ref m, ref m, ref m);
doc = null;
wordApp.Quit(ref m, ref m, ref m);
wordApp = null;
Roundup
I created a very basic Windows Application: I added on the Form a FolderBrowserDialog, a Button, a RichTextBox and a ProgressBar. I've also created an index page for all those html files. Here is the source code:
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
if (folderBrowserDialog1.ShowDialog() == DialogResult.OK)
{
string inputDirectory = folderBrowserDialog1.SelectedPath;
string outputDirectory = Path.Combine(inputDirectory, "output");
Convert(inputDirectory, outputDirectory);
GenerateHtmlIndex(outputDirectory);
}
}
private void Convert(string inputDirectory, string outputDirectory)
{
richTextBox1.Text = "starting...\n";
richTextBox1.Update(); ;
richTextBox1.Update();
progressBar1.Minimum = 0;
progressBar1.Value = 0;
if (Directory.Exists(outputDirectory))
Directory.Delete(outputDirectory, true);
Directory.CreateDirectory(outputDirectory);
object m = Type.Missing;
object wdFormatFilteredHTML = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
string[] documentFiles = Directory.GetFiles(inputDirectory, "*.doc");
progressBar1.Maximum = documentFiles.Length;
Microsoft.Office.Interop.Word.Application wordApp = new Microsoft.Office.Interop.Word.Application();
Microsoft.Office.Interop.Word.Document doc;
wordApp.Visible = false;
for (int i = 0; i < documentFiles.Length; i++)
{
object documentFile = documentFiles[i];
doc = wordApp.Documents.Open(ref documentFile, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m);
object oStart = 0;
object oEnd = doc.Characters.Count;
Microsoft.Office.Interop.Word.Range range = doc.Range(ref oStart, ref oEnd);
range.Font.Size = 10;
range.Font.Name = "Arial";
documentFiles[i] = Path.Combine(outputDirectory, Path.GetFileNameWithoutExtension(documentFiles[i]) + ".htm");
documentFile = documentFiles[i];
doc.SaveAs(ref documentFile, ref wdFormatFilteredHTML, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m, ref m);
doc.Close(ref m, ref m, ref m);
doc = null;
richTextBox1.AppendText(String.Format("[{0} of {1}] {2}\n", i + 1, documentFiles.Length, Path.GetFileName(documentFiles[i])));
richTextBox1.Update();
progressBar1.Increment(1);
}
wordApp.Quit(ref m, ref m, ref m);
wordApp = null;
richTextBox1.AppendText("finished.\n");
richTextBox1.Update();
}
private void GenerateHtmlIndex(string directory)
{
string[] htmFiles = Directory.GetFiles(directory, "*.htm");
StringBuilder indexFile = new StringBuilder();
indexFile.Append(
String.Format("<html><head><title>{0}</title></head><body>",
Path.GetDirectoryName(directory)));
for (int i = 0; i < htmFiles.Length; i++)
indexFile.Append(String.Format("<a href=\"{0}\">{0}</a><br />", Path.GetFileName(htmFiles[i])));
indexFile.Append("</body></html>");
File.WriteAllText(Path.Combine(directory, "_index.htm"), indexFile.ToString());
}
}
0 comments:
Post a Comment