Microsoft Word is the most common writing program in the world. It’s everywhere, and at some point in their lives, nearly everyone has used it. Much of this success is hard-earned. Word is a good piece of software, and it works very well.
But the dominance of Word has resulted in a rather significant problem for the users of other programs. Microsoft Word files have become the de-facto way to exchange written material with others, and if your alternative program doesn’t support Word files, you can be cut off from colleagues, friends, family, and publishers.
Unfortunately, one my favorite Open Source Writing programs, the LaTeX front-end LyX, does not support Microsoft Word files. And even though I prefer to write using LyX and LaTeX, I’m often forced to use MS Word simply so that I can “stay in the loop”.
That doesn’t mean I’ve been happy about it, though.
When I’m not happy, I get motivated to solve the problem. For quite some time, I’ve been trying to shoehorn my preferred tools into a world dominated by Word. I’ve experimented with a lot of different options and I think I’ve finally come up with a system seems to work pretty well.
Here’s how you can import Word documents into LyX:
- It allows me to import Microsoft Word files into LyX with a single click.
- It maintains most document structure, including headers, styles, and other structural elements.
- It successfully translates MS Word syntax to LyX. This means that I do not need to spend time repairing double quotes or fixing em-dashes.
In this post, I will describe that system, how to set it up, and its use. I will also take a look at instances where it may be limited and provide manual alternatives that may be more appropriate.
The Cause of the Problem
But before jumping into the solution, it’s very important to understand the cause of the problem.
Anyone who has used both Microsoft Word and LyX/LaTeX knows that there are several significant differences between their design philosophies. Microsoft Word is a What You See Is What You Get (WYSIWYG) Word Processor. It’s entire purpose is to empower users and give them tremendous control over what a document looks like. If a user wants to bold a particular word, or underline it, or make it pink with a purple outline; they can. If they want quarter inch margins, they can have them. And if they want to use a font that looks like kindergarten handwriting, that’s an available option.
LaTeX – and to some extent LyX – on the other hand, has a diametrically opposed world view. It was never designed to be WYSIWYG, and all but hides formatting and markup from the author. Instead, LaTeX relies on a strictly enforced system of styles and tags in order to describe what a particular piece of content means, rather than what it should look like. Formatting information is added after the content is written and the final document complies with a tight style sheet. The end result is a document with a great deal of visual consistency and professional appeal.
But these differences also cause another huge issue: to transfer a document from Microsoft Word to LaTeX usually means that formatting must be stripped away; and anytime you need to strip away information, you run the risk of mangling content.
A Potential Solution
To deal with this problem, people have historically taken one of two approaches when attempting to convert MS Word documents to LaTeX:
- They go whole hog and any non-structural formatting is automatically removed. Unless a piece of text is specified as heading, subheading, caption, footnote, or other piece of essential structural information; it gets stripped.
- They side step it completely and the converter attempts to perfectly recreate the appearance of an MS Word doc using LaTeX.
Both approaches have their merit, but there are very few middle roads. The other day, however, I managed to find one. Enter Writer2LaTeX, a command line program that works with Open Document files (ODF).
Using Writer2LaTeX, and a few associated utilities, it is possible to cleanly import a Word document into LyX. But even more important than a clean import, is an automated one; and this system allows you to do that, too.
The diagram below shows the major steps in my ad hoc system.
First, I take advantage of existing conversion programs to automate the conversion of .doc to .odf. Then, I process the ODF to create a well-formed LaTeX document. Finally, the LaTeX document is imported into LyX via the latex2lyx script. (If I choose, I can also add an extra step where I use html2latex in order to remove superfluous LaTeX tags and other information added by Writer2LaTeX.)
Limitations and Disclaimers
But even though I’ve found this system works well for my needs, there are a few disclaimers:
- Given the type of work that I do and type of product I normally produce, I am more concerned about the structure of a document. For that reason, the process described here is focused on preserving structure and removing unnecessary formatting. If your document contains a large number of tables, cross-links and advanced formatting, you may wish to find another option, or to use Writer2LaTeX to convert from ODF directly to LaTeX.
- This process uses a lot of external utilities that may not be included with your system. If you are using Linux, these utilities will likely be available from your distro repositories. If you are using Mac OS X or Windows, however, you will need to download and configure them separately.
Software and Installation
- ConvertDoc. Python class used to automate the conversion of MS Word docs to ODF, incorporates a number of OpenOffice utilities. It can be downloaded here. To install, simply extract the directory to a location on your hard drive. In the next section, I will describe how to create a LyX converter that is able to use it.
- Writer2LaTeX. Used to convert from OpenOffice to either HTML or LaTeX. Included in the repositories of most distributions, but can also be downloaded from the project’s homepage. For Ubuntu users, you can install it from the command line:sudo apt-get install writer2latex, openoffice.org-writer2latex, openoffice.org-writer2xhtml, writer2latex-manual
- html2latex. Utility used to clean XHTML output from Writer2LaTeX. While it was once included in some distribution repositories, this appears to no longer be the case. You can download a copy of the most recent version here. To install, extract to a directory on your hard drive and modify the existing LyX converter to point at the right location.
In order to automate the configuration process, you need to create two new LyX document converters and modify an existing one. The Document Converter settings can be found in the Program Preferences dialog by going to:
To add a new converter, specify the file format that you wish to convert from, the file format that you wish to convert to, and the necessary command line arguments. When finished, click on the “Add” button or the “Modify” button.
- From format = “OpenDocument”, To format = “HTML”
Converter = w2l –xhtml –cleanhtml $$i $$o
- From format = “MS Word”, To format = “OpenDocument”
Converter = python /path/to/ConvertDoc/ConvertDoc.py $$i $$o
Modify Existing Converter
For users of Ubuntu Linux, html2latex is no longer provided by the distribution packages. This means that you will probably need to install and configure it manually. To work, latex2html requires three arguments: an input file, an output file, and a configuration file. As a result, you will need to update the “HTML –> LaTeX (plain)” converter so that it contains this information.
- From format = “HTML”, To format = “LaTeX (plain)”
Converter = java –jar /path/to/html2latex/htmltolatex.jar –input $$i –output $$o –config /path/to/config.xml
The /path/to/html2latex is simply the path where you extracted the html2latex folder in the previous step. config.xml is typically found in the same folder as htmltolatex.jar.
Once you have installed the software and added the additional converters, you will be able to import MS Word documents directly into LyX. LyX will handle all of the background steps for you automatically.
To import a new document, simply go to File->Import->Word Document, and select the file from your hard drive.
Though this custom converter makes it much easier to work with MS Word documents in LyX, it isn’t perfect:
- The process is fairly slow. To convert from a Word DOC to an ODF, it is necessary to open a background instance of Open Office. This may take several seconds, during which LyX may appear as though it has stopped responding. (The python class will close OpenOffice when it is finished.)
- While htmltolatex does an admirable job of converting formatting and other markup, documents with sophisticated layout and obscure tags are not always imported correctly. This, unfortunately, includes HTML tables. For documents with tables, you will likely have better luck using Writer2LaTeX directly.
- Mathematical typesetting within Word is not always maintained when converting from .doc to .odf. In many instances equations are transformed into images.
- Image layouts are not always maintained. htmltolatex is not able to transform images into proper float or wrap environments.
Some of the above concerns can be somewhat mitigated by creating a custom writer2latex configuration file (though I’ve been very happy with the default options). This allows you to map encodings, indicate which backend you would like to use in the conversion and specify how any custom styles or tags should be transformed. More information can be found at the writer2latex project page.