| May 14, 2010 5:56 pm

Microsoft Word is the most common writing program in the world.  It’s everywhere, and at some point in their lives, nearly everyone has used it. Much of this success is hard-earned. Word is a good piece of software, and it works very well.

But the dominance of Word has resulted in a rather significant problem for the users of other programs. Microsoft Word files have become the de-facto way to exchange written material with others, and if your alternative program doesn’t support Word files, you can be cut off from colleagues, friends, family, and publishers.

Unfortunately, one my favorite Open Source Writing programs, the LaTeX front-end LyX, does not support Microsoft Word files. And even though I prefer to write using LyX and LaTeX, I’m often forced to use MS Word simply so that I can “stay in the loop”.

That doesn’t mean I’ve been happy about it, though.

When I’m not happy, I get motivated to solve the problem. For quite some time, I’ve been trying to shoehorn my preferred tools into a world dominated by Word. I’ve experimented with a lot of different options and I think I’ve finally come up with a system seems to work pretty well.

Here’s how you can import Word documents into LyX:

  • It allows me to import Microsoft Word files into LyX with a single click.
  • It maintains most document structure, including headers, styles, and other structural elements.
  • It successfully translates MS Word syntax to LyX.  This means that I do not need to spend time repairing double quotes or fixing em-dashes.

In this post, I will describe that system, how to set it up, and its use. I will also take a look at instances where it may be limited and provide manual alternatives that may be more appropriate.

MS-Doc2LyX


The Cause of the Problem

But before jumping into the solution, it’s very important to understand the cause of the problem.

Anyone who has used both Microsoft Word and LyX/LaTeX knows that there are several significant differences between their design philosophies.  Microsoft Word is a What You See Is What You Get (WYSIWYG) Word Processor.  It’s entire purpose is to empower users and give them tremendous control over what a document looks like.  If a user wants to bold a particular word, or underline it, or make it pink with a purple outline; they can.  If they want quarter inch margins, they can have them.  And if they want to use a font that looks like kindergarten handwriting, that’s an available option.

LaTeX – and to some extent LyX – on the other hand, has a diametrically opposed world view.  It was never designed to be WYSIWYG, and all but hides formatting and markup from the author.  Instead, LaTeX relies on a strictly enforced system of styles and tags in order to describe what a particular piece of content means, rather than what it should look like.  Formatting information is added after the content is written and the final document complies with a tight style sheet.  The end result is a document with a great deal of visual consistency and professional appeal.

But these differences also cause another huge issue: to transfer a document from Microsoft Word to LaTeX usually means that formatting must be stripped away; and anytime you need to strip away information, you run the risk of mangling content.

A Potential Solution

To deal with this problem, people have historically taken one of two approaches when attempting to convert MS Word documents to LaTeX:

  1. They go whole hog and any non-structural formatting is automatically removed.  Unless a piece of text is specified as heading, subheading, caption, footnote, or other piece of essential structural information; it gets stripped.
  2. They side step it completely and the converter attempts to perfectly recreate the appearance of an MS Word doc using LaTeX.

Both approaches have their merit, but there are very few middle roads.  The other day, however, I managed to find one. Enter Writer2LaTeX, a command line program that works with Open Document files (ODF).

Using Writer2LaTeX, and a few associated utilities, it is possible to cleanly import a Word document into LyX.  But even more important than a clean import, is an automated one; and this system allows you to do that, too.

Process Overview

The diagram below shows the major steps in my ad hoc system.

Word2LyX-Overview

First, I take advantage of existing conversion programs to automate the conversion of .doc to .odf.  Then, I process the ODF to create a well-formed LaTeX document.  Finally, the LaTeX document is imported into LyX via the latex2lyx script.  (If I choose, I can also add an extra step where I use html2latex in order to remove superfluous LaTeX tags and other information added by Writer2LaTeX.)

Limitations and Disclaimers

But even though I’ve found this system works well for my needs, there are a few disclaimers:

  • Given the type of work that I do and type of product I normally produce, I am more concerned about the structure of a document.  For that reason, the process described here is focused on preserving structure and removing unnecessary formatting.  If your document contains a large number of tables, cross-links and advanced formatting, you may wish to find another option, or to use Writer2LaTeX to convert from ODF directly to LaTeX.
  • This process uses a lot of external utilities that may not be included with your system.  If you are using Linux, these utilities will likely be available from your distro repositories.  If you are using Mac OS X or Windows, however, you will need to download and configure them separately.

Software and Installation

  1. ConvertDoc.  Python class used to automate the conversion of MS Word docs to ODF, incorporates a number of OpenOffice utilities.  It can be downloaded here.  To install, simply extract the directory to a location on your hard drive.  In the next section, I will describe how to create a LyX converter that is able to use it.
  2. Writer2LaTeX.  Used to convert from OpenOffice to either HTML or LaTeX.  Included in the repositories of most distributions, but can also be downloaded from the project’s homepage.  For Ubuntu users, you can install it from the command line:sudo apt-get install writer2latex, openoffice.org-writer2latex, openoffice.org-writer2xhtml, writer2latex-manual
  3. html2latex.  Utility used to clean XHTML output from Writer2LaTeX.  While it was once included in some distribution repositories, this appears to no longer be the case.  You can download a copy of the most recent version here.  To install, extract to a directory on your hard drive and modify the existing LyX converter to point at the right location.

Configuration

In order to automate the configuration process, you need to create two new LyX document converters and modify an existing one.   The Document Converter settings can be found in the Program Preferences dialog by going to:

Tools->Preferences->File Handling->Converters

To add a new converter, specify the file format that you wish to convert from, the file format that you wish to convert to, and the necessary command line arguments.  When finished, click on the “Add” button or the “Modify” button.

New Converters

  1. From format = “OpenDocument”, To format = “HTML”
    Converter = w2l –xhtml –cleanhtml $$i $$o
  2. From format = “MS Word”, To format = “OpenDocument”
    Converter = python /path/to/ConvertDoc/ConvertDoc.py $$i $$o

Modify Existing Converter

For users of Ubuntu Linux, html2latex is no longer provided by the distribution packages.  This means that you will probably need to install and configure it manually.  To work, latex2html requires three arguments: an input file, an output file, and a configuration file.  As a result, you will need to update the “HTML –> LaTeX (plain)” converter so that it contains this information.

  1. From format = “HTML”, To format = “LaTeX (plain)”
    Converter = java –jar /path/to/html2latex/htmltolatex.jar –input $$i –output $$o –config /path/to/config.xml

The /path/to/html2latex is simply the path where you extracted the html2latex folder in the previous step.  config.xml is typically found in the same folder as htmltolatex.jar.

Use

Once you have installed the software and added the additional converters, you will be able to import MS Word documents directly into LyX.  LyX will handle all of the background steps for you automatically.

To import a new document, simply go to File->Import->Word Document, and select the file from your hard drive.

Final Thoughts

Though this custom converter makes it much easier to work with MS Word documents in LyX, it isn’t perfect:

  1. The process is fairly slow.  To convert from a Word DOC to an ODF, it is necessary to open a background instance of Open Office.  This may take several seconds, during which LyX may appear as though it has stopped responding.  (The python class will close OpenOffice when it is finished.)
  2. While htmltolatex does an admirable job of converting formatting and other markup, documents with sophisticated layout and obscure tags are not always imported correctly.  This, unfortunately, includes HTML tables.  For documents with tables, you will likely have better luck using Writer2LaTeX directly.
  3. Mathematical typesetting within Word is not always maintained when converting from .doc to .odf.  In many instances equations are transformed into images.
  4. Image layouts are not always maintained.  htmltolatex is not able to transform images into proper float or wrap environments.

Fine Tuning

Some of the above concerns can be somewhat mitigated by creating a custom writer2latex configuration file (though I’ve been very happy with the default options).  This allows you to map encodings, indicate which backend you would like to use in the conversion and specify how any custom styles or tags should be transformed.  More information can be found at the writer2latex project page.

Comments

9 Responses to “Automatically Importing an MS Word Document into LyX”

Liviu wrote a comment on May 14, 2010

First thank you for working on this.

It seems that on Debian-based distros one can acquire html2latex by installing gnuhtml2latex, which “aims to be replacement of html2latex”. Not sure what it’s worth, though.

Liviu wrote a comment on May 14, 2010

I don’t know very much about the LyX converters, but I’m curious if there is no clash between the path you propose (MS Word > OpenDocument > LaTeX) and the existing MS Word > LaTeX (via the wv package). I added the converters as you suggested and I don’t see any particular changes in the Import/Export menus. Any ideas?

Rob Oakes wrote a comment on May 14, 2010

Hi Liviu,

It is quite possible that the path proposed here does indeed conflict with the wv package. I’ve never actually had much luck with those filters, and as a result, I neither use them nor have them installed. However, if you do happen to have them installed, then you need to create a second MS Word file entry for the OpenOffice conversion path.

To do so, go to Tools->Preferences->File Handling->File formats. Then, press the “New” button and copy the following settings:

Format: MS Word (OpenOffice)
Document Format: Yes
Vector Graphics Format: Yes
Short Name: word2
Extension: doc

When finished, press “Save”

Then, modify the MS Word converter above:

From format = “MS Word (OpenOffice)”, To format = “OpenDocument”
Converter = python /path/to/ConvertDoc/ConvertDoc.py $$i $$o

That should make it appear in the Import list.

While creating a new file format for MS Word, you might also want to create an OpenDocument format so that you can specify whether you want a “clean” import, or wish it to follow the standard import path (e.g. ODF -> LaTeX (plain), rather than ODF -> HTML -> LaTeX (plain)).

Rob Oakes wrote a comment on May 14, 2010

Interestingly enough, it looks like instead of using OpenOffice to do the conversion, it might also be possible to use AbiWord. I’ll need to experiment with this a little more, but I’ve had extremely good luck when using AbiWord to convert tables and other graphics to plain LaTeX. AbiWord can also be used to output very high quality HTML if need be.

** Edit ** 2010-05-14 18:08

While AbiWord can be used to output high quality HTML and LaTeX, it doesn’t do very well with images. One of the strengths of using writer2latex is that it does a pretty good job with both images and tables.

Liviu wrote a comment on May 15, 2010

Hello Rob
Thank you for the suggestions. I went ahead and defined
From format = “MS Word (OpenOffice)”, To format = “OpenDocument (w2l)”
Converter = python /path/to/ConvertDoc/ConvertDoc.py $$i $$o

and
From format = “OpenDocument (w2l)”, To format = “LaTeX (plain)”
w2l -clean $$i

This way (I think) I make sure that the .doc import passes through OOo and directly to LaTeX (I am not sure I need the HTML step). Otherwise, I’m looking forward to hearing of any progress you make with the DocBook conversions.

Rob Oakes wrote a comment on May 15, 2010

@Liviu: I actually have converters defined for both. Importing via HTML does a very good job of cleaning out miscellaneous formatting. If I pass it directly through w2l, I find that I get a lot of errant LaTeX tags that aren’t converted to their LyX counterparts, which I don’t particularly care for. For tables, though, I pass it directly to w2l, as trying to go through HTML processing tends to mangle the content.

The nice thing with such a strategy, though, is that you can use either pathway as the need arises.

Anders Host-Madsen wrote a comment on September 2, 2010

It seems your main concern is conversion of text, not equations. Do I understand this correctly? In my case, the documents I would like to convert are perhaps 90% equations. How well does this procedure work for equations? The procedure to implement this seems complex to me, so I don’t even want to try if the accuracy is low with equations.

Rob Oakes wrote a comment on September 2, 2010

Hi Anders. That is correct, I’m primarily concerned with the accurate conversion of text. That is why I run the text through two different filters. I wish to remove unnecessary markup and “finger painting.”

If your text consists primarily of equations, this system may not work for you very well. However, it may be worthwhile to give it a try.

(If you’re on Linux, setup really isn’t bad. Most of the needed software is in the repositories. On Windows or Mac, it is significantly more involved.)

I’ve found that the accuracy for my documents is quite good. I’ve also found that the system does a respectable job with tables and other advanced formatting. (Much better than I initially credited in the article.) If you could provide a document with equations, I would be happy to test it. This is the one area that I have not experimented with and I’m curious as to what the results will look like.

costin wrote a comment on February 18, 2011

hello Sir, I tried the above instructions but i’ve faced this error when doing python \path\converterdoc\converter.py file :
Traceback (most recent call last):
File “D:\ConvertDoc\ConvertDoc.py”, line 2, in
from convertdoc import ooutils, DocumentConverter
File “D:\ConvertDoc\convertdoc\ooutils.py”, line 22, in
import uno
ImportError: No module named uno

What can it be that went wrong ? Thanks

Care to comment?