| March 8, 2012 12:15 am

For many writers, the act of writing (or placing one word after another) is synonymous with the tool that they use to do it. There’s a reason why writers feel so strongly about their moleskin notebooks, fountain pens, and computer software. I’m no different than any other writer. I have my preferred tools, and I love them dearly. They help to focus on my ideas and craft prose that I can be proud of.

When writing on a computer, the tool of choice for many writers (dare I say most?) is Microsoft Word. It’s everywhere and everyone has used it. It comes preinstalled on most computers and is a de-facto standard for exchanging written material with others.

Unfortunately, Word is not part of my preferred toolset. I prefer to write using LyX and an add-on I’ve written for it called LyX-Outline. But while I love my writing program, it makes it difficult to collaborate with other writers who use Word, as LyX doesn’t have a straightforward way to directly import Word files.

This isn’t a new problem and I’ve written about it before. I’ve even proposed a solutions. But while that solution was a good fit for me, it isn’t something that I would recommend to others.

For starters, it required a great deal of software to be installed. You needed a program to convert Microsoft Word documents to Open Office documents. You then had to use a second utility to convert it to HTML or LaTeX. After that, you used to a third utility to clean it up and import the LaTeX code into LyX. Three distinct steps, with a lot of places where things could go wrong.

Over the past few months, I’ve found that I need a better way, a tool that can directly import a Word document and cleanly translate its content. So, I decided to create one.

MSDoc2LyX

Project Goals

I wanted my tool to meet several important criteria:

  • it should be easy to use and maintain
  • it should maintain document structure including headers, styles and other elements
  • it should successfully translate Word to LyX, meaning that I don’t need to spend time repairing double quotes or fixing em-dashes
  • it should support tables, images, and footnotes
  • it should be extensible, allowing me to create custom templates that support a wide variety of Word documents and LaTeX document classes

After several months of work, I’m now ready to release an alpha-quality version of that tool. I’m calling it word2lyx.

word2lyX

word2lyx is a python script to convert Microsoft Word documents to a format that can be imported into LyX. word2lyx only works on documents saved in the docx format, created using Word 2007 or 2010. Older in doc will have to be converted before word2lyx will be able to read them.

But here’s the important thing, it works directly on the Word file. It doesn’t rely on external parsers for translation. For that reason, you can specify exactly how you would like for your documents to be converted.

Currently, word2lyx supports:

  • Translating Word paragraph and character styles to LyX paragraph and character styles. In cases where the character styles aren’t defined, it will write entries for them  in the local layout. These entries will include basic LaTeX font commands, created from the font properties.
  • Importing Word tables, including those with merged rows or columns. it will also do its best with table borders.
  • Enumerated and itemized lists
  • Importing images from the Word document. Word image objects, such as Excel charts and graphs are skipped.
  • The use of custom templates, which allow you to finely describe how you would like for your documents to be translated. There are two templates included with the script and supporting libraries: article.w2l and book.w2l. These can be found in the lyx/templates folder. (For more about templates, please see below.)

For best conversion, please make use of paragraph and character styles. These tags are  converted directly. Per word formatting is removed. This makes it easier to create  clean markup for LyX.

Usage

To use the script, use a variation of the command below:

python word2lyx.py “InputFile.docx” “OutputFile.lyx”

You can also specify a template you would like for it to use:

python word2lyx.py “InputFile.docx” “OutputFile.lyx” -t book

Right now, word2lyx includes two templates: one for articles and another for book chapters. These can serve as a base for importing your work into LyX. After you’ve got it loaded, you can then change the document class and begin to customize it from inside of LyX.

Downloads and Installation

If you think word2lyx sounds interesting, you can download it here. The download includes the script and two supporting libraries, one for reading docx files and one for writing LyX files. Additionally, it has a sample document which will show off most of the script features. To install word2lyx, put the files somewhere inside of your python path. (Or, when you run it, you can just type the full path to the script.)

Automating Document Conversion

Though using the script from the command line is pretty easy, a better way is to automate the conversion process. To do this, you need to create a new LyX document converter. Step-by-step instructions for adding additional file types and converters to LyX can be found in the older post I wrote on converting Word to LyX. Here, I’ll repeat the most important information.

First, go to the File Handling preferences (Tools > Preferences > File Handling > File Types), and add a new type called Word 2010. Set the file ending to be “docx.”

Word 2010 File Format Settings for LyX

Next, go to Converters dialog and create a new “MS Word 2010” to “LyX” converter.

word2lyx Converter Settings

From format = “MS Word 2010”, To Format = “LyX”. Converter setting:
python “/path/to/word2lyx.py” $$I $$o –t article

Once you’ve added these settings, you can then import a document by going to File > Import > MS Word 2010.

To import a Word document, go to File > Import > MS Word 2010...

word2lyx Templates

word2lyx templates are files which tell word2lyx how a particular type of document should be translated. They include lists of paragraph, character, and table styles and their LyX  equivalents. Consider the example below, which is taken from the article template:

ImageDir = ‘Images’
IgnoreStyles = FootnoteReference, EndnoteReference

[ParagraphStyles]
Title = Title
Author = Author
Part = Part
Heading1 = Chapter

The template file contains settings and sections. The example above includes two settings, ImageDir and IgnoreStyles, and a group of paragraph styles. The paragraph styles specify how Word styles (on the left), such as Heading1, should be translated into LyX styles, such as Chapter, Section, etc.

When you create a new template (or extend one of the existing options), you add additional styles so that word2lyx knows how you want those particular pieces of information to be processed.

Each word2lyx template includes several sections. These include:

  • ParagraphStyles: Matched pairs that convert Word styles (on the left) to LyX styles (on the right).
  • CharacterStyles: Matched pairs that convert Word charstyles (again, on the left) to LyX charstyles (again, specified on the right).
  • TableStyles: Similar to paragraph and character styles, but used in the conversion of tables.

Sections are defined by a section tag, enclosed with square brackets. Values in the section are entered as matching pairs, offset by an equal sign.

If you choose, you can add an additional section called [DocOptions], where you specify information such as the document class, which types of fonts you would like the document to use, and so forth. (Please refer to the LyX documentation for avaialble options.)

If you create new templates, you should place them in the lyx/templates folder for word2lyx to be able to find them. They should have a ‘w2l’ file extension.

Conclusion

I’m posting this utility in the hope that people will find it useful. Please, download it and take it for a test drive on your documents. Let me know about any successes or failures that you might have. (Especially the failures, those are important to fix.) If there’s a feature you would like to see included, please leave a comment here. I’m looking forward to hearing if word2lyx helps make collaboration a little easier for you.

Comments

14 Responses to “Importing Word Documents Into LyX (word2lyx 0.1)”

Ketil Thorgersen wrote a comment on March 8, 2012

Wonderful! Will surely come in handy and should (of course) be a part of the official Lyx program! Any chance of seeing that?

And how is the work on Lyx-outline going? I see that you regularly attempt to build Ubuntu packages but that they fail on your devel ppa. Is this because you do not care because it is being merged into Lyx official or because of something else?

Thanks for all your hard work!!
ketil

Rob Oakes wrote a comment on March 8, 2012

Hi Ketil,

Integrating word2lyx into the main LyX code is my current plan. This is a preliminary release so that people can test it and report problems. Once issues are worked out, it will incorporated upstream.

The work on LyX-Outline is coming slowly. The builds on the devel-ppa are currently a work in progress. There is a problem with dependencies that we are trying to work out. I just haven’t had the time to work on it yet. There are also a couple of problems related to the LyX 2.0 codebase that still need some sorting, again, I haven’t had much time to tackle it.

However, LyX-Outline is still quite useable. It’s the writing environment I use every day. At the moment, if you want to test drive it, you have to build it yourself. I’m nearly finished with a book I’ve been working on. Once that is complete, LyX-Outline is the first project up.

I know that I’ve been saying this for a while, but it is still true 😉

Cheers,

Rob

Bob Alvarez wrote a comment on March 13, 2012

Rob

I am getting an error when I run the program on my Windows XP system with Python 2.6. The error message is below. Any suggestions? If you are interested, I can send the docx file.

E:\Downloads-Apps\Lyx\Word2lyx\word2lyx>python word2lyx.py claims.docx claims.lyx
Traceback (most recent call last):
File “word2lyx.py”, line 13, in
from docx import read as docxread
File “E:\Downloads-Apps\Lyx\Word2lyx\word2lyx\docx\read.py”, line 10, in
from parser import ElementTree as xmlreader
File “E:\Downloads-Apps\Lyx\Word2lyx\word2lyx\docx\parser.py”, line 13, in
class etree_element(ElementTree.Element):
TypeError: Error when calling the metaclass bases
function() argument 1 must be code, not str

Ketil Thorgersen wrote a comment on March 14, 2012

Hi again Rob

I used to run Lyx-outline, but since I reinstalled ubuntu some months ago I never could make the compilation work again. The instructions on http://blog.oak-tree.us/index.php/2010/06/25/lyx-outline02-1 is obviously outdated since there is no ” cmake project file (“CMakeLists.txt”), which can be found in the “development/cmake” folder”
I tried to compile using the CMakeLists.txt file in the root, but that failed. I also tried to just compile with ./install ; ./configure ; make but even that failed. I would be very grateful if you could post updated instructions on how to compile Lyx-outline!!

All the best
Ketil

Rob Oakes wrote a comment on March 14, 2012

Hi Bob,

There appears to be a problem with how I’ve subclassed the xml parser. The class definitions in Python 2.6 are different than Python 2.7 (which I used for development). I’ll look into it more today and try and post a fix.

Part of the problem is that I haven’t had access to a testing machine with Python 2.6 installed on it. While I need to fix this error, the easiest fix might be to update to Python 2.7.

I’ll sent you a link to the new download as soon as I’ve got it working.

Cheers,

Rob

Rob Oakes wrote a comment on March 14, 2012

Hi Ketil,

I’ll be happy to update the instructions. FYI, though, building LyX-Outline is exactly like building LyX. The biggest snag for most people is making sure that they have all the dependencies. For users on Mac and Windows, this means downloading a number of things and configuring the path.

For Linux users, though, it’s more straightforward. apt-get can automate the download of most of the dependencies through a single command:

sudo apt-get build-dep lyx

From there, make sure that you have bzr installed:

sudo apt-get install bzr

At that point, download the sources:

bzr branch lp:lyx-outline lyx-outline-devel

Then, make a new folder for your build:

mkdir lyx-outline-build

Go inside the folder for the build:

cd lyx-outline-build

Configure:

cmake ../lyx-outline-devel

Once it’s configured, compile:

make

The build process will probably take between 15 and 20 minutes. But once it’s finished, you’ll have a fully functioning version of LyX-Outline to use. You can update the source code by going into the lyx-outline-devel directory and using:

bzr pull lp:lyx-outline

Hope that helps.

Cheers,

Rob

Bob Alvarez wrote a comment on March 14, 2012

Hi Rob

I installed Lyx 2.0.3 on one of my computers. This also installed Python 2.7. in the Lyx directory C:\Program Files\LyX20\Python) When I added the directory to the system path, I was able to execute your code.
Unfortunately I got another error:

C:\Downloads\Lyx20\word2lyx>python word2lyx.py claims.docx claims.lyx
Traceback (most recent call last):
File “word2lyx.py”, line 63, in
DOC_REL = docxread.openDocxRelationships(inputfile)
File “C:\Downloads\Lyx20\word2lyx\docx\read.py”, line 26, in openDocxRelationships
relationships = xmlreader.fromstring(xmlrels)
File “C:\Program Files\LyX20\Python\lib\xml\etree\ElementTree.py”, line 1281,
in XML
parser = XMLParser(target=TreeBuilder())
File “C:\Program Files\LyX20\Python\lib\xml\etree\ElementTree.py”, line 1447,
in __init__
“No module named expat; use SimpleXMLTreeBuilder instead”
ImportError: No module named expat; use SimpleXMLTreeBuilder instead

Ketil Thorgersen wrote a comment on March 15, 2012

Hi again

The install instructions worked brilliantly! Thanks! Only one problem so far and that is that the spellchecker is greyed out despite me having the aspell and hunspell libraries installed. Any pointers?

Thanks again!!
Ketil

Bob Alvarez wrote a comment on March 19, 2012

Rob

I was able to get the word2lyx program to work by installing Python 2.7 from the python.org website. Apparently the version of Python installed with Lyx is not a full version??

Lyx 2.0.3 complains about some problems with the Lyx file created by word2lyx but then displays the document and it seems to be OK.

I think your instructions should specify the version of Python. I have found this to be a general problem with distributing Python software. Different versions are not fully compatible nor even compatible in more recent versions. Something that runs with 2.7 will not necessarily run with 2.8. A possible solution is to include the compatible version of Python with your distributed software. This is a pain because the distribution file is big but it works.

Bob

Rob Oakes wrote a comment on March 19, 2012

Hi Bob,

Glad to hear that you were able to get things working.

I think you’re right. I developed against Python 2.7.2, and I tried to be very careful not to use any Python 2.7 or 3.0 specific features.

Apparently, I wasn’t very successful. People have reported problems with the Python that ships with LyX 2.0 on Windows, and with versions 2.6. I’m in the process of writing a wrapper library, which I hope will take care of most of those issues.

I also need to talk with the LyX developers and find out which version they ship with LyX 2.0. It may be because they’ve shipped 2.7.1, or because they don’t include the whole standard library. If it’s the latter case, I’ll need to petition that they include the xml classes. I’ve just about got ePub export working, and rely heavily on the xml classes in the standard library for parsing and validation. If they don’t ship them, I might have to include them, myself.

In fact, that might just be the best solution. Instead of shipping all of Python, just include the standard library dependencies.

Cheers,

Rob

José Carvallo wrote a comment on June 26, 2012

Hello Rob. First of all I’d like to thank you for this script. It’ll be very helpful to me if I learn how to use it.
Im under Mac OS X trying to convert some .docx file I have. I downloaded Python 2.7.3 (as I read in the comments that’s the compatible version).

Then I opened Terminal and wrote:

cd /Users/jcarvallo/Cursos/LaTeX/word2lyx
python word2lyx.py “Grupo 23.docx” “G.lyx”

The result was:

Beginning Conversion of Grupo 23.docx
Traceback (most recent call last):
File “word2lyx.py”, line 646, in
doc_body = processDocument(inputfile, outputfile, doc_options)
File “word2lyx.py”, line 118, in processDocument
out_text = processTable(body_element)
File “word2lyx.py”, line 413, in processTable
out_text = writeLyxTable(lyx_table)
File “/Users/jcarvallo/Cursos/LaTeX/word2lyx/lyx/tables.py”, line 234, in writeLyxTable
table_col = lyx_table.columns[col]
IndexError: list index out of range

Also, when I try to do the automatized version of the process through LyX I get the following error message:

Se ha producido un error al ejecutar: (means: “an error occurred trying to execute:”)
python “/Users/jcarvallo/Cursos/LaTeX/word2lyx/word2lyx.py” $$I “Grupo 23.lyx”
–t article

What am I doing wrong?

Thank you!

Chester wrote a comment on July 2, 2012

Hi friends. Am afcing alittle bit of a problem with my Lyx document. I have uploaded some pictures and graphs into my lyx document. I plotted the graphs using excel then conerted them to PNG. However, when i try to convert the document to pdf, the graphs can appear in the document. I have tried my best to manipulate and navigate my way through but things just cant work for me. A,m not thinking of switching to MS word because of this so i please seek your help. You can please help me by sending the comments or instructions to my mail addresss: ckalinda@gamil.com. thanking you all in advance.

AC wrote a comment on August 22, 2012

Rob… This works beautifully, but I am running into two strange problems. I converted http://dl.dropbox.com/u/1791181/Word2Lyx.docx to http://dl.dropbox.com/u/1791181/Word2Lyx.lyx. When I open the converted file, it throws a “Unknown token: \use_package \use_package” error with a string of “Document header error” messages. Second, the conversion process ignores hyperlinks. You can see that footnote #1 does not have the URL. I removed the hyperlink in Footnote #2 in the Word file before conversion, and this shows up fine. What gives?

Fabian Pascal wrote a comment on July 6, 2015

Hi,

Am I correct that in order to use the converter I must have Python installed?
Is there any chance of making it available for us, non-programmers, without Python?
Thanks.
FP

Care to comment?