For many writers, the act of writing (or placing one word after another) is synonymous with the tool that they use to do it. There’s a reason why writers feel so strongly about their moleskin notebooks, fountain pens, and computer software. I’m no different than any other writer. I have my preferred tools, and I love them dearly. They help to focus on my ideas and craft prose that I can be proud of.
When writing on a computer, the tool of choice for many writers (dare I say most?) is Microsoft Word. It’s everywhere and everyone has used it. It comes preinstalled on most computers and is a de-facto standard for exchanging written material with others.
Unfortunately, Word is not part of my preferred toolset. I prefer to write using LyX and an add-on I’ve written for it called LyX-Outline. But while I love my writing program, it makes it difficult to collaborate with other writers who use Word, as LyX doesn’t have a straightforward way to directly import Word files.
This isn’t a new problem and I’ve written about it before. I’ve even proposed a solutions. But while that solution was a good fit for me, it isn’t something that I would recommend to others.
For starters, it required a great deal of software to be installed. You needed a program to convert Microsoft Word documents to Open Office documents. You then had to use a second utility to convert it to HTML or LaTeX. After that, you used to a third utility to clean it up and import the LaTeX code into LyX. Three distinct steps, with a lot of places where things could go wrong.
Over the past few months, I’ve found that I need a better way, a tool that can directly import a Word document and cleanly translate its content. So, I decided to create one.
I wanted my tool to meet several important criteria:
- it should be easy to use and maintain
- it should maintain document structure including headers, styles and other elements
- it should successfully translate Word to LyX, meaning that I don’t need to spend time repairing double quotes or fixing em-dashes
- it should support tables, images, and footnotes
- it should be extensible, allowing me to create custom templates that support a wide variety of Word documents and LaTeX document classes
After several months of work, I’m now ready to release an alpha-quality version of that tool. I’m calling it word2lyx.
word2lyx is a python script to convert Microsoft Word documents to a format that can be imported into LyX. word2lyx only works on documents saved in the docx format, created using Word 2007 or 2010. Older in doc will have to be converted before word2lyx will be able to read them.
But here’s the important thing, it works directly on the Word file. It doesn’t rely on external parsers for translation. For that reason, you can specify exactly how you would like for your documents to be converted.
Currently, word2lyx supports:
- Translating Word paragraph and character styles to LyX paragraph and character styles. In cases where the character styles aren’t defined, it will write entries for them in the local layout. These entries will include basic LaTeX font commands, created from the font properties.
- Importing Word tables, including those with merged rows or columns. it will also do its best with table borders.
- Enumerated and itemized lists
- Importing images from the Word document. Word image objects, such as Excel charts and graphs are skipped.
- The use of custom templates, which allow you to finely describe how you would like for your documents to be translated. There are two templates included with the script and supporting libraries: article.w2l and book.w2l. These can be found in the lyx/templates folder. (For more about templates, please see below.)
For best conversion, please make use of paragraph and character styles. These tags are converted directly. Per word formatting is removed. This makes it easier to create clean markup for LyX.
To use the script, use a variation of the command below:
python word2lyx.py “InputFile.docx” “OutputFile.lyx”
You can also specify a template you would like for it to use:
python word2lyx.py “InputFile.docx” “OutputFile.lyx” -t book
Right now, word2lyx includes two templates: one for articles and another for book chapters. These can serve as a base for importing your work into LyX. After you’ve got it loaded, you can then change the document class and begin to customize it from inside of LyX.
Downloads and Installation
If you think word2lyx sounds interesting, you can download it here. The download includes the script and two supporting libraries, one for reading docx files and one for writing LyX files. Additionally, it has a sample document which will show off most of the script features. To install word2lyx, put the files somewhere inside of your python path. (Or, when you run it, you can just type the full path to the script.)
Automating Document Conversion
Though using the script from the command line is pretty easy, a better way is to automate the conversion process. To do this, you need to create a new LyX document converter. Step-by-step instructions for adding additional file types and converters to LyX can be found in the older post I wrote on converting Word to LyX. Here, I’ll repeat the most important information.
First, go to the File Handling preferences (Tools > Preferences > File Handling > File Types), and add a new type called Word 2010. Set the file ending to be “docx.”
Next, go to Converters dialog and create a new “MS Word 2010” to “LyX” converter.
From format = “MS Word 2010”, To Format = “LyX”. Converter setting:
python “/path/to/word2lyx.py” $$I $$o –t article
Once you’ve added these settings, you can then import a document by going to File > Import > MS Word 2010.
word2lyx templates are files which tell word2lyx how a particular type of document should be translated. They include lists of paragraph, character, and table styles and their LyX equivalents. Consider the example below, which is taken from the article template:
ImageDir = ‘Images’
IgnoreStyles = FootnoteReference, EndnoteReference
Title = Title
Author = Author
Part = Part
Heading1 = Chapter
The template file contains settings and sections. The example above includes two settings, ImageDir and IgnoreStyles, and a group of paragraph styles. The paragraph styles specify how Word styles (on the left), such as Heading1, should be translated into LyX styles, such as Chapter, Section, etc.
When you create a new template (or extend one of the existing options), you add additional styles so that word2lyx knows how you want those particular pieces of information to be processed.
Each word2lyx template includes several sections. These include:
- ParagraphStyles: Matched pairs that convert Word styles (on the left) to LyX styles (on the right).
- CharacterStyles: Matched pairs that convert Word charstyles (again, on the left) to LyX charstyles (again, specified on the right).
- TableStyles: Similar to paragraph and character styles, but used in the conversion of tables.
Sections are defined by a section tag, enclosed with square brackets. Values in the section are entered as matching pairs, offset by an equal sign.
If you choose, you can add an additional section called [DocOptions], where you specify information such as the document class, which types of fonts you would like the document to use, and so forth. (Please refer to the LyX documentation for avaialble options.)
If you create new templates, you should place them in the lyx/templates folder for word2lyx to be able to find them. They should have a ‘w2l’ file extension.
I’m posting this utility in the hope that people will find it useful. Please, download it and take it for a test drive on your documents. Let me know about any successes or failures that you might have. (Especially the failures, those are important to fix.) If there’s a feature you would like to see included, please leave a comment here. I’m looking forward to hearing if word2lyx helps make collaboration a little easier for you.