Things I have learnt in the last week

(or, in some cases, things I already knew but never really cared about)

Microsoft Word's .docx format is officially called the "Office Open XML Format" and has been managed by ECMA since 2006.

The specification consists of four documents. The first one, "Fundamentals and Markup Language Reference", is 5039 (five thousand and thirty nine) pages long. I have read several hundred of those pages in the last five days.

The specification is so long because Word is a stupid application that has ended up being able to do anything. This is also why it's so frustrating and does everything badly.

A .docx file is actually a zip file that contains several folders, each with individual documents within it. Most of these documents are in XML format.

If you drag a .docx file from the Finder to your BBEdit[1] icon in the Dock, BBEdit will open a "disk browser" window, which allows you to navigate through the internal folders and look at the contents of each document.

I've always tried to ignore XML and found it annoying to work with, but XML Schema documents (XSD) are surprisingly readable.

All of the examples for manipulating Word documents, that I can find online, use the .NET API. As I do not have access to that API, I have to translate the C# or VB code into matching XML fragments.

Github Copilot (powered by ChatGPT) is amazing at taking a piece of C# code (that I paste into my ruby source file) and translating it into Nokogiri::XML::Node statements containing the correct XML tag names (which do not always directly correspond to the API method name). This has probably saved me weeks of time and means I have not wanted to spend the last few days with my head buried in a bucket of vomit.

Nokogiri is ruby code from another age. A lot of the conventions and idioms we use in ruby nowadays (especially in Rails) are not present and the style of it feels very different.

The Mac "TextEdit" application seems to have no support for images embedded within .docx files[2].

LibreOffice seems to be "liberal in what it accepts, strict in what it emits". Microsoft Word is "strict in what it accepts and complains loudly". Pages is "liberal in what it accepts but won't give you any meaningful help when things go wrong".

I have tried diffing a .docx file that shows the correct formatting and embedded images with the .docx file that my code generates, I can see no meaningful differences between them, yet the fucking thing still does not work.


  1. BBEdit is an amazing text editor for the Mac that has been around for years. If you have to do anything vaguely technical, it's definitely worth using. ↩︎

  2. TextEdit uses .rtfd files, which are also a file-system embedded within a file. But instead of a zip file, rtfd is a Mac package (a folder that behaves like a file) ↩︎

Rahoul Baruah

Rahoul Baruah

Rubyist since 1.8.6. I like hair, dogs and Kim/Charli/Poppy. Also CTO at Collabor8Online.
Leeds, England