Confessions of a document generator
Brett Weir, May 1, 2023
There's something I have to say.
I like generating documentation. Not writing it—no one has time for that—but generating it!
A document that could have taken 30 minutes to just write, now takes several hours to build a toolchain to build the documentation from source instead. But then I can hook it into some automation and now it magically auto-updates in perpetuity. It continues to update long after the sun has boiled over and humanity has absconded into the vast recesses of space. Eons into the future, the inheritors of our planet will find my documentation, still generating, last updated mere days before, and it will be a window through which they understand the civilization long lost.
Okay, maybe it's not quite like that.
I'm not kidding though; I really like generating documentation.
I always have. It remains to be seen whether I can successfully generate a given document with the given inputs, but I'll be damned if I'm not going to try. Even if I will never need to write this document again, the opportunity is just. Too. Good. To pass up.
If your parents ask you if you want ice cream after school, you don't say "Forget it, Mom. I'm just going to eat a healthy dinner, do my homework, and go to bed." Screw that! You're going to say "Hell, yes, Mom! I'd like two scoops! Maybe three!"
And so it is with document engineering. There is no limit to how clever you can be, and this can be your downfall.
A case study
The cycle begins innocently enough. You see a word or phrase repeated in the document. Maybe it's the name of the company, or the version of the software. Then the little worry sets in. "What if we get bought out? What if they release a new version of the software before I finish writing this?"
So what do you do? Add a little templating. Replace all instances of the company
name with {companyName}
. And that pesky version number? Now it's
{softwareVersion}
.
At this point, you've already figured out that doing this in Google Docs was a
bad idea. You realize that this really should be a Markdown file. That way, you
can commit it to Git, alongside the newly written Python script that's going to
find and replace all instances of {companyName}
and {softwareVersion}
in
your source. Awesome. Now you can track everything together.
But how's this going to work? You can't just replace the text in place. If you
do that, you'll never be able to run your replacement script again. No, your
script needs to render the final files to a different directory. Thus, the
dist/
folder was born.
Of course, now that you've created a dist/
folder, mapped the source paths to
the output paths, and extended your script to overwrite existing outputs and
create missing directories, it's only a stone's throw away to generate multiple
output formats, like dist/pdf/
. We're only using Python's
string
to do the
replacements, but how hard would it be to use Pandoc
instead? Pandoc supports so many output formats, like
EPUB and
roff?
LaTeX? Word
docs? Heck, I can even convert
it to Confluence wiki
markup
to upload to our internal wiki if I wanted to. What if our customers want to
read our docs on their eReaders? I know there's a guy in our department that
likes using Emacs; maybe Org Mode would be a good export
too?
But that's not enough, it's never enough. Because now, with all these export formats, you want them to look good, right? A plain PDF and a plain HTML export would not look good at all, definitely not like the amazing sites everyone else is producing. There's just no way. You gotta write some custom stylesheets for that! Maybe they'll be web-oriented and support new CSS, maybe not, or maybe there is no way to customize it at all.
At any rate, you're committed now, because you've probably already told everyone about all these amazing documentation capabilities you're about to introduce. Did you ever finish the document you set out to write? Does it matter?
The pivot
At this point, you start to wonder if maybe writing it in Markdown was a bad idea. Now that you're trying to support a print version of your document, you realize how strange it's going to look if you don't have section headings or a table of contents or an index. Pandoc might be able to do this, with some elbow grease, but something purpose-built to generate those things might work better.
Maybe it's time to migrate to Sphinx, and translate our content to ReStructuredText. Pandoc can help with this. It won't do a perfect job, but we can get most of it and then patch up the rest by hand. Okay, now we can produce LaTeX documents, but bootstrapping a TeX Live installation in CI has proven problematic (it's like 1.5GB!), so we need to create a build container that bottles up this toolchain so we don't need to install it for each build job. We can install Sphinx too, and pre-load it with all the extensions we're using. Sphinx is a little heavier, and doesn't support live-reloading out of the box, but I'm sure we can hack something together.
Ack! You realize that the document will probably need to be translated into some other languages for customers. LaTeX doesn't support Unicode! At least not the pdfTex front-end. It looks like Sphinx has support for the XeTeX front-end though, so you can try switching to that. However, now your TeX preamble is hopelessly broken, you don't know what packages to use anymore, and you're drowning in TeX errors. Maybe we'll abandon the international versions. Customizing the PDF theme was also too hard, but the defaults look pretty good.
Summing it up
Okay, so what's the score now?
-
ReStructuredText markup
-
Sphinx toolchain
-
pdfTeX for PDF export
-
No Unicode support
-
No customization
-
1.5GB container to build PDFs in CI
-
Forgotten why I'm here
Whatever, that's probably good enough for now. You bundle up all the document exports and send copies off to your technical writer. You're super stoked. You won. You've conquered the tools. You're a document master.
A few hours later, buried in an email from your technical writer, you read: "I
keep finding instances of {companyName}
and {softwareVersion}
in the
document. Typo?"