Sunday, July 31, 2022

Implementing a "mini-LaTeX" in ~2000 lines of code

A preliminary note

The previous blog post on this subject got posted to hackernews. The comments were roughly like the following (contains exaggeration, but only a little):

The author claims to obtain superior quality compared to LaTeX but after reading the first three sentences I can say that there is nothing like that in the text so I just stopped there. Worst! Blog! Post! Ever!

So to be absolutely clear I'm not claiming to "beat" LaTeX in any sort of way (nor did I make such a claim in the original post). The quality is probably not better and the code has only a fraction of the functionality of LaTeX. Missing features include things like images, tables, cross references and even fairly basic functionality like widow and orphan control, italic text or any sort of customizability. This is just an experiment I'm doing for my own amusement.

What does it do then?

It can take a simple plain text version of The War of the Worlds, parse it into chapters, lay the text out in justified paragraphs and write the whole thing out into a PDF file. The output looks like this:

Even though the implementation is quite simple, it does have some typographical niceties. All new chapters begin on a right-hand page, the text is hyphenated, there are page numbers (but not on an empty page immediately preceding a new chapter) and the first paragraph of each chapter is not indented. The curious among you can examine the actual PDF yourselves. Just be prepared that there are known bugs in it.

Thus we can reasonably say that the code does contain an implementation of a very limited and basic form of LaTeX. The code repository has approximately 3000 total lines of C++ code but if you remove the GUI application and other helper code the core implementation only has around 2000 lines. Most of the auxiliary "heavy lifting" code is handled by Pango and Cairo.

Performance

The input text file for War of the Worlds is about 332 kB in size and the final PDF contains 221 pages. The program generates the output in 7 seconds on a Ryzen 7 3700 using only one core. This problem is fairly easily parallelizable so if one were to use all 16 cores at the same time the whole operation would take less than a second. I did not do exact measurements but the processing speed seems to be within the same order of magnitude as plain LaTeX.

The really surprising thing was that according to Massif the peak memory consumption was 5 MB. I had not tried to save memory when coding and just made copies of strings and other objects without a care in the world and still managed to almost fit the entire workload in the 4 MB L2 cache of the processor. Goes to show that premature optimization really is the root of all evil (or wasted effort at least).

Most CPU cycles are spent inside Pango. This is not due to any perf problems in Pango, but because this algorithm has an atypical work load. It keeps on asking Pango to shape and measure short text segments that are almost but not entirely identical. For each line that does get rendered, Pango had to process ~10 similar blocks of text. The code caches the results so it should only ask for the size of any individual string once, but this is still the bottleneck. On the other hand since you can process a fairly hefty book in 10 seconds or so it is arguable whether further optimizations are even necessary,

The future

I don't have any idea what I'm going to do with this code, if anything. One blue sky idea that came to mind was that it would be kind of cool to have a modern, fully color managed version of LaTeX that goes from markdown to a final PDF and ebook. This is not really feasible with the current approach since Cairo can only produce RGB files. There has been talk of adding full color space support to Cairo but nothing has happened on that front in 10 years or so.

Cairo is not really on its own in this predicament. Creating PDF files that are suitable for "commercial grade" printing using only open source is surprisingly difficult. For example LibreOffice does output text in the proper grayscale colorspace but silently converts all grayscale images (even 1-bit ones) to RGB. The only software that seems to get everything right is Scribus.

1 comment:

  1. For publishing a PDF plus an ebook (or HTML page(s)), there is ConTeXt which is quite powerful: https://wiki.contextgarden.net/

    For the ebook, ConTeXt has support for XML.

    Since Markdown is a simpler language, I think it's definitely possible to convert *.md pages into ConTeXt, then "export" it to different formats.

    (ConTeXt is a sibling of LaTeX, if TeX is the parent ;-) )

    ReplyDelete