Friday, June 11, 2021

Typesetting a full book part II, Scribus

Some time ago I wrote a blog post on what it's like to typeset an entire book using nothing but LibreOffice. One of the comments mentioned that LO does not do a great job of aligning text. This is again probably because it needs to copy MS Word's behaviour, which means greedy line splitting. Supposedly Scribus does this a lot better, but the only way to be really sure was to typeset the whole text with Scribus. So that's what I did (using the latest 1.5 release from Flathub).

Workflow for Scribus

Every program has the things it is good for and things it's not that good for. Scribus' strengths lie in producing output with fairly short pieces of text with precise layout requirements, especially if there are many images. A traditional "single flow of text" is not that, so there are some things you need to plan for.

First of all, a Scribus document should not be created until the text is almost completely finished. Doing big changes (like adding text to existing chapters, changing physical page size etc) can become quite tedious. Scribus also does not do long pieces of text particularly smootly. I tried loading all 350is pages to a single linked frame sequence. It sort of worked, but things got quite laggy quite quickly. Eventually I converged on a layout where every chapter was its own set of linked frames. The text was imported directly from LO files that held one chapter each. The original had just one big LO file, so I had to split it up by hand for the import. If the original had been done with master documents, this would have been simpler.

The table of contents had to be done by hand again. Scribus has support for tables, but they could not be used, because tables drew outlines around each cell and I could not find a way to switch that off. Websearching found several pages with info, but none of them worked. It also turns out that you can not add page references to table cells, only to text frames. No, I don't know why either. The option was greyed out in the menus and trying to sneakily copypaste a page reference from a text frame to a table caused a segfault.

Issues discovered

While LO was surprisingly bug free, Scribus was less so and I encountered many bugs and baffling missing features, such as:
  • Scribus would sometimes create empty text frames far outside the document (i.e. to page 600 on a 300 page document)
  • Text frames got a strange empty character at their end which would cause text overflow warnings, deleting it did not help as the empty characters kept reappearing
  • Adding a page reference to an anchor point would always link to the page where the linked frame sequence started, not where the anchor was placed
  • Text is not hyphenated automatically, only by selecting a text frame and then selecting extras > hyphenate text in the main menu, one would imagine hyphenation being a paragraph style property instead
  • I managed to create an anchor point that does not exist anywhere except the mark list, but deleting it leads to an immediate segfault
None of these obstacles were insurmountable, but they made for a nonsmooth experience. Eventually the work was done and here is how they compare (LO on the left, Scribus on the right).
As you can probably tell, Scribus creates more condensed output. The settings were the same for both programs (automatically translated from LO styles by Scribus, not verified by hand) and LO's output file was 339 pages compared to 326 for Scribus.

Which one should you use then?

Like most things in life, that depends. If your document has a notable amount of mathematics, then you most likely want to go with LaTeX. If the document is something like a magazine or you require the highest typographical quality possible, then Scribus is a good choice. For "plain old books" the question becomes more complicated.

If you need a fully color managed workflow, then Scribus is the only viable option. If the default output of LO is good enough for you, the document has few figures and you are fine with needing to have a great battle at the end to line the images up, LO provides a fairly smooth experience.  You have to use styles properly, though, or the whole thing will end up in tears. LO is especially suitable for documents with lots of levels, headings and cross references between the two. LaTeX is also very good with those, but its unfortunate downside is that defining new styles is really hard. So is changing fonts, so you'd better be happy with Computer Modern. If the document has lots of images, then LaTeX's automatic figure floats make a ton of manual work completely disappear.

Original data

The original source documents as well as the PDF output for both programs can be found in this Github repo

Tuesday, June 8, 2021

An overhaul of Meson's WrapDB dependency management/package manager service

For several years already Meson has had a web service called WrapDB for obtaining and building dependencies automatically. The basic idea is that it takes unaltered upstream tarballs, adds Meson build definitions (if needed) as a patch on top and builds the whole thing as a Meson subproject. While it has done its job and provided many packages, the UX for adding new versions has been a bit cumbersome.

Well no more! With a lot of work from people (mostly Xavier Claessens) all of WrapDB has been overhauled to be simpler. Instead of separate repos, all wraps are now stored in a single repo, making things easier.  Adding new packages or releases now looks like this:

  • Fork the repo
  • Add the necessary files
  • Submit a PR
  • Await results of automatic CI and (non-automatic :) reviewer comments
  • Fix issues until the PR is merged
The documentation for the new system is still being written, but submissions are already open. You do need the current trunk of Meson to use the v2 WrapDB. Version 1 will remain frozen for now so old projects will keep on building. All packages and releases from v1 WrapDB have been moved to v2, except some old ones that have been replaced by something better (e.g. libjpeg has been replaced by libjpeg-turbo) so starting to use the new version should be transparent for most people.

Submitting new dependencies

Anyone can submit any dependency project that they need (assuming they are open source, of course). All you need to do is to convert the project's build definition to Meson and then submit a pull request as described above. You don't need permission from upstream to submit the project. The build files are MIT licensed so projects that want to provide native build definitions should be able to integrate WrapDB's build definitions painlessly.

Submitting your own libraries

Have you written a library that already builds with Meson and would like to make it available to all Meson users with a single command:

meson wrap install yourproject

The procedure is even simpler than above, you just need to file a pull request with the upstream info. It only takes a few minutes.

Friday, June 4, 2021

Formatting an entire book with LibreOffice, what's it like?

I have created full books using both LaTeX and Scribus. I have never done it with LibreOffice, though. The closest I've ever come was watching people write their masters' theses in Word, failing miserably and swearing profusely. To find out what it's really like, I chose to typeset an entire book from scratch using nothing else but LibreOffice.

The unfortunate thing about LibreOffice (which it inherits from MS Word compatibility) is that there is a "correct" way to use it which is the exact opposite way of how people instinctively want to use it. The proper way is to use styles for everything as opposed to applying fonts, sizes, spacing et al by hand. In practice every time I am given a Word document I check how it has been formatted. I have never seen a single document in the wild that would have gotten this right. Even something as simple as chapter indentation is almost always done with spaces.

Getting the text

Rather than using lorem ipsum or writing a book from scratch, I grabbed an existing book from Project Gutenberg. A random choice landed upon Gulliver's Travels, which turned out to be fortunate as it has several interesting and uncommon typographical choices. The source data of Project Gutenberg is UTF-8 text. All formatting has to be added afterwards. Here's what the first page ended up looking like after a few evenings' worth of work.

The source text file is line wrapped to 80 characters and chapters are separated by two or more newlines. This does not really work with LO, so the first step is to preprocess the text with a Python script so that every chapter of text is on its own (very long) line and then the text can be imported to LO. After import each sections must be assigned a proper style. The simplest approach is to select all text, apply the Text Body style and then manually seek all chapter headings and set them to Heading 1. That takes care of the formatting needs of ~95% of the text (though the remaining 5% take 10x more work).

Page layout

The original book's dimensions are not provided, so I took a random softcover book [1] from my shelf, measured it with a ruler and replicated the page settings. The book is set in the traditional style where everything up to the actual text has page numbers in roman numerals whereas the actual text uses arabic numerals. Setting it up was straightforward, though I had to create six different page styles to get the desired result.

Text layout challenges

Gulliver's Travels is a bit unusual in that every chapter begins with a small introductory text explaining what will happen in the chapter. Apparently readers in the 1720s were not afraid of spoilers.

In the Project Gutenberg source text these sections (and many others) were written in all capital letters. However it is likely that in the original book they were instead written in small caps. Fixing this would require retyping the text to be in lower case. Fortunately LO has an option in the format menu to convert text to lower case, which makes this operation fairly painless.

Another unusual thing is that the book does not have a regular table of contents, instead it duplicates these small text chapters.

LO has a builtin TOC generator but it can't handle this (I think) so the layout has to be recreated by hand with tables and manual cross reference fields. Controlling page breaks and the like is difficult and I could not make it work perfectly. The above picture has two bugs, the illustration cross reference should be in roman numerals (as it is on a preface page) but LO insists on formatting it using an arabic number. The last chapter on the left page gets split up and the page number is on the left page, whereas it should be on the right (bottom aligned). Even better would be if the chapter heading and text could be defined to always stick together and not be split over pages. There is a setting for this, but it does not seem to work inside tables

Pictures

There are several illustrations in the book and scans of the pictures were also provided by Project Gutenberg. Adding them in the document revealed that figure handling is clearly LO's weakest point (again, presumably because it inherits its model from Word). It seems that in this model each figure has an anchor point in the text and you can align the figure relative to that but the image must be on the same page as the anchor. Were it to go on the next page, LO adds a page break so that the two go to the same page. This leaves a potentially large empty space at the end of the previous page, which looks just plain weird.

In contrast this is something that LaTeX does exceptionally well with its floating figures. Basically it tries to add the figure on the current page and if it will not fit, it puts it on the next page. There does not seem to be a way to get this behaviour in LO. Or at least I could not find one, googling did not help and neither did asking for help on the lazyweb. Playing with images was also the only time I managed to crash LO, so be careful; save early, save often.

The only reasonably working solution seems to be page aligned images. This works but means that if text is edited, figures do not move along with the changes and get disconnected from their source locations. Thus image aligning must be the very last thing to be done. This approach also does not work if you are using master documents. Books with many images should probably be typeset with Scribus instead, especially if proper color management is required.

In conclusion

If you are very disciplined and use LO exactly as it should be used, the end result is actually really nice. You can, for example, change the font used for text in only one place (the base style) and the entire document gets fully reformatted, reflown and repaged in less than a second. This allows you to do invasive layout tests easily, such as finding out how much more space IBM Plex Serif takes when compared to Nimbus Roman [2]. The downside is that any cut corners will cause broken output that you can't find without manually inspecting the entire document.

IKEA effect notwithstanding laying out the text in proper form makes it a lot more enticing. The process of shaping raw text to form really makes it come alive in a sense. It would be nice if Project Gutenberg (or anyone else, really) provided properly formatted versions of their books (and in fact, some already are) because presentation really makes a difference for readability. Plain text and unformatted HTML is unfortunately quite drab to read.

[1] The Finnish edition of the first book in the Illuminatus trilogy, for the curious among you.

[2] Approximately 380 pages compared to 340.