Friday, June 4, 2021

Formatting an entire book with LibreOffice, what's it like?

I have created full books using both LaTeX and Scribus. I have never done it with LibreOffice, though. The closest I've ever come was watching people write their masters' theses in Word, failing miserably and swearing profusely. To find out what it's really like, I chose to typeset an entire book from scratch using nothing else but LibreOffice.

The unfortunate thing about LibreOffice (which it inherits from MS Word compatibility) is that there is a "correct" way to use it which is the exact opposite way of how people instinctively want to use it. The proper way is to use styles for everything as opposed to applying fonts, sizes, spacing et al by hand. In practice every time I am given a Word document I check how it has been formatted. I have never seen a single document in the wild that would have gotten this right. Even something as simple as chapter indentation is almost always done with spaces.

Getting the text

Rather than using lorem ipsum or writing a book from scratch, I grabbed an existing book from Project Gutenberg. A random choice landed upon Gulliver's Travels, which turned out to be fortunate as it has several interesting and uncommon typographical choices. The source data of Project Gutenberg is UTF-8 text. All formatting has to be added afterwards. Here's what the first page ended up looking like after a few evenings' worth of work.

The source text file is line wrapped to 80 characters and chapters are separated by two or more newlines. This does not really work with LO, so the first step is to preprocess the text with a Python script so that every chapter of text is on its own (very long) line and then the text can be imported to LO. After import each sections must be assigned a proper style. The simplest approach is to select all text, apply the Text Body style and then manually seek all chapter headings and set them to Heading 1. That takes care of the formatting needs of ~95% of the text (though the remaining 5% take 10x more work).

Page layout

The original book's dimensions are not provided, so I took a random softcover book [1] from my shelf, measured it with a ruler and replicated the page settings. The book is set in the traditional style where everything up to the actual text has page numbers in roman numerals whereas the actual text uses arabic numerals. Setting it up was straightforward, though I had to create six different page styles to get the desired result.

Text layout challenges

Gulliver's Travels is a bit unusual in that every chapter begins with a small introductory text explaining what will happen in the chapter. Apparently readers in the 1720s were not afraid of spoilers.

In the Project Gutenberg source text these sections (and many others) were written in all capital letters. However it is likely that in the original book they were instead written in small caps. Fixing this would require retyping the text to be in lower case. Fortunately LO has an option in the format menu to convert text to lower case, which makes this operation fairly painless.

Another unusual thing is that the book does not have a regular table of contents, instead it duplicates these small text chapters.

LO has a builtin TOC generator but it can't handle this (I think) so the layout has to be recreated by hand with tables and manual cross reference fields. Controlling page breaks and the like is difficult and I could not make it work perfectly. The above picture has two bugs, the illustration cross reference should be in roman numerals (as it is on a preface page) but LO insists on formatting it using an arabic number. The last chapter on the left page gets split up and the page number is on the left page, whereas it should be on the right (bottom aligned). Even better would be if the chapter heading and text could be defined to always stick together and not be split over pages. There is a setting for this, but it does not seem to work inside tables

Pictures

There are several illustrations in the book and scans of the pictures were also provided by Project Gutenberg. Adding them in the document revealed that figure handling is clearly LO's weakest point (again, presumably because it inherits its model from Word). It seems that in this model each figure has an anchor point in the text and you can align the figure relative to that but the image must be on the same page as the anchor. Were it to go on the next page, LO adds a page break so that the two go to the same page. This leaves a potentially large empty space at the end of the previous page, which looks just plain weird.

In contrast this is something that LaTeX does exceptionally well with its floating figures. Basically it tries to add the figure on the current page and if it will not fit, it puts it on the next page. There does not seem to be a way to get this behaviour in LO. Or at least I could not find one, googling did not help and neither did asking for help on the lazyweb. Playing with images was also the only time I managed to crash LO, so be careful; save early, save often.

The only reasonably working solution seems to be page aligned images. This works but means that if text is edited, figures do not move along with the changes and get disconnected from their source locations. Thus image aligning must be the very last thing to be done. This approach also does not work if you are using master documents. Books with many images should probably be typeset with Scribus instead, especially if proper color management is required.

In conclusion

If you are very disciplined and use LO exactly as it should be used, the end result is actually really nice. You can, for example, change the font used for text in only one place (the base style) and the entire document gets fully reformatted, reflown and repaged in less than a second. This allows you to do invasive layout tests easily, such as finding out how much more space IBM Plex Serif takes when compared to Nimbus Roman [2]. The downside is that any cut corners will cause broken output that you can't find without manually inspecting the entire document.

IKEA effect notwithstanding laying out the text in proper form makes it a lot more enticing. The process of shaping raw text to form really makes it come alive in a sense. It would be nice if Project Gutenberg (or anyone else, really) provided properly formatted versions of their books (and in fact, some already are) because presentation really makes a difference for readability. Plain text and unformatted HTML is unfortunately quite drab to read.

[1] The Finnish edition of the first book in the Illuminatus trilogy, for the curious among you.

[2] Approximately 380 pages compared to 340.

5 comments:

  1. In the bad old days I used LO for documentation of a REST API. I used proper diligence applying styles. It was a table heavy document, so I created a table style. Everything was fine, until it wasn't: after a few years the table style was gone! All the tables still looked like before, but I couldn't create new tables with that style. That was the last time I created big doc with LO.

    ReplyDelete
  2. "It would be nice if Project Gutenberg (or anyone else, really) provided properly formatted versions of their books (and in fact, some already are) because presentation really makes a difference for readability."

    Have you taken a look at Standard Ebooks? Their focus is on high quality formatting. I'd be interested to hear your thoughts on their quality.

    I think they're lovely but that's compared to the low bar of free versions I downloaded from Amazon years ago.

    https://standardebooks.org/

    ReplyDelete
    Replies
    1. Apparently everything has already been invented. I did not know about this. Seems nice, but it is focused on ereader output whereas I was thinking more along the lines of PDF. The web version looks nice but reading an entire book on a web browser is not something most people want to do.

      Delete
    2. I use it for my Kobo devices but you can click on "Read on one page" and then print to file and it gives you a nice PDF.

      Not sure if the formatting is the way you are seeking but it gets the job done for a quick PDF.

      Delete
  3. I've tried writing reports in LO Writer (and Word) one or two times, and my experience pretty much matches exactly what you're describing. Moreover, the Style system is somewhat confusing, e.g. it took me a bit to understand how "Default Paragraph Style" differs from "Text Body" and which one I should use (and it doesn't help that "Default Paragraph Style" is selected by default). I was also hit by figure positioning issues in LO.

    I wonder if there's a way to design a document preparation system which is both easy to use like Writer, Word, Google Docs, and obvious and easy to use *correctly*.

    Also, would it be possible for you to upload the final document somewhere? I'm curious about how you managed some of those styles.

    ReplyDelete