Wednesday, August 24, 2022

Random things on designing a text format for books

In previous blog posts there was some talk about implementing a simple system that generates books (both PDF and ebook) from plain text input files. The main question for that is what the input format should be. Currently there are basically two established formats: LaTeX and Markdown. The former is especially good if the book has a lot of figures, cross references, indexes and all that. The latter is commonly used in most modern web systems but it is more suitable to specifying text in the "web page" style as opposed to "split aesthetically over pages".

The obvious solution when faced with this issue is to design your own file format that fits your needs perfectly. I did not do that, but instead I did think about the issue and did some research and thinking. This is the outcome of that. It is not a finished product, you can think of instead as a grouping of unrelated things and design requirements that you'd need to deal with when creating such a file format.

Goals

The file format should be used to create traditional novel like Lord of the Rings and The Hitch-Hiker's Guide to the Galaxy. The output will be a "single flow" of text separated by chapter headings and the like. There needs to be support for different paragraph styles for printing things like poems, telegraphs or computer printouts in different fonts and indents.

The input files must be UTF-8 plain text in a format that works natively with revision control systems.

Supporting pictures and illustrations should be possible.

You need to be able to create both press-ready PDFs and epubs directly from the input files without having to reformat the text with something like Scribus.

Don't have styling information inside the input files. Those should be defined elsewhere, for example when generating the epub, all styling should come from a CSS file that the end user writes by hand. The input text should be as "syntax-only" as possible.

Writing HTML or XML style tags is right out.

Specifying formatting inline

Both LaTeX and Markdown specify their style information inline. This seems like a reasonable approach that people are used to. In fact I have seen Word documents written by professional proofreaders that do not use Word's styling at all but instead type Markdown-style formatting tokens inside Word documents.

The main difference between LaTeX and Markdown is that the former is verbose whereas the latter is, well, not as simple as you'd expect. The most common emphasis style is italic. The LaTeX way of doing it is to write \emph{italic} whereas Markdown requires it to be written as _italic_. This is one of the main annoying things about the LaTeX format, you need to keep typing that emph (or create a macro for it) and it takes up a lot of space in your text. Having a shorthand for common operations, like Markdown does, seems like an usability win. Typing \longcommand for functionality that is only used rarely is ok.

These formatting tokens have their own set of problems. Here's something you can try in any web site that supports Markdown formatting (I used Github here). Write the same word twice: one time so that the middle of the word has styling tokens and a second time so that the entire word is emphasized.

Then click on the preview tab.

Whoops. One of the two of these has unexpected formatting. Some web editors even get this wrong. If you use a graphical preview widget and emphasize the middle of the word using ctrl-i, the editor shows it as emphasized but if you click on the text tab and then return to the preview tab it shows the underscore characters rather than italic text.

This might lead you to think that underscores within words (i.e. sequences of characters separated by whitespace) are ignored. So let's test that.

This is a simple test with three different cases. Let's look at the outcome.

Our hypothesis turned out incorrect. There are special cases where underscores within words are considered styling information and others where they should be treated as characters. It is left as an exercise to the reader to 1) determine when that is and 2) what is the rendering difference between the two lower lines in the image (there is one).

Fair enough. Surely the same applies for other formatting as well.

This turns into:

Nope, behavioural difference again! But what if you use double underscores?

Okay, looking good. Surely that will work:

Nope. So what can we learn from this? Basically that in-band signaling is error prone and you typically should avoid it because it will come back and bite you in the ass. Since the file format is UTF-8 we could sacrifice some characters outside basic ASCII for this use but then you get into the problem of needing to type them out with unusual keyboard combinations (or configure your editor to write them out when typing ctrl-i or ctrl-b).

Small caps

Small caps letters are often used in high quality typography. In LaTeX you get them with the \textsc command. Markdown does not support small caps at all. There are several discussion threads that talk about adding support for it to Markdown. To save you time, here is a condensed version of pretty much all of them:

"Can we add small caps to Markdown?"

"No, you don't need them."

"Yes I do."

"No you don't."

And so on. Small caps might be important enough to warrant its own formatting character as discussed in the previous chapter and the implementation would have the same issues.

The dialogue clash

There are many different ways of laying out dialogue. Quotation marks are the most common but starting a paragraph with a dash is also used (in Finnish at least, this might be culture dependent). Like so:

– Use the Force Luke, said Obi-Wan.

Thus it would seem useful to format all paragraphs that start with a dash character as dialogue. In this example the actual formatting used an en-dash. If you want to go the Markdown way this is problematic, because it specifies that lines starting with dashes turn into bulleted lists:

  • Use the Force Luke, said Obi-Wan.
These are both useful things and you'd probably want to support both, even though the latter is not very common in story-oriented books. Which one should use the starting dash? I don't have a good answer.

Saturday, August 13, 2022

Making decision without all the information is tricky, a case study

In a recent blog post, Michal Catanzaro wrote about choosing proper configurations for your build, especially the buildtype attribute. As noted in the text, Meson's build type setup is not the greatest in the world., so I figured I'd write why that is, what would a better design look like and why we don't use that (and probably won't for the foreseeable future).

The concept of build types was copied almost directly from CMake. The main thing that they do is to set compiler flags like -g and -O2. Quite early in the development process of Meson I planned on adding top level options for debug info and optimization but the actual implementation for those was done much later. I copied the build types and flags almost directly except for build types RelWithDebInfo and Release. Having these two as separate build types did not make sense to me, because you always need debug info for releases. If you don't have it, you can't debug crash dumps coming from users. Thus I renamed them to debugoptimized and release.

So far so good, except there was one major piece of information I was missing. The word "debug" has two different meaning. On most platforms it means "debug info" but on Windows (or, specifically, with the MSVC toolchain) "debug" means a special build type that uses the "debug runtime" that has additional runtime checks that are useful during development. More info can be found e.g. here. This made the word "debug" doubly problematic. Not only do people on Windows want it to refer to the debug runtime but then some (but not all) people on Linux think that "debugoptimized" means that it should only be used during development. Originally that was not the case, it was supposed to mean "build a binary with the default optimizations and debug info". What I originally wanted was that distros would build packages with buildtype set to debugoptimized as opposed to living in the roaring 90s, passing a random collection of flags via CFLAGS and hoping for the best.

How it should have been done?

With the benefit of hindsight a better design is fairly easy to see. Given that Meson already has toggles for all individual bits, buildtype should describe the "intent", that is, what the end result will be used for. Its possible values should have the following:

  • development
  • releaseperf (maximize performance)
  • releasesize (minimize size)
It might also contain the following:

  • distro (when building distro packages)
  • fuzzing
Note that the word "debug" does not appear. This is intentional, all words are chosen so that they are unambiguous. If they are not, then they would need to be changed. The value of this option would be orthogonal to other flags. For example you might want to have a build type that minimizes build size but still uses -O2, because sometimes it produces smaller code than -Os or -Oz. Suppose you have two implementations of some algorithm: one that has maximal performance and another that yields less code. With this setup you could select between the two based on what the end result will be used rather than trying to guess it from optimization flags. (Some of this you can already do, but due to issues listed in Michael's blog it is not as simple.)

Can we switch to this model?

This is very difficult due to backwards compatibility. There are a ton of existing projects out there that depend on the way things are currently set up and breaking them would lead to a ton of unhappy users. If Meson had only a few dozen users I would have done this change already rather than writing this blog post.