Sunday, May 25, 2025

Iterators and adaptors

Previously it was mentioned that Python and C++ do iteration quite differently. Python has "statefull" objects that have a .next() method that returns a new object or throws a StopIteration exception. Incidentally Rust does exactly the same thing, except that it uses an optional type rather than an exception. C++ does not work like this, instead it has a concept of a "start" and "end" of a sequence and the machinery keeps incrementing the start until it reaches the end.

At the end of last post it was said that it is difficult to integrate an object of the former type with C++'s native language facilities, at least without macros.

Now we'll look how to integrate an object of the former type with C++'s native language facilities without any macros.

In fact, it only took fewer than 30 lines of code to do. The caveat being that it is probably fairly easy to break it. But it works for me.

Here it is being used.

There is also a second, similar helper class that takes ownership of the object being iterated over.

Monday, May 19, 2025

Optimizing page splits in books

Earlier in this blog we looked at how to generate a justified block of text, which is nowadays usually done with the Knuth-Plass algorithm. Sadly this is not, by itself, enough to create a finished product. Processing all input text with the algorithm produces one very long column of text. This works fine if you end result is a scroll (of the papyrus variety), but people tend to prefer their text it book format. Thus you need a way to split that text column into pages.

The simplest solution is to decide that a page has some N number of lines and page breaks happen at exact these places. This works (and is commonly used) but it has certain drawbacks. From a typographical point of view, there are at least three things that should be avoided:

Orphan lines, a paragraph at the bottom of a page that has only one line of text.
Widow lines, a paragraph that starts a new page and has only one line of text.
Spread imbalance, where the two pages on a spread have different number of lines on them (when both are "full pages")

Determining "globally optimal" page splits for a given text is not a simple problem, because every pagination choice affects every pagination choice that comes after it. If you stare at the problem long and hard enough, eventually you realize something.

The two problems are exactly the same and can be solved in the same way. I have implemented this in the Chapterizer book generator program. The dynamic programming algorithm for both is so similar that they could, in theory, use shared code. I chose not to do it because sometimes it is just so much simpler to not be DRY. They are so similar, in fact, that a search space optimization that I had to do in the paragraph justification case was also needed for pagination even though the data set size is typically much smaller in pagination. The optimization implementation turned out to be exactly the same for both cases.

The program now creates optimal page splits and in addition prints a text log of all pages with widows, orphans and imbalances it could not optimize away.

Has this been done before?

Probably. Everything has already been invented multiple times. I don't know for sure, so instead I'm going to post my potentially incorrect answer here on the Internet for other people to fact check.

I am not aware of any book generation program doing global page optimization. LibreOffice and Word definitely do not do it. As far as I know LaTeX processes pages one by one and once a page is generated it never returns to it, probably because computers in the early 80s did not have enough memory to hold the entire document in RAM at the same time. I have never used Indesign so I don't know what it does.

Books created earlier than about the mid eighties were typeset by hand and they seem to try to avoid orphans and widows at the cost of some page spreads having more lines than other spreads. Modern books seem to optimize for filling the page fully rather than avoiding widows and orphans. Since most books nowadays are created with Indesign (I think), it would seem to imply that Indesign does not do optimal page splitting or at least it is not enabled by default.

When implementing the page splitter the reasons for not doing global page splitting became clear. Computing both paragraph and page optimization is too slow for a "real time" GUI app. This sort of an optimization really only makes sense for a classical "LaTeX-style" book where there is one main text flow. Layout programs like Scribus and Indesign support richer magazine style layouts. This sort of a page splitter does not work in the general case, so in order to use it they would need to have a separate simpler document format.

But perhaps the biggest issue is that if different pages may have different number of lines on each page, it leads to problems in the page's visual appearance. Anything written at the bottom of the page (like page numbers) need to be re-placed based on how many lines of text actually ended up on the page. Doing this in code in a batch system is easy, doing the same with GUI widgets in "real time" is hard.

Saturday, May 3, 2025

Writing your own C++ standard library part 2

This blog post talked about the "self written C++ standard library" I wrote for the fun of it (code here).

The post got linked by Hackernews and Reddit. As is usual the majority of comments did not talk about the actual content but instead were focused on two tangential things. The first one being "this is not a full implementation of the C++ standard library as specified by the ISO standard, therefore the author is an idiot". I am, in actual fact, an idiot, but not due to project scope but because I assumed people on the Internet to have elementary reading comprehension skills. To make things clear: no, this is not an implementation of the ISO standard library. At no point was such a thing claimed. There is little point in writing one of those, there are several high quality implementations available. "Standard library" in this context means "a collection of low level functions and types that would be needed by most applications".

The second discussion was around the fact that calling the C++ standard library "STL" was both wrong and proof that the person saying that does not understand the C++ language. This was followed by several "I am a C++ standard library implementer and everyone I know calls it the STL". Things deteriorated from there.

The complexity question

Existing container implementations are complex by necessity. They need to handle things like types that can not be noexcept moved or copied. The amount of rollback code needed explodes very quickly and needs to be processed every time the header is included (because of templates). A reasonable point against writing your own containers is that eventually you need to support all the same complexity as existing ones because people will insert "bad" types into your container and complain when things fail. Thus you need to have all the same nasty, complex and brittle code as an STL implementation, only lacking decades of hardening to shake out all the bugs.

That is true, but the beauty of having zero existing users is that you can do something like this:

This is basically a concept that requires the type given to be noexcept-movable (and some other things that could arguably be removed). Now, instead of allowing any type under the sun, all containers require all types they get instantiated with to be WellBehaved. This means that the library does not have to care at all about types behaving badly because trying to use those results in a compiler error. A massive amount of complexity is thus eliminated in a single stroke.

There are of course cases when you need to deal with types that are not well behaved. If you can tolerate a memory allocation per object, the simple solution is to use unique_ptr. OTOH if you have types that can't be reliably moved and which are so performance critical that you can't afford memory allocations, then you are probably going to write your own container code anyway. If you are in the position that you have badly behaved types that you can't update, can't tolerate an allocation and can't write your own containers, then that is your problem.

Iterating things the native way

One of the most common things I do with strings in Python is to split them on whitespace. In Python this is simple as there is only one string type, but in native code things are more complicated. What should the return type of mystring.split() be?

vector<string>
vector<string_view>
A coroutine handle
The method should be a full template so you can customize it to output any custom string type (as every C++ code base has at least three of them)
Something ranges something something

There is probably not a single correct answer. All of the options have different tradeoffs. Thus I implemented this in two ways. The first returns a vector<string> as you would get in Python. The second one was a fully general, lazy and allocation-free method that uses the most unloved of language features: the void pointer.

So given that you have a callback function like this:

the split function becomes:

This is as fully general as you can get without needing to muck about with templates, coroutines or monads (I don't actually know what monads are, I'm just counting on the fact that mentioning them makes you look cool). The cost is that the end user needs to write a small adapter lambda to use it.

Iterating things the Python way

The Python iteration protocol is surprisingly simple. The object being iterated needs to provide a .next() method that either returns the next value or throws a StopIteration exception. Obviously exceptions can't be used in native code because they are too slow, but the same effect can be achieved by returning an optional<T> instead. Here's a unit test for an implementation of Python's range function.

Sadly there is not a simple way to integrate this to native loop constructs without macros and even then it is a bit ugly.

Current status

The repo has basic functionality for strings (regular and enforced UTF-8), regexes and basic containers.

Building the entire project has 8 individual compilation and linking steps and takes 0.8 seconds when using a single core on this laptop computer. Thus compiling a single source file with optimizations enabled takes ~ 0.1 seconds. Which is nice.

The only cheat is that pystd uses ctre for regexes and it is precompiled.