Earlier in this blog we looked at how to generate a justified block of text, which is nowadays usually done with the Knuth-Plass algorithm. Sadly this is not, by itself, enough to create a finished product. Processing all input text with the algorithm produces one very long column of text. This works fine if you end result is a scroll (of the papyrus variety), but people tend to prefer their text it book format. Thus you need a way to split that text column into pages.
The simplest solution is to decide that a page has some N number of lines and page breaks happen at exact these places. This works (and is commonly used) but it has certain drawbacks. From a typographical point of view, there are at least three things that should be avoided:
- Orphan lines, a paragraph at the bottom of a page that has only one line of text.
- Widow lines, a paragraph that starts a new page and has only one line of text.
- Spread imbalance, where the two pages on a spread have different number of lines on them (when both are "full pages")
The two problems are exactly the same and can be solved in the same way. I have implemented this in the Chapterizer book generator program. The dynamic programming algorithm for both is so similar that they could, in theory, use shared code. I chose not to do it because sometimes it is just so much simpler to not be DRY. They are so similar, in fact, that a search space optimization that I had to do in the paragraph justification case was also needed for pagination even though the data set size is typically much smaller in pagination. The optimization implementation turned out to be exactly the same for both cases.
The program now creates optimal page splits and in addition prints a text log of all pages with widows, orphans and imbalances it could not optimize away.
Has this been done before?
Probably. Everything has already been invented multiple times. I don't know for sure, so instead I'm going to post my potentially incorrect answer here on the Internet for other people to fact check.
I am not aware of any book generation program doing global page optimization. LibreOffice and Word definitely do not do it. As far as I know LaTeX processes pages one by one and once a page is generated it never returns to it, probably because computers in the early 80s did not have enough memory to hold the entire document in RAM at the same time. I have never used Indesign so I don't know what it does.
Books created earlier than about the mid eighties were typeset by hand and they seem to try to avoid orphans and widows at the cost of some page spreads having more lines than other spreads. Modern books seem to optimize for filling the page fully rather than avoiding widows and orphans. Since most books nowadays are created with Indesign (I think), it would seem to imply that Indesign does not do optimal page splitting or at least it is not enabled by default.
When implementing the page splitter the reasons for not doing global page splitting became clear. Computing both paragraph and page optimization is too slow for a "real time" GUI app. This sort of an optimization really only makes sense for a classical "LaTeX-style" book where there is one main text flow. Layout programs like Scribus and Indesign support richer magazine style layouts. This sort of a page splitter does not work in the general case, so in order to use it they would need to have a separate simpler document format.
But perhaps the biggest issue is that if different pages may have different number of lines on each page, it leads to problems in the page's visual appearance. Anything written at the bottom of the page (like page numbers) need to be re-placed based on how many lines of text actually ended up on the page. Doing this in code in a batch system is easy, doing the same with GUI widgets in "real time" is hard.