Friday, February 23, 2024

Creating tagged PDFs with CapyPDF now sort of possible

There are many open source PDF generators available. Unfortunately they all have some limitations when it comes to generating tagged PDFs

  • Cairo does not support tagged PDFs at all
  • LaTeX can create tagged PDFs, but obviously only out of LaTeX documents
  • Scribus does not support tagged PDF
  • LibreOffice does support tagged PDF generation, but its code is not available as a standalone library, it can only be used via LibreOffice
  • HexaPDF does not support tagged PDFs (though they seem to be on the roadmap), but it is implemented in Ruby so most projects can't use it
  • Firefox can print pages to PDF, but the result is not tagged, even though the PDF document structure model is almost exactly the same as in HTML

There does not seem to be a library that provides for all of this with a "plain C" API that can be used to easily generated tagged PDFs using almost any programming language.

There still isn't, but at least now CapyPDF can generate simple tagged PDF documents. A sample document can be downloaded via this link. Here is a picture showing the document structure in Acrobat Pro.

It should also work in things like screen readers and other accessibility tools, but I have not tested it.

None of this is exposed in the C API, because this has a fairly large API surface and I have not yet come up with a good way to represent it.