Monday, September 19, 2022

Diving deeper into custom PDF and epub generation

In a previous blog post I looked into converting a custom markup text format into "proper" PDF and epub documents. The format at the time was very simple and could not do even simple things like italic text. At the time it was ok, but as time went on it seemed a bit unsatisfactory.

Ergo, here is a sample input document:

# Demonstration document

This document is a collection of sample paragraphs that demonstrate
the different features available, like /italic text/, *bold text* and
even |Small Caps text|. All of Unicode is supported: ", », “, ”.

The previous paragraph was not indented as it is the first one following a section title. This one is indented. Immediately after this paragraph the input document will have a scene break token. It is not printed, but will cause vertical white space to be added. The
paragraph following this one will also not be indented.

#s

A new scene has now started. To finish things off, here is a
standalone code block:

```code
#include<something.h>
/* Cool stuff here */
```

This is "Markdown-like" but specifically not Markdown because novel typesetting has requirements that can't easily be retrofit in Markdown. When processed this will yield the following output:

Links to generated documents: PDF, epub. The code can be found on Github.

A look in the code

An old saying goes that that the natural data structure for any problem is an array and if it is not, then change the problem so that it is. This turned out very much to be the case in this problem. The document is an array of variants (section, paragraph, scene change etc). Text is an array of words (split at whitespace) which get processed into output, which is an array of formatted lines. Each line is an array of formatted words.

For computing the global chapter justification and final PDF it turned out that we need to be able to render each word in its final formatted form, and also hyphenated sub-forms, in isolation. This means that the elementary data structure is this:

struct EnrichedWord {
    std::string text;
    std::vector<HyphenPoint> hyphen_points;
    std::vector<FormattingChange> format;
    StyleStack start_style;
};

This is "pure data" and fully self-contained. The fields are obvious: text has the actual text in UTF-8. hyphen_points lists all points where the word can be hyphenated and how. For example if you split the word "monotonic" in the middle you'd need to add a hyphen to the output but if you split the hypothetical combination word "meta–avatar" in the middle you should not add a hyphen, because there is already an en-dash at the end. format contains all points within the word where styling changes (e.g. italic starts or ends). start_style is the only trickier one. It lists all styles (italic, bold, etc) that are "active" at the start of the word and the order in which they have been declared. Since formatting tags can't be nested, this is needed to compute and validate style changes within the word.

Given an array of these enriched words the code computes another array of all possible points where the text stream can be split, both within and between words. The output of this algorithm is then yet another array. It contains all the split points. With this the final output can be created fairly easily: each output line is the text between split points n and n+1.

The one major missing typographical feature missing is widow and orphan control. The code merely splits the page whenever it is full. Interestingly it turns out that doing this properly is done with the same algorithm as paragraph justification. The difference is that the penalty terms are things like "widow existence" and "adjacent page height imbalance".

But that, as they say, is another story. Which I have not written yet and might not do for a while because there are other fruit to fry.

Sunday, September 4, 2022

Questions to ask a prospective employer during a job interview

Question: Do developers in your organization have full admin rights on their own computer?

Rationale: While blocking admin rights might make sense for regular office workers it is a massive hindrance for software developers. They do need admin access for many things and not giving it to them is a direct productivity hit. You might also note that Google does give all their developers root access to their own dev machines and see how they respond.

Question: Are developers free to choose and install the operating system on their development machines? If yes, can you do all administrative and bureaucracy task from "non-official" operating systems?

Rationale: Most software projects nowadays deal with Linux somehow and many people are thus more productive (and happier) if they can use a Linux desktop for their development. If the company mandates the use of "IT-approved" Windows install where 50% of all CPU time is spent on virus scanners and the like, productivity takes a big hit. There are also some web services that either just don't work on Linux or are a massive pain to use if they do (the web UI of Outlook being a major guilty party here).

Question: How long does it take to run the CI for new merge requests?

Rationale: Anything under 10 minutes is good. Anything over 30 minutes is unacceptably slow. Too slow of a CI means that instead of submitting small, isolated commits people start aggregating many changes into a small number of mammoth commits because it is the only way to get things done. This causes the code quality to plummet.

Question: Suppose we find a simple error, like a typo in a log message. Could you explain the process one needs to follow to get that fixed and how long does it approximately take? Explicitly list out all the people in the organization that are needed to authorize said change.

Rationale: The answer to this should be very similar to the one above: make the commit, submit for review, get ack, merge. It should be done in minutes. Sometimes that is not the case. Maybe you are not allowed to work on issues that don't have an associated ticket or that are not pre-approved for the current sprint. Maybe you need to first create a ticket for the issue. Maybe you first need to get manager approval to create said ticket (You laugh, but these processes actually exist. No, I'm not kidding.). If their reply contains phrases like "you obtain approval from X", demand details: how do you apply for approval, who does it, how long is it expected to take, what happens if your request is rejected, and so on. If the total time is measured in days, draw your own conclusions and act accordingly

Question: Suppose that I need to purchase some low-cost device like a USB hub for development work. Could you explain the procedure needed to get that done? 

Rationale: The answer you want is either "go to a store of your choice, buy what you need and send the receipt to person X" or possibly "send a link to person X and it will be on your desk (or delivered home) within two days". Needing to get approval from your immediate manager is sort of ok, but needing to go any higher or sideways in the org chart is a red flag and so is needing to wait more than a few days regardless of the reason.

Question: Could you explain the exact steps needed to get the code built?

Rationale: The steps should be "do a git clone, run the build system in the standard way, compile, done". Having a script that you can run that sets up the environment is also fine. Having a short wiki page with the instructions is tolerable. Having a long wiki page with the instructions is bad. "Try compilng and when issues arise ask on slack/teams/discord/water cooler/etc" is very bad.

Question: Can you build the code and run tests on the local machine as if it was a standard desktop application?

Rationale: For some reason corporations love creating massive build clusters and the like for their products (which is fine) and then make it impossible to build the code in isolation (which is not fine). Being able to build the project on your own machine is pretty much mandatory because if you can't build locally then e.g. IDE autocompletions does not work because there is no working compile_commands.json.

This even applies for most embedded projects. A well designed embedded project can be configured so that most code can be built and tested on the host compiler and only the hardware-touching bits need cross compilation. This obviously does not cover all cases, such as writing very low level firmware that is mostly assembly. You have to use your own judgement here.

Question: Does this team or any of the related teams have a person who actively rejects proposals to improve the code base?

Rationale: A common dysfunction in existing organizations is to have a "big fish in a small pond" developer, one that has been working on said code for a long time but which has not looked at what has been happening in the software development landscape in general. They will typically hard reject all attempts to improve the code and related processes to match current best practices. They typically use phrases like "I don't think that would improve anything", "That can't work (no further reasoning given)" and the ever popular "but we already have [implementation X, usually terrible] and it works". In extreme cases if their opinions are challenged they resort to personal attacks. Because said person is the only person who truly understands the code, management is unwilling to reprimand them out of fear that they might leave.

Out of all the questions in this list, this one is the most crucial. Having to work with such a person is a miserable experience and typically a major factor in employee churn. This is also the question that prospective employers are most likely to flat out lie to you because they know that if they admit to this, they can't hire anyone. If you are interviewing with a manager, then they might not even know that they have such a person in their team. The only reliable way to know this is to talk with actual engineers after they have had several beers. This is hard to organize before getting hired.

Question: How many different IT support organizations do you have. Where are they physically located?

Rationale: The only really acceptable answer is "in the same country as your team" (There are exceptions, such as being the only employee in a given country working 100% remote). Any other answer means that support requests take forever to get done and are a direct and massive drain on your productivity. The reason for this inefficiency is that if you have your "own" support then you can communicate with each other like regular human beings. If they are physically separated then you just became just another faceless person in a never ending ticketing queue somewhere and things that should take 15 minutes take weeks (these people typically need to serve many organisations in different countries and are chronically overworked).

The situation is even worse if the IT support is moved to physically distant location and even worse if it is bought as a service from a different corporation. A typical case is that a corporation in Europe or USA outsources all their IT support to India or Bangladesh. This is problematic, not because people in said countries are not good at their jobs (I've never met them so I can't really say) but because these transfers are always done to minimize costs. Thus a core part of the organization's engineering productivity is tied to an organisation that is 5-10 time zones away, made the cheapest offer and over which you can't even exert any organizational pressure over should it be needed. This is not a recipe for success. If there are more than one such external companies within the organization, failure is almost guaranteed.

Question: Suppose the team needs to start a new web service like a private Gitlab, add new type of CI to the testing pipeline or something similar. Could you explain the exact steps needed to get it fully operational? Please list all people who need to do work to make this happen (including just giving authorization), and time estimates for each individual step.

Rationale: This follows directly from the above. Any answer that has more than one manager and takes more than a day or two is a red flag.

Question: Please list all the ways you are monitoring your employees during work hours. For example state whether you have a mandatory web proxy that everyone must use and enumerate all pieces of security and tracking software you have installed on employees' computers. Do you require all employees to assign all work hours to projects? If yes, what granularity? If the granularity is less than 1 hour, does your work sheet contain an entry for "entering data into work hour enumeration sheet"?

Rationale: This one should be fairly obvious, but note that you are unlikely to get a straight answer.

Friday, September 2, 2022

Looking at LibreOffice's Windows installer

There has long been a desire to get rid of Cygwin as a build dependency on LibreOffice. In addition to building dependencies it is also used to create the Windows MSI installer. Due to reasons I don't remember any more I chose to look into replacing just that bit with some modern tooling. This is the tragedy that followed.

The first step to replacing something old is to determine what and how the old system works. Given that it is an installer one would expect it to use WiX, NSIS or maybe even some less know installer tool. But of course you already guessed that's not going to be the case. After sufficient amounts of digging you can discover that the installer is invoked by this (1600+ line) Perl script. It imports 50+ other internal Perl modules. This is not going to be a fun day.

Eventually you stumble upon this file and uncover the nasty truth. The installer works by manually building a CAB file with makecab.exe and Perl. Or, at least, that is what I think it does. With Perl you can never be sure. It might even be dead code that someone forgot to delete. So I asked from my LO acquaintances if that is how it actually works. The answer? "Nobody knows. It came from somewhere and was hooked to the build and nobody has touched it since."

When the going gets tough, the tough starts compiling dependencies from scratch

In order to see if that is actually what is happening, we need to to be able to run it and see what it does. For that we need to first compile LO. This is surprisingly simple, there is a script that does almost all of the gnarly bits needed to set up the environment. So then you just run it? Of course you don't.

When you start the build, first things seem to work fine, but then one of the dependencies misdetects the build environment as mingw rather than cygwin and then promptly fails to build. Web searching finds this email thread which says that the issue has been fixed. It is not.

I don't even have mingw installed on this machine.

It still detects the environment as mingw.

Then I uninstalled win-git and everything that could possibly be interpreted as mingw.

It still detects the environment as mingw.

Then I edited the master Makefile to pass an explicit environment flag to the subproject's configure invocation.

It still detects the environment as mingw.

Yes, I did delete all cached state I could think of between each step. It did not help.

I tried everything I could and eventually had to give up. I could not make LO compile on Windows. Back to the old drawing board.

When unzipping kills your machine

What I actually wanted to test was to build the source code, take the output directory and then pass that to the msicreator program that converts standalone directories to MSI installers using WiX. This is difficult to do if you can't even generate the output files in the first place. But there is a way to cheat.

We can take the existing LO installer, tell the Windows installer to just extract the files in a standalone directory and then use that as the input data. So then I did the extraction and it crashed Windows hard. It brought up the crash screen and the bottom half of it had garbled graphics. Which is an impressive achievement for what is effectively the same operation as unpacking a zip file. Then it did it again. The third time I had my phone camera ready to take a picture but that time it succeeded. Obviously.

After fixing a few bugs in msireator and the like I could eventually build my own LibreOffice installer. I don't know if it is functionally equivalent but at least most of the work should be there. So, assuming that you can do the equivalent of DESTDIR=/foo make install with LO on Windows then you should be able to replace the unmaintained multi-thousand line Perlthulhu with msicreator and something like this:

{
    "upgrade_guid": "SOME-GUID-HERE",
    "version": "7.4.0",
    "product_name": "LibreOffice",
    "manufacturer": "LibreOffice something something",
    "name": "LibreOffice",
    "name_base": "libreoffice",
    "comments": "This is a comment.",
    "installdir": "printerdir",
    "license_file": "license.rtf",
    "need_msvcrt": true,
    "parts": [
        {
         "id": "MainProgram",
         "title": "The Program",
         "description": "The main program",
         "absent": "disallow",
         "staged_dir": "destdir"
        }
    ]
}

In practice it probably won't be this simple, because it never is.

Wednesday, August 24, 2022

Random things on designing a text format for books

In previous blog posts there was some talk about implementing a simple system that generates books (both PDF and ebook) from plain text input files. The main question for that is what the input format should be. Currently there are basically two established formats: LaTeX and Markdown. The former is especially good if the book has a lot of figures, cross references, indexes and all that. The latter is commonly used in most modern web systems but it is more suitable to specifying text in the "web page" style as opposed to "split aesthetically over pages".

The obvious solution when faced with this issue is to design your own file format that fits your needs perfectly. I did not do that, but instead I did think about the issue and did some research and thinking. This is the outcome of that. It is not a finished product, you can think of instead as a grouping of unrelated things and design requirements that you'd need to deal with when creating such a file format.

Goals

The file format should be used to create traditional novel like Lord of the Rings and The Hitch-Hiker's Guide to the Galaxy. The output will be a "single flow" of text separated by chapter headings and the like. There needs to be support for different paragraph styles for printing things like poems, telegraphs or computer printouts in different fonts and indents.

The input files must be UTF-8 plain text in a format that works natively with revision control systems.

Supporting pictures and illustrations should be possible.

You need to be able to create both press-ready PDFs and epubs directly from the input files without having to reformat the text with something like Scribus.

Don't have styling information inside the input files. Those should be defined elsewhere, for example when generating the epub, all styling should come from a CSS file that the end user writes by hand. The input text should be as "syntax-only" as possible.

Writing HTML or XML style tags is right out.

Specifying formatting inline

Both LaTeX and Markdown specify their style information inline. This seems like a reasonable approach that people are used to. In fact I have seen Word documents written by professional proofreaders that do not use Word's styling at all but instead type Markdown-style formatting tokens inside Word documents.

The main difference between LaTeX and Markdown is that the former is verbose whereas the latter is, well, not as simple as you'd expect. The most common emphasis style is italic. The LaTeX way of doing it is to write \emph{italic} whereas Markdown requires it to be written as _italic_. This is one of the main annoying things about the LaTeX format, you need to keep typing that emph (or create a macro for it) and it takes up a lot of space in your text. Having a shorthand for common operations, like Markdown does, seems like an usability win. Typing \longcommand for functionality that is only used rarely is ok.

These formatting tokens have their own set of problems. Here's something you can try in any web site that supports Markdown formatting (I used Github here). Write the same word twice: one time so that the middle of the word has styling tokens and a second time so that the entire word is emphasized.

Then click on the preview tab.

Whoops. One of the two of these has unexpected formatting. Some web editors even get this wrong. If you use a graphical preview widget and emphasize the middle of the word using ctrl-i, the editor shows it as emphasized but if you click on the text tab and then return to the preview tab it shows the underscore characters rather than italic text.

This might lead you to think that underscores within words (i.e. sequences of characters separated by whitespace) are ignored. So let's test that.

This is a simple test with three different cases. Let's look at the outcome.

Our hypothesis turned out incorrect. There are special cases where underscores within words are considered styling information and others where they should be treated as characters. It is left as an exercise to the reader to 1) determine when that is and 2) what is the rendering difference between the two lower lines in the image (there is one).

Fair enough. Surely the same applies for other formatting as well.

This turns into:

Nope, behavioural difference again! But what if you use double underscores?

Okay, looking good. Surely that will work:

Nope. So what can we learn from this? Basically that in-band signaling is error prone and you typically should avoid it because it will come back and bite you in the ass. Since the file format is UTF-8 we could sacrifice some characters outside basic ASCII for this use but then you get into the problem of needing to type them out with unusual keyboard combinations (or configure your editor to write them out when typing ctrl-i or ctrl-b).

Small caps

Small caps letters are often used in high quality typography. In LaTeX you get them with the \textsc command. Markdown does not support small caps at all. There are several discussion threads that talk about adding support for it to Markdown. To save you time, here is a condensed version of pretty much all of them:

"Can we add small caps to Markdown?"

"No, you don't need them."

"Yes I do."

"No you don't."

And so on. Small caps might be important enough to warrant its own formatting character as discussed in the previous chapter and the implementation would have the same issues.

The dialogue clash

There are many different ways of laying out dialogue. Quotation marks are the most common but starting a paragraph with a dash is also used (in Finnish at least, this might be culture dependent). Like so:

– Use the Force Luke, said Obi-Wan.

Thus it would seem useful to format all paragraphs that start with a dash character as dialogue. In this example the actual formatting used an en-dash. If you want to go the Markdown way this is problematic, because it specifies that lines starting with dashes turn into bulleted lists:

  • Use the Force Luke, said Obi-Wan.
These are both useful things and you'd probably want to support both, even though the latter is not very common in story-oriented books. Which one should use the starting dash? I don't have a good answer.

Saturday, August 13, 2022

Making decision without all the information is tricky, a case study

In a recent blog post, Michal Catanzaro wrote about choosing proper configurations for your build, especially the buildtype attribute. As noted in the text, Meson's build type setup is not the greatest in the world., so I figured I'd write why that is, what would a better design look like and why we don't use that (and probably won't for the foreseeable future).

The concept of build types was copied almost directly from CMake. The main thing that they do is to set compiler flags like -g and -O2. Quite early in the development process of Meson I planned on adding top level options for debug info and optimization but the actual implementation for those was done much later. I copied the build types and flags almost directly except for build types RelWithDebInfo and Release. Having these two as separate build types did not make sense to me, because you always need debug info for releases. If you don't have it, you can't debug crash dumps coming from users. Thus I renamed them to debugoptimized and release.

So far so good, except there was one major piece of information I was missing. The word "debug" has two different meaning. On most platforms it means "debug info" but on Windows (or, specifically, with the MSVC toolchain) "debug" means a special build type that uses the "debug runtime" that has additional runtime checks that are useful during development. More info can be found e.g. here. This made the word "debug" doubly problematic. Not only do people on Windows want it to refer to the debug runtime but then some (but not all) people on Linux think that "debugoptimized" means that it should only be used during development. Originally that was not the case, it was supposed to mean "build a binary with the default optimizations and debug info". What I originally wanted was that distros would build packages with buildtype set to debugoptimized as opposed to living in the roaring 90s, passing a random collection of flags via CFLAGS and hoping for the best.

How it should have been done?

With the benefit of hindsight a better design is fairly easy to see. Given that Meson already has toggles for all individual bits, buildtype should describe the "intent", that is, what the end result will be used for. Its possible values should have the following:

  • development
  • releaseperf (maximize performance)
  • releasesize (minimize size)
It might also contain the following:

  • distro (when building distro packages)
  • fuzzing
Note that the word "debug" does not appear. This is intentional, all words are chosen so that they are unambiguous. If they are not, then they would need to be changed. The value of this option would be orthogonal to other flags. For example you might want to have a build type that minimizes build size but still uses -O2, because sometimes it produces smaller code than -Os or -Oz. Suppose you have two implementations of some algorithm: one that has maximal performance and another that yields less code. With this setup you could select between the two based on what the end result will be used rather than trying to guess it from optimization flags. (Some of this you can already do, but due to issues listed in Michael's blog it is not as simple.)

Can we switch to this model?

This is very difficult due to backwards compatibility. There are a ton of existing projects out there that depend on the way things are currently set up and breaking them would lead to a ton of unhappy users. If Meson had only a few dozen users I would have done this change already rather than writing this blog post.

Sunday, July 31, 2022

Implementing a "mini-LaTeX" in ~2000 lines of code

A preliminary note

The previous blog post on this subject got posted to hackernews. The comments were roughly like the following (contains exaggeration, but only a little):

The author claims to obtain superior quality compared to LaTeX but after reading the first three sentences I can say that there is nothing like that in the text so I just stopped there. Worst! Blog! Post! Ever!

So to be absolutely clear I'm not claiming to "beat" LaTeX in any sort of way (nor did I make such a claim in the original post). The quality is probably not better and the code has only a fraction of the functionality of LaTeX. Missing features include things like images, tables, cross references and even fairly basic functionality like widow and orphan control, italic text or any sort of customizability. This is just an experiment I'm doing for my own amusement.

What does it do then?

It can take a simple plain text version of The War of the Worlds, parse it into chapters, lay the text out in justified paragraphs and write the whole thing out into a PDF file. The output looks like this:

Even though the implementation is quite simple, it does have some typographical niceties. All new chapters begin on a right-hand page, the text is hyphenated, there are page numbers (but not on an empty page immediately preceding a new chapter) and the first paragraph of each chapter is not indented. The curious among you can examine the actual PDF yourselves. Just be prepared that there are known bugs in it.

Thus we can reasonably say that the code does contain an implementation of a very limited and basic form of LaTeX. The code repository has approximately 3000 total lines of C++ code but if you remove the GUI application and other helper code the core implementation only has around 2000 lines. Most of the auxiliary "heavy lifting" code is handled by Pango and Cairo.

Performance

The input text file for War of the Worlds is about 332 kB in size and the final PDF contains 221 pages. The program generates the output in 7 seconds on a Ryzen 7 3700 using only one core. This problem is fairly easily parallelizable so if one were to use all 16 cores at the same time the whole operation would take less than a second. I did not do exact measurements but the processing speed seems to be within the same order of magnitude as plain LaTeX.

The really surprising thing was that according to Massif the peak memory consumption was 5 MB. I had not tried to save memory when coding and just made copies of strings and other objects without a care in the world and still managed to almost fit the entire workload in the 4 MB L2 cache of the processor. Goes to show that premature optimization really is the root of all evil (or wasted effort at least).

Most CPU cycles are spent inside Pango. This is not due to any perf problems in Pango, but because this algorithm has an atypical work load. It keeps on asking Pango to shape and measure short text segments that are almost but not entirely identical. For each line that does get rendered, Pango had to process ~10 similar blocks of text. The code caches the results so it should only ask for the size of any individual string once, but this is still the bottleneck. On the other hand since you can process a fairly hefty book in 10 seconds or so it is arguable whether further optimizations are even necessary,

The future

I don't have any idea what I'm going to do with this code, if anything. One blue sky idea that came to mind was that it would be kind of cool to have a modern, fully color managed version of LaTeX that goes from markdown to a final PDF and ebook. This is not really feasible with the current approach since Cairo can only produce RGB files. There has been talk of adding full color space support to Cairo but nothing has happened on that front in 10 years or so.

Cairo is not really on its own in this predicament. Creating PDF files that are suitable for "commercial grade" printing using only open source is surprisingly difficult. For example LibreOffice does output text in the proper grayscale colorspace but silently converts all grayscale images (even 1-bit ones) to RGB. The only software that seems to get everything right is Scribus.

Thursday, July 28, 2022

That time when I accidentally social engineered myself to a film set

I spent the last week in Toronto in the Cpp North conference. It was so much fun just to hang around with people after such a long pause.My talk was about porting large code bases from one build system to another, using LibreOffice as an example The talk should eventually show up on Youtube but there is no precise schedule for that yet.

After the conference ended I had a few spare days to do touristy stuff which was also fun. On the last day when I was packing my stuff I noticed that there was a film crew just outside the conference hotel clearly shooting something. The area did not seem to be closed off so obviously I went in to take a closer look. It was past 9 pm and I only had my phone camera so all the pictures below are a bit murky. On the other hand you can clearly see just how much lighting power you need to shoot high quality video material.

Another thing you pick up quite quickly is just how much stuff a film set requires. Here is an assorted selection of props and costumes that took up a major fraction of a side street.

After hanging around for a while a person belonging to the crew noticed me and we started chatting. Once it became known that I was from Finland we had the legally mandated Canada–Finland hockey discussion. Then he told me all things about the set, which was filming a new series for Apple TV. He also told me the name of the series and what sort of scene they were going to shoot next.

This is where things got interesting. In the couple of hours I spent looking at the set, there were tens of people who walked by and asked the security people what they were filming. They got one of two answers. The first one was a name for a series that was different than the one I was told. The second answer was "this is a new series with a largely unknown cast". This answer sounded very specific and very rehearsed. Based on some research I did later the series name I was given was probably the correct one. It's a bit strange that they had a fake name for the production but there you go (it is possible that the other name given to people was the episode name rather than the series name but this is just speculation).

Regardless of the name, the series takes place in New York so they had staged the street to look like a typical Manhattan street. This included a fair amount of detail that you might think would be done cheaply with CGI instead.

They had NY street signs and trash cans (not shown in the image, but the street name was "John st"). The things at the bottom marked with a red arrow are fake garbage bags. They even had a fake US style ATM, marked with the other red arrow. The water truck and was used to make the street wet, probably because the scene takes place after rain. They actually re-wet the street multiple times to maintain consistency. The only noticeable issue was that all cars on set had Ontario plates. Maybe they are counting on that not showing up on screen.

However the most important things you need to stage New York are actually smoke generators. Lots and lost of smoke generators. There were several big ones on the street, the garbage cans had ones, even the aforementioned ATM had its very own smoke machine. Don't really know if that is relevant to the plot or whether ATMs in New York run on steam.

As I stood on the street watching people go about their business eventually I noticed something strange. They started gathering more and more film equipment around me. There were a bunch of cameras, lights and I could clearly hear people planning the lighting et al for the shoot. Eventually I realized that they were building the "control center" (whatever its actual name is) of the set around me. This is what it looked like from the outside.

I stood next to one of the pillars shown in the middle of the picture next to a thing that looked suspiciously like a director's chair. I just leaned against the wall perfectly still while remaining calm and passive and nobody paid any attention to me at all. I could observe the crew going about their business,  browse their monitors (which, sadly, did not show anything interesting) and even see their uncannily realistic looking baby prop up close. I was probably there for around 30 minutes or so until someone finally asked me if I was part of the crew and then kindly asked me to leave which I did.

This was probably for the better as soon they started filming actual scenes and I could get a lot better look from the opposite side of the street. Watching the operation made it immediately obvious why shooting professional video is so expensive. The crew consisted of around 40-50 people who all seemed to work a full day well past midnight on a Friday. In the three hours I spent observing the operation they managed to film two scenes, each one lasting around 15 seconds.

All in all if you ever get the chance to observe a film crew in operation I highly recommend it. It is a better "making of" experience than any professionally produced featurette or documentary. The main reason being that it is about the actual making of instead of of millionaires lying through their teeth on how all other millionaires on set were so professional, super awesome and how the whole experience was the best thing ever because that is what they are contractually obligated to say. In person you get to see actually interesting things like how a camera operator puts on a weird Doctor Octopus carrying harness or how a team of tech workers builds and balances a 10+ meter long dolly track from scratch in under ten minutes. It is quite impressive.