Tuesday, October 20, 2020

Cargo-style dependency management for C, C++ and other languages with Meson

My previous blog post about modern C++ got a surprising amount of feedback. Some people even reimplemented the program in other languages, including one in Go, two different ones in Rust and even this slightly brain bending C++ reimplementation as a declarative style pipeline. It also got talked about on Reddit and Hacker news. Two major comments that kept popping up were the following.

  1. There are potential bugs if the program is ever extended to support non-ASCII text
  2. It is hard to use external dependencies in C++ (and by extension C)
Let's solve both of these problems at the same time. There are many "lite" solutions for doing Unicode aware text processing, but we're going to use the 10 000 kilogram hammer: International Components for Unicode. The actual code changes are not that big, interested parties can go look up the details in this Github repo. The thing that is relevant for this blog post is the build and dependency management. The Meson build definition is quite short:

project('cppunicode', 'cpp',
        default_options: ['cpp_std=c++17',
                          'default_library=static'])
             
icu_dep = dependency('icu-i18n')
thread_dep = dependency('threads')
executable('cppunicode', 'cppunicode.cpp',
           dependencies: [icu_dep, thread_dep])

The threads dependency is for the multithreaded parts (see end of this post). I developed this on Linux and used the convenient system provided ICU. Windows and Mac do not provide system libs so we need to build ICU from scratch on those platforms. This is achieved by running the following command in your project's source root:

$ meson wrap install icu
Installed icu branch 67.1 revision 1

This contacts Meson's WrapDB server and downloads build definition files for ICU. That is all you need to do. The build files do not need any changes. When you start building the project, Meson will automatically download and build the dependency. Here is a screenshot of the download step:

Here it is building in Visual Studio:

And here is the final built result running on macOS:

One notable downside of this approach is that WrapDB does not have all that many packages yet. However I have been told that given the next Meson release (in a few weeks) and some upstream patches, it is possible to build the entire GTK widget toolkit as a subproject, even on Windows. 

If anyone wants to contribute to the project, contributions are most welcome. You can for example convert existing projects and submit them to wrapdb or become a reviewer. The Meson web site has the relevant documentation

Appendix: making it parallel

Several people pointed out that while the original program worked fine, it only uses one thread. This may be a bottleneck and that "in C++ it is hard to execute work in parallel". This is again one of those things that has gotten a lot better in the last few years. The "correct" solution would be to use the parallel version of transform_reduce. Unfortunately most parallel STL implementations are still in the process of being implemented so we can't use those in multiplatform code. We can, however, roll our own fairly easily, without needing to create or lock a single mutex by hand. The code has the actual details, but the (slightly edited) gist of of it is this:

for(const auto &e:
    std::filesystem::recursive_directory_iterator(".")) {
    futures.emplace_back(std::async(std::launch::async,
                                    count_word_files,
                                    e));
    if(futures.size() > num_threads) {
        pop_future(word_counts, futures);
    }
}
while(!futures.empty()) {
   pop_future(word_counts, futures);
}

Here the count_word_files function calculates the number of words in a single file and the pop_future function joins individual results to the final result. By using a share-nothing architecture, pure functions and value types all business logic code can be written as if it was single threaded and the details of thread and mutex management can be left to library code. Haskell fans would be proud (or possibly horrified, not really sure).

Thursday, October 15, 2020

Does C++ still deserve the bad rap it has had for so long?

Traditionally C++ has been seen by many (and you know who you are) as just plain bad: the code is unreadably verbose, error messages are undecipherable, it's unsafe, compilation takes forever and so on. In fact making fun of C++ is even a fun pastime for some. All of this was certainly true in the 90s and even as recent as 10 years ago. But is it still the case? What would be a good way to determine this?

A practical experiment

Let's write a fairly simple program that solves a sort-of-real-worldish problem and see what comes out of it. The problem we want to solve is the following:
Find all files with the extension .txt recursively in the subdirectories of the current directory, count the number of times each word appears in them and print the ten most common words and the number of times they are used.

We're not going to go for extreme performance or handling all possible corner cases, instead going for a reasonable implementation. The full code can be found by following this link.

The first thing we need is a way to detect files with a given extension and words within text. This calls for regular expressions:

const std::regex fname_regex("\\.txt$", std::regex_constants::icase);

const std::regex word_regex("([a-z]{2,})", std::regex_constants::icase);

This might a bit verbose, but quite readable. This only works for ASCII text, but for the purposes of this experiment it is sufficient. To calculate the number of times each word is seen, we need a hash table:

std::unordered_map<std::string, int> word_counts;

Now we need to iterate over all files in the current directory subtree.

for(const auto &e: std::filesystem::recursive_directory_iterator("."))

Skip everything that is not a .txt file.

if(!e.is_regular_file()) {
    continue;
}
if(!std::regex_search(e.path().c_str(), fname_regex)) {
    continue;
}

Process the file line by line:

std::ifstream current_file(e.path());
for (std::string line; std::getline(current_file, line); )

For each line we need to find all words that match the regular expression.

std::sregex_iterator word_end;
for(auto it = std::sregex_iterator(line.begin(), line.end(), word_regex); it != word_end; ++it)

This is a bit verbose, but it takes care of all the state tracking needed to run the regex multiple times on the same input string. Doing the same by hand is fairly tedious and error prone. Now that we have the matches, we need to convert them to standalone lowercase words.

std::string word{it->str()};
for(auto &c : word) {
    c = std::tolower(c);
}

Lowercasing strings is a bit cumbersome, granted, but this is at least fairly readable. Then on to incrementing the word count.

++word_counts[word];

That's all that is needed to count the words. This can't be done directly in the hash map so we need to convert the data to an an array. It needs some setup code.

struct WordCount {
    std::string word;
    int count;
};

std::vector<WordCount> word_array;
word_array.reserve(word_counts.size());

Since we know how many entries will be in the array, we can reserve enough space for it in advance. Then we need to iterate over all entries and add them to the array.

for(const auto &[word, count]: word_counts) {
    word_array.emplace_back(WordCount{word, count});
}

This uses the new structured bindings feature, which is a lot more readable than the old way of dealing with iterator objects or pairs with their firsts and their seconds and all that horror. Fortunately no more.

The simple way of getting the 10 most used entries is to sort the array and then grab the 10 first elements. Let's do something slightly swaggier instead [0]. We'll do a partial sort and discard all entries after 10. For that we need a descending sorting criterion as a lambda.

auto count_order_desc = [](const WordCount &w1, const WordCount &w2) { return w2.count < w1.count; };

Using it is straightforward.

const auto final_size = std::min(10, (int)word_array.size());
std::partial_sort(word_array.begin(), word_array.begin() + final_size, word_array.end(), count_order_desc);
word_array.erase(word_array.begin() + final_size, word_array.end());

All that remains is to print the final result.

for(const auto& word_info: word_array) {
    std::cout << word_info.count << " " << word_info.word << "\n";
};

Safety

As safety and security are important features in current sofware development, let's examine how safe this program is. There are two main safety points: thread safety and memory safety. As this program has only one thread, it can be formally proven to be thread safe. Memory safety also divides into two main things to note: use after frees (or dangling references) and out of bound array accesses.

In this particular program there are no dangling references. All data is in value types and the only references are iterators. They are all scoped so that they never live past their usages. Most are not even stored into named variables but instead passed directly as function arguments (see especially the call to partial_sort). Thus they can not be accessed after they have expired. There is no manual resource management of any kind. All resources are freed automatically. Running the code under Valgrind yields zero errors and zero memory leaks.

There is one place in the code where an out-of-bounds access is possible: when creating the iterator pointing to 10 elements past the beginning of the array [2]. If you were to forget to check that the array has at least 10 elements, that would be an immediate error. Fortunately most standard libraries have a way to enable checks for this. Here's what GCC reports at runtime if you get this wrong:

Error: attempt to advance a dereferenceable (start-of-sequence) iterator 100000 steps, which falls outside its valid range.

Not really a perfect safety record, but overall it's fairly good and certainly a lot better than implementing the same by manually indexing arrays.

Problems

Compiling the program takes several seconds, which is not great. This is mostly due to the regex header, which is known to be heavyweight. Some of the compiler messages encountered during development were needlessly verbose. The actual errors were quite silly, though, such as passing arguments to functions in the wrong order. This could definitely be better. It is reported that the use of concepts in C++20 will fix most of this issue, but there are no practical real-world usage reports yet.

Unique features

This code compiles, runs and works [1] on all major platforms using only the default toolchain provided by the OS vendor without any external dependencies or downloads. This is something that no other programming language can provide as of date. The closest you can get is plain C, but its standard library does not have the necessary functionality. Compiling the code is also simple and can be done with a single command:

g++ -o cppexample -O2 -g -Wall -std=c++17 cppexample.cpp

Conclusions

If we try to look at the end result with objective eyes it would seem to be fairly nice. The entire implementation takes fewer than 60 lines of code. There's nothing immediately terrible that pops up. What is happening at each step is fairly understandable and readable. There's nothing that one could immediately hate and despise for being "awful", as is tradition.

This is not to say you could still not hate C++. If you want to, go right ahead. There are certainly reasons for it. But if you choose to do that, make sure you hate it for real actual reasons of today, rather than things that were broken decades ago but have been fixed since.

[0] There is an even more efficient solution using a priority queue that does not need the intermediate array. Implementing it is left as an exercise to the reader.
[1] Reader beware, I have only proven this code to be portable, not tried it.
[2] The implementation in [0] does not have this problem as no iterators are ever modified by hand.

Tuesday, October 6, 2020

Is your project a unique snowflake or do you just think that it is?

If there is a national sport for programmers, it is reinventing existing things from scratch. This seems to be especially common inside corporations that tend to roll their own rather than, for example, using an existing open source library. If you try to ask around why this has been done, you typically get some variation of this:
We can't use off-the-shelf solutions because we have unique needs and requirements that no-one else has.

Now, there genuinely are cases where this is true, but it mostly happens only in very spesialised cases, such as when you are google-scale, do space engineering or something similar. There might also be legal or regulatory reasons that you must own all code in a product. Most projects are not like this. In fact almost always someone (typically hundreds of someones) has had the exact same problem and solved it. Yet people seem to keep reinventing the wheel for no real purpose. If you ever find yourself in a debate on why people should use existing solutions rather than roll your own, here are some talking points to consider.

How have you established that your needs are actually unique?

Very often when someone says "the needs for <project X> are unique" what they are actually saying is "this needs a feature I have never personally encountered before". This is not due to malice, it is just how human beings seem to behave. If one has a rough idea of how to solve a problem by coding it from scratch but no knowledge of an existing solution, people tend to go with the former rather than spend time on researching the problem. I don't know why. This is especially true if there is external pressure on "getting something going fast".

I know this, because I have actually done it myself. Ages ago I was working on some Java code and needed to communicate with a different process. I was starting to write a custom socket protocol when my coworker asked why am I not simply using regular HTTP with Jetty. The answer, to the surprise of absolutely no-one, is that I had never heard of it. This single sentence saved days of implementation work and probably the same amount in ongoing maintenance and debugging work.

We need more performance than existing solutions can provide

This splits nicely into two parts. In the first one you actually need the performance. This is rare, but it happens. Even so, you might consider optimizing the existing solution first. If you are successfull, submit the changes upstream. This way they benefit everyone and reduce the ongoing maintenance burden to roughly zero.

The other case is that all the performance you think you need, you don't actually need. This is surprisingly common. It turns out that modern computers and libraries are so fast that even "reasonably ok" is performant enough most of the time for most problems [0]. As an example I was working on a product that chose to roll its own functionality X rather than using the standard solution in Linux because it needed to do "thousands of operations per second". These operations were only done in response to human interaction, so there was no way it could generate more than 10 or so events per second. But still. Must have performance, so multiple months of hand-rolling it was.

To add insult to injury, I did some measurements and it turned out that the existing solution could do one thousand operations per second without even needing to go to C. The Python bindings were fast enough.

We need the ability to change the implementation at will

One major downside of using an existing open source solution is that you have to give up some control. Changes must be submitted to people outside your immediate control. They might say no and you can't override them with office politics. In-house code does not have this problem.

This control comes with a significant downside. What fairly often happens is that some developers write and maintain the new functionality and everything is rosy. Then time happens and the original coders switch teams and jobs, become managers and so on. They take the knowledge required to change the system with them. Unless your company has a very strong culture of transferring this knowledge, and most companies most definitely don't, the code will bitrot, fossilize and become a massive clump of legacy that must not be touched because there is no documentation, no tests and nobody really understands it any more.

So even if the corporation has the possibility to change the code at will they lack the ability to do so.

Sometimes people just plain lie

There are two main reasons that explain pretty much everything what people do, say and recommend:
  1. The reason they give to others
  2. The real reason
Even for "purely technical" decisions like these, the two can be drastically different. Suppose some developer has done several similar "boring" projects. Then when the time comes to choose the approach for the new project. If they choose an existing system, they know that the next project will be just like the preceding five. If, on the other hand, some new tech is chosen, then the entire task becomes more meaningful and exciting. In this case the stated reason is technical fitness, but the real reason is frustration. Another way of stating this is that when asked "should we use library X or write our own from scratch" the answer the coder would want to give is "I don't want to work on that project". 

This is not something that people won't say out loud because it gets you easily labeled as a "prima-donna", "work avoider" or "elitist". Many people will also have trouble saying it because it comes from a "bad" emotional response. All of this leads to shame and people are willing to go to great lengths to avoid feeling ashamed.

The same reasoning applies to other "impure" emotions such as infatuation ("OMG this new tech is so cool"), disgust ("using tech X means I have to work with Bob from department X"), infidelity ("either you give me this or I'm accepting the job offer from a competitor"), megalomania ("creating this new framework will make me a superstar celebrity coder") and the ever-popular self-disappointment ("I got into programming because I wanted to create new things like an architect, but ended up as a code janitor").