Thursday, October 15, 2020

Does C++ still deserve the bad rap it has had for so long?

Traditionally C++ has been seen by many (and you know who you are) as just plain bad: the code is unreadably verbose, error messages are undecipherable, it's unsafe, compilation takes forever and so on. In fact making fun of C++ is even a fun pastime for some. All of this was certainly true in the 90s and even as recent as 10 years ago. But is it still the case? What would be a good way to determine this?

A practical experiment

Let's write a fairly simple program that solves a sort-of-real-worldish problem and see what comes out of it. The problem we want to solve is the following:
Find all files with the extension .txt recursively in the subdirectories of the current directory, count the number of times each word appears in them and print the ten most common words and the number of times they are used.

We're not going to go for extreme performance or handling all possible corner cases, instead going for a reasonable implementation. The full code can be found by following this link.

The first thing we need is a way to detect files with a given extension and words within text. This calls for regular expressions:

const std::regex fname_regex("\\.txt$", std::regex_constants::icase);

const std::regex word_regex("([a-z]{2,})", std::regex_constants::icase);

This might a bit verbose, but quite readable. This only works for ASCII text, but for the purposes of this experiment it is sufficient. To calculate the number of times each word is seen, we need a hash table:

std::unordered_map<std::string, int> word_counts;

Now we need to iterate over all files in the current directory subtree.

for(const auto &e: std::filesystem::recursive_directory_iterator("."))

Skip everything that is not a .txt file.

if(!e.is_regular_file()) {
    continue;
}
if(!std::regex_search(e.path().c_str(), fname_regex)) {
    continue;
}

Process the file line by line:

std::ifstream current_file(e.path());
for (std::string line; std::getline(current_file, line); )

For each line we need to find all words that match the regular expression.

std::sregex_iterator word_end;
for(auto it = std::sregex_iterator(line.begin(), line.end(), word_regex); it != word_end; ++it)

This is a bit verbose, but it takes care of all the state tracking needed to run the regex multiple times on the same input string. Doing the same by hand is fairly tedious and error prone. Now that we have the matches, we need to convert them to standalone lowercase words.

std::string word{it->str()};
for(auto &c : word) {
    c = std::tolower(c);
}

Lowercasing strings is a bit cumbersome, granted, but this is at least fairly readable. Then on to incrementing the word count.

++word_counts[word];

That's all that is needed to count the words. This can't be done directly in the hash map so we need to convert the data to an an array. It needs some setup code.

struct WordCount {
    std::string word;
    int count;
};

std::vector<WordCount> word_array;
word_array.reserve(word_counts.size());

Since we know how many entries will be in the array, we can reserve enough space for it in advance. Then we need to iterate over all entries and add them to the array.

for(const auto &[word, count]: word_counts) {
    word_array.emplace_back(WordCount{word, count});
}

This uses the new structured bindings feature, which is a lot more readable than the old way of dealing with iterator objects or pairs with their firsts and their seconds and all that horror. Fortunately no more.

The simple way of getting the 10 most used entries is to sort the array and then grab the 10 first elements. Let's do something slightly swaggier instead [0]. We'll do a partial sort and discard all entries after 10. For that we need a descending sorting criterion as a lambda.

auto count_order_desc = [](const WordCount &w1, const WordCount &w2) { return w2.count < w1.count; };

Using it is straightforward.

const auto final_size = std::min(10, (int)word_array.size());
std::partial_sort(word_array.begin(), word_array.begin() + final_size, word_array.end(), count_order_desc);
word_array.erase(word_array.begin() + final_size, word_array.end());

All that remains is to print the final result.

for(const auto& word_info: word_array) {
    std::cout << word_info.count << " " << word_info.word << "\n";
};

Safety

As safety and security are important features in current sofware development, let's examine how safe this program is. There are two main safety points: thread safety and memory safety. As this program has only one thread, it can be formally proven to be thread safe. Memory safety also divides into two main things to note: use after frees (or dangling references) and out of bound array accesses.

In this particular program there are no dangling references. All data is in value types and the only references are iterators. They are all scoped so that they never live past their usages. Most are not even stored into named variables but instead passed directly as function arguments (see especially the call to partial_sort). Thus they can not be accessed after they have expired. There is no manual resource management of any kind. All resources are freed automatically. Running the code under Valgrind yields zero errors and zero memory leaks.

There is one place in the code where an out-of-bounds access is possible: when creating the iterator pointing to 10 elements past the beginning of the array [2]. If you were to forget to check that the array has at least 10 elements, that would be an immediate error. Fortunately most standard libraries have a way to enable checks for this. Here's what GCC reports at runtime if you get this wrong:

Error: attempt to advance a dereferenceable (start-of-sequence) iterator 100000 steps, which falls outside its valid range.

Not really a perfect safety record, but overall it's fairly good and certainly a lot better than implementing the same by manually indexing arrays.

Problems

Compiling the program takes several seconds, which is not great. This is mostly due to the regex header, which is known to be heavyweight. Some of the compiler messages encountered during development were needlessly verbose. The actual errors were quite silly, though, such as passing arguments to functions in the wrong order. This could definitely be better. It is reported that the use of concepts in C++20 will fix most of this issue, but there are no practical real-world usage reports yet.

Unique features

This code compiles, runs and works [1] on all major platforms using only the default toolchain provided by the OS vendor without any external dependencies or downloads. This is something that no other programming language can provide as of date. The closest you can get is plain C, but its standard library does not have the necessary functionality. Compiling the code is also simple and can be done with a single command:

g++ -o cppexample -O2 -g -Wall -std=c++17 cppexample.cpp

Conclusions

If we try to look at the end result with objective eyes it would seem to be fairly nice. The entire implementation takes fewer than 60 lines of code. There's nothing immediately terrible that pops up. What is happening at each step is fairly understandable and readable. There's nothing that one could immediately hate and despise for being "awful", as is tradition.

This is not to say you could still not hate C++. If you want to, go right ahead. There are certainly reasons for it. But if you choose to do that, make sure you hate it for real actual reasons of today, rather than things that were broken decades ago but have been fixed since.

[0] There is an even more efficient solution using a priority queue that does not need the intermediate array. Implementing it is left as an exercise to the reader.
[1] Reader beware, I have only proven this code to be portable, not tried it.
[2] The implementation in [0] does not have this problem as no iterators are ever modified by hand.

13 comments:

  1. Have grown up with modern C++11. I've learned that it allows to go low-level with plain pointers and performance and high-level with decent comfort and the compiler is the guide of the programmer. That's what a lot people neglect deliberately. Aside from "valgind" the modern compiler option of "-fsanitize=[address|...]" in GCC and LLVM has proven as complete game changer (for me) regarding reliability of C and C++. I wonder how "modules" will change programming C++.

    ReplyDelete
  2. Your downcasing does not work, see https://en.cppreference.com/w/cpp/string/byte/tolower

    You need...

    c = static_cast(std::tolower(static_cast(c)));

    ReplyDelete
  3. Yes, you CAN write elegant, modern, safe C++.

    The problem remains that C++ has been out there for decades, so you can too easily mix new, safe, cool C++, with old, ugly, and unsafe C++, and the worst part is that the end result won't be even that much uglier if you look at the source code. Moreover, you're expecting programmers that used C++ for decades, to forget old programming habits and write code only using the new idioms.

    IMHO C++ is beyond fixing, too huge, too easy to write unsafe code, too many bad habits accumulated over the years.

    That's the reason we have new safer languages like Rust. From my experience (I'm not a Rust expert, but I made some simple CLI tools with it, so I'm not a novice either), you CAN make unsafe Rust code, but the source code looks way uglier than the safe one, so you instinctively write better code.

    ReplyDelete
    Replies
    1. I would say that thread and memory safety are not even the worst part of C++. Sure, it doesn't hold your hands and you need to think about what you are doing but it is possible to write safe code in C++.

      Heck, you can make memory leaks and memory access-related crashes even in a language like Python or C# if you don't think about what you are doing and only blindly rely on the compiler and garbage collector to handle your messes for you.

      The biggest problem with C++ is, IMO, its insane complexity. And new things are getting constantly piled on, with little to no cleanup of the old, because backward compatibility is needed.

      I am a long time (20+ years now) C++ programmer and, seriously, there are things in C++11, C++17 and later that I don't really wrap my head around. How many people understand all the intricacies of how to write code that handles exceptions correctly? Or the various type deduction rules, especially when dealing with templates?

      Oh and templates - some developers think that template metaprogramming (C++ templates are Turing complete!) is the best thing since the sliced bread because you get code that the compiler will optimize into very tight and compact code (unlike with classes which are a runtime construct).

      The problem? Compiling such code takes ages (templates literally move the computation into compile time), any error messages are an undecipherable gobledygook (it is getting slightly better, though), especially when things like parameter packs and recursion are used (very common) and trying to understand what such code does often requires superhuman effort and expert knowledge of the various idiosyncracies of C++ accumulated over the years.

      It is really hard these days to reach for C++ to solve some problem if I don't have to. Other languages solve the same issues and allow the programmer to be actually productive instead of spending hours of time waiting for compilation to finish, fighting incompatible ABIs (especially kudos to Microsoft for the Debug/Release runtime BS but also libraries built with one C++ compiler version may or may not be compatible with another) and trying to debug insane template messes in things like Boost or even the standard library when there is any kind of problem.

      Those are the things that make people hate C++, the perceived lack of safety is mostly a red herring pushed by fans of Rust or various garbage collected languages but not something that actual C++ developers are banging their heads on.



      Delete
  4. Re [1]: on Windows there are build errors with std::regex_search and std::toupper

    ReplyDelete
    Replies
    1. Yeah. It seems in Windows filename are wchars even though the method is called c_str(). A better (more readable) solution would be to do something like:

      auto extension = e.path().extension()
      if(!(extension == ".txt" || extension == ".TXT"))

      Delete
  5. ...This issis that no other programming language can provide as of date. The closest you can get is plain C..
    Ever heard of Rust ( with the exception of therthe crate is not in std)

    ReplyDelete
    Replies
    1. Let me reply to you with the same level of politeness you showed me:

      Ever learned how to read? Specifically the part that says "using only the default toolchain provided by the OS vendor"

      Delete
  6. This code is fragile. Its safety and security depend on a subtle invariant: that the 'word_regex' does not match any non-ASCII character. If requirements changed so the regex could match a non-ASCII character, then one of the 'c' chars could have a negative value other than -1. https://en.cppreference.com/w/cpp/string/byte/tolower says "If the value of ch is not representable as unsigned char and does not equal EOF, the behavior is undefined", so executing std::tolower would trigger undefined behavior, i.e. a safety and security bug.

    This fragility is not at all obvious in the code, and it is easy to imagine someone making such a change in response to changed requirements.

    ReplyDelete
    Replies
    1. If you were to really change the code to support non-ASCII then you'd need to change the string type to something that natively handles Unicode, preferably UTF-8. You'd also need to make decision about encodings, handling of invalid characters and so on. At that point you would not (and probably could not) use tolower any more but the manipulation methods of that type (or library). Storing non-ASCII text into char arrays and expecting it to work is just plain wrong, and further is a mistake you can easily make in any programming language.

      Delete
    2. It is easy to change `word_regex` to match a non-ASCII character without intentionally "changing the code to support non-ASCII". For example you could say "let's look for words starting with an ASCII letter followed by non-whitespace" and write
      [a-z][^ \t]+
      Boom, now you have a security problem.

      Delete
  7. !std::regex_search(e.path().c_str(), fname_regex)

    why not use e.path().extension()!=".txt"?

    ReplyDelete
  8. I had a go adding some parallelism using the parallel STL and TBB. I made two implementations, count_words_stl() which parallelized the processing of files and count_words_tbb() which further parallelized the traversal of the file-tree. I was pretty pleased with the performance uplift for relatively little effort.
    I used RE2 instead of std::regex because the compile times began to annoy me but regex turned out not to be the bottleneck anyway.

    Here's the code:
    https://pastebin.com/uSN9tiSa

    ReplyDelete