Thursday, October 15, 2020

Does C++ still deserve the bad rap it has had for so long?

Traditionally C++ has been seen by many (and you know who you are) as just plain bad: the code is unreadably verbose, error messages are undecipherable, it's unsafe, compilation takes forever and so on. In fact making fun of C++ is even a fun pastime for some. All of this was certainly true in the 90s and even as recent as 10 years ago. But is it still the case? What would be a good way to determine this?

A practical experiment

Let's write a fairly simple program that solves a sort-of-real-worldish problem and see what comes out of it. The problem we want to solve is the following:

Find all files with the extension .txt recursively in the subdirectories of the current directory, count the number of times each word appears in them and print the ten most common words and the number of times they are used.

We're not going to go for extreme performance or handling all possible corner cases, instead going for a reasonable implementation. The full code can be found by following this link.

The first thing we need is a way to detect files with a given extension and words within text. This calls for regular expressions:

const std::regex fname_regex("\\.txt$", std::regex_constants::icase);

const std::regex word_regex("([a-z]{2,})", std::regex_constants::icase);

This might a bit verbose, but quite readable. This only works for ASCII text, but for the purposes of this experiment it is sufficient. To calculate the number of times each word is seen, we need a hash table:

std::unordered_map<std::string, int> word_counts;

Now we need to iterate over all files in the current directory subtree.

for(const auto &e: std::filesystem::recursive_directory_iterator("."))

Skip everything that is not a .txt file.

if(!e.is_regular_file()) {

continue;

}

if(!std::regex_search(e.path().c_str(), fname_regex)) {

continue;

}

Process the file line by line:

std::ifstream current_file(e.path());

for (std::string line; std::getline(current_file, line); )

For each line we need to find all words that match the regular expression.

std::sregex_iterator word_end;

for(auto it = std::sregex_iterator(line.begin(), line.end(), word_regex); it != word_end; ++it)

This is a bit verbose, but it takes care of all the state tracking needed to run the regex multiple times on the same input string. Doing the same by hand is fairly tedious and error prone. Now that we have the matches, we need to convert them to standalone lowercase words.

std::string word{it->str()};

for(auto &c : word) {

c = std::tolower(c);

}

Lowercasing strings is a bit cumbersome, granted, but this is at least fairly readable. Then on to incrementing the word count.

++word_counts[word];

That's all that is needed to count the words. This can't be done directly in the hash map so we need to convert the data to an an array. It needs some setup code.

struct WordCount {

std::string word;

int count;

};

std::vector<WordCount> word_array;

word_array.reserve(word_counts.size());

Since we know how many entries will be in the array, we can reserve enough space for it in advance. Then we need to iterate over all entries and add them to the array.

for(const auto &[word, count]: word_counts) {

word_array.emplace_back(WordCount{word, count});

}

This uses the new structured bindings feature, which is a lot more readable than the old way of dealing with iterator objects or pairs with their firsts and their seconds and all that horror. Fortunately no more.

The simple way of getting the 10 most used entries is to sort the array and then grab the 10 first elements. Let's do something slightly swaggier instead [0]. We'll do a partial sort and discard all entries after 10. For that we need a descending sorting criterion as a lambda.

auto count_order_desc = [](const WordCount &w1, const WordCount &w2) { return w2.count < w1.count; };

Using it is straightforward.

const auto final_size = std::min(10, (int)word_array.size());

std::partial_sort(word_array.begin(), word_array.begin() + final_size, word_array.end(), count_order_desc);

word_array.erase(word_array.begin() + final_size, word_array.end());

All that remains is to print the final result.

for(const auto& word_info: word_array) {

std::cout << word_info.count << " " << word_info.word << "\n";

};

Safety

As safety and security are important features in current sofware development, let's examine how safe this program is. There are two main safety points: thread safety and memory safety. As this program has only one thread, it can be formally proven to be thread safe. Memory safety also divides into two main things to note: use after frees (or dangling references) and out of bound array accesses.

In this particular program there are no dangling references. All data is in value types and the only references are iterators. They are all scoped so that they never live past their usages. Most are not even stored into named variables but instead passed directly as function arguments (see especially the call to partial_sort). Thus they can not be accessed after they have expired. There is no manual resource management of any kind. All resources are freed automatically. Running the code under Valgrind yields zero errors and zero memory leaks.

There is one place in the code where an out-of-bounds access is possible: when creating the iterator pointing to 10 elements past the beginning of the array [2]. If you were to forget to check that the array has at least 10 elements, that would be an immediate error. Fortunately most standard libraries have a way to enable checks for this. Here's what GCC reports at runtime if you get this wrong:

Error: attempt to advance a dereferenceable (start-of-sequence) iterator 100000 steps, which falls outside its valid range.

Not really a perfect safety record, but overall it's fairly good and certainly a lot better than implementing the same by manually indexing arrays.

Problems

Compiling the program takes several seconds, which is not great. This is mostly due to the regex header, which is known to be heavyweight. Some of the compiler messages encountered during development were needlessly verbose. The actual errors were quite silly, though, such as passing arguments to functions in the wrong order. This could definitely be better. It is reported that the use of concepts in C++20 will fix most of this issue, but there are no practical real-world usage reports yet.

Unique features

This code compiles, runs and works [1] on all major platforms using only the default toolchain provided by the OS vendor without any external dependencies or downloads. This is something that no other programming language can provide as of date. The closest you can get is plain C, but its standard library does not have the necessary functionality. Compiling the code is also simple and can be done with a single command:

g++ -o cppexample -O2 -g -Wall -std=c++17 cppexample.cpp

Conclusions

If we try to look at the end result with objective eyes it would seem to be fairly nice. The entire implementation takes fewer than 60 lines of code. There's nothing immediately terrible that pops up. What is happening at each step is fairly understandable and readable. There's nothing that one could immediately hate and despise for being "awful", as is tradition.

This is not to say you could still not hate C++. If you want to, go right ahead. There are certainly reasons for it. But if you choose to do that, make sure you hate it for real actual reasons of today, rather than things that were broken decades ago but have been fixed since.

[0] There is an even more efficient solution using a priority queue that does not need the intermediate array. Implementing it is left as an exercise to the reader.

[1] Reader beware, I have only proven this code to be portable, not tried it.

[2] The implementation in [0] does not have this problem as no iterators are ever modified by hand.

13 comments:

AnonymousOctober 15, 2020 at 5:03 PM
Have grown up with modern C++11. I've learned that it allows to go low-level with plain pointers and performance and high-level with decent comfort and the compiler is the guide of the programmer. That's what a lot people neglect deliberately. Aside from "valgind" the modern compiler option of "-fsanitize=[address|...]" in GCC and LLVM has proven as complete game changer (for me) regarding reliability of C and C++. I wonder how "modules" will change programming C++.
ReplyDelete
Replies
M WelinderOctober 16, 2020 at 2:02 AM
Your downcasing does not work, see https://en.cppreference.com/w/cpp/string/byte/tolower

You need...

c = static_cast(std::tolower(static_cast(c)));
ReplyDelete
Replies
FrancoOctober 16, 2020 at 8:57 AM
Yes, you CAN write elegant, modern, safe C++.

The problem remains that C++ has been out there for decades, so you can too easily mix new, safe, cool C++, with old, ugly, and unsafe C++, and the worst part is that the end result won't be even that much uglier if you look at the source code. Moreover, you're expecting programmers that used C++ for decades, to forget old programming habits and write code only using the new idioms.

IMHO C++ is beyond fixing, too huge, too easy to write unsafe code, too many bad habits accumulated over the years.

That's the reason we have new safer languages like Rust. From my experience (I'm not a Rust expert, but I made some simple CLI tools with it, so I'm not a novice either), you CAN make unsafe Rust code, but the source code looks way uglier than the safe one, so you instinctively write better code.
ReplyDelete
Replies
Vagn JOctober 16, 2020 at 10:58 AM
Re [1]: on Windows there are build errors with std::regex_search and std::toupper
ReplyDelete
Replies
UnknownOctober 17, 2020 at 3:26 AM
...This issis that no other programming language can provide as of date. The closest you can get is plain C..
Ever heard of Rust ( with the exception of therthe crate is not in std)
ReplyDelete
Replies
RobertOctober 17, 2020 at 3:46 PM
This code is fragile. Its safety and security depend on a subtle invariant: that the 'word_regex' does not match any non-ASCII character. If requirements changed so the regex could match a non-ASCII character, then one of the 'c' chars could have a negative value other than -1. https://en.cppreference.com/w/cpp/string/byte/tolower says "If the value of ch is not representable as unsigned char and does not equal EOF, the behavior is undefined", so executing std::tolower would trigger undefined behavior, i.e. a safety and security bug.

This fragility is not at all obvious in the code, and it is easy to imagine someone making such a change in response to changed requirements.
ReplyDelete
Replies
SomeOctober 19, 2020 at 9:13 AM
!std::regex_search(e.path().c_str(), fname_regex)

why not use e.path().extension()!=".txt"?
ReplyDelete
Replies
The Big MOctober 21, 2020 at 4:11 PM
I had a go adding some parallelism using the parallel STL and TBB. I made two implementations, count_words_stl() which parallelized the processing of files and count_words_tbb() which further parallelized the traversal of the file-tree. I was pretty pleased with the performance uplift for relatively little effort.
I used RE2 instead of std::regex because the compile times began to annoy me but regex turned out not to be the bottleneck anyway.

Here's the code:
https://pastebin.com/uSN9tiSa
ReplyDelete
Replies

Add comment