My previous blog post about modern C++ got a surprising amount of feedback. Some people even reimplemented the program in other languages, including one in Go, two different ones in Rust and even this slightly brain bending C++ reimplementation as a declarative style pipeline. It also got talked about on Reddit and Hacker news. Two major comments that kept popping up were the following.
- There are potential bugs if the program is ever extended to support non-ASCII text
- It is hard to use external dependencies in C++ (and by extension C)
Let's solve both of these problems at the same time. There are many "lite" solutions for doing Unicode aware text processing, but we're going to use the 10 000 kilogram hammer:
International Components for Unicode. The actual code changes are not that big, interested parties can go look up the details in
this Github repo. The thing that is relevant for this blog post is the build and dependency management. The Meson build definition is quite short:
project('cppunicode', 'cpp',
default_options: ['cpp_std=c++17',
'default_library=static'])
icu_dep = dependency('icu-i18n')
thread_dep = dependency('threads')
executable('cppunicode', 'cppunicode.cpp',
dependencies: [icu_dep, thread_dep])
The threads dependency is for the multithreaded parts (see end of this post). I developed this on Linux and used the convenient system provided ICU. Windows and Mac do not provide system libs so we need to build ICU from scratch on those platforms. This is achieved by running the following command in your project's source root:
$ meson wrap install icu
Installed icu branch 67.1 revision 1
This contacts Meson's WrapDB server and downloads build definition files for ICU. That is all you need to do. The build files do not need any changes. When you start building the project, Meson will automatically download and build the dependency. Here is a screenshot of the download step:
Here it is building in Visual Studio:
And here is the final built result running on macOS:
One notable downside of this approach is that WrapDB does not have all that many packages yet. However I have been told that given the next Meson release (in a few weeks) and some upstream patches, it is possible to build the entire GTK widget toolkit as a subproject, even on Windows.
If anyone wants to contribute to the project, contributions are most welcome. You can for example convert existing projects and submit them to wrapdb or become a reviewer. The Meson web site has the relevant documentation.
Appendix: making it parallel
Several people pointed out that while the original program worked fine, it only uses one thread. This may be a bottleneck and that "in C++ it is hard to execute work in parallel". This is again one of those things that has gotten a lot better in the last few years. The "correct" solution would be to use the parallel version of
transform_reduce. Unfortunately most parallel STL implementations are still in the process of being implemented so we can't use those in multiplatform code. We can, however, roll our own fairly easily,
without needing to create or lock a single mutex by hand. The code has the actual details, but the (slightly edited) gist of of it is this:
for(const auto &e:
std::filesystem::recursive_directory_iterator(".")) {
futures.emplace_back(std::async(std::launch::async,
count_word_files,
e));
if(futures.size() > num_threads) {
pop_future(word_counts, futures);
}
}
while(!futures.empty()) {
pop_future(word_counts, futures);
}
Here the count_word_files function calculates the number of words in a single file and the pop_future function joins individual results to the final result. By using a share-nothing architecture, pure functions and value types all business logic code can be written as if it was single threaded and the details of thread and mutex management can be left to library code. Haskell fans would be proud (or possibly horrified, not really sure).
No comments:
Post a Comment