Nibble Stew: June 2017

Monday, June 26, 2017

Writing your own simple GUI SSH client

Most Unix users are accustomed to using SSH from the command line. On Windows and other platforms GUI tools are popular and they can do some nice tricks such as opening graphical file transfer windows and initiate port forwardings to an existing connection. You can do all the same things with the command line client but you have to specify all things you want to use when first opening the connection.

This got me curious. How many lines of code would one need to build a GUI client that does all that on Linux. The answer turned out to be around 1500 lines of code whose job is mostly to glue together the libvte terminal emulator widget and the libssh network library. This is what it looks like:

Port forwardings can be opened at any time. This allows you to e.g. forward http traffic through your own proxy server to go around draconian firewalls.

File transfers also work.

A lot of things do not work, such as reverse port forwards or changing the remote directory in file transfers. Some key combinations that have ctrl/alt/etc modifiers can't be sent over the terminal. I don't really know why, but it seems the vte terminal does some internal state tracking to know which modifiers are active. There does not seem to be a way to smuggle corresponding raw keycodes out, it seems to send them directly to the terminal it usually controls. I also did not find an easy way of getting full keyboard status from raw GTK+ events.

Thursday, June 15, 2017

Of XML, GUIDs and installers

Every now and then I have to use Windows. I'd really like to use Emacs for editing so I need to install it. Unfortunately the official releases from the FSF are zip archives that you need to manually unpack, add shortcuts and all that stuff. This gets tedious and it would be nice if there were MSI installer packages to do all that for you. So I figured I'd create some.

Conceptually what the installer needs to do is the following:

Create the c:\Program Files\GNU Emacs directory
Unzip contents of the official release zip file in said directory

Seems simple, right? There are two main tools for creating installers. The first one is NSIS, but it only creates exe installers, not msi. It also has a scripting language designed by a shaman who has licked a few frogs too many. So let's ignore that.

The other tool is WiX. It creates nice installers but it is Enterprise. Very, Very Enterprise. And XML. But mostly Enterprise. For starters the installer needs to have a GUID (basically a 128 bit random number). It also needs an "upgrade GUID". And a "package GUID". The first two must be the same over all installer versions but the latter must not be.

Having conjured the necessary amount of GUIDs (but not too many), you are ready to tackle copying files around. As you probably guessed, each one of them needs its own GUID. But if you though that each one would require an XML element of their own, you are sadly mistaken. Every file needs two nested XML elements. The files also need a container. And a container-container.

Did I mention that the documentation consists almost entirely of XML fragments? So it is all but impossible to tell which tag should follow which and which ones should be nested?

The Emacs distribution has a lot of files, so obviously writing the XML by hand is out of the question. Unfortunately the WiX toolkit ships a helper utility called heat to generate the XML from a directory tree. Yes, I really did say unfortunately. While the script pretends to work in reality it does not. Following the documentation you might try doing something like this (simplified for purposes of exposition):

heat <directory with unpacked files> <output xml file> other_flags

This does create an installer which installs all the files. But it also does a little bit extra. If your unpack directory was called unpack, then the files will be installed to c:\Program Files\GNU Emacs\unpack. How might we get rid of that extra directory segment? The correct answer is that the system is optimized for in-source Visual Studio builds and trying to do anything else is doomed to fail. But let's try it anyway.

First one might look for a command line switch like -P1 for patch to discard the first path segment. It does not seem to exist.

Next one might try to be clever and cd inside the unpack dir and do this (again simplified):

heat . <output xml file> other_flags

The system reports no error but the output is identical to the previous attempt. The script helpfully notices that you are using a period for the directory name and will do a lookup in the parent directory to see what it would be called and substitutes it in. Because!

Since there are only a few directories in the top level one might try something along the lines of:

heat dir1 dir2 dir3 dir4 <output xml file> other args

Which again succeeds without error. However the output installer only has files from dir1. The tool parses all the input and then dutifully throws away all entries but the first without so much as a warning.

The way to make this work is to generate the XML files and then monkey patch all paths in them before passing them on to the installer generator. For That Is the Enterprise Way! Just be glad there are no files at the root directory.

Join us next time to learn how a randomly generated daddy GUID comes together with a randomly generated mommy GUID to produce a new entry in the start menu.

Is this available somewhere?

The script is here. There are no downloadable binaries because I don't have the resources to maintain those. It would be cool if someone (possibly even the FSF) would provide them.

Thursday, June 8, 2017

Optimizing code: even the simplest things are unbelievably complex

In the previous post we looked at optimizing this simple function.

uint64_t result = 0;
for(size_t i=0; i<bufsize; i++) {
if(buf[i] >= 128) {
result += buf[i];
}
}

We shall now do more measurements with real world hardware and compilers. The algorithms we use are the following:

simple: the above loop as-is
lookup: create a lookup table where entries less than 128 have value zero and the rest have the same value as the index
bit fiddling: convert the if into a branchless bitmask operation
partition: run std::partition on the data and only add the first half
zeroing: go over the data and set values not matching to zero, then add all
bucket: keep an array of 255 entries and count the number of times each number appears
multiply: convert if to a multiplication by 0 or 1, then add
parallel add: add several chars in a single packed 64 bit addition

Those interested in the actual implementations should look it up in the repo.

The hardware used is the following:

Raspberry Pi, Raspbian, GCC 4.9.2, Clang 3.5.0
Ubuntu zesty, GCC 6.3.0, Clang 4.0.0
Macbook Pro i5, XCode 8
Windows 10, Visual Studio 2017, run in VirtualBox

The test suite runs all available compilers with a selection of optimization types, CPU features (SSE, AVX, Neon etc) and measures the times taken.

The results

Let's start by looking at the simplest build setup.

This seems quite reasonable. Parallel addition is the fastest, others are roughly as fast and the two algoritms that reorder the input array are the slowest. For comparison Raspberry Pi looks like this:

Everything is much flatter as one would expect. Since everything is going smoothly, let's look at the first measurement again, except this time we sort the input data before evaluating. One would expect that the simple loop becomes faster because the branch predictor has an easier task, partitioning becomes faster and nothing becomes noticeably slower.

Well ... ummm ... one out of three ain't bad, I guess. At this point I should probably confess that I don't have a proper grasp on why these things are happening. Any speculation to follow might be completely wrong. The reason bucket slows down is the easier of these two to explain. Since the input is sorted, consecutive iterations of the loop attempt to write to the same memory address, which leads to contention. When the data was random, each iteration wrote to a random location which leads to fewer collisions.

The reason why the simple loop does not get faster may be caused by the processor evaluating both branches of the if clause in any case and thus having better branch prediction does not matter. On the other hand Visual Studio does this:

Bucket is slower for sorted as above, but the simple loop is an order of magnitude slower on unsorted data. Ideas on what could be the cause of that are welcome.

What is the fastest combination?

The fastest combination for each hardware platform is the following.

Ubuntu: bit fiddle, g++, release build, -msse, unsorted
Raspi: bit fiddle, g++, release build, -mfpu=neon, sorted
OSX: simple loop, Clang++, debugoptimized build, -msse4.2, sorted
VS2017: lut, debugoptimized build, unsorted

This is basically random. There does not seem to be any one algorithm that is consistently the fastest, every one of them is noticeably slower than others under some circumstances. Even weirder, things that you would expect to be straightforward and true are not. Here are some things to scratch your head over:

AVX instructions are never the fastest, and on an i7 the fastest is plain old SSE (for the record MMX was not tested)
With Clang, enabling Neon instructions makes everything a lot slower
On the Raspberry Pi doing a read only table lookup using Neon is slower than with regular instructions
On an i7 multiplication is sometimes faster than arithmetic shifting