Saturday, October 31, 2015

Shell script or Python?

Every now and then a discussion arises on whether some piece of scripting functionality should be written as a basic shell script or in Python. Here is a handy step-by-step guide to answering the question. Start at the beginning and work your way down until you have your answer.

Will the script be longer than one screenful of text (80 lines)? If yes ⇒ Python.

Will the script ever need to be run on a non-unix platform such as Windows (including Cygwin)? If yes ⇒ Python.

Does the script require command line arguments? If yes ⇒ Python.

Will the script do any networking (e.g. call curl or wget)? If yes ⇒ Python.

Will the script use any functionality that is in Python standard library but is not available as a preinstalled binary absolutely everywhere (e.g. compressing with 7zip)? If yes ⇒ Python.

Will the script use any control flow (functions/while/if/etc)? If yes ⇒ Python.

Will the script have modifications in the future rather than being written once and then dropped? If yes ⇒ Python.

Will the script be edited by more than one person? If yes ⇒ Python.

Does the script call into any non-trivial Unix helper tool (awk, sed, etc)? If yes ⇒ Python.

Does the script call into any tool whose behaviour is different on different platforms (e.g. mkdir -p, some builtin shell functionality)? If yes ⇒ Python.

Does the script need to do anything in parallel? If yes ⇒ Python.

If none of the above match then you have a script that proceeds in a straight line calling standard and simple Unix commands and never needs to run on a non-unix platform. In this case a shell script might be ok. But you might still consider doing it in Python, because scripts have a tendency to become more and more complex with time.

Friday, October 30, 2015

Proposal for launching standalone apps on Linux

One nagging issue Linux users often face is launching third party applications, that is, those that do not come from package repositories. As an example the Pitivi video editor provides daily bundles that you can just download and run . Starting it takes a surprising amount of effort and may require going to the command line (which is an absolute no-no if your target audience includes regular people).

This page outlines a simple, distro agnostic proposal for dealing with the problem. It is modelled after OSX application bundles (which are really just directories with some metadata) and only requires changes to file managers. It is also a requirement that the applications must be installable and runnable without any priviledge escalation (i.e. does not require root).

The core concept is an application bundle. The definition of an app bundle named foobar is a subdirectory named foobar.app which contains a file called foobar.desktop. There are no other requirements for the bundle and it may contain arbitrary files and directories.

When a file manager notices a directory that matches the above two requirements, and thus is a bundle, it must treat it as an application. This means the following:


  • it must display it as an application using the icon specified in the desktop file rather than as a subdirectory
  • when double clicked, it must launch the app as specified by the desktop file rather than showing the contents of the subdir
These two are the only requirements to get a working OSX-like app bundle going on Linux. There can of course be more integration, such as offering to install an app to ~/Applications when opening a zip file with an app bundle or registering the desktop file to the system wide application list (such as the dash on Ubuntu).

Monday, October 12, 2015

Some comments on the C++ modules talk

Gabriel Dos Reis had a presentation on C++ modules at cppcon. The video is available here. I highly recommend that you watch it, it's good stuff.

However as a build system developer, one thing caught my eye. The way modules are used (at around 40 minutes in the presentation) has a nasty quirk. The basic approach is that you have a source file foo.cpp, which defines a module Foobar. To compile this you should say this:

cl -c /module foo.cxx

The causes the compiler to output foo.o as well as Foobar.ifc, which contains the binary definition of the module. To use this you would compile a second source file like this:

cl -c baz.cpp /module:reference Foobar.ifc

This is basically the same way that Fortran does its modules and it suffers from the same problem which makes life miserable for build system developers.

There are two main reasons. One: the name of the ifc file can not be known beforehand without scanning the contents of source files. The second is that you can't know what filename to give to the second command line without scanning it to see what imports it uses _and_ scanning potentially every source file in your project to find out what file actually provides it.

Most modern build systems work in two phases. First you parse the build definition and determine how and which order to do individual build steps in. Basically it just serialises the dependency DAG to disk. The second phase loads the DAG, checks its status and takes all the steps necessary to bring the build up to date.

The first phase of the two takes a lot more effort and is usually much slower than the second part. A typical ratio for a medium project is that first phase takes roughly ten seconds of CPU time and the second step takes a fraction of a second. In contemporary C++ almost all code changes only require rerunning the second step, whereas changing build config (adding new targets etc) requires doing the first step as well.

This is caused by the fact that output files and their names are fully knowable without looking at the contents of the source files. With the proposed scheme this no longer is the case. A simple (if slightly pathological) example should clarify the issue.

Suppose you have file A that defines a module and file B that uses it. You compile A first and then B. Now change the source code so that the module definition goes to B and A uses it. How would a build system know that it needs to compile B first and only then A?

The answer is that without scanning the contents of A and B before running the compiler this is impossible. This means that to get reliable builds either all build systems need to grow a full C++ parser or all C++ compilers must grow a full build system. Neither of these is particularly desirable. Even if build systems got these parsers they would need to reparse the source of all changed files before starting the compiler and it would need to change the compiler arguments to use. This makes every rebuild take the slow path of step one instead of the fast step two.

Potential solutions


There are a few possible solutions to this problems, none of which are perfect. The first is the requirement that module Foo must be defined in a source file Foo.cpp. This makes everything deterministic again but has that iffy Java feeling about it. The second option is to define the module in a "header-like" file rather than in source code. Thus a foo.h file would become foo.ifc and the compiler could pick that up automatically instead of the .h file.

Friday, September 25, 2015

Linux ABI compatibility even reduxer

My previous blog post raised some discussion on Reddit. There were two major issues raised. The first one is that a helloworld application is not much of a test. The other was that because the distro was so new, the test does not validate that backwards compatibility goes back 16 years, but only about 6.

Very well.



I installed Red Hat Linux 6.2, which was originally published in 2000. Then I downloaded the source code of Gimp 1.0, compiled it and embedded its dependencies, basically glib + gtk, into an app bundle. Stable core libraries such as libjpeg, X libraries and glibc were not bundled. I then copied this bundle to a brand new Ubuntu install and ran it.



It worked without a hitch.

What have we learned from this? Basically that it was possible in the year 2000 to create binary only applications for Linux that would work 15 years later on a different distro.

If you follow the exact same steps today, it is presumable that you can create binary apps that will run without any changes in the year 2030.

Thursday, September 24, 2015

Linux applications backwards compatibility redux

As we all know Linux applications are tied to the distro they have been built on. The do not work on other distros and even on later releases of the same distro. Anyone who has tried to run old binaries knows this. It is the commonly accepted truth.

But is it the actual truth?

In preparation for my LCA2016 presentation I decided to test this. I took Debian Lenny from 2009 and installed it in VirtualBox. Lenny is the oldest version that would work, the older ones would fail because distro mirrors no longer carry their packages. Then I downloaded GTK+ 1.2 from the previous millennium (1999). I built and installed it and GLib 1 into a standalone directory and finally compiled a helloworld application and a launcher script and put them in the same dir.

This directory formed an application bundle that is almost identical to OSX app bundles. The main difference to a distro package is that it embeds all its dependencies rather than resorting to distro packages. This is exactly how one would have produced a standalone binary at the time. I copied the package to a brand new Fedora install and launched it. This is the result.
This is a standard Fedora install running Gnome 3. The gray box in the middle is the GTK+ 1.2 application. It ran with no changes whatsoever. This is even more amazing when you consider that this means a backwards compatibility time span of over 15 years, over two completely different Linux distributions and even CPU architectures (Lenny was x86 whereas Fedora is x86_64).

Tuesday, August 18, 2015

Proposal for a dependency security scheme for sandboxed apps

Perhaps the main disadvantage of embedding external projects as source code rather than using distro packages is the loss of security. Linux distributions have lots of people working on keeping distro packages up to date and safe. With embedded source code this is no longer the case. The developer is in charge of keeping the application up to date. Some developers are better at it than others.

What makes this worse is the fact that if you embed (and especially statically link) your dependencies it is impossible to know what versions of which libraries you are using. If this information were available, then the host operating system could verify the list of embedded dependencies against a known white- or blacklist. The packaging format simply does not have this information.

So let's put it there.

A merge proposal has just recently been proposed to Meson. This makes it create (and optionally install) a dependency manifest for each generated binary. This manifest is simply a JSON file that lists all the embedded dependencies that each given binary uses. Its proposed format looks like this.

{
  "type": "dependency manifest",
  "version": 1,
  "dependencies":
  {
    "entity": "1.0"
  }
}

In this case the executable has only one dependency, the project entity version 1.0. Other such dependencies could include zlib version 1.2.8 or openssl version 1.0.2d. The project names and releases would mirror upstream releases. This manifest would make it easy to guard against binaries that have embedded unsafe versions of their dependencies.

But wait, it gets better.

All dependencies that are provided by the Wrap database would (eventually ;) expose this information and thus would generate the manifest automatically. The developer does not need to do anything to get it built, only to say he wants to have it installed. It is simply a byproduct of using the dependency.

As the Linux application installation experience keeps moving away from distro packages and towards things such as xdg-app, snappy and the like, the need to increase security becomes ever trickier. This is one such proposal that is already working and testable today. Hopefully it will see adoption among the community.

Single question mini-faq

What happens if someone creates a fraudulent or empty manifest file for their app?

The exact same thing as now when there is no manifest info of any kind. This scheme is not meant to protect against Macchiavelli, only against Murphy.

Wednesday, August 5, 2015

Make your tool programs take file name arguments

There are a lot of utility programs for text manipulation, scanning and so on. Often they are written as filters so you can use them with shell redirection like this.

dosomething < infile.txt > outfile.txt

There's nothing wrong with this as such, but the problem is that this causes problems when you invoke the program from somewhere else than a unix shell prompt. As an example if you need to invoke it from a compiled binary, things get very complex as you need to juggle with pipes and other stuff.

Because of this you should always make it possible to specify inputs and outputs as plain files. For new programs this is simple. This is a bigger problem for existing programs that already have hundreds of lines of code that read and write directly to stdout and stdin. Refactoring it to use files might be a lot of work for little visible gain.

No matter, the C standard library has you covered. It has a method called freopen that opens a file and replaces an existing file descriptor with it. To forward stdout and stdin to files you just need to do this at the beginning of your program:

freopen(ifilename, "r", stdin);
freopen(ofilename, "w", stdout);