Sunday, April 24, 2016

Rewriting from scratch, should you do it?

One thing that has bothered me for a while is that the unzip command line tool only decompresses one file at a time. As a weekend project I wanted to see if I could make it work in parallel. One could write this functionality from scratch but this gave me a possibility to try really look into incremental development.

All developers love writing stuff from scratch rather than fixing and existing solutions (yes, guilty as charged). However the accepted wisdom is that you should never do a from scratch rewrite but instead improve what you have via incremental improvements. Thus I downloaded the sources of Info-Zip and got to work.

Info-Zip's code base is quite peculiar. It predates things such as ANSI C and has support for tons of crazy long dead hardware. MS-DOS ranks among the most recently added platforms. There is a lot of code for 16 bit processors, near and far pointers and all that fun stuff your grandad used to complain about. There are even completely bizarre things such as this snippet:

#ifndef const
#  define const const

The code base contains roughly 80 000 lines of K&R C. This should prove an interesting challenge. Those wanting to play along can get the code from Github.

Compiling the code turned out to be quite simple. There is no configure script or the like, everything is #ifdeffed inside the source files. You just compile them into an app and then you have a working exe. The downside is that the source has more preprocessor code than actual code (only a slight exaggaration).

Originally the code used a single (huge) global struct that houses everything. At some point the developers needed to make the code reentrant. Usually this means changing every function to take the global state struct as a function argument instead. These people chose not to do this. Instead they created a C preprocessor macro system that can be used to pass the struct as an argument but also compile the code so it has the old style global struct. I have no idea why they did that. The only explanation that makes any sort of sense is that adding the pointer to stack on every function call is too expensive on 16 bit and smaller platforms. This is just speculation, though, but if anyone knows for sure please let me know.

This meant that every single function definition was a weird concoction of preprocessor macros and K&R syntax. For details see this commit that eventually killed it.

Getting rid of all the cruft was not particularly difficult, only tedious. The original developers were very pedantic about flagging their #if/#endif pairs so killing dead code was straightforward. The downside was that what remained after that was awful. The code had more asterisks than letters. A typical function was hundreds of lines long. Some preprocessor symbols were defined in opposite ways in different header files but things worked because some other preprocessor clauses kept all but one from being evaluated (the code confused Eclipse's syntax highlighter so it's really hard to see what was really happening).

Ten or so hours of solid work later most dead cruft was deleted and the code base had shrunk to 30 000 lines of code. At this point looking into adding threading was starting to become feasible. After going through the code that iterates the zip index and extracts files it became a lot less feasible. As an example the inflate function was not isolated from the rest of the code. All its arguments were given in The One Big Struct and it fiddled with it constantly. Those would need to be fully separated to make anything work.

That nagging sound in your ear

While fixing the code I kept hearing the call of the rewrite siren. Just rewrite from scratch, it would say. It's a lot less work. Go on! Just try it! You know you want to!

Eventually the lure got too strong so I opened the Wikipedia page on Zip file format. Three hours and 373 lines of C++ later I had a parallel unzipper written from scratch. Granted it does not do advanced stuff like encryption, ZIP64 or creating subdirectories for files that it writes. But it works! Code is available in this repo.

Even better, adding multithreading took one commit with 22 additions and 7 deletions. The build definition is 10 lines of Meson instead of 1000+ lines of incomprehensible Make.

There really is no reason, business or otherwise, to modernise the codebase of Info-Zip. With contemporary tools, libraries and methodologies you can create code that is an order of magnitude simpler, clearer, more maintainable and just all around more pleasant to work with than existing code. In a fraction of the time.

Sometimes rewriting from scratch is the correct thing to do.

This is the exact opposite of what I set out to prove but that's research for you.


The program in question has now been expanded to do full Zip packing + unpacking. See here for benchmarks.


  1. I don't get the rationale behind NEVER rewriting code from scratch. If it is a horrible mess wrapped around what should be relatively simple code, why not rewrite it in a way that is better and easier to maintain for future people?

  2. The rationale for not rewriting is that usually the rewrite turns out to be a lot more expensive than originally estimated. Focusing on the new version means neglecting the existing version and if that goes on for too long, your customers will get angry and go elsewhere.

  3. In this case you've got a well defined API to replace, and you've got a well defined vision of what you want. The conventional "never rewrite" wisdom (IMHO) applies more to monolithic business applications, with years of bug fixes and quirky/murky/important behavior.


  4. A total rewrite only makes sense if you truly understand the *complete* requirements and architectural design of the application. This will allow you to *re-architect* the program quickly and efficiently.

    Sometime, a rewrite is necessary because the existing codebase is such a huge, tangled mess. Refactoring may be much more trouble than it's worth.

    Refactoring requires you to fully understand the codebase. If it's poorly written, it will be extremely painful to read and understand.