My previous blog post was about some measurements I made to Refterm. It got talked about on certain places on the net from where it got to Twitter. Then Twitter did the thing it always does, which is to make everything terrible. For example I got dozens of comments saying that I was incompetent, an idiot, a troll and even a Microsoft employee. All comments on this blog are manually screened but this was the only time when I just had to block them all. Going through those replies seemed to indicate that absolutely everyone involved [1] had bad communication and also misunderstood most other people involved. Thus I felt compelled to write a followup explaining what the blog post was and was not about. Hopefully this will bring the discussion down to a more civilized footing.

What we agree on

Let's start by writing down all the things we agree on.

The current Windows terminal is slow.
It can be made faster.
A GPU-based renderer (such as the one in Refterm) can render terminal text hundreds of times faster than the current implementation in Windows terminal.

Note that even the people working on Microsoft Terminal acknowledged all of these to be true before any code on Refterm had been written. From what I can tell, #3 is what Refterm set out to prove and it was successfull at it.

So what's the issue then?

Once the code was out many people started making these kinds of statements.

Windows terminal should switch to using this method because it is obviously superior and not doing that is just laziness and excuses.

Now, and this is important, the original people who worked on Refterm never made these kinds of claims. They did not! And further, I never claimed that they did. Other people ("the talking heads on the Internet") made those claims and then mental misattribution took over. This is unfortunate but sadly almost inevitable whenever these kinds of debates happen. That then leads to the obvious follow up question:

Could the rendering mechanism used in Refterm be put in Windows terminal proper? If not, why not? If yes, what would the outcome be like and would the code need changing?

This is what my original blog post was about. Since this was outside of the original project's design goals I should have stated the goals out explicitly. I did not, and that is a flaw on my part, sorry about that.

The problems with prototypes

Implementing a simple prototype of an existing program (or a part of it), achieving great success and then extrapolating from that to the whole program (and to reiterate: the original Refterm authors did not do this speculation, other people did) has a well known pitfall. I could write an entire blog post about it. Fortunately I don't have to since Matthew Garrett already wrote one. I recommend that everyone reads that before continuing with this post.

The tl/dr version is that when you bring a protype up to sufficient feature parity with an existing implementation you will encounter unexpected problems. The algorithms or approaches you took might not be feasible. You might need to add a bunch of functionality that you never considered or had even heard of. Until you have the entire implementation you don't know whether you approach will work. In fact you can't know it. Anyone who claims to know is lying, either to others or to themselves. (Reminder again: the Refterm authors did not make these kinds of estimates.)

We can try to come up with some of the obstacles and problems one could have when moving the prototype implementation into a real one and then examine those. They can't prove fitness but they can reveal un-fitness. The points discussed in the blog post were just some that I came up with. There are undoubtedly many others.

Resource usage

Let's start this by acknowledging a notable flaw in the original post [2]. When evaluating memory usage the post only compared it against other types of apps, not other terminals. I ran some new measurements by starting both the Windows cmd.exe shell as well as the Git-Scm's MSYS2 terminal, running a few simple commands and looking at memory consumption with the Task Manager. Refterm took 350 MB of ram, MSYS2 took 4 MB and cmd.exe took 7 MB.

People really love their terminals. I have seen setups where a single developer has 10+ different terminals open at the same time all as different processes (with several tabs each). So even if 300 MB of ram usage for a single app would be fine, using 3 GB of ram in this case would not be. Thus one would either need to dramatically reduce memory usage or have something like a shared glyph cache between all the various processes. That requires shared GPU resources, some sort of an IPC communication mechanism, multiprocess cache invalidation and all other fun stuff (or that is my understanding at least as a Windows and GPU neophyte, if there is a simpler way do let me know).

This piece of information is useful and important on its own. It gives new information and understanding of what the code does and does not do.

A retort to this that was actually made by Refterm developers was that "there are variables and knobs in the code you can tweak to affect the performance". To this I say: no. The first tests should always be done with the exact setup the code ships with. There are two reasons for this. First of all, that makes experiments made by different people directly comparable with each other. Secondly, the original author(s) know the code best and thus it makes sense to choose those parameter values that they went with.

Code layout and nonstandardness

Let's start again with the thing we all agree on:

For your own projects you can choose whatever code layout, build system, organization and so on that you want. Do whatever works best for you and don't let anyone tell you otherwise!

Things get more complicated when you start including other people, especially outside your own circle of devs. An open source project is a typical example. An anonymous commenter told me the following:

This is also the simplest possible code structure, very simple to work with for new contributors.

This sentence is interesting in that it is both true and not true. If you have a person who has no prior programming knowledge then yes, the layout is simplest possible and easy to get started with. On the other hand if the potential contributor is already accustomed to the "standard way of setting up a project" then things change and the nonstandard layout can get confusing [3] and can be a barrier to entry for new people. This is the nature of teamwork: sometimes it might make sense to do the thing that is inconvenient for you personally for the benefit of others. Sometimes it does not make sense. Like most things in life being nonstandard is not an absolute thing, it has its advantages but also disadvantages.

I actually encountered a technical disadvantage caused by this. I wanted to compile and run Refterm under Address Sanitizer, which is a really great tool for finding bugs. Asan is integrated into the latest VS and all you need to do is to add /fsanitize=address flags to the compiler to use it. This does not work for Refterm but instead leads to a bunch of linker errors. The Asan library depends on the C runtime and Refterm is explicitly set up not to use it. It took me a fair bit of time to work out that the way to get it working is to go through the code and replace the WinMainCRTStartup function with a "normal" WinMain function and then the linker would do the right thing [4].

That SIMD memcpy thing

I pondered for a while whether I should mention the memcpy thing and now I really wish I hadn't. But not for the reasons you might think.

The big blunder I did was to mention SIMD by name, because the issue was not really about SIMD. The compiler does convert the loop to SIMD automatically. I don't have a good reference, but I have been told that Google devs have measured that 1% of all CPU usage over their entire fleet of computers is spent on memcpy. They have spent massive amounts of time and resources on improving its performance. At least as late as 2013, performance optimizing memcpy was still subject to fundamental research (or software patents at least). For reference here is the the code for the glibc version of memcpy, which seems to be doing some special tricks.

If this is the case and the VS stdlib provides a really fast memcpy then rolling your own does cause a performance hit (though in this particular case the difference is probably minimal, possibly even lost in the noise). On the other hand it might be that VS can already optimize the simple version to the optimal code in which case the outcome is the same for both approaches. I don't know what actually happens and finding out for sure would require its own set of tests and benchmarks.

Concluding and a word about blog comments

That was a very long post and it did not even go through all the things I had in mind. If you have any comments, feel free to post them below, but note the following:

All comments are prescreened and only appear after manually approved.
Any comments that contain insults, whining, offensive tone or any other such thing will be trashed regardless of its other merits
The same goes for any other post whose contents makes it obvious that the commenter has not read the whole text but is just lashing out.

[1] Yes, this includes me. It most likely includes you as well.

[2] Thanks to an anonymous commenter for pointing this out.

[3] I can't speak for others, bu it was for me.

[4] There may have been other steps, but this was the crucial one.

4 comments:

Matthew FernandezJuly 9, 2021 at 6:29 PM
I don't know if this is precisely what you were looking for, but if you follow the "slides" link in the README of https://github.com/google/EXEgesis you will find some interesting 2017 references to Google's memcpy problems.
RobertJuly 10, 2021 at 11:53 AM
The renderer already used in Windows Terminal is also GPU-based.
UnknownJuly 12, 2021 at 7:54 AM
The original objection made by the MS developers was that text layout and rendering is difficult, and that the kind of performance Casey was talking about was impossible. That's clearly been proven false. The video he posted today showed RefTerm using lower memory than Windows terminal.

Yes, it's a prototype that doesn't account for everything, but it's a prototype that accounted for all the reasons made by MS developers and clearly exemplified a better way of doing glyph rendering in a terminal.
Christian Parpart.July 19, 2021 at 2:38 PM
"that text layout and rendering is difficult"

That's actually true, just try to do it yourself and you will realize. Have a look at "Blink's text stack" document to get a glimpse of what one should do in order to get it *right*. refterm has a working version, but cmuratori also acknowledges some problems with it in one of his vids.

I think refterm is not super optimized but is definitely using a few optimization strategies (loook at how terminal buffer management is implemented, look at how ASCII is scanned and how scrollback lines are represented).

Personally I also think MS WT is slow and acknowledge that fact. I don't understand why people make such a fuzz out of it. It's just feeding (granted, I'm feeding right now, too ;).

But the bottom line, a tile based renderer should always be fast. I agree. ;)

Nibble Stew

Thursday, July 8, 2021

A followup to the Refterm blog post