On Understanding Computers

Many times I’ve been asked: “Why are you working on that? Is it for work?” referring to my tinkerings with kernel internals. “No, it’s not for work,” I reply, “I’m just interested in it.”

Other times this goes a bit further: “That looks horrible. Why would anyone want to work in C and assembler? No-one needs to understand that any more.” I’ll reply that I think it’s important to understand that which you rely on.

Then I’ll be told: “You don’t need to understand the internal combustion engine to drive a car.” They’re right, of course, but there is still value in understanding how an internal combustion engine works.

Then maybe someone will be telling me about the wonderful abstraction provided by the Node.js event loop, or the concurrency benefits of using a functional language. These things scare me, because I know that they eventually have to play ball with the real and terrifying world of the kernel and the processor, and only a fraction of a percentage of programmers have any interest in how either of those things run their code.

Entitlement and the Cost of Abstraction

Frustration is part and parcel of programming. A lot of our abstractions are a house of cards, and frequently they collapse around us and cause us pain. If you’ve used any third-party library then you’re probably aware of this. Having to work around other people’s bugs is a huge pain and makes your code ugly.

Abstraction Example: Amazon SWF

Amazon have a service called “Simple Workflow”. It takes away a lot of the pain of having to write a distributed message pipe-lining application. The idea is that you have multiple “activities” and a “decider”. Messages are sent to the decider and can contain arbitrary input, we use JSON, then the decider can schedule activities to run. On completion, the activities report back to the decider with their result and the next stage can be initiated.

This service works by polling queues that live inside Amazon. You can have multiple instances of activities, deciders and entire workflows. It’s really quite nice, until you start having to do complicated things with it.

Each “workflow” is a separate instance and has its own ID. You can kick-off multiple of the same type of workflow with different input and they can be processed in parallel depending on how many deciders and activity workers you have.

SWF gives you the ability to “signal” a workflow with arbitrary data at an arbitrary time, not dissimilar to Unix signals. When this happens, it forces the decider to run and make a decision on the new data. However, if a decision task had already been started, you end up with a situation where two decisions are scheduled at the same time.

SWF ensures only one decision per workflow happens at the same time, but the signal will invalidate any decision made in the first run of the decider. This is not obviously documented. It appears on page 87 of the SWF developer guide as the following note:

Note: There are some cases where closing a workflow execution fails. For example, if a signal is received while the decider is closing the workflow execution, the close decision will fail. To handle this possibility, ensure that the decider continues polling for decision tasks. Also, ensure that the decider that receives the next decision task responds to the event—in this case, a signal—that prevented the execution from closing.

This caused a bug in part of our code that took quite some time to track down and fix. I paid a price for the abstraction I chose to use, because I didn’t fully understand it.

Entitlement

A common response to these types of situations is: “Why does it work that way? That makes no sense. It should work this way instead.” People get angry that something isn’t designed exactly for their use case when, in reality, they are at fault for using something they didn’t understand.

Of course, taking the time to understand everything you use everyday would be a herculean, perhaps impossible, task. Instead, be patient with the abstractions you use and devote time to getting to know them intimately. Don’t get upset if they do something you don’t expect, it just means you haven’t paid them the attention they deserve.

We’re Manipulating an Array of Bytes

By far the biggest realisation I had with computers was to understand that everything we do is the manipulation of a large array of bytes*. All of the cutesy things we put on top of that are just ways of making this manipulation map closer to our thoughts and ideas.

We ascribe meaning to this array of bytes. Text, data, stack, BSS. We divide it up into 4kb pages for administrative purposes, as a layer of protection against processes accessing the memory for other processes, and as a neat way of swapping memory to secondary storage when we run out of primary storage.

Local variables are nothing more than offsets into the current stack frame. That’s why it makes sense that the value of an uninitialised variable could be anything. It depends on what the last function to occupy that part of the stack was.

However, none of this applies in the age of the virtual machine. When it comes to working in Ruby, Python, Java, or any other language that runs on top of its own virtualised environment, the rules change and I no longer know what’s happening. I know that the virtual machine must communicate with the physical one, so parts of my understanding still apply, but the rules for things like function calls and variable lookup are defined differently on a per-environment basis.

Example: Ruby instance variables

Lately I’ve been working on a project by Nikk Markwell called boot.rb. A simple x86 kernel that will eventually boot into a Ruby shell. It uses the mruby version of Ruby.

As a result of this work, I have had to dive very deep into the mruby internals. A few days ago I was spelunking around how instance variables are defined on classes. Check this out: iv_put here and here.

The iv_put routine is used for setting instance variables on an object. The code reveals two interesting things: setting instance variables creates a Ruby “symbol”, and Ruby has two methods of setting instance variables. One of them is a segmented list which appears to operate in O(n) time (n being the number of instance variables already set) but saves memory, and the other, the default, is a hash table.

A “symbol” in Ruby is an interned string. The interesting property of them in our context is that they are never garbage collected. Therefore, every time you create a differently named instance variable, you’re losing a little bit more memory.

I won’t critique this design decision, but it is an interesting property of instance variables that I sincerely doubt most of the Ruby community know, and could potentially bite if someone were doing some crazy metaprogramming in an embedded Ruby environment configured for low-memory. Tenuous, I know, but these are the types of subtle scenarios that really draw blood when they bite.

Wrapping up

The taller the house of cards, the more scared I get, and the more painful the bugs can be. Abstractions are necessary to get things done in the kind of time frames that modern businesses expect, but they carry with them a cost. You have to be prepared to pay that cost when you run into a bug that sits at a lower level abstraction than the one you’re operating in.

Understanding the computer that all abstractions depend on is a very valuable skill and it has helped me to understand some of the hardest bugs I’ve ever run into. It’s not knowledge that comes in handy often, but when it does it’s the sort of knowledge that can turn a multi-day debugging session into only a few minutes.

* This is itself a very complex abstraction, but that’s far beyond the scope of my current understanding of computers, to the point where this entire article could be considered hypocritical.

Comments