In this post I want to walk through what a stack is in terms of the processor, what function calls are, why your stack can overflow or get too deep and how all of this relates to tail-call optimisation.
Prerequisite Knowledge: It would be useful if you’ve run into a stack overflow error previously. If you haven’t I want to offer a congratulations but it’ll happen at some point, I promise, and when it does it’ll be nice to understand what causes them.
Other than that, I’m hoping this post is accessible to anybody with reasonable knowledge of pointers and one programming language (C is ideal) under their belt, but useful for even the seasoned programmer.
Program Memory Layout
To understand the stack, and the rest of this post, we must first have a good idea of how a program is laid out in memory. When the binary image of a program is initially laid out into memory, it is done in a very specific way that looks something like this:
high memory +----------------+ | Command line | | args and env | | variables | +----------------+ | Stack | | (Grows down) | +----------------+ | | | | V | | | | | | ^ | | | | +----------------+ | Heap | | (Grows up) | +----------------+ | Uninitialised | | data (BSS) | +----------------+ | Initialised | | data | +----------------+ | Text | | | low memory +----------------+
This diagram is not to scale, but that’s how things get laid out. For the unfamiliar, this explanation is tied very heavily into the ELF binary format. Here’s a small refresher on what each of the segments mean:
Uninitialised data (BSS segment) is where global variables that are declared but not initialised get put. When we say “not initialised” here, we mean that they are not statically initialised at compile time. For example, if you define a 500 element int array in the top level namespace, you won’t get a 2000 byte empty array sitting in your binary file. That’s wasteful. You’ll get a declaration in your binary that tells the loader how much space to reserve as soon as the program is loaded into memory.
Initialised data is for all of the stuff that is statically initialised at compile time. Any global variables that are defined before the program is running will go into here.
Text. This is a slightly misleading name for the code segment of your binary. This is where the meat and potatoes, the actual machine code that runs, can be found.
Heap. This is a dynamically resizeable area of memory for doing dynamic, unmanaged memory allocation. Every call to
malloc()will give you back a pointer that points to somewhere inside this region.
Stack. This is what we’re interested in. This is where local, temporary variables go and it also dynamically resizes by way of a “stack pointer”, the
%espregister on x86 machines.
Tangent: Clearing up some niggling questions
So it’s all well and good saying that this is how our program is laid out in memory, and saying that part of it is at high memory and part of it is in low memory, but what does that even mean? Where is high memory? Where is low memory? Why doesn’t it start at the bottom? What happens when other processes get involved?
The answer to all of the above questions is made very difficult because it’s fallacious to think that memory is accessed in a direct manner. It isn’t. Memory, for all user programs, is accessed through a layer of abstraction called “virtual memory”. If you’re unfamiliar with virtual memory, I won’t go into great detail here, but Googling the topic can lead to a very enlightening afternoon well spent.
The gist of it is that your program never sees physical memory. It sees what the operating system shows it, which is this idealistic view of memory. Idealistic because the process is made to think that it is the only thing running in the system and it is potentially shown that it has access to an amount of memory that is larger than the physical memory available.
Why? Because having a uniform view of memory that does not depend on the physical memory available and believing your program has free reign over all of memory makes writing software about 8 million times easier. It also means that memory can be allocated in a safe way and you can control access to regions of memory.
A process sees only itself in memory, and the operating system deals with mapping virtual addresses into physical ones. This makes the layout of a program easy to manage when you are running multiple processes. The application programmer doesn’t have to worry because the operating system will take care of the mapping and your program will always think it’s in the same place.
Back to reality
Phew. Sorry about that. Where were we?
The stack! Of course. Now that we know roughly how your program gets laid out in memory, we can see that the stack starts at the top and grows down. We can also see that the heap starts low and grows up. So, in theory, won’t they eventually meet in the middle? Therein lies the problem.
Realistically, they probably won’t meet. That’s a bit difficult to keep track of. The stack has a limited size, determined at the start of the program / thread (yup, threads get their own stack) and dependent on a vast number of factors, that it cannot go over. If it does try to go over that, say if you try and put a huge array on the stack, your program will throw a stack overflow error.
Basic stack operations
Let’s take a look at what some stack operations look like and how they affect the stack. I also want to stop referring to “the stack” as some abstract thing and show you exactly how it looks at each stage of these operations.
At the start of your program, the stack will contain the following things:
bottom of stack +----------------+ | null value | +----------------+ | *envp[n] | +----------------+ | *envp[...] | +----------------+ | *envp | +----------------+ | null value | +----------------+ | *argv[argc-1] | +----------------+ | *argv[...] | +----------------+ | *argv | <--- Pointer to name of program +----------------+ | argc | <--- %esp top of stack +----------------+
execve system call is made, the program loader and operating system
program initialisation routines will make sure that we have the above values on
the stack, in that order.
- argc is the number of elements in the **argv array
- **argv is an array containing pointers to null terminated strings. The 0th element is always the name of your program, the rest are parameters passed in. It’s where your command line parameters end up.
- **envp is an array containing pointers to null terminated strings. All of the environment variables inherited by your program from its parent live in here.
Pushing onto the stack
If we want to add something to the stack, x86 offers the
This will push the value
0xff onto the stack. We declare this value as a
dword, which means it will be 4 bytes in size. The stack now looks like this:
bottom of stack +----------------+ | **envp | <-- May contain many items +----------------+ | **argv | <-- May contain many items +----------------+ | argc | +----------------+ | 0x000000ff | <--- %esp top of stack +----------------+
Notice that the
%esp register has moved to now point at the new top of the
stack. This is part of what x86’s
push is doing for us behind the scenes (yes,
there is still a lot of abstraction at the assembler level!).
I say “the %esp register has moved”, which is completely wrong. What I mean is
that the value inside the
%esp register has changed to be 4 less than what
it was previously. Subtracting from the stack pointer advances it forward
(remember, starts at the top and grows down).
Popping off the stack
In a similar fashion, x86 offers the
pop instruction to remove things from the
This will pop the value at the top of the stack into the register
new stack looks like this:
bottom of stack +----------------+ | **envp | <-- May contain many items +----------------+ | **argv | <-- May contain many items +----------------+ | argc | <--- %esp top of stack +----------------+
%eax now contains
Function calls and calling conventions
When you call a function in C, there is a convention followed to ensure binary compatibility between libraries. The convention is defined by the System V Application Binary Interface and it can differ between processor architectures. I’m going to focus on x86, 32bit, SVR4 (the spec for which can be found here, relevant calling convention information starts on page 35).
If you don’t want to read the spec, and I totally understand why you wouldn’t, here’s a brief run down of the very basics.
If you want to make sure your assembly function can be called from C code,
you must push all arguments to that function onto the stack in reverse order
(right to left), then
call the function. Simple. Here’s a contrived example:
1 2 3
That’s the general gist of it. Push things onto the stack in reverse order and call the function. Here’s a more realistic example, though, because it’s not necessarily that simple in all cases:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
If you want to run this and you’re not on a Mac, you may have to fiddle around
with the following instructions so that it works on your platform. You’ll also
likely need to remove the
_ from the front of
_printf, that’s apparently a
Mac thing. Assembling and linking required a few flags:
nasm command required the filetype to be specified, which is a Mach-O i386
ld command required the OSX min version specified, and it also
needed us to link in libc and crt1 (if you’re unfamiliar with what crt1 is, I
talked about it in my previous blog post).
Once that’s done, we will have a file called
a.out and when we run it:
Stack alignment? What?
You’ll notice in the
printf asm snippet that one of the lines mentions “stack
alignment”, and is seemingly doing an arbitrary subtraction from the stack
pointer. Why is this?
When the operating system drops us into our
_main function, the stack pointer
will be evenly divisible by 16. Why is this? I don’t know exact details, it’s an
area of processors I’m yet to dive into, but
this stack overflow question
seems to suggest that it is something to do with “SIMD” instructions. They’re a
type of instruction that do the same operation to multiple bits of data in
parallel, and apparently require the stack to be aligned on 16 byte boundaries.
If we’re following the above calling convention properly, every time we call a function we’ll be creating a “stack frame”. This stack frame will contain the following:
+----------------+ | Arg N | <-- ebp + 4n + 8 +----------------+ | ... | +----------------+ | Arg 0 | <-- ebp + 8 +----------------+ | Return addr | <-- ebp + 4 +----------------+ | Previous ebp | <-- ebp +----------------+ | Func locals | <-- ebp - 4 +----------------+ | ... | <-- variable length section for your local vars +----------------+ | ... | <-- esp +----------------+
The args in the above diagram are technically inside the previous stack frame,
%ebp register is referred to as the “frame pointer”. When you
write your own assembler functions you will have a section at the start of every
function called the “preamble”, and then clean up at the end. It’ll look a bit
1 2 3 4 5 6 7 8
We’re pushing the current frame pointer onto the stack, then using the stack pointer as the new frame pointer. Then, if we wished, we could use the stack however we wanted, as long as we clean it up when we’re done. Then at the end, we pop the previous frame pointer back out and return.
call instruction pushes the current address at the time of the call onto
the stack, and this is what
ret uses to know where to return to. It’s very
important that you correctly clean up your stack, otherwise you will be
returning into who-knows-where and you’ll likely segfault.
Stack local variables
If you’re familiar with C or more or less any other modern programming language,
you’ll be aware of “variable scope”. It’s the idea that variables defined inside
a function can only be referenced inside that function and are deallocated when
you leave. This is why it’s always a bad idea to return a pointer to something
that was defined on the stack, and why you should use
malloc instead (because
malloc returns a pointer that doesn’t point into the stack).
Based on the above description, it should start to be clear why this is. When we leave a function, we’re deallocating all of its part of the stack behind it. So the stack size decreases. When we go into another function, we grow the stack back over the section we’ve just deallocated! Nothing really cleans that area up, so our previous frame was not zeroed. This is why variables declared but not initialised are not guaranteed to be 0.
Right! We know about the stack, we know about stack frames, so what about recursion? Recursion is the act of calling a function inside itself. The quintessential example is a naive Fibonacci sequence generator:
1 2 3 4
So inside of
fib we’re calling
fib, giving us recursion. Consider what this
means in terms of stack frames. What’s going to happen when this gets called?
For the sake of example, let’s say you call
fib(100000). You’re just going to
spend all of your time allocating stack frames until you run out of stack space
and get a stack overflow error. Interestingly, because the order of growth of
the above algorithm is O( 2N ), you’ll be trying to allocate 2100000 stack
frames! That’s… a lot of stack frames. Good luck.
Tail Call Optimisation
fib example above isn’t a possible candidate for tail call
optimisation. To be eligible for TCO, a function must call itself as the very
last thing it does. The above function is a bit deceptive because it looks like
it’s calling itself as the last thing, but it isn’t. Think about what the
processor would have to do to service that request.
fib called. Check n < 2. Test fails. Return fib(n - 1) + fib(n - 2) Calculate n - 1. Call fib(n - 1). Calculate n - 2. Call fib(n - 2). Add the two results together. Return.
So the last thing that’s done is the addition, thus TCO cannot apply.
Let’s consider a simpler, contrived example. Something that is eligible for TCO.
1 2 3 4 5
So this function counts down from any number until it gets to 0. In this example, the self referential call is the last thing to run, so we can do something very clever.
Consider for a second what’s about to happen. We push a new return address onto the stack, we push our new argument onto the stack and then we jump to the same code we just executed. Why bother? We aren’t going to use the previous stack ever again apart from the return address back to the original caller. If we recurse, we’re just going to end up with a massive daisy chain of jumps to return addresses that will be very, very inefficient to execute. Why don’t we just overwrite the existing stack and leave the return address in tact?
And that, ladies and gents, is TCO in a nutshell. You just overwrite the existing stack and leave the return address as is, then jump into the start of the function again. Dead simple.
As always, there are likely lots of omissions and simplifications that hide some of the complexities of what I’m explaining. Specifically, I think there are a lot of complications I’m not aware of when it comes to TCO. I have a 10,000 foot view of what TCO is and how it is implemented at the assembler level, but when it comes to how the compiler spots it I really don’t have a clue yet :)
Hope you enjoyed the post! I’m thinking the next post might be about context switching.
- Corrected complexity of Fibonacci function. Thanks @AaronKalair!