Poking Around Inside Ruby

I’ve been wanting to start exploring the MRI Ruby source code for quite some time now, but didn’t quite know where to start. I wanted to devise a way of figuring out what parts of the Ruby source code were important, so if I were to run a “Hello, world!” program I could see exactly what the Ruby VM had to do to execute that.

Prerequisite Knowledge: It would help if you’ve used GDB before, but it should be self explanatory if you haven’t. If you’ve never used Ruby this post might give you tips on how to explore language code bases, but other than that it’ll probably be uninteresting. It’ll also help you if you have a vague idea of what a “symbol table” is in a binary file.

This is going to be quite meandering. I’m basically just going to be picking interesting things in the differences between certain running programs and looking at the C source to figure out what’s happening. If that sounds like your cup of tea, let’s do it! :)

What I was looking to find out was the path of execution Ruby followed when interpreting and running a simple puts "Hello, world!". I wanted a list of C functions, in the order they were called, from start to finish.

ltrace, dtrace, strace

The suite of *trace tools didn’t quite fit my purposes. ltrace only traces calls into dynamic libraries (through some really clever symbol stubbing, might look more into how that works and write a post on it at some point). dtrace probably would be able to do what I want but I didn’t have the patience to learn it inside out (plus I wanted to check the difference between Ruby 1.9 and Ruby 2.0, which I wouldn’t be able to do without modifying the 1.9 source code). strace only traces system calls and, while that’s useful, I’m specifically interested in the C calls from inside the Ruby code base.

gdb!

After a fair bit of Googling, I came across a blog post that explained how to do exactly what I wanted using a really simple gdb script. This is an example of what such a script would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
break some_function
command
silent
backtrace 1
continue
end

break some_other_function
command
silent
backtrace 1
continue
end

run

The idea here is to set breakpoints on functions that you’re interested in and print out the latest item in the backtrace whenever the breakpoint is hit, then just continue. Easy peasy.

The slight complication is that I’m interested in all of the C functions I can get my hands on. We can check how many functions Ruby has by using the nm tool under Linux:

$ nm `which ruby`
nm: /home/sam/.rbenv/shims/ruby: File format not recognized

Ahh yes. Balls. I’m using rbenv to manage multiple Ruby installs on one machine, so the ruby in my path isn’t actually a compiled executable file, it’s a bash script that sets up some environment variables and then calls some other executable to find which version of Ruby it needs to hit.

Fortunately, the rbenv developers are good people and offer an rbenv which command that finds the real executable for us:

$ nm `rbenv which ruby`
Lots and lots of symbols...

Let’s apply some filtering. Each line that nm outputs has the following format:

<memory address> <symbol type> <symbol name>

We’re only interested in functions, which according to the nm man pages is a symbol type of “t”. So let’s try and get a better count:

$ nm `rbenv which ruby` | grep " t " | wc -l
4546

That’s using Ruby 2.0.0-p247. There’s quite a large difference between 2.x and 1.9.x, check this out:

$ rbenv local 1.9.3-p448
$ nm `rbenv which ruby` | grep " t " | wc -l
3825

Not sure what conclusions we can draw from that, but it’s interesting nonetheless.

Auto generating the gdb script

At this point I’d hope nobody would have any intention of hand-crafting thousands of lines of gdb script, so let’s hack together something that generates it for us:

gdbscript.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
ruby_binary = ARGV.shift

`nm #{ruby_binary} | grep " t " | cut -d " " -f 3`.each_line do |symbol|
  puts <<-EOF
break #{symbol.chomp}
commands
silent
backtrace 1
continue
end

  EOF
end

Who needs error checking? This beauty will output the necessary script block for each and every function inside the Ruby executable.

Gluing it all together

First, we have to generate our gdb script. We’ll do that separately:

$ ruby gdbscript.rb `rbenv which ruby` > script.gdb

Next, we need to invoke gdb with our script and the necessary Ruby program. I did it as a one-liner and wrapped it around the time program to see how long it would take. It ended up looking like this:

$ time gdb --batch -x script.gdb --args `rbenv which ruby` -e "puts 'hello'"

Now, I had no idea how long this would take. I certainly didn’t think it would take this long for 1.9.3:

real 339m14.492s
user 196m39.789s
sys  147m3.939s

Unfortunately, I didn’t make a note of how long 2.0 took (I fell asleep). It took all night and a lot of the next day, though. Easily over twice as long as 1.9.3, which completely dashes all hopes of this being a remotely useful way of figuring out what Ruby is doing under the hood. However, we’ve collected some potentially fun data, so let’s soldier on.

Analysing the output

I ran this 4 times. 2 different Ruby programs over both of the Ruby versions, 2.0 and 1.9.3. The two Ruby programs were the simple puts 'hello' and, for comparison purposes, exit 42. I saved all of the output to files, check out how big they are (keep in mind this is one function call per line):

$ wc -l *

     697049 1.9.3_exit42.out
     696583 1.9.3_hello.out
    1991463 2.0.0_exit42.out
    1990666 2.0.0_hello.out

Firstly, it’s a lot of lines regardless. Secondly, look at the difference between Ruby 1.9.3 and Ruby 2.0.0! There’s almost 3 times more lines for Ruby 2.0. You generally find that Ruby 2.0 starts up a lot faster than Ruby 1.9, though… I wonder what’s going on.

So what does the output look like?

Here’s some example output:

$ head -n 20 rtrace-1.9.3-exit42.out
#0  0x00007ffff7bc5990 in frame_dummy () from /usr/lib/libpthread.so.0
#0  0x00007ffff7bc5910 in register_tm_clones () from /usr/lib/libpthread.so.0
#0  0x0000000000414d20 in frame_dummy ()
#0  0x0000000000414cc0 in register_tm_clones ()
#0  main (argc=3, argv=0x7fffffffe678) at main.c:31
#0  fill_standard_fds () at ruby.c:1842
#0  0x0000000000589100 in fstat ()
#0  0x0000000000589100 in fstat ()
#0  0x0000000000589100 in fstat ()
#0  get_stack (addr=0x7fffffffe548, size=0x7fffffffe550) at thread_pthread.c:487
#0  get_stack (addr=0x7fffffffe448, size=0x7fffffffe450) at thread_pthread.c:487
#0  Init_BareVM () at vm.c:2224
#0  vm_init2 (vm=0x846f20) at vm.c:1670
#0  rb_objspace_alloc () at gc.c:432
#0  Init_native_thread () at thread_pthread.c:436
#0  native_thread_init (th=0x847520) at thread_pthread.c:450
#0  native_cond_initialize (cond=0x8475b0, flags=1) at thread_pthread.c:242
#0  ruby_thread_set_native (th=0x847520) at thread_pthread.c:428
#0  native_mutex_initialize (lock=0x83e6c0 <signal_thread_list_lock>) at thread_pthread.c:219
#0  mutex_debug (msg=0x5b33f4 "init", lock=0x83e6c0 <signal_thread_list_lock>) at thread_pthread.c:178

The first few bits have memory addresses rather than symbol names and I’m not 100% sure why that is. You don’t see it in the rest of the file, though, so I assume it’s some voodoo that’s done during the bootstrapping of the process. Explanations welcome!

Working with the data

Let’s do something obvious to begin with: which C functions get called the most:

$ cat rtrace-1.9.3-exit42.out | grep "#0  " | cut -d " " -f 3 | sort | uniq -c | sort -nr | head
43031 parser_nextc
22583 us_ascii_mbc_enc_len
22529 parser_tokadd
21272 vm_malloc_fixup
21197 vm_malloc_prepare
20378 insn_len
19841 vm_xmalloc
16910 rb_enc_dummy_p
15822 parser_tokadd_mbchar
14751 compile_data_alloc

The file name should be self explanatory. I’m grepping for lines beginning with “#0 ” because those are the backtrace lines (there are a few lines in there that aren’t backtrace lines, when threads die and whatnot). Then I’m grabbing the 3rd space separated field, sorting it, counting it, sorting it again and grabbing the first 10.

No real surprises… Parsing/lexing/memory/encoding stuff. Let’s compare it against Ruby 2.0:

$ cat rtrace-2.0.0-exit42.out | grep "#0  " | cut -d " " -f 3 | sort | uniq -c | sort -nr | head
122836 parser_nextc
72578 markable_object_p
67072 GET_THREAD
64400 parser_tokadd
62980 mbc_enc_len
61260 gc_mark_ptr
56964 gc_mark
52222 insn_len
43518 parser_tokadd_mbchar
37379 compile_data_alloc

Mooooostly the same… Some differences, though. I compiled both versions of Ruby specially to include maximum debugging information, which is why we’re seeing MACRO-style calls in here too. Let’s dig around some of the more interesting looking differences.

Spelunking in a C code base

If you’ve not really rummaged around inside a C code base before, it seems really, really scary. There are tools that can help, though!

I’m a vim user, and no self-respecting vim user that does any C code leaves the warm confines of their bed without having “cscope” installed. I use this plugin, which makes some really nice mappings into useful cscope commands. For example, if I want to find the definition of a symbol (which is what I do most often when exploring a new code base), I can type <leader>fg and it will open a split window with potential definitions of that symbol.

It does other really cool stuff, too, like finding functions that call a specific function, or finding files that #include the current file and so on. Typing :help cscope will tell you way more than I can.

If you’re not using vim I can’t really help you a whole lot, sorry. The Emacs wiki has a page on cscope and emacs which may help. There’s also this sublime text plugin that seems to do the trick.

To GET_THREAD or not to GET_THREAD

Interestingly, GET_THREAD is never called in the 1.9 examples. It did get changed a fair bit between 1.9 and 2.0, though, look at the 1.9 version:

1
#define GET_THREAD() ruby_current_thread

2.0 version:

1
2
3
4
5
6
7
8
9
10
11
12
static inline rb_thread_t *
GET_THREAD(void)
{
    rb_thread_t *th = ruby_current_thread;
#if OPT_CALL_CFUNC_WITHOUT_FRAME
    if (UNLIKELY(th->passed_ci != 0)) {
    void vm_call_cfunc_push_frame(rb_thread_t *th);
    vm_call_cfunc_push_frame(th);
    }
#endif
    return th;
}

Without the OPT_CALL_CFUNC_WITHOUT_FRAME symbol defined, because the Ruby 2.0 version gets inlined, the two version of this do exactly the same thing.

From what I can tell from the patch notes, this is part of an optimisation that is disabled by default that delays the creation of a “frame” (stack frame? not sure) for C function calls from Ruby. I tried to dig into it but my unfamiliarity with the Ruby code base is holding me back a bit.

I did check to make sure I had macro information compiled into my 1.9 version and it is there, there are other macros listed in my outputs, so this doesn’t seem like me fucking up… Bit weird that 2.0 makes such heavy use of GET_THREAD and 1.9 doesn’t. Anyone got any ideas? :)

UNLIKELY?

In the Ruby 2.0 GET_THREAD snippet from above there’s an intriguing looking macro called UNLIKELY. There’s a corresponding LIKELY macro as well, and they expand to the following in both 1.9 and 2.0:

1
2
#define LIKELY(x)   (__builtin_expect((x), 1))
#define UNLIKELY(x) (__builtin_expect((x), 0))

__builtin_expect is a GCC specific thing. The documentation says the following:

You may use `__builtin_expect` to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect.

So, branch prediction information. I tried to write a small, demonstrative example of when using this builtin made a difference to the generated assembly code but every time the output was exactly the same regardless of whether I used likely or unlikely. According to this post you have to write something suitably non-trivial in order to make sure the compiler doesn’t just optimise your code without the need for the prediction data.

However, even in the blog post’s example code, compiling with and without the use of __builtin_expect only changes 1 line of the generated assembly.

Without __builtin_expect:

1
cmpl $0, %eax

With __builtin_expect

1
2
movslq %eax, %rax
cmpq $0, %rax

The only thing that change is doing is converting between a 32bit signed integer to a 64bit signed integer (because __builtin_expect returns a long). movslq is a command that moves a value between a 32bit register and a 64bit register, doing a sign extension if it needs to.

The above difference only appears with -O0 specified, any other optimisation level just returns assembler that is exactly the same. I used the following commands to generate the assembler I compared against:

$ gcc -O3 -S -DEXPECT_RESULT=1 builtin_expect_test.c -o expect.s
$ gcc -O3 -S -DDONT_EXPECT builtin_expect_test.c -o dont_expect.s
$ diff expect.s dont_expect.s

No difference. It didn’t even matter when changing what the expected result was supposed to be. Which sort of questions some of guy’s conclusions, but maybe there are differences between the versions of GCC we were using or perhaps I did something wrong? Either way, I can’t seem to find an example that clearly shows what __builtin_expect is doing with the information you’re giving it.

GC in 2.0 but not in 1.9

Waaay up the article, when we took a look at the 10 most called functions for each ruby version, you may notice that there’s nothing said about garbage collection for 1.9, but the 2.0 list has 3 GC related functions in it! Why is this?

According to a slide deck by Joe Damato, the Ruby garbage collector runs when Ruby has run out of space on its own heap to allocate objects. Seeing as we’re not really doing any object allocations (unless using literals counts (it probably does)), my guess is that loading the standard library is where the problem lies.

Has the standard library significantly increased in size? I’m using David Wheeler’s sloccount tool to find out. Results:

1.9:

$ sloccount lib/
ruby:         80731 (100.00%)

2.0:

$ sloccount lib/
ruby:        104673 (100.00%)

That’s a fairly large difference, maybe it would account for the extra GC calls? It’s really hard to know for sure. I’m not sure if it would account for so many more lines generated for the Ruby 2.0 runs over the 1.9 runs…

To be completely fair for 1.9, it does do GC stuff, it’s just not in the top 10. If we look at the top 20 functions called for 1.9:

$ cat rtrace-1.9.3-exit42.out | grep "#0  " | cut -d " " -f 3 | sort | uniq -c | sort -nr | head -n 20
43031 parser_nextc
22583 us_ascii_mbc_enc_len
22529 parser_tokadd
21272 vm_malloc_fixup
21197 vm_malloc_prepare
20378 insn_len
19841 vm_xmalloc
16910 rb_enc_dummy_p
15822 parser_tokadd_mbchar
14751 compile_data_alloc
13382 gc_mark
13009 enc_set_index
11493 verify_list
9784 mbc_enc_len
9473 search_nonascii
8563 ADD_ELEM
8133 check_int
7597 coderange_scan
7592 ruby_sip_hash24
7525 yylex

We can see gc_mark in there, but not a whole lot else. Without digging into the inner workings of Ruby’s GC and the differences between 1.9 and 2.0 (which is quite a lot, I hear) I won’t be able to know why this is the case. If anyone feels like giving an explanation or linking to something that explains this, I would be happy to read it :)

Wrapping up

The data is really interesting and there’s probably a tonne more stuff you could do with it. If anyone wants a copy of the data, I’ll happily try and get it to you. It’s a few hundred megabytes and this blog is in a GitHub repo and hosted on Heroku, so I’ll avoid committing it directly unless I absolutely have to!

So yeah. This is part of what my spare time has been spent doing recently, woo! It’s a lot of fun, I totally recommend just dicking around with this kind of stuff. I always find I run into something interesting in one way or another.

Comments