Pekka's Blog

A blog about programming topics, mostly JVM, Linux kernel, and x86 related things.

Saturday, July 24, 2010

I'm moving over to Posterous

This blog has been moved over to Posterous. The RSS URL still remains the same.

Tuesday, July 20, 2010

Linux kernel OOPS debugging

Debugging the Linux kernel can seem like black magic for the uninitiated. It certainly felt like that when I first started out hacking on the kernel! What I found to be particularly amazing back then was how kernel hackers seemed to be able to debug and fix bugs that they weren't able to reproduce on their machine.

In this article, I'm going to walk through a real world Linux kernel bug report to show the different steps required to pin-point where in the kernel a problem happened.

[ Please skim through the report before you continue reading. ]

The first important detail in the bug report is the kernel version. We usually want to look at the exact same version of the kernel sources as the reporter, so lets use "git checkout":


$ git checkout v2.6.34


The reporter did not include his .config file so unfortunately we can't build the same kind of version on our machine (different versions of GCC change the generated code a bit as well). However, lets look at a cleaned up version of the reported OOPS to see what we can deduce from it:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
IP: [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
PGD 11e7e5067 PUD 11fd3d067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/md0/md/metadata_version
CPU 1
Modules linked in: aes_x86_64(+) aes_generic sg

Pid: 4652, comm: modprobe Not tainted 2.6.34-gentoo-r1 #1 MS-7368/MS-7368
RIP: 0010:[<ffffffff810aab89>] [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
RSP: 0018:ffff88011e75fe08 EFLAGS: 00010006
RAX: ffff880001b0f088 RBX: ffffffff8170d4d0 RCX: ffff88011e574b80
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 00000000000002d0
RBP: 0000000000000296 R08: 0000000000000014 R09: ffff88011e574800
R10: 0000000000000001 R11: ffff880001a12008 R12: 00000000000000d0
R13: 0000000000000003 R14: ffffffff81064abb R15: ffffc90010729d68
FS: 00007f0a9acb8700(0000) GS:ffff880001b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000003 CR3: 000000011d03e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 4652, threadinfo ffff88011e75e000, task ffff88011d114150)
Stack:
0000000000000000 ffffc90010729c97 0000000000000008 ffff88011e574800
ffff88011e574aa0 ffffffff8108c27b ffffffffa0018920 ffffc900000000d0
ffffffffa0018920 ffffc90010728000 ffffc90010729d68 ffffffff81064abb
Call Trace:
[<ffffffff8108c27b>] ? kstrdup+0x3b/0x70
[<ffffffff81064abb>] ? load_module+0x13eb/0x1730
[<ffffffff81064e7b>] ? sys_init_module+0x7b/0x260
[<ffffffff810024ab>] ? system_call_fastpath+0x16/0x1b
Code: 23 25 dc 47 6f 00 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55 9d 4d 85 ed 74 06 66 45 85 e4 78 22
RIP [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
RSP <ffff88011e75fe08>
CR2: 0000000000000003
---[ end trace 692101747f991cfb ]---


OK, so it's an invalid pointer access in __kmalloc_track_caller:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
IP: [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110


As we don't have a copy of the kernel image, lets use the scripts/decodecode script in Linux kernel sources to decode "Code" in the OOPS which contains a small machine code dump of the problem area:


echo "Code: 23 25 dc 47 6f 00 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55 9d 4d 85 ed 74 06 66 45 85 e4 78 22" | ./scripts/decodecode

Code: 23 25 dc 47 6f 00 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55 9d 4d 85 ed 74 06 66 45 85 e4 78 22
All code
========
0: 23 25 dc 47 6f 00 and 0x6f47dc(%rip),%esp # 0x6f47e2
6: 41 f6 c4 10 test $0x10,%r12b
a: 75 66 jne 0x72
c: 9c pushfq
d: 5d pop %rbp
e: fa cli
f: 65 48 8b 14 25 a8 d1 mov %gs:0xd1a8,%rdx
16: 00 00
18: 48 8b 03 mov (%rbx),%rax
1b: 48 8d 04 02 lea (%rdx,%rax,1),%rax
1f: 4c 8b 28 mov (%rax),%r13
22: 4d 85 ed test %r13,%r13
25: 74 55 je 0x7c
27: 48 63 53 18 movslq 0x18(%rbx),%rdx
2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction
30: 48 89 10 mov %rdx,(%rax)
33: 55 push %rbp
34: 9d popfq
35: 4d 85 ed test %r13,%r13
38: 74 06 je 0x40
3a: 66 45 85 e4 test %r12w,%r12w
3e: 78 22 js 0x62

Code starting with the faulting instruction
===========================================
0: 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx
5: 48 89 10 mov %rdx,(%rax)
8: 55 push %rbp
9: 9d popfq
a: 4d 85 ed test %r13,%r13
d: 74 06 je 0x15
f: 66 45 85 e4 test %r12w,%r12w
13: 78 22 js 0x37


Looking at the trapping instruction:


2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction


and the register dump:


RAX: ffff880001b0f088 RBX: ffffffff8170d4d0 RCX: ffff88011e574b80
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 00000000000002d0
RBP: 0000000000000296 R08: 0000000000000014 R09: ffff88011e574800
R10: 0000000000000001 R11: ffff880001a12008 R12: 00000000000000d0
R13: 0000000000000003 R14: ffffffff81064abb R15: ffffc90010729d68


We can see that R13 contains the invalid pointer which matches the reported dereference:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000003


Next step is to locate the resolved function:


IP: [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110


in kernel sources:


$ make tags
$ vim -t __kmalloc_track_caller


which reveals two functions - one in mm/slab.c and one in mm/slub.c. The bug reported mentioned using CONFIG_SLUB=y so we're interested in the one in mm/slub.c:


void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
{
struct kmem_cache *s;
void *ret;

if (unlikely(size > SLUB_MAX_SIZE))
return kmalloc_large(size, gfpflags);

s = get_slab(size, gfpflags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

ret = slab_alloc(s, gfpflags, -1, caller);

/* Honor the call site pointer we recieved. */
trace_kmalloc(caller, ret, size, s->size, gfpflags);

return ret;
}


Scanning the decoded machine code for some interesting instructions we find:


c: 9c pushfq
d: 5d pop %rbp
e: fa cli


which is generated code from local_irq_save() and


f: 65 48 8b 14 25 a8 d1 mov %gs:0xd1a8,%rdx


which is per-CPU data access - from which we can also deduce that CONFIG_SMP=y. Reading __kmalloc_track_caller some more, we find slab_alloc():


static __always_inline void *slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr)
{
void **object;
struct kmem_cache_cpu *c;
unsigned long flags;

gfpflags &= gfp_allowed_mask;

lockdep_trace_alloc(gfpflags);
might_sleep_if(gfpflags & __GFP_WAIT);

if (should_failslab(s->objsize, gfpflags, s->flags))
return NULL;

local_irq_save(flags);
c = __this_cpu_ptr(s->cpu_slab);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);
stat(s, ALLOC_FASTPATH);
}
local_irq_restore(flags);

if (unlikely(gfpflags & __GFP_ZERO) && object)
memset(object, 0, s->objsize);

kmemcheck_slab_alloc(s, gfpflags, object, s-%gt;objsize);
kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);

return object;
}


which has local_irq_save and a per-CPU data access:


local_irq_save(flags);
c = __this_cpu_ptr(s->cpu_slab);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);
stat(s, ALLOC_FASTPATH);
}
local_irq_restore(flags);


Turning our attention back to the decoded machine code:


f: 65 48 8b 14 25 a8 d1 mov %gs:0xd1a8,%rdx
16: 00 00
18: 48 8b 03 mov (%rbx),%rax
1b: 48 8d 04 02 lea (%rdx,%rax,1),%rax
1f: 4c 8b 28 mov (%rax),%r13
22: 4d 85 ed test %r13,%r13
25: 74 55 je 0x7c
27: 48 63 53 18 movslq 0x18(%rbx),%rdx
2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction


we can see that the


22: 4d 85 ed test %r13,%r13
25: 74 55 je 0x7c


sequence is likely to be the following if-else statement written in inverse:


if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);
stat(s, ALLOC_FASTPATH);
}


which would mean that the trapping instructions:


27: 48 63 53 18 movslq 0x18(%rbx),%rdx
2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction


are in get_freepointer():


static inline void *get_freepointer(struct kmem_cache *s, void *object)
{
return *(void **)(object + s->offset);
}


if this is true, then "s" is stored in "%rbx" and "object" is in "%r13". Furthermore, ->offset must be at offset 0x18 in struct kmem_cache. Now, we can easily verify our assumptions but we need the reporter's kernel image or .config.

However, looking at "struct kmem_cache":


struct kmem_cache {
struct kmem_cache_cpu *cpu_slab;
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
int objsize; /* The size of an object without meta data */
int offset; /* Free pointer offset. */
[...]


we can see that it's layout in memory is not affected by .config so we can safely inspect any image that has CONFIG_SLUB=y. We first compile a kernel image with CONFIG_DEBUG_INFO=y and the load vmlinux image to GDB for inspection:


$ gdb ./vmlinux
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /home/penberg/linux/vmlinux...done.


and then use following GDB magic to print out the offset of ->offset in struct kmem_cache:


(gdb) print &((struct kmem_cache *)0)->offset
$1 = (int *) 0x18


Bingo! We can conclude that the invalid pointer "0000000000000003" is in "object" of get_freepointer(). Tracing backwards in the source code:


c = __this_cpu_ptr(s->cpu_slab);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);


we see that "object" comes from "c->freelist" which means that the slab cache freelist is corrupted. There are various possibilites how that can happen so the next step would be to ask the reporter to try to reproduce the problem with SLUB debugging enabled. For that, you can either enable CONFIG_SLUB_DEBUG_ON or pass "slub_debug" as kernel parameter.

In summary, when you're analyzing a kernel bug report:

  • Make sure you're looking at the right version of kernel sources. You don't want to spend hours and hours scratching your head only to realize the code you're looking at has changed significantly.

  • Ask for the reporter for the .config. Remember to check your assumptions if you don't have one especially when inspecting a kernel image!

  • You can usually get quite far by decoding "Code" and reading the register dump in OOPSes. It's also useful to spend some time skimming through the functions reported in the backtrace to get a feel of the code.



There's more interesting kernel debugging tricks documented Documentation/BUG-HUNTING. Happy debugging!

Thursday, May 13, 2010

Jato status update for May 2010

We didn't make it to Google Summer of Code this year but we've managed to get some pretty cool stuff done:


  • Tomek Grabiec integrated Boehm GC with Jato. While we still intend to continue developing our own GC, it became apparent that we needed a GC now to move forward. This means that the VM now has all the necessary capabilities to run real programs on 32-bit x86 Linux machines.

  • Tomek Grabiec implemented relaxed-locking protocol for monitors which improved performance tremendously.

  • Eduard-Gabriel Munteanu contributed tons of fixes to our x86-64 port. We're not there yet but hopefully we're getting closer.


There's still tons of work remaining but we're slowly getting there. As for our performance, we're pretty close to JamVM (see the following diagram for selected DaCapo benchmarks) which is not that bad considering that we haven't done much serious performance tuning yet.



As for future plans, we'll continue to work on compatibility issues. For that, we probably will be looking GNU Classpath 0.98 support and OpenJDK integration for better compabitility. On the performance side, I hope we find the time to look into array bounds check elimination, inline caching, and method inlining to try to close the performance gap to CACAO.

Sunday, April 25, 2010

A Short Introduction to the x86 Instruction Set Encoding

A lot of people think that the x86 instruction set is arcane and complex. While it is certainly true that the instruction set is humongous, it is rather straight-forward when you look at the actual instruction set encoding. As a matter of fact, given that the encoding hasn't changed much since the 16-bit 8086 days, it's pretty amazing how cleanly the instruction set has been able to evolve into the modern 64-bit architecture x86 is today.

Instruction set encoding is the binary representation of instructions (e.g. addition, loads, stores) that the CPU executes. The main difference between x86 and modern RISC architectures is that the former uses variable-length encoding and latter architectures (e.g. ARM and PowerPC) use fixed-length encoding. That is, on x86, an instruction can be anywhere from 1 to 15 bytes in length whereas on ARM each instruction is exactly 16 bits or 32 bits depending in which mode the CPU is.

Looking at Chapter 2 ("Instruction Format") of Intel manual Volume 2A, we can see a figure of the instruction format that looks roughly like this:



What's interesting about this format is that it's identical all the way from 8086 to x86-64 with the exception of SIB byte and REX prefix that we'll go in more detail later. Intel, AMD, and other vendors have added more instructions but the above encoding has survived over the years. Pretty neat, huh? Much of the apparent complexity comes from the early days when engineers tried to squeeze instructions into as small space as possible.

The most important part of the format is the opcode. It is 1 to 3 bytes in length with an optional 3-bit opcode extension in the ModR/M byte. An opcode (operation code) specifies the operation to be performed. For example, the ret instruction that is used to return from a procedure is encoded as a single byte 0xc3 (near return) or 0xcb ("far return").

Most instructions have one or more operands that they operate on. Instructions that have register operands encode the source and target register either implicitly in the opcode or in the ModR/M byte. Instructions that refer to an operand in memory also use the ModR/M byte and optionally the SIB byte to specify the addressing mode of the operand. Chapter 2 ("Instruction Format") of the Intel manuals specify the different Mod and R/M bitmasks for different source and target combinations.

Following the ModR/M and SIB bytes, an instruction has optional address displacement and immediate data bytes. The address displacement is used as an offset on top of a base register (e.g. [EAX]+displacement) and together with the ModR/M byte and SIB byte, they represent the different x86 addressing modes. Immediate data is used for constant operands and relative branch target.

Finally, an instruction can have one instruction prefixes from each of the four groups: (1) lock and repeat prefixes, (2) segment override prefixes, (3) operand-size override prefix, and (4) address-size override prefix. In 64-bit mode, REX prefixes are used for specifying GPRs and SSE registers, 64-bit operand size, and extended control registers. Each instruction can have only one REX prefix at a time.

Lets look at an example of instruction encoding. On x86-64, the following assembly code


movl $0xdeadbeef, 0x12345678(%ebx,%edx,1)


compiles to the following byte sequence:


67 c7 84 53 78 56 34 12 ef be ad de


Looking at the breakdown of the above byte sequence:



We can see that the opcode byte is 0xc7 which is one of the encodings for the mov instruction. As we used the l-postfix (for 32-bit operands) in the assembly code, we need the address-size override prefix 0x67 to force 32-bit operands in 64-bit mode. The same prefix can also be used to switch between 16-bit and 32-bit addressing in legacy mode. The address displacement and immediate data bytes are little endian representation of the respective constants we used in the assembly code.

Finally, looking at the ModR/M byte, the value of Mod is b10 and R/M is b100 which means that SIB follows the ModR/M byte. The value of Reg is b00 which is in this case an opcode extension for 0xc7 that specifies that the source operand is an immediate. In the SIB byte, Scale is b01, Index is b010 and Base is b011 which translates to [EBX+EDX*2] as per Table 2-3 ("32-bit addressing forms with the SIB byte") of the Intel manual Volume 2A.

The above example explains the different parts of instruction format pretty well. Much of the perceived complexity comes from the fact that much information, such as operand types and special cases, is encoded implicitly in the opcode bytes. That doesn't mean that the x86 instruction set is arbitrary, far from it. Unfortunately much of this information is scattered across the manuals and comes apparently only if decode the instruction set.

For more information, check out Volumes 2A and 2B of the Intel manuals. You might want to check out the x86_decode_insn() function in the arch/x86/kvm/emulate.c file of the Linux kernel sources for a real-world partial x86 decoder that's used by KVM. A work-in-progress x86 decoder can also be found in libcpu sources in the arch/x86/x86_decode.cpp file for those that are interested in getting their hands dirty on hacking.

Sunday, April 11, 2010

CPU virtualization techniques

I guess most technical people have heard of virtualization by now and many are using a hypervisor such as QEMU for their daily work but don't really understand how they work. I'll try to give a brief overview of the major techniques here but please be warned that this is very x86 and Linux centric view of things.

The two fundamental questions in virtualization are: do you need to be able to run unmodified guests and how fast do you want the guest to go? Unmodified guest is, rather unsuprisingly, your out-of-the box Linux or Windows installation, for example. To be able to run them under a hypervisor, you need either hardware virtualization or hardware emulation. If you are allowed to modify the guest, you can use paravirtualization. Roughly speaking, KVM falls under the hardware virtualization category, QEMU under the hardware emulation, and Xen under paravirtualization.

Hardware emulation is the purest form of virtualization in a sense that it supports unmodified guests without any hardware assistance. Hardware emulators are actually not that different from say, the Java virtual machine. In both cases, you have an instruction set (machine code vs. bytecode) and some way to describe machine state (machine registers vs. operand stack). And much like with the JVM, hardware emulation can be pretty fast if dynamic binary translation (a form of Just-in-time compilation) is used as shown by QEMU, for example. Unfortunately on virtualization unfriendly architectures such as the x86 [1], some aspects of CPU emulation incurs serious performance overhead.

[ 1. See Section 3.1 ("x86 obstacles to virtualization") of Adams and Agesen (2006) for details. ]

Hardware virtualization also supports unmodified guests just like hardware emulation does but provides near-native performance because the hypervisor doesn't need to do binary translation. With the virtualization extensions in modern x86 CPUs, the only thing you need to do is to put the CPU in "virtualization mode" and let the guest operating system run on it. The hypervisor needs to, of course, provide I/O emulation if you want to run anything serious under it. On Linux, KVM provides the application binary interface for controlling the virtualization extensions. KVM is usually used in combination with QEMU which handles rest of the hardware emulation to implement a full hypervisor.

Paravirtualization is a technique to support high performance virtual machines on virtualization unfriendly architectures. It requires the guest operating system to be ported for the paravirtualization ABI. On Linux, this ABI is called "paravirt-ops" and it's used by the Xen hypervisor, for example. Although I tend to think Xen as a historical accident (the project started before Intel and AMD introduced their virtualization extensions), it is being used in large scale environments such as Amazon EC2 and scalability benefits seem to be real.

While the boundaries between different virtualization techniques are pretty clear on the CPU virtualization side, for real world problems they tend to be not so black and white. You almost certainly want to use paravirtualized device drivers for hardware virtualization solutions such as KVM if you're interested in high performance I/O and make sure your paravirtualization solution such as Xen takes advantage of the CPU virtualization extensions when they're available. That shouldn't come as a huge surprise as in all technology, practice beats theory every time.

Saturday, January 9, 2010

Java Virtual Machine developer documentation

So you want to be a Java virtual machine hacker but don't know where to start? One obvious starting point is to navigate your way through the large body of developer documentation.

For a JVM developer, the most important document is the The Java Virtual Machine Specification and its companion document Clarifications and Amendments. The specification describes the ABI of the JVM but doesn't mandate how it is implemented. While you should read through the whole thing, the most interesting chapters to a JVM hacker are Chapter 4 ("The class File Format") and Chapter 6 ("The Java Virtual Machine Instruction Set"). It's also important to understand the JSR-133 ("Java memory model") which describes how threads in the JVM interact through memory. Doug Lea's The JSR-133 Cookbook for Compiler Writers has a pretty nice summary of the specification from a JVM hacker point of view.

There are two additional things that the JVM needs to run serious programs: JNI support and runtime classes. The Java Native Interface Specification describes the iteroperability interface to native code from the JVM. The Sun HotSpot comes with its own runtime classes but many open source JVMs use the GNU Classpath to provide essential classes for the JVM. The GNU Classpath Virtual Machine Integration Guide describes the API that a VM needs to implement to integrate with GNU Classpath.

While the above documentation describe what is needed for a JVM, it's also important to understand how real world JVMs are implemented to see the big picture. Design of the Java HotSpot Client Compiler for Java 6 and The Jalapeño Dynamic Optimizing Compiler for Java provide nice overview of how the HotSpot and Jikes RVM are implemented, respectively. There's a more extensive list of papers on the Jato home page but there's no substitute to reading JVM code if you really want to understand the implementation.

Jato v0.0.2 released

I just released version 0.0.2 of Jato. See the release announcement for details.