A blog about programming topics, mostly JVM, Linux kernel, and x86 related things.

Tuesday, July 20, 2010

Linux kernel OOPS debugging

Debugging the Linux kernel can seem like black magic for the uninitiated. It certainly felt like that when I first started out hacking on the kernel! What I found to be particularly amazing back then was how kernel hackers seemed to be able to debug and fix bugs that they weren't able to reproduce on their machine.

In this article, I'm going to walk through a real world Linux kernel bug report to show the different steps required to pin-point where in the kernel a problem happened.

[ Please skim through the report before you continue reading. ]

The first important detail in the bug report is the kernel version. We usually want to look at the exact same version of the kernel sources as the reporter, so lets use "git checkout":


$ git checkout v2.6.34


The reporter did not include his .config file so unfortunately we can't build the same kind of version on our machine (different versions of GCC change the generated code a bit as well). However, lets look at a cleaned up version of the reported OOPS to see what we can deduce from it:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
IP: [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
PGD 11e7e5067 PUD 11fd3d067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/virtual/block/md0/md/metadata_version
CPU 1
Modules linked in: aes_x86_64(+) aes_generic sg

Pid: 4652, comm: modprobe Not tainted 2.6.34-gentoo-r1 #1 MS-7368/MS-7368
RIP: 0010:[<ffffffff810aab89>] [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
RSP: 0018:ffff88011e75fe08 EFLAGS: 00010006
RAX: ffff880001b0f088 RBX: ffffffff8170d4d0 RCX: ffff88011e574b80
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 00000000000002d0
RBP: 0000000000000296 R08: 0000000000000014 R09: ffff88011e574800
R10: 0000000000000001 R11: ffff880001a12008 R12: 00000000000000d0
R13: 0000000000000003 R14: ffffffff81064abb R15: ffffc90010729d68
FS: 00007f0a9acb8700(0000) GS:ffff880001b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000003 CR3: 000000011d03e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 4652, threadinfo ffff88011e75e000, task ffff88011d114150)
Stack:
0000000000000000 ffffc90010729c97 0000000000000008 ffff88011e574800
ffff88011e574aa0 ffffffff8108c27b ffffffffa0018920 ffffc900000000d0
ffffffffa0018920 ffffc90010728000 ffffc90010729d68 ffffffff81064abb
Call Trace:
[<ffffffff8108c27b>] ? kstrdup+0x3b/0x70
[<ffffffff81064abb>] ? load_module+0x13eb/0x1730
[<ffffffff81064e7b>] ? sys_init_module+0x7b/0x260
[<ffffffff810024ab>] ? system_call_fastpath+0x16/0x1b
Code: 23 25 dc 47 6f 00 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55 9d 4d 85 ed 74 06 66 45 85 e4 78 22
RIP [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110
RSP <ffff88011e75fe08>
CR2: 0000000000000003
---[ end trace 692101747f991cfb ]---


OK, so it's an invalid pointer access in __kmalloc_track_caller:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000003
IP: [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110


As we don't have a copy of the kernel image, lets use the scripts/decodecode script in Linux kernel sources to decode "Code" in the OOPS which contains a small machine code dump of the problem area:


echo "Code: 23 25 dc 47 6f 00 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55 9d 4d 85 ed 74 06 66 45 85 e4 78 22" | ./scripts/decodecode

Code: 23 25 dc 47 6f 00 41 f6 c4 10 75 66 9c 5d fa 65 48 8b 14 25 a8 d1 00 00 48 8b 03 48 8d 04 02 4c 8b 28 4d 85 ed 74 55 48 63 53 18 <49> 8b 54 15 00 48 89 10 55 9d 4d 85 ed 74 06 66 45 85 e4 78 22
All code
========
0: 23 25 dc 47 6f 00 and 0x6f47dc(%rip),%esp # 0x6f47e2
6: 41 f6 c4 10 test $0x10,%r12b
a: 75 66 jne 0x72
c: 9c pushfq
d: 5d pop %rbp
e: fa cli
f: 65 48 8b 14 25 a8 d1 mov %gs:0xd1a8,%rdx
16: 00 00
18: 48 8b 03 mov (%rbx),%rax
1b: 48 8d 04 02 lea (%rdx,%rax,1),%rax
1f: 4c 8b 28 mov (%rax),%r13
22: 4d 85 ed test %r13,%r13
25: 74 55 je 0x7c
27: 48 63 53 18 movslq 0x18(%rbx),%rdx
2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction
30: 48 89 10 mov %rdx,(%rax)
33: 55 push %rbp
34: 9d popfq
35: 4d 85 ed test %r13,%r13
38: 74 06 je 0x40
3a: 66 45 85 e4 test %r12w,%r12w
3e: 78 22 js 0x62

Code starting with the faulting instruction
===========================================
0: 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx
5: 48 89 10 mov %rdx,(%rax)
8: 55 push %rbp
9: 9d popfq
a: 4d 85 ed test %r13,%r13
d: 74 06 je 0x15
f: 66 45 85 e4 test %r12w,%r12w
13: 78 22 js 0x37


Looking at the trapping instruction:


2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction


and the register dump:


RAX: ffff880001b0f088 RBX: ffffffff8170d4d0 RCX: ffff88011e574b80
RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 00000000000002d0
RBP: 0000000000000296 R08: 0000000000000014 R09: ffff88011e574800
R10: 0000000000000001 R11: ffff880001a12008 R12: 00000000000000d0
R13: 0000000000000003 R14: ffffffff81064abb R15: ffffc90010729d68


We can see that R13 contains the invalid pointer which matches the reported dereference:


BUG: unable to handle kernel NULL pointer dereference at 0000000000000003


Next step is to locate the resolved function:


IP: [<ffffffff810aab89>] __kmalloc_track_caller+0x69/0x110


in kernel sources:


$ make tags
$ vim -t __kmalloc_track_caller


which reveals two functions - one in mm/slab.c and one in mm/slub.c. The bug reported mentioned using CONFIG_SLUB=y so we're interested in the one in mm/slub.c:


void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
{
struct kmem_cache *s;
void *ret;

if (unlikely(size > SLUB_MAX_SIZE))
return kmalloc_large(size, gfpflags);

s = get_slab(size, gfpflags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
return s;

ret = slab_alloc(s, gfpflags, -1, caller);

/* Honor the call site pointer we recieved. */
trace_kmalloc(caller, ret, size, s->size, gfpflags);

return ret;
}


Scanning the decoded machine code for some interesting instructions we find:


c: 9c pushfq
d: 5d pop %rbp
e: fa cli


which is generated code from local_irq_save() and


f: 65 48 8b 14 25 a8 d1 mov %gs:0xd1a8,%rdx


which is per-CPU data access - from which we can also deduce that CONFIG_SMP=y. Reading __kmalloc_track_caller some more, we find slab_alloc():


static __always_inline void *slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node, unsigned long addr)
{
void **object;
struct kmem_cache_cpu *c;
unsigned long flags;

gfpflags &= gfp_allowed_mask;

lockdep_trace_alloc(gfpflags);
might_sleep_if(gfpflags & __GFP_WAIT);

if (should_failslab(s->objsize, gfpflags, s->flags))
return NULL;

local_irq_save(flags);
c = __this_cpu_ptr(s->cpu_slab);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);
stat(s, ALLOC_FASTPATH);
}
local_irq_restore(flags);

if (unlikely(gfpflags & __GFP_ZERO) && object)
memset(object, 0, s->objsize);

kmemcheck_slab_alloc(s, gfpflags, object, s-%gt;objsize);
kmemleak_alloc_recursive(object, s->objsize, 1, s->flags, gfpflags);

return object;
}


which has local_irq_save and a per-CPU data access:


local_irq_save(flags);
c = __this_cpu_ptr(s->cpu_slab);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);
stat(s, ALLOC_FASTPATH);
}
local_irq_restore(flags);


Turning our attention back to the decoded machine code:


f: 65 48 8b 14 25 a8 d1 mov %gs:0xd1a8,%rdx
16: 00 00
18: 48 8b 03 mov (%rbx),%rax
1b: 48 8d 04 02 lea (%rdx,%rax,1),%rax
1f: 4c 8b 28 mov (%rax),%r13
22: 4d 85 ed test %r13,%r13
25: 74 55 je 0x7c
27: 48 63 53 18 movslq 0x18(%rbx),%rdx
2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction


we can see that the


22: 4d 85 ed test %r13,%r13
25: 74 55 je 0x7c


sequence is likely to be the following if-else statement written in inverse:


if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);
stat(s, ALLOC_FASTPATH);
}


which would mean that the trapping instructions:


27: 48 63 53 18 movslq 0x18(%rbx),%rdx
2b:* 49 8b 54 15 00 mov 0x0(%r13,%rdx,1),%rdx <-- trapping instruction


are in get_freepointer():


static inline void *get_freepointer(struct kmem_cache *s, void *object)
{
return *(void **)(object + s->offset);
}


if this is true, then "s" is stored in "%rbx" and "object" is in "%r13". Furthermore, ->offset must be at offset 0x18 in struct kmem_cache. Now, we can easily verify our assumptions but we need the reporter's kernel image or .config.

However, looking at "struct kmem_cache":


struct kmem_cache {
struct kmem_cache_cpu *cpu_slab;
/* Used for retriving partial slabs etc */
unsigned long flags;
int size; /* The size of an object including meta data */
int objsize; /* The size of an object without meta data */
int offset; /* Free pointer offset. */
[...]


we can see that it's layout in memory is not affected by .config so we can safely inspect any image that has CONFIG_SLUB=y. We first compile a kernel image with CONFIG_DEBUG_INFO=y and the load vmlinux image to GDB for inspection:


$ gdb ./vmlinux
GNU gdb (GDB) 7.0-ubuntu
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
...
Reading symbols from /home/penberg/linux/vmlinux...done.


and then use following GDB magic to print out the offset of ->offset in struct kmem_cache:


(gdb) print &((struct kmem_cache *)0)->offset
$1 = (int *) 0x18


Bingo! We can conclude that the invalid pointer "0000000000000003" is in "object" of get_freepointer(). Tracing backwards in the source code:


c = __this_cpu_ptr(s->cpu_slab);
object = c->freelist;
if (unlikely(!object || !node_match(c, node)))

object = __slab_alloc(s, gfpflags, node, addr, c);

else {
c->freelist = get_freepointer(s, object);


we see that "object" comes from "c->freelist" which means that the slab cache freelist is corrupted. There are various possibilites how that can happen so the next step would be to ask the reporter to try to reproduce the problem with SLUB debugging enabled. For that, you can either enable CONFIG_SLUB_DEBUG_ON or pass "slub_debug" as kernel parameter.

In summary, when you're analyzing a kernel bug report:

  • Make sure you're looking at the right version of kernel sources. You don't want to spend hours and hours scratching your head only to realize the code you're looking at has changed significantly.

  • Ask for the reporter for the .config. Remember to check your assumptions if you don't have one especially when inspecting a kernel image!

  • You can usually get quite far by decoding "Code" and reading the register dump in OOPSes. It's also useful to spend some time skimming through the functions reported in the backtrace to get a feel of the code.



There's more interesting kernel debugging tricks documented Documentation/BUG-HUNTING. Happy debugging!

3 comments:

Ken O'Brien said...

Good stuff. This will help me debug my hibernate problem. :)

mahaveer said...

Very well written article. thanks a lot.

Anonymous said...

Very well written. This help me a lot to debug my first kernel null dereference. Thanks.