A blog about programming topics, mostly JVM, Linux kernel, and x86 related things.

Sunday, August 23, 2009

Full JVM instruction set coverage on 32-bit x86

Jato reached 100% coverage of the JVM instruction set on 32-bit x86 today. We have plenty of bugs and missing features that prevent you from running most Java programs out there. However, reaching 100% coverage is an important milestone because you can now try to run your favorite Java applications under Jato and report bugs to us when you see the VM crash and burn!

Tuesday, August 11, 2009

Hello, World (the Swing edition)!

The Hello World Swing example makes perfect screenshot material to show off Jato's amazing progress:

Pretty cool, eh?

Friday, August 7, 2009

Optimizing System.out.println() in Jato (...or how to reduce execution time from 10.5 seconds down to 0.3 seconds)

I am a big fan of the "first make it work, then make it better" principle and that's what we've been doing with Jato. However, as we were getting System.out.println() mostly working, we found ourselves in the "then make it better" situation because running a simple "hello, world" program took a whopping 10.5 seconds:

  penberg@penberg-laptop:~/testing/jato$ time jato Hello

real 0m12.680s
user 0m10.544s
sys 0m1.027s

As it was pretty evident something was seriously broken, I turned to the shiny new performance counters tool that's part of Linux 2.6.31 release candidate kernels:

  penberg@penberg-laptop:~/testing/jato$ perf record -f jato -Xperf Hello
[ perf record: Captured and wrote 2.665 MB perf.data (~116455 samples) ]
penberg@penberg-laptop:~/testing/jato$ perf report | head -10
# Samples: 116267
# Overhead Command Shared Object Symbol
# ........ ....... ....................................... ......
37.71% jato /home/penberg/bin/jato [.] analyze_liveness
16.27% jato /home/penberg/bin/jato [.] insn_defs
12.61% jato /home/penberg/bin/jato [.] test_bit
12.21% jato /home/penberg/bin/jato [.] insn_uses
6.30% jato /home/penberg/bin/jato [.] split_interval_at

The profile clearly indicates that we were spending 66.19% of the execution time in liveness analysis (functions analyze_liveness, insn_defs, and insn_uses). It's also likely that much of the 12.61% time spent in test_bit() was originated from liveness analysis as well!

After looking at the trace for few seconds, I went for the the sucker optimization of commit a8b16eb ("lib: Make test_bit() an inline function") which helped but not much:

  penberg@penberg-laptop:~/testing/jato$ time jato Hello

real 0m12.636s
user 0m9.723s
sys 0m1.050s

Luckily, I heard the voice of an ex-colleague of mine in my head saying "always optimize the _top-most_ functions in a profile report _first_" and turned my attention to the real problem. Reading the code, I was able to determine that we were updating live ranges way too many times in the wrong place. As I tried to fix it, I kept seeing crashes which turned out to be actual bugs in our use-def code!

After Arthur Huillet and Vegard Nossum fixed the use-def bugs, I did the first serious optimization in commit 8aa6c4f ("jit: optimize __analyze_use_def()") bringing the execution time from 9.7 seconds to 6.3 seconds:

  penberg@penberg-laptop:~/testing/jato$ time jato Hello

real 0m7.266s
user 0m6.389s
sys 0m0.159s

While 6.3 seconds is better, it's not amazingly good so I then turned to profile report again:

# Samples: 66106
# Overhead Command Shared Object Symbol
# ........ ....... ....................................... ......
33.76% jato /home/penberg/bin/jato [.] insn_defs
21.72% jato /home/penberg/bin/jato [.] __update_interval_insn
20.91% jato /home/penberg/bin/jato [.] insn_uses
14.51% jato /home/penberg/bin/jato [.] __analyze_use_def
1.42% jato /home/penberg/bin/jato [.] allocate_registers

It didn't took too long for Arthur Huillet to notice that we were calling for_each_variable() (where the number of iterations can be very big in pathological cases) to find instructions that use or define a temporary. He suggested turning the code the other way around and find all the temporaries of an instruction which I and Vegard Nossum implemented in commit f0438e0 ("jit: Optimize __analyze_use_def()"). The change brought us from 6.3 down to 1.8 seconds:

  penberg@penberg-laptop:~/testing/jato$ time jato Hello

real 0m2.228s
user 0m1.861s
sys 0m0.166s

I was feeling pretty happy about the improvement so I decided to see how much we were losing to Hotspot:

  penberg@penberg-laptop:~/testing/jato$ java -version
java version "1.6.0_10"
Java(TM) SE Runtime Environment (build 1.6.0_10-b33)
Java HotSpot(TM) Client VM (build 11.0-b15, mixed mode, sharing)
penberg@penberg-laptop:~/testing/jato$ time java Hello
hello, world

real 0m0.268s
user 0m0.040s
sys 0m0.018s

Ouch! It took 0.04 seconds for Hotspot to do what we did in 1.8 seconds so I fired up the profiler again:

  # Samples: 20298
# Overhead Command Shared Object Symbol
# ........ ....... ....................................... ......
69.71% jato /home/penberg/bin/jato [.] __update_interval_insn
4.25% jato /home/penberg/bin/jato [.] allocate_registers
4.19% jato [kernel] [k] get_page_from_freelist
1.68% jato /lib/tls/i686/cmov/libc-2.8.90.so [.] _int_malloc
1.66% jato /home/penberg/bin/jato [.] split_interval_at

Looking at the hotspot in __update_interval_insn() we saw yet another for_each_variable() loop (I told you it was expensive). Vegard Nossum replaced the loop with a proper data structure in commit
5adab71 ("jit: introduce per-cu lir -> insn mapping") and was able to bring us down to 0.3 seconds:

  penberg@penberg-laptop:~/testing/jato$ time jato Hello
hello, world

real 0m0.530s
user 0m0.375s
sys 0m0.077s

There's still more room to improve but we've turned our attention to fixing bugs as the level of performance is pretty satisfactory at this point. I did take a look at the profile one last time:

# Samples: 4429
# Overhead Command Shared Object Symbol
# ........ ....... ....................................... ......
15.69% jato /home/penberg/bin/jato [.] insert_to_list
6.84% jato /lib/tls/i686/cmov/libc-2.8.90.so [.] _int_malloc
6.62% jato [kernel] [k] get_page_from_freelist
6.34% jato /lib/tls/i686/cmov/libc-2.8.90.so [.] __GI_strcmp
3.97% jato /home/penberg/bin/jato [.] zalloc
2.73% jato /lib/tls/i686/cmov/libc-2.8.90.so [.] __GI___libc_malloc

The current bottleneck function insert_to_list() is another example of using a simple loop instead of a proper data structure, this time in the register allocator. It's also interesting to see how malloc() (_int_malloc) and the kernel page allocator (get_page_from_freelist) total at 10.8% of the execution time. This hardly surprising as Jato is very memory hungry and doesn't have a garbage collector yet.

While I'm happy and proud of the speed-up, it's not that impressive as the real challenge is to try to catch up with Hotspot which is pretty damn hard. That said, the whole exercise reinforced two fundamental rules of performance optimizations in me: (1) measure, don't guess and (2) always optimize the parts where your program spends most its time in. Sounds simple and trivial? Yes, but it's very easy to forget that when you're trying to make something go faster.

Watch for Falling Rocks...

Do you know the warning signs of a bad programmer?

Thursday, August 6, 2009

Hello, World from Jato!

...3 years, 10 months, and 23 days later:

  penberg@penberg-laptop:~/testing/jato$ cat Hello.java 
public class Hello {
public static void main(String[] args) {
System.out.println("hello, world");
penberg@penberg-laptop:~/testing/jato$ javac Hello.java
penberg@penberg-laptop:~/testing/jato$ jato Hello
vm/jni-interface.c:459: warning: vm_jni_delete_local_ref not implemented
hello, world
vm/jni-interface.c:459: warning: vm_jni_delete_local_ref not implemented
vm/jni-interface.c:459: warning: vm_jni_delete_local_ref not implemented

If you're wondering why it took so long, lets turn on method invocation tracing:

penberg@penberg-laptop:~/testing/jato$ jato -Xtrace:invoke Hello >& trace
penberg@penberg-laptop:~/testing/jato$ grep "trace invoke" trace | wc -l
penberg@penberg-laptop:~/testing/jato$ grep "trace invoke" trace | sed -e "s/.*trace invoke//g" | sort -u | wc -l

That's 356 individual methods and total of 18059 method invocations! When I started the Jato project, the very first program I wanted to compile and run was "hello, world" but it turns out that in the Java world (with GNU Classpath), you need to cover 80% of the JVM instruction set and implement a bunch of JNI infrastructure to be able to do that!

I think we're getting close to a 0.01 release but in case you're interested in trying Jato out, it's as simple as:

  git clone git://github.com/penberg/jato.git
# check README how to install dependencies
make install
# Jato is in $HOME/bin now
jato Hello

Have fun!