A blog about programming topics, mostly JVM, Linux kernel, and x86 related things.

Sunday, April 25, 2010

A Short Introduction to the x86 Instruction Set Encoding

A lot of people think that the x86 instruction set is arcane and complex. While it is certainly true that the instruction set is humongous, it is rather straight-forward when you look at the actual instruction set encoding. As a matter of fact, given that the encoding hasn't changed much since the 16-bit 8086 days, it's pretty amazing how cleanly the instruction set has been able to evolve into the modern 64-bit architecture x86 is today.

Instruction set encoding is the binary representation of instructions (e.g. addition, loads, stores) that the CPU executes. The main difference between x86 and modern RISC architectures is that the former uses variable-length encoding and latter architectures (e.g. ARM and PowerPC) use fixed-length encoding. That is, on x86, an instruction can be anywhere from 1 to 15 bytes in length whereas on ARM each instruction is exactly 16 bits or 32 bits depending in which mode the CPU is.

Looking at Chapter 2 ("Instruction Format") of Intel manual Volume 2A, we can see a figure of the instruction format that looks roughly like this:



What's interesting about this format is that it's identical all the way from 8086 to x86-64 with the exception of SIB byte and REX prefix that we'll go in more detail later. Intel, AMD, and other vendors have added more instructions but the above encoding has survived over the years. Pretty neat, huh? Much of the apparent complexity comes from the early days when engineers tried to squeeze instructions into as small space as possible.

The most important part of the format is the opcode. It is 1 to 3 bytes in length with an optional 3-bit opcode extension in the ModR/M byte. An opcode (operation code) specifies the operation to be performed. For example, the ret instruction that is used to return from a procedure is encoded as a single byte 0xc3 (near return) or 0xcb ("far return").

Most instructions have one or more operands that they operate on. Instructions that have register operands encode the source and target register either implicitly in the opcode or in the ModR/M byte. Instructions that refer to an operand in memory also use the ModR/M byte and optionally the SIB byte to specify the addressing mode of the operand. Chapter 2 ("Instruction Format") of the Intel manuals specify the different Mod and R/M bitmasks for different source and target combinations.

Following the ModR/M and SIB bytes, an instruction has optional address displacement and immediate data bytes. The address displacement is used as an offset on top of a base register (e.g. [EAX]+displacement) and together with the ModR/M byte and SIB byte, they represent the different x86 addressing modes. Immediate data is used for constant operands and relative branch target.

Finally, an instruction can have one instruction prefixes from each of the four groups: (1) lock and repeat prefixes, (2) segment override prefixes, (3) operand-size override prefix, and (4) address-size override prefix. In 64-bit mode, REX prefixes are used for specifying GPRs and SSE registers, 64-bit operand size, and extended control registers. Each instruction can have only one REX prefix at a time.

Lets look at an example of instruction encoding. On x86-64, the following assembly code


movl $0xdeadbeef, 0x12345678(%ebx,%edx,1)


compiles to the following byte sequence:


67 c7 84 53 78 56 34 12 ef be ad de


Looking at the breakdown of the above byte sequence:



We can see that the opcode byte is 0xc7 which is one of the encodings for the mov instruction. As we used the l-postfix (for 32-bit operands) in the assembly code, we need the address-size override prefix 0x67 to force 32-bit operands in 64-bit mode. The same prefix can also be used to switch between 16-bit and 32-bit addressing in legacy mode. The address displacement and immediate data bytes are little endian representation of the respective constants we used in the assembly code.

Finally, looking at the ModR/M byte, the value of Mod is b10 and R/M is b100 which means that SIB follows the ModR/M byte. The value of Reg is b00 which is in this case an opcode extension for 0xc7 that specifies that the source operand is an immediate. In the SIB byte, Scale is b01, Index is b010 and Base is b011 which translates to [EBX+EDX*2] as per Table 2-3 ("32-bit addressing forms with the SIB byte") of the Intel manual Volume 2A.

The above example explains the different parts of instruction format pretty well. Much of the perceived complexity comes from the fact that much information, such as operand types and special cases, is encoded implicitly in the opcode bytes. That doesn't mean that the x86 instruction set is arbitrary, far from it. Unfortunately much of this information is scattered across the manuals and comes apparently only if decode the instruction set.

For more information, check out Volumes 2A and 2B of the Intel manuals. You might want to check out the x86_decode_insn() function in the arch/x86/kvm/emulate.c file of the Linux kernel sources for a real-world partial x86 decoder that's used by KVM. A work-in-progress x86 decoder can also be found in libcpu sources in the arch/x86/x86_decode.cpp file for those that are interested in getting their hands dirty on hacking.

Sunday, April 11, 2010

CPU virtualization techniques

I guess most technical people have heard of virtualization by now and many are using a hypervisor such as QEMU for their daily work but don't really understand how they work. I'll try to give a brief overview of the major techniques here but please be warned that this is very x86 and Linux centric view of things.

The two fundamental questions in virtualization are: do you need to be able to run unmodified guests and how fast do you want the guest to go? Unmodified guest is, rather unsuprisingly, your out-of-the box Linux or Windows installation, for example. To be able to run them under a hypervisor, you need either hardware virtualization or hardware emulation. If you are allowed to modify the guest, you can use paravirtualization. Roughly speaking, KVM falls under the hardware virtualization category, QEMU under the hardware emulation, and Xen under paravirtualization.

Hardware emulation is the purest form of virtualization in a sense that it supports unmodified guests without any hardware assistance. Hardware emulators are actually not that different from say, the Java virtual machine. In both cases, you have an instruction set (machine code vs. bytecode) and some way to describe machine state (machine registers vs. operand stack). And much like with the JVM, hardware emulation can be pretty fast if dynamic binary translation (a form of Just-in-time compilation) is used as shown by QEMU, for example. Unfortunately on virtualization unfriendly architectures such as the x86 [1], some aspects of CPU emulation incurs serious performance overhead.

[ 1. See Section 3.1 ("x86 obstacles to virtualization") of Adams and Agesen (2006) for details. ]

Hardware virtualization also supports unmodified guests just like hardware emulation does but provides near-native performance because the hypervisor doesn't need to do binary translation. With the virtualization extensions in modern x86 CPUs, the only thing you need to do is to put the CPU in "virtualization mode" and let the guest operating system run on it. The hypervisor needs to, of course, provide I/O emulation if you want to run anything serious under it. On Linux, KVM provides the application binary interface for controlling the virtualization extensions. KVM is usually used in combination with QEMU which handles rest of the hardware emulation to implement a full hypervisor.

Paravirtualization is a technique to support high performance virtual machines on virtualization unfriendly architectures. It requires the guest operating system to be ported for the paravirtualization ABI. On Linux, this ABI is called "paravirt-ops" and it's used by the Xen hypervisor, for example. Although I tend to think Xen as a historical accident (the project started before Intel and AMD introduced their virtualization extensions), it is being used in large scale environments such as Amazon EC2 and scalability benefits seem to be real.

While the boundaries between different virtualization techniques are pretty clear on the CPU virtualization side, for real world problems they tend to be not so black and white. You almost certainly want to use paravirtualized device drivers for hardware virtualization solutions such as KVM if you're interested in high performance I/O and make sure your paravirtualization solution such as Xen takes advantage of the CPU virtualization extensions when they're available. That shouldn't come as a huge surprise as in all technology, practice beats theory every time.