A blog about programming topics, mostly JVM, Linux kernel, and x86 related things.

Sunday, April 25, 2010

A Short Introduction to the x86 Instruction Set Encoding

A lot of people think that the x86 instruction set is arcane and complex. While it is certainly true that the instruction set is humongous, it is rather straight-forward when you look at the actual instruction set encoding. As a matter of fact, given that the encoding hasn't changed much since the 16-bit 8086 days, it's pretty amazing how cleanly the instruction set has been able to evolve into the modern 64-bit architecture x86 is today.

Instruction set encoding is the binary representation of instructions (e.g. addition, loads, stores) that the CPU executes. The main difference between x86 and modern RISC architectures is that the former uses variable-length encoding and latter architectures (e.g. ARM and PowerPC) use fixed-length encoding. That is, on x86, an instruction can be anywhere from 1 to 15 bytes in length whereas on ARM each instruction is exactly 16 bits or 32 bits depending in which mode the CPU is.

Looking at Chapter 2 ("Instruction Format") of Intel manual Volume 2A, we can see a figure of the instruction format that looks roughly like this:



What's interesting about this format is that it's identical all the way from 8086 to x86-64 with the exception of SIB byte and REX prefix that we'll go in more detail later. Intel, AMD, and other vendors have added more instructions but the above encoding has survived over the years. Pretty neat, huh? Much of the apparent complexity comes from the early days when engineers tried to squeeze instructions into as small space as possible.

The most important part of the format is the opcode. It is 1 to 3 bytes in length with an optional 3-bit opcode extension in the ModR/M byte. An opcode (operation code) specifies the operation to be performed. For example, the ret instruction that is used to return from a procedure is encoded as a single byte 0xc3 (near return) or 0xcb ("far return").

Most instructions have one or more operands that they operate on. Instructions that have register operands encode the source and target register either implicitly in the opcode or in the ModR/M byte. Instructions that refer to an operand in memory also use the ModR/M byte and optionally the SIB byte to specify the addressing mode of the operand. Chapter 2 ("Instruction Format") of the Intel manuals specify the different Mod and R/M bitmasks for different source and target combinations.

Following the ModR/M and SIB bytes, an instruction has optional address displacement and immediate data bytes. The address displacement is used as an offset on top of a base register (e.g. [EAX]+displacement) and together with the ModR/M byte and SIB byte, they represent the different x86 addressing modes. Immediate data is used for constant operands and relative branch target.

Finally, an instruction can have one instruction prefixes from each of the four groups: (1) lock and repeat prefixes, (2) segment override prefixes, (3) operand-size override prefix, and (4) address-size override prefix. In 64-bit mode, REX prefixes are used for specifying GPRs and SSE registers, 64-bit operand size, and extended control registers. Each instruction can have only one REX prefix at a time.

Lets look at an example of instruction encoding. On x86-64, the following assembly code


movl $0xdeadbeef, 0x12345678(%ebx,%edx,1)


compiles to the following byte sequence:


67 c7 84 53 78 56 34 12 ef be ad de


Looking at the breakdown of the above byte sequence:



We can see that the opcode byte is 0xc7 which is one of the encodings for the mov instruction. As we used the l-postfix (for 32-bit operands) in the assembly code, we need the address-size override prefix 0x67 to force 32-bit operands in 64-bit mode. The same prefix can also be used to switch between 16-bit and 32-bit addressing in legacy mode. The address displacement and immediate data bytes are little endian representation of the respective constants we used in the assembly code.

Finally, looking at the ModR/M byte, the value of Mod is b10 and R/M is b100 which means that SIB follows the ModR/M byte. The value of Reg is b00 which is in this case an opcode extension for 0xc7 that specifies that the source operand is an immediate. In the SIB byte, Scale is b01, Index is b010 and Base is b011 which translates to [EBX+EDX*2] as per Table 2-3 ("32-bit addressing forms with the SIB byte") of the Intel manual Volume 2A.

The above example explains the different parts of instruction format pretty well. Much of the perceived complexity comes from the fact that much information, such as operand types and special cases, is encoded implicitly in the opcode bytes. That doesn't mean that the x86 instruction set is arbitrary, far from it. Unfortunately much of this information is scattered across the manuals and comes apparently only if decode the instruction set.

For more information, check out Volumes 2A and 2B of the Intel manuals. You might want to check out the x86_decode_insn() function in the arch/x86/kvm/emulate.c file of the Linux kernel sources for a real-world partial x86 decoder that's used by KVM. A work-in-progress x86 decoder can also be found in libcpu sources in the arch/x86/x86_decode.cpp file for those that are interested in getting their hands dirty on hacking.

10 comments:

Shawn Tan said...

I would not call the evolution "clean" as you did. Try plotting out an opcode map of the instructions and you will see how things are scattered all over the map. It's like they chose the first available slot to put instructions in.

Pekka Enberg said...

@Shawn,

I'm not sure what you mean exactly but if you look at the original 8086 instruction format, you can see that the opcode byte had a structured format of its own (the d and w bits) that has since disappeared. That probably explains some of the reasoning why instructions are distributed across the opcode map as they are.

Anonymous said...

You may find this reference useful:
http://ref.x86asm.net

Anonymous said...

I think you missed something obvious in your wonder at how the encoding hasn't changed for so long: it's a variable-length encoding. So of course you can encode thousands of instructions without changing the format. There is no format!

It's like being amazed that you can say so much with "subject verb object" in English. Of course you can, that's just a template.

Pekka Enberg said...

@Anonymous,

You have a point there. I don't find the fact that vendors were able to add more instructions amazing, though. What I think is pretty neat is the fact that they were able to transform 16-bit 8086 architecture into modern x86-64 while keeping the instruction encoding stable and relatively clean.

Anonymous said...

The actual evolution is more complex than people think.

The 16-bit code that underlies the 32-bit stuff is actually a band-aid on top of older 8/16 bit stuff from the 8080/Z-80 that came from the old 8008 that itself came from the 4004.

Go ask your self sometime why the registers are ordered the way they are in the binary opcodes.

Ax used to be the accumulator in the 4004. Yes the latest I7,5,3 processors are still running compatible with the original; 4-bit prototypes designed back in the late sixties.

Do some opcode archeology on the x86 line and you find a family tree easily as complex as our own going back not only multiple versions each generation, but multiple iterations that retained backwards compatibility, quite an achievement that your windows 7 machine runs code compatible some of the 4-bit processors in the early automated controllers like were used in microwaves.

Everything comes from something, you just have to keep an open enough mind to see the connection.

RBerenguel said...

I still remember my fiddlings with 8086 Assembly: the book (or something similarly named, it was from a friend) when I was 12 or 13. That was hard... and probably useless. Next time I asm-ed it was far better and enjoyable.

But your post made me remember those idle times a long time ago, when I cared about those d-w bits, and how the instructions got translated into binary.

Odd times, they were.

Anonymous said...

Thanks for posting.

I have been looking to get into ASM programming.

Miaubiz said...

sweet stuff Pekka!

business reasons have forced the engineers at amd, intel & co. to push the new stuff into micro-ops, internal cpu registers etc.

@Shawn, since smaller is smaller, if any slots open up in the one or two byte opcodes, it's better to reuse them rather than waste cache and bandwidth on a new larger instruction

so far economies of scale, i.e. huge r&d investments enabled by a universal platform have kept the x86 technologically up to par aswell. we'll see what ARM can do now that they're challenging x86 also in performance.

David Mott said...

@Shawn Tan
The x86 instruction set is octal, not hex. Mapping opcodes in octal shows how clean and logical the encoding is. For details see http://reocities.com/SiliconValley/heights/7052/opcode.txt