A lot of people think that the x86 instruction set is arcane and complex. While it is certainly true that the instruction set is humongous, it is rather straight-forward when you look at the actual instruction set encoding. As a matter of fact, given that the encoding hasn't changed much since the 16-bit 8086 days, it's pretty amazing how cleanly the instruction set has been able to evolve into the modern 64-bit architecture x86 is today.
Instruction set encoding is the binary representation of instructions (e.g. addition, loads, stores) that the CPU executes. The main difference between x86 and modern RISC architectures is that the former uses variable-length encoding and latter architectures (e.g. ARM and PowerPC) use fixed-length encoding. That is, on x86, an instruction can be anywhere from 1 to 15 bytes in length whereas on ARM each instruction is exactly 16 bits or 32 bits depending in which mode the CPU is.
Looking at Chapter 2 ("Instruction Format") of Intel manual Volume 2A, we can see a figure of the instruction format that looks roughly like this:
What's interesting about this format is that it's identical all the way from 8086 to x86-64 with the exception of SIB byte and REX prefix that we'll go in more detail later. Intel, AMD, and other vendors have added more instructions but the above encoding has survived over the years. Pretty neat, huh? Much of the apparent complexity comes from the early days when engineers tried to squeeze instructions into as small space as possible.
The most important part of the format is the opcode. It is 1 to 3 bytes in length with an optional 3-bit opcode extension in the ModR/M byte. An opcode (operation code) specifies the operation to be performed. For example, the
ret instruction that is used to return from a procedure is encoded as a single byte
0xc3 (near return) or
0xcb ("far return").
Most instructions have one or more operands that they operate on. Instructions that have register operands encode the source and target register either implicitly in the opcode or in the ModR/M byte. Instructions that refer to an operand in memory also use the ModR/M byte and optionally the SIB byte to specify the addressing mode of the operand. Chapter 2 ("Instruction Format") of the Intel manuals specify the different Mod and R/M bitmasks for different source and target combinations.
Following the ModR/M and SIB bytes, an instruction has optional address displacement and immediate data bytes. The address displacement is used as an offset on top of a base register (e.g.
[EAX]+displacement) and together with the ModR/M byte and SIB byte, they represent the different x86 addressing modes. Immediate data is used for constant operands and relative branch target.
Finally, an instruction can have one instruction prefixes from each of the four groups: (1) lock and repeat prefixes, (2) segment override prefixes, (3) operand-size override prefix, and (4) address-size override prefix. In 64-bit mode, REX prefixes are used for specifying GPRs and SSE registers, 64-bit operand size, and extended control registers. Each instruction can have only one REX prefix at a time.
Lets look at an example of instruction encoding. On x86-64, the following assembly code
movl $0xdeadbeef, 0x12345678(%ebx,%edx,1)
compiles to the following byte sequence:
67 c7 84 53 78 56 34 12 ef be ad de
Looking at the breakdown of the above byte sequence:
We can see that the opcode byte is
0xc7which is one of the encodings for the
movinstruction. As we used the l-postfix (for 32-bit operands) in the assembly code, we need the address-size override prefix
0x67to force 32-bit operands in 64-bit mode. The same prefix can also be used to switch between 16-bit and 32-bit addressing in legacy mode. The address displacement and immediate data bytes are little endian representation of the respective constants we used in the assembly code.
Finally, looking at the ModR/M byte, the value of Mod is
b10and R/M is
b100which means that SIB follows the ModR/M byte. The value of Reg is
b00which is in this case an opcode extension for
0xc7that specifies that the source operand is an immediate. In the SIB byte, Scale is
b01, Index is
b010and Base is
b011which translates to
[EBX+EDX*2]as per Table 2-3 ("32-bit addressing forms with the SIB byte") of the Intel manual Volume 2A.
The above example explains the different parts of instruction format pretty well. Much of the perceived complexity comes from the fact that much information, such as operand types and special cases, is encoded implicitly in the opcode bytes. That doesn't mean that the x86 instruction set is arbitrary, far from it. Unfortunately much of this information is scattered across the manuals and comes apparently only if decode the instruction set.
For more information, check out Volumes 2A and 2B of the Intel manuals. You might want to check out the
x86_decode_insn()function in the arch/x86/kvm/emulate.c file of the Linux kernel sources for a real-world partial x86 decoder that's used by KVM. A work-in-progress x86 decoder can also be found in libcpu sources in the arch/x86/x86_decode.cpp file for those that are interested in getting their hands dirty on hacking.