This post shows how to write a basic JIT compiler for the Python bytecode, from scratch, using nothing but stock Python modules.
At the end of this post, we will be able to compile branchless Python functions that perform arithmetic on signed 64-bit values:
Our strategy is to translate Python bytecode to an intermediate representation, which will then be optimized before being emitted as x86-64 machine code. So the first part will be to understand how the Python bytecode works.
You can see the raw bytecode for the
foo function at the top in Python 3 by
In Python 2.7, that would be
Because the two bytecode sequences are near identical, it doesn't matter which one will be used for the explanation. I've picked Python 2.7 for the remainder of this post, but the GitHub code supports both 2.7 and 3+.
Let's have a look at the disassembly of
The leftmost number
2 is the Python source code line number. The next column
contains the bytecode offsets. We clearly see that the
takes three bytes: One for the opcode (which instruction it is) and two for a
16-bit argument. That argument is zero, referring to the first function
CPython — like the JVM, CLR, Forth and many others – is implemented as a stack
machine. All the bytecode instructions operate on a stack of
objects. For example,
LOAD_FAST will push a reference to the variable
on the stack, while
BINARY_MULTIPLY will pop off two, multiply them together
and put their product on the stack. For our purposes, we will treat the stack
as holding values.
A beautiful property of postfix systems is that operations can be serialized. For example, to compute an infix expression like
we have to jump back and forth, calculating products before subtracting. But in a postfix system, we need only scan forwards. For example, the above expression can be translated to Reverse Polish Notation (RPN) using the shunting-yard algorithm:
Moving from left to right, we push 2 on the stack, then another 2. For the
operation we pop them both off and push their product 4. Push 3 and 3, pop them
off and push their product 9. The stack will now contain 9 on the top and 4 at
the bottom. For the final subtraction, we pop them off, perform the subtraction
and push the result -5 on the stack.
Now, imagine that the expression was actually written in a programming language:
The thing is, in postfix form, the evaluation order becomes explicit:
subtract functions find their arguments on the stack. For
subtract, the two arguments consist of the products
The use of a stack makes it possible to execute instructions linearly, and this is essentially how stack machines operate. With that, you will probably understand most of the CPython opcodes and its interpreter loop.
We will now translate the bytecode instructions to an intermediate representation (IR). That is, in a form suitable for performing things like analysis, translation and optimization. Ours will be blissfully naive. We will forego things like single-static assignment (SSA) and register allocation for the sake of simplicity.
Our IR will consist of pseudo-assembly instructions in a list, with a faint resemblance to three address codes (TAC). For example
Contrary to TAC, we put the operation first, followed by the destination and source
registers. We use
None to indicate unused registers and arguments. It would
be a very good idea to use unique, abstract registers names like
and so on, because it facilitates register
allocation. Out of scope.
We will reserve registers RAX and RBX for menial work like arithmetic, pushing and popping. RAX must also hold the return value, because that's the convention. The CPU already has a stack, so we'll use that as our data stack mechanism.
Registers RDI, RSI, RDX and RCX will be reserved for variables and arguments. Per AMD64 convention, we expect to see function arguments passed in those registers, in that order. In real programs, the matter is a bit more involved.
Constants in the bytecode can be looked up with
We can now build a compiler that translates Python bytecode to our intermediate representation. Its general form will be
It takes some bytecode and constants, and keeps a running index of the current
bytecode position. It is wise to split the translation up into fetch and decode
fetch method simply retrieves the next bytecode, while
will fetch the opcode, look up its name and fetch any arguments.
We need to look up which registers holds which variable:
The main method will look like
Here you can already see how we translate
LOAD_FAST. We just push the
corresponding register onto the stack. So, if the function we are compiling has
one argument, the bytecode will refer to the zeroth variable. Through the
variable method, we see that this is register RDI. So it will output
STORE_FAST instruction does the reverse. It pops a value off the stack
and stores it in the argument register:
A binary instruction will pop two values off the stack. For example
That's just about it.
LOAD_CONST will use a special instruction for storing
immediate values (i.e., constant integers). Here is the entire method:
We can now compile the
foo function at the top to our IR.
Wow, that sure is a lot of stack operations!
We're going to perform peephole optimizations on our IR. Such optimizations work on only a few instructions at at time, and translate them equivalent but better code. We will go for fewer instructions.
In the IR above, we see an obvious improvement. Instructions like
can surely be translated to
Let's write a function for that. We'll also eliminate nonsensical instructions
mov rax, rax.
Instead of showing that this actually works, we'll just throw in a few other optimizations. Just note that writing such optimizations are deceptively simple. It's very easy to do something that seem to make sense, only to see your program crash.
A construct like
can surely be translated to
so we'll add that as well:
Finally, the short-circuit of pop and push can be extended so that it works over one or several unrelated instructions. Take
Since RAX isn't modified in
mov rsi, rax, we can just write
We have to be careful that the middle instruction isn't a push, though. So we'll add
There is nothing wrong with supporting an indefinite amount of middle instructions, but we won't do that here.
With these instructions, let's try to optimize the above IR. The complete optimization function is
The IR code was
Running that through
saving us four instructions. But we still got a few spots left. The first three instructions should be optimizable. Let's run two passes on the IR:
We've now saved seven instructions. Our optimizer won't be able to improve this
code any further. We could add even more peephole optimizations, but another
good technique would be to use a real register allocated so that we use the
full spectrum of available registers. The IR compiler could then just assign
values to unique registers like
reg2 and so on, then the allocator
would choose how to populate the available registers properly. This is actually
a hot topic for research, and especially for JIT compilation because the
general problem is NP-complete.
So, we have translated Python bytecode to our IR and we have done some optimizations on it. We are finally ready to assemble it to machine code!
Our approach will be to write an assembler class that emits instructions. If we use the same name for the emitter methods as in the IR, and use the same signature for all, then we can just blindly assemble the whole IR in a short loop:
If the instruction is
mov rax, rbx, then
emit will point to
and the call will therefore be
Let's write an assembler class. We'll copy the code for
little_endian and import
create_block from the code in the previous
assembler.ret(None, None) will set the first machine code byte to
0xc3. That's how
retq is encoded. To find the encoding of other instructions,
I mainly used the NASM assembler. Putting the following in a file
I assembled it with
-fmacho64 for macOS) and dumped the machine code with
It seems like the 64-bit
movq (which we just call
mov) is encoded with
0x48 0x89 with the source and destination registers stored in the
last byte. Digging into a few manuals, we see that they are encoded using
three bits each. We'll write a method for that.
movq instruction, we can now write
The rest of the instructions are done in a similar manner, except for moving immediate (i.e., constant) values into registers.
The only special thing about the last method is that we have to use a different prefix and encode the number in little-endian format.
Finally, we can tie everything together. Given the function
we first extract the Python bytecode, using
ord to map bytes to integers, and
Compiling to IR
Perform a few optimization passes:
Assemble to native code
Make the memory block executable
ctypes to set the correct signature and cast the code to a callable
Python function. We can get the number of arguments with
co_argcount, and we
treat input arguments as signed 64-bit integers.
It prints -5. Yay!
To disassemble the code, we can use the Capstone disassembler right
within Python. It's not a built-in module, so you need to install it yourself.
Or you can break into the Python process with a debugger. Here is the
You can try out different functions, for example
You may wonder how fast this runs. The answer is: Slow. The reason is: Because
there is inherent overhead when calling into native code with
needs to convert arguments and so on. I also believe (but haven't
double-checked) that it saves some registers, because per the convention, we
should have restored RBX before exiting.
But it would be interesting to compile larger functions with native function calls, loops and so on, and compare that with Python. At that point, I believe you'll see the native code going much faster.
On /r/compsci there was a comment that this really isn't just-in-time compilation until there is some mechanism that automatically swaps out a function with a compiled version. So let's try to do something about that.
A pretty obvious approach is to require a little help from the user. Use a decorator. Recall that a decorator is really just a function that gets the freshly defined object as the first argument. If we install a little closure there that remembers the original function, we can then literally compile just-in-time when it is called for the first time. Again, only the decorated functions that are actually called will be compiled to native code.
We'll start without anything:
With this, we can mark functions that we want to be compiled:
So the inner
frontend function then just needs to check if the function has
already been compiled. If not, compile it and install it as the local function
reference. If the compilation fails, don't swap out anything. The complete
decorator looks like this:
See the GitHub repository for an example program that uses this.
I believe this is good for learning, so play around a bit. Try to make a register allocator, for example. Create more peephole optimizations. Add support for calling other functions, loops.
With a decorator, you should be able to swap out class methods on the fly with compiled ones. That's exactly what Numba does, but ours is just a drop in the ocean compared to that.
While we took the approach of translating Python bytecode, another good
technique is to use the
ast module to traverse the abstract syntax tree. Ben
Hoyt did exactly that in pyast64, and I strongly recommend to take a
look at his excellent code and post.