-- Advanced

Anatomy of a Virtual Machine: From Stack to Registers in Achronyme

#architecture #compilers #vm #achronyme

Abstract

This paper explores the architectural foundations of Virtual Machines, analyzing the structural differences between Stack-based and Register-based models. Through the case study of the Achronyme language, it demonstrates how the inherent bottleneck of Stack VMs in cryptographic operations was mitigated by adopting a Register-based VM (inspired by RISC, Lua 5.0, and Dalvik), significantly reducing the dispatch loop overhead and optimizing cache locality.

When I started building Achronyme over a year ago, I was fresh out of studying compiler theory and virtual machines. In its prototype phase, Achronyme was nothing more than a simple binary functioning as a tree-walk interpreter (evaluating the Abstract Syntax Tree node by node). It was undeniably inefficient, but building it marked what was, for me, the most decisive moment in the project’s architecture: the need to build a true virtual machine.

In academia, virtual machines are often taught somewhat abstractly, but it isn’t until you actually need to optimize clock cycles in a real-world environment that the full picture becomes clear.

What Exactly is a Virtual Machine (in this context)?

When we talk about a VM in the context of programming languages (like Java’s JVM, JavaScript’s V8, or Erlang’s BEAM), we are not referring to hardware virtualization (like VirtualBox or VMware). We are referring to a Process Virtual Machine: a software layer that emulates an abstract computer architecture designed to execute a specific instruction set (Bytecode), isolating execution from the underlying physical hardware.

The heart of these VMs is the Dispatch Loop (or fetch-decode-execute cycle):

  1. Fetch: Retrieve the next instruction from memory.
  2. Decode: Understand what the operation is and identify its operands.
  3. Execute: Perform the operation and advance the instruction pointer (IP).

How the VM handles memory and passes variables into the Execute phase defines its architecture. The two dominant families are Stack-based and Register-based.

Stack vs Register VM Architecture

The Stack Bottleneck

When I implemented Achronyme’s first proof of concept, I used a Stack VM—the de facto standard for nascent projects due to its simplicity when compiling. In this model, instructions do not have explicit operands; they assume the data they need is at the top of a LIFO (Last-In-First-Out) data structure: the operand stack.

To understand the problem, let’s look at how a simple mathematical addition (a = b + c) is compiled.

Bytecode in a Stack VM:

0001: LOAD_LOCAL 1  // Pushes 'b' to the stack
0002: LOAD_LOCAL 2  // Pushes 'c' to the stack
0003: ADD           // Pops 'c', Pops 'b', adds them, and Pushes the result
0004: STORE_LOCAL 0 // Pops the result and stores it in 'a'

The Illusion of Efficiency

The theoretical advantage of this model is code density. Since instructions like ADD don’t need to specify “what” to add (it’s always the top of the stack), instructions usually occupy a single byte (hence the name bytecode).

However, I was in for a surprise: the goal of creating an efficient runtime environment was being thwarted. The intermediate instructions required to move data between local memory and the top of the stack (LOAD and STORE) accumulate incredibly fast.

The cryptographic focus I was giving Achronyme required evaluating dense mathematical functions in hot loops. Cryptographic operations are inherently expensive; adding the cost of executing the Dispatch Loop four times just for a simple mathematical addition was not viable. Every loop iteration implies decoding overhead and conditional jumps on the physical processor.

RISC in Software: Register-Based Machines

I wanted to avoid this overhead, so I researched several industrial projects and discovered register-based virtual machines. It was a moment of revelation. If you like the RISC (Reduced Instruction Set Computer) philosophy in hardware, a Register VM is practically a highly optimized emulator of that approach designed to execute bytecode.

Instead of using an intermediary stack, a Register VM allocates a block of memory (a Stack Frame) for each function and treats it as an array of “Virtual Registers” (R0, R1, R2...). Instructions explicitly specify which registers to operate on.

Let’s look at the same mathematical addition (a = b + c) in this model:

Bytecode in a Register VM:

// Format: OPCODE Destination, Source1, Source2
0001: ADD R0, R1, R2  // Adds R1 and R2, stores in R0. (All in one instruction)

Industry Validation

I didn’t invent anything new; the transition from Stack to Registers has massive industrial precedents that directly inspired Achronyme:

  1. The Lua 5.0 VM: Perhaps the most famous paper on this is “The Implementation of Lua 5.0” (2005). Lua revolutionized scripting by demonstrating that, even though register instructions are “fatter” (4 bytes in Lua to accommodate the opcode and three register addresses), the dramatic reduction in the total number of instructions more than compensates for the individual decoding cost.
  2. Dalvik (Android): Before migrating to Ahead-Of-Time compilation with ART, Android used the Dalvik VM. Designed for the constrained ARM processors of early smartphones, Google chose a register architecture. The reason? It mapped much more naturally to the physical registers of the ARM processor and reduced memory traffic.

The Architectural Trade-off in Achronyme

Implementing this change in Achronyme was not trivial and brought an obvious trade-off: bytecode binaries grew in size. Because each instruction must encode the indices of its operands explicitly (usually in 32 or 64-bit words), the 1-byte-per-instruction density of Stack VMs is lost.

But the runtime result more than justified it: the number of dispatched instructions was reduced by almost 50%. By cutting the instructions in half, we halve the number of times the VM must perform the expensive fetch and decode process.

The Hidden Impact: Cache Locality

Beyond the instruction count, I saw the greatest benefit in spatial cache locality. By keeping operands in a contiguous block of memory (the function frame’s register array) and not constantly modifying a LIFO stack pointer, the physical CPU can predict and cache memory (L1/L2 cache) far more efficiently. The cryptographic algorithms inside Achronyme began to iterate at a speed that was simply unattainable with the previous model.

Conclusion

Building your own virtual machine and writing the compiler that generates its code teaches a brutal lesson that the abstraction of modern software often hides: at some point, the correct architectural decision forces you to stop thinking about language syntax and start understanding how physical hardware breathes.

You have to think about how bytes actually move across memory buses, how the processor interacts with cache lines to avoid a cache miss, and why executing one 4-byte instruction matters infinitely more in real life than dispatching four 1-byte instructions.

High-performance software is written when, regardless of the level of abstraction you operate at, you intimately understand the silicon machine that ultimately executes it.