A Forth VM’s Performance on Apple Silicon

8 min readDec 19, 2020

The Apple M1 is a very fast CPU for running a Forth VM, competitive with top-of-the-line performers such as the Intel i7–10700k and the AMD Ryzen 9 3950x. We also dig into some interesting aspects and optimizations for an indirect-threaded Forth p-code virtual machine.

I had written a proprietary 16-bit network-endian p-code VM that models a stack machine for a project over the last couple of years. It runs Forth code that was compiled to a memory image that encapsulated all state such as virtual registers, code, and data.

It has a development environment written in golang that provides:

Ahead-of-time compilation
Stepwise debugging
Performance profiling
A bytecode level optimizer that did tail call optimization, code inlining, packing of literal values, pattern replacement, and other techniques.
Metadata-enhanced disassembly
Forth syntax highlighting editor with definition cross-reference and jump-to.

It has multiple hardware targets and VM implementations including a C-based VM, and a proprietary hardware device. The golang environment itself had a VM implementation written in Golang. It even ran in a web browser using gopherjs!

It produced images that were very compact, with state and code, including a network TCP stack and core language runtime that fits in a 4 kilobyte memory image!

The C-based VM also runs very, very fast, typically at 1 opcode every 3 or 4 clock cycles on a modern CPU. An opcode is a primitive operation such as, ‘add this number to that number’, ‘if greater than this, jump to here’.

So, translated, this means:

800 to 1,000 ops/µs — this is nearly an opcode executed every nanosecond.
On a 4ghz machine, we run at 1ghz, which is 1 billion VM opcodes executed a second.

With the emergence of Apple Silicon-based CPUs with amazing single core performance, I decided to compare on a few of my available machines:

Linux Ryzen (AMD Ryzen 3950x, 16-cores, 3.5ghz boosing to 4ghz, 64gb 2666mhz DDR4 ECC)
2020 iMac 27" (Intel 8-core i7–10700k, 128gb 2666mhz DDR4 non-ECC)
Apple DTK (Apple A12z Bionic, 1.6ghz 16gb, LPDDR4)
Apple Mac Mini 2020 (Apple M1, 3.2ghz, 16gb 4266mhz LPDDR4)

Performance Comparison

In the past, atest harness framework was written that would try various feature and compiler optimization flags to determine what the fastest set of flags were for each target platform and architecture. The interaction between these was complicated, as sometimes some combinations actually made the VM run slower. For each set of combinations of compiler and feature flags, it would perform 5 runs, and record the results into a CSV file.

On the MacOS X machines, the Apple-provided clang-12 was used. On the Linux Ryzen machine, clang-10 was used. The test harness supports using gcc as well, but as we don’t have gcc for Apple Silicon, we disabled it.

On a pure single-core performance basis running this Forth VM, the Apple M1 at 1,241 ops/µs is competitive with the Intel i7–10700k’s 1,250 ops/µs, which is the fastest Intel CPU available in a Mac!

When we look at metrics such as the number of opcodes executed per microsecond divided by the estimated TDP per high-performance core, the Apple M1 is nearly 4 times as efficient!

Then when we compare it on a number of opcodes per microsecond per dollar basis, the Apple M1 is also four times more efficient.

Some notes for the tables below:

It is very difficult to compare simultaneous multithreading (SMT) CPUs such as the AMD Ryzen and Intel i7 family with heterogenous computing CPUs with a mix of performance cores and efficient cores such as Apple’s M1 or A12z. So we only compare the actual core count for SMT to the performance count.
The TDP (thermal design power) are estimates in the case of Apple, and misleading in the case of AMD and Intel CPUs as they can exceed the TDP by 30% in many cases!
The clock rate is often an estimate in the case of AMD and Intel CPUs due to the use of boost clocking.
The below table is the result of the most optimal combination of Feature Flags and compiler setting for each CPU in question:

Ryzen 3950x: -

Macs, both Apple Silicon and x86–64:

These feature flags are documented later in this post, if you want more details.

Opcodes per microsecond, compared with TDP, number of cores and clock rate

Comparison of CPUs on an absolute cost, cost per core, and cost per ops per microsecond per dollar per core basis

The following is a very technical dive into the weeds about the Forth VM and how different optimization techniques apply here. If you are reading here because you are primarily interested in a M1 vs x86–64 comparison, you can stop reading here.

Assembler Comparison and Some VM Details

It might be interesting to some readers to look at the assembly output. For the specific case of a Forth VM, there really is not much difference in instruction count for RISC vs CISC — which is a bit surprising considering RISC’s reputation for having more verbose code.

The VM is a single large function with a jump table that maps opcodes to labelled code blocks. When it encounters an opcode byte, it looks it up in the table, and executes. Operands for the opcodes are popped from the stack, and the results are pushed onto the stack. This allows instructions to be a single byte.

Example Opcode — ADD

ADD opcode in C (with a lot of macros)

In the above example, the macro STACK_ALU_2 expands into code that pops the two stack values adds them together, and pushes the result onto the stack. The Execution Instruction Pointer is moved forward, and the NEXT_OP handler is called.

Apple’s clang-12 produced LLVM IR code which is them used by llc to generate the assembler and binaries for the target architectures. The following is the assembly language output for ARM64.

ARM64 assembly for VM_ADD

The following is the x86–64 assembly language output.

x86–64 assembly for VM_ADD

Example Opcode — 2DUP

With the construction of various C preprocessor macros, we have a very assembly like syntax for writing opcodes. The following example pushes onto the stack copies of the top two elements of the stack.

Feature Flags

As a lot of optimizations and features have different impact depending on the platform that we’re building for, we use -D for feature flags that control how the VM does operations such as moving to the next opcode, or reading from the stack. The effects of these feature flags can actually degrade performance depending on the platform or the C compiler flags being used.

This might be interesting to some readers as to how we optimize some aspects of a very fast indirect-threaded Forth VM.

_BSD_SOURCE

We are a network-endian 16-bit virtual machine, so extremely fast endian-swapping operations when reading and writing to the image is crucial. We try to use the system-provided be16toh, be32toh and htobe16 as well as htobe32 instructions to swap endianness if needed. These are often inlined assembly macros for the specific architecture we’re building for.

Then from the above, we define our image manipulation macros:

These take care of the complicated details of dereferencing a pointer to a pointer with a relative address passed in, and automatically do the correct endian-swapping required for the host in question. My head hurt while writing that statement.

USE_REGISTERS

There are a few performance critical variables in the VM, all of which are host-word sized (32 or 64 bit) intptr_t, and point into the memory image.

EIP — execution instruction pointer; the instruction to execute
SP — stack pointer; where the Top-of-Stack (TOS) is
RSP — return stack pointer; where to return when a call is done.

We also have a tracking variable.

cycles — opcode execution count.

In our C code, we treat them as ordinary variables and trust the compiler to make the optimization decisions. USE_REGISTERS explicitly tells the compiler that it should store them in desginated registers.

When USE_REGISTERS is not defined, the ps_reg type is a no-op, as is reg(R).

Declaration and initialization of the registers variables from the image is transparent looks like the following C code:

If we don’t have USE_REGISTERS defined, then ps_reg and reg(“xN”) just macro expands into blank spaces. Otherwise, they expand into register and __asm__.

PLAID_MODE

Normally, we define the NEXT_OP instruction macro to jump to a pointer to a label which can be adjusted dynamically. This is useful for turning on and off per-opcode debugging by changing what block of code *NEXT points to:

But as this happens on every single opcode we execute, the overhead of performing a goto to a pointer address adds up considerably. So in a production scenario, we use PLAID_MODE.

When PLAID_MODE is defined, we cut the dereference of pointers, and go straight to the NEXT opcode:

The VM_OPCODE macro resolves NEXT to an address label at compile time, so it’s a very fast goto.

There are further optimizations such as INLINE_NEXT_OP, in that we just inline the calls but the benefits have generally been minimal as the compiler is already effective at inlining.

USE_TOS

Normally, we operate directly on the stack; the first operand to an opcode is read directly from the memory address referenced by SP0. USE_TOS caches this in a register, and opcodes will read this register for their first operand, as well as put single-operand results into it.

This changes the behavior of all the stack macros. Normally, without USE_TOS, the macros look like:

We just decrement or increment the stack pointer, as appropriate, and return the dereferenced value.

When we enable USE_TOS caching, the behaviors change so that PUSH and POP_S update the TOS register.

It may be interesting to look at how STACK_ALU_2 and STACK_CMP only require one LOAD16_BE to perform their operation, versus two LOAD_16BE and one STORE16_BE of the non-USE_TOS case.

USE_B0_OP / LIT / LIT7

USE_B0_OP is a bit of an esoteric optimization, specific to this 16-bit VM with an optimized byte-sized LIT7 instruction. A LIT and LIT7 instruction pushes a 16-bit and 7-bit literal number onto the stack respectively.

The LIT7 optimization can yield considerable savings, as a lot of VM and system variables are defined to lie within the first 128 bytes of the image.

So, the Forth instructions “SP @” when compiled with the LIT opcode becomes 4 bytes — it pushes the memory address for the SP variable onto the stack and then @ fetches the 16-bit value stored there and pushes it onto the stack.

As the memory address SP is 30, the value fits in a 7-bit integer, with the LIT7 optimization, it becomes 2 bytes:

The LIT7 instruction without USE_B0_OP macro-expands into:

However, when the USE_B0 feature flag is enabled, we have a byte-sized B0 register:

When we have USE_B0_OP enabled, the actual operation to deal with the LIT7 instruction becomes:

We are relying on the pre-fetch of the opcode into CPU_B0 to avoid having to dereference the EIP and read it again to resolve a LIT7 instruction.

With USE_B0 enabled, when the prior instruction loads the next opcode, the LIT7 byte is cached in CPU_B0, and is ready for the LIT7 instruction to mask and push onto the stack immediately.

MEMALIGN_SZ

On some architectures, memory alignment really matters especially for caching. The MEMALIGN_SZ define aligns the memory image that we operate on along defined page boundaries, which improves memory access. It should be noted that it does not appear to make a difference on Apple’s CPUs, but it does on Intel and AMD CPUs.

Finally …

This has been a bit of a deep dive, and I hope that some of you were able to get some value from it. The Forth VM discussed is unfortunately proprietary and the source code itself cannot be shared, but some interesting techniques can be gleaned from how it was written.