Joachim Breitner's Homepage
Winter is coming even more quickly
TL;DR: I explain how I improved the performance of an interpreter for WebAssembly written in Haskell by plucking some low-hanging fruit.
Background
Motivated by my work at the DFINITY Foundation, I was looking into interpreters for WebAssembly written in Haskell, and found my colleague John Wiegley’s winter
: A straight-forward port of the WebAssembly reference interpreter, written in Ocaml by Andreas Rossberg (another colleague of mine … I guess there is a common theme here.)
Obviously, an interpreter will never be as fast as a real compiler implementation (such as the one in v8, lucet or wasmtime). But for my purposes an interpreter is fine. Nevertheless, I don’t want it to be needlessly slow, and when I picked up wasm
, it was clear that I had to work at least a little bit on performance.
None of this is to blame John, of course: I have no doubt that he could have sped it up at least as well! I assume it just wasn’t a priority at that time, and winter
is a nicely done piece of Haskell.
A graph
I like statistics and graphs! In this blog post, I am always running the following very simple C program:
unsigned int i;
void init () {
= 1024*1024;
i while (i--) {};
}
which I compile using clang
version 9 and run with wasm-invoke
from the winter
package (compiled with GHC-8.6.5):
$ clang-9 --target=wasm32-unknown-unknown-wasm -fno-builtin -ffreestanding loop.c --compile
$ wasm-ld-9 loop.o --gc-sections --export init --no-entry -o loop.wasm
$ wasm-invoke -w loop.wasm -f init +RTS -t
The +RTS t
tells the Haskell runtime to produce a line with some statistics, of which I collect
- total number of byte allocated
- maximum memory size used and
- total time elapsed
To reduce the noise a little bit, I run the program five times, and take the minimum of all the runs.
I did this for all the commits I produce while trying to make the code faster. This yields the following graph:
Note that I had to draw this with a logarithmic y axis, because some of the improvements were so drastic that a linear graph would not be helpful.
But how did I get there? I will take you through a guided tour through all the commits that have improved performance, explain how I found an performance issue, what I did to improve things and why that improved things. To keep this post at a reasonable size, I split it into a series of blog post, faithful to the chronological order in which I performed them:
Here they are:
- Vectors
- SPECIALIZE
- Difference lists
- Export lists
- Eta-Expanding
ReaderT
- Simpler code
- A Zipper
- Statistics (the making-of)
Was it worth it? I’d say so! Initially, running the program above would take 44 minutes (slow!) and continuously grows to require 6,7 GB of memory (Houston, we have a space leak!).
After all the steps above, memory consumption is constant and below 2MB, and the program finishes in 3,6s. Overall, this yields the following, very satisfying, statistics:
Improvement: Allocations: -99.60% Memory: -99.97% Time: -99.86% (Commits de9f4f8…5406efd)
By the way, the Ocaml reference interpreter takes 2,92s to run this (and it does a bit more, such as validation of the module). So there is still room for improvement…
Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.