In this article, we'll make a wasm library that exports a single function by writing a simple helper program to generate the raw wasm bytes.
This article is the first in a series about creating a minimal self-hosted language that compiles to wasm.
If you want to write low-level wasm, "wat" is much nicer for authoring wasm programs, but in my case I am interested in writing a compile-to-wasm language where the bytes are a more useful output target.
Outside of the webassembly spec itself (in the "binary format" section), I couldn't find any good info about how to hand-hack binary wasm at the byte level so that's why this article exists!
The wasm spec is very complete, but it's not at all obvious how to get started with making your own wasm binary output. And I had to do a lot of guesswork coming up with the demo that turned into this article, so hopefully future people reading this are saved a bit of trouble.
The generator program is written in javascript so it can run on this page at the bottom as a demo. However, it shouldn't be hard to port this concept to your favorite language.
A .wasm
file is a "wasm module". A wasm module starts with a 4-byte magic number
("\0asm"
) followed by a 4-byte little endian u32 version number.
In this article we'll use wasm v1, so the first 8 bytes in the file are then:
0x00 0x61 0x73 0x6d 0x01 0x00 0x00 0x00
or we can write this in a more readable way using an inline string for the "asm"
literal:
0x00 "asm" 0x01 0x00 0x00 0x00
After the magic number and wasm version are a sequence of sections which can appear in any order
until the end of the file. However, you will get errors if you include a section of the same type
more than once (except for custom sections [section id of 0x00
]).
Each section begins with a one-byte section id
followed by a u32 byte length of the
section contents (in uleb encoding, more on that soon), followed by the section contents as a byte
array.
The section names with their corresponding one-byte section id that we will be using are:
Before we get around to generating raw wasm bytes, we will need some functions.
The wasm spec uses a compact representation for integers throughout called LEB128. You can read more about LEB128 on wikipedia. The wikipedia article even includes C-like pseudocode and javascript implementations for both encoding and decoding (for this article we only need an encoder).
LEB128 can represent signed (sleb
) and unsigned (uleb
) integers.
In this article we are only concerned with 32-bit integers so these implementations may not be appropriate for larger integers.
Here are the sleb()
and uleb()
functions we will use,
each returning an array of bytes:
function uleb(x) { let bytes = [] if (x < 0) x = 0 do { let b = x & 0x7f x >>= 7 if (x != 0) b |= 0x80 bytes.push(b) } while (x != 0) return bytes } function sleb(x) { let bytes = [] let more = 1 let neg = (x < 0) let size = 32 // 32 bit integer while (more) { let b = x & 0x7f x >>= 7 if (neg) x |= (~0 << (size - 7)) let sb = (b >> 6) & 1 if ((x === 0 && sb === 0) || (x === -1 && sb === 1)) { more = 0 } else { b |= 0x80 } bytes.push(b) } return bytes }
Small integers in LEB128 are often a single byte and larger integers will consume more bytes:
> uleb(50) [ 50 ] > uleb(3000) [ 184, 23 ] > sleb(-37) [ 91 ] > sleb(-50000) [ 176, 249, 124 ] > sleb(1337) [ 185, 10 ]
For convenience, our wasm generator will be built from nested arrays, byte integers, and strings.
We will need two functions to turn that nested structure into a flat array of bytes:
sizeof()
and write()
.
sizeof()
calculates the number of bytes for the nested structure in the flat byte array
representation:
const te = new TextEncoder function sizeof(p) { let size = 0 if (typeof p === 'number') { // byte size += 1 } else if (typeof p === 'string') { size += te.encode(p).length } else if (Array.isArray(p)) { for (let i = 0; i < p.length; i++) { size += sizeof(p[i]) } } else { throw new Error(`unexpected input: ${p}`) } return size }
while write()
sets the bytes for the nested representation into an appropriately-sized
Uint8Array:
function write(out, offset, p) { if (typeof p === 'number') { // byte out[offset++] = p } else if (typeof p === 'string') { offset += te.encodeInto(p, out.subarray(offset)).written } else if (Array.isArray(p)) { for (let i = 0; i < p.length; i++) { offset = write(out, offset, p[i]) } } else { throw new Error(`unexpected input: ${p}`) } return offset }
Finally, we need to implement the byte format for wasm vectors and byte arrays.
A byte array is specified with the uleb number of bytes followed by the bytes themselves:
function bytes(items) { return uleb(sizeof(items)).concat(items) }
A vector is specified with the uleb number of elements followed by the vector contents:
function vec(items) { return uleb(items.length).concat(items) }
0x01
)
The type section is a vector of types.
To define a function type, use the code 0x60
for functions followed by a vector of the
types of the function arguments followed by a vector of the types of the function's return values.
Our function will take one uint32
as an argument and will return one
uint32
. The type code for uint32
is 0x7f
.
So we have:
let typeSection = [ 0x01, bytes([ vec([ [ 0x60, vec([0x7f]), vec([0x7f]) ], // fn[0]: [i32] -> [i32] ]), ]) ]
0x03
)The function section is a vector of type indices for functions in the type section.
We previously defined a function as the first item in the type section which is index
0
, so the function section becomes:
let functionSection = [ 0x03, bytes([ vec([ [ uleb(0) ], // fn[0] ]), ]) ]
0x07
)
The export section allows us to access our function from outside wasm.
This section is a vector of export records, which each contain a name as a byte array of utf-8
and an "export description". For functions, the export description starts with 0x00
and
then contains the index into the function section of the function to export.
To export the first and only function from the function section (index 0
), we have:
let exportSection = [ 0x07, bytes([ vec([ [ bytes('f'), 0x00, uleb(0) ], // fn[0] ]), ]) ]
0x0a
)
Like other low-level assembler languages, wasm instructions operate on a stack.
The instructions themselves are a variable number of bytes long.
Each instruction may push, pop, or peek values from the stack.
There are a lot of good resources about wasm instructions, so I won't go too much into the topic
here. You can read more in the "instructions" chapter (5.4) of the "binary format" section of the
wasm spec.
One thing that wasn't obvious to me reading that chapter is that sometimes you need to preface
instructions with the type code like the local.get
operation below.
In this code section, we define a function (at index 0) that takes a uint32
as an argument, multiplies the argument by the constant 111
,
and returns the result. The value 0x0b
indicates where the function definition ends so
that the next one may begin. In this case we only define a single function.
As in the previous type section, we use the value 0x7f
to indicate that the the type of
the first argument is a uint32
.
let codeSection = [ 0x0a, bytes([ vec([ bytes([ vec([0x7f]), // locals (arguments): [uint32] [ 0x7f, 0x20, 0x00, // i32.local.get 0: read argument[0] onto the stack 0x41, sleb(111), // i32.const(111): push the constant 111 onto the stack 0x6c, // i32.mul: pops two values from the stack, // multiplies them together, and pushes the result onto the stack 0x0f, // return 0x0b // end ], ]), ]), ]) ]
Putting all of the previous code together plus a bit of glue to instantiate our wasm code and call its exported function, we get:
[ program.js ]:
// generate the raw bytes for a wasm library with one function export, // call that function with an argument, and print the output const te = new TextEncoder function uleb(x) { let bytes = [] if (x < 0) x = 0 do { let b = x & 0x7f x >>= 7 if (x != 0) b |= 0x80 bytes.push(b) } while (x != 0) return bytes } function sleb(x) { let bytes = [] let more = 1 let neg = (x < 0) let size = 32 // 32 bit integer while (more) { let b = x & 0x7f x >>= 7 if (neg) x |= (~0 << (size - 7)) let sb = (b >> 6) & 1 if ((x === 0 && sb === 0) || (x === -1 && sb === 1)) { more = 0 } else { b |= 0x80 } bytes.push(b) } return bytes } function sizeof(p) { let size = 0 if (typeof p === 'number') { // byte size += 1 } else if (typeof p === 'string') { size += te.encode(p).length } else if (Array.isArray(p)) { for (let i = 0; i < p.length; i++) { size += sizeof(p[i]) } } else { throw new Error(`unexpected input: ${p}`) } return size } function write(out, offset, p) { if (typeof p === 'number') { // byte out[offset++] = p } else if (typeof p === 'string') { offset += te.encodeInto(p, out.subarray(offset)).written } else if (Array.isArray(p)) { for (let i = 0; i < p.length; i++) { offset = write(out, offset, p[i]) } } else { throw new Error(`unexpected input: ${p}`) } return offset } function bytes(items) { return uleb(sizeof(items)).concat(items) } function vec(items) { return uleb(items.length).concat(items) } let program = [ 0x00, 'asm', 0x01, 0x00, 0x00, 0x00, 0x01, bytes([ // type section vec([ [ 0x60, vec([0x7f]), vec([0x7f]) ], // fn[0]: [i32] -> [i32] ]), ]), 0x03, bytes([ // function section vec([ [ uleb(0) ], // fn[0] ]), ]), 0x07, bytes([ // export section vec([ [ bytes('f'), 0x00, uleb(0) ], // fn[0] ]), ]), 0x0a, bytes([ // code section vec([ bytes([ vec([0x7f]), // locals (arguments): [uint32] [ 0x7f, 0x20, 0x00, // i32.local.get 0: read argument[0] onto the stack 0x41, sleb(111), // i32.const(111): push the constant 111 onto the stack 0x6c, // i32.mul: pops two values from the stack, // multiplies them together, and pushes the result onto the stack 0x0f, // return 0x0b // end ], ]), ]), ]) ] ;(async function () { let data = new Uint8Array(sizeof(program)) write(data, 0, program) // at this point you could serialize `data` // and then load data and instantiate a webassembly instance from it later: let w = await WebAssembly.instantiate(data) console.log(w.instance.exports.f(9)) // prints 999 })()
When we print the wasm data for the program:
console.log(Array.from(data).map(x => '0x'+x.toString(16).padStart(2,'0')).join(' '))
then we get:
0x00 0x61 0x73 0x6d 0x01 0x00 0x00 0x00 0x01 0x06 0x01 0x60 0x01 0x7f 0x01 0x7f 0x03 0x02 0x01 0x00 0x07 0x05 0x01 0x01 0x66 0x00 0x00 0x0a 0x0d 0x01 0x0b 0x01 0x7f 0x7f 0x20 0x00 0x41 0xef 0x00 0x6c 0x0f 0x0b
Annotated, this is:
0x00 0x61 0x73 0x6d # magic number 0x01 0x00 0x00 0x00 # wasm version 1 0x01 0x06 # type section with 6 bytes 0x01 # types vector of length 1 0x60 # function type 0x01 0x7f # function takes 1 argument: uint32 0x01 0x7f # function returns 1 type: uint32 0x03 0x02 # function section with 2 bytes 0x01 0x00 # vector of length 1: function signature is at type index zero 0x07 0x05 # export section with 5 bytes: 0x01 # vector of length 1 0x01 0x66 # export name: byte array of length 1, contents: "f" 0x00 # export description: function type 0x00 # export description: function index 0 0x0a 0x0d # code section with 13 (0x0d) bytes 0x01 # vector of length 1 (one code item) 0x0b # 11 (0x0b) bytes in this code item 0x01 0x7f # 1-element vector of local (argument) types: uint32 0x7f 0x20 0x00 # i32.local.get 0 0x41 0xef 0x00 # i32.const 111 (111 in decimal is 0xef 0x00 in sleb) 0x6c # i32.mul 0x0f # return 0x0b # end
# wasm binary output
# result
fd_write