
Search with Zig, Wasm, and a Worker
Search with Zig, Wasm, and a Worker êŽë š
Static search was a nice upgrade to my blog. cloudcannon/pagefind is super easy to use. Point it at <main> and itâll crawl your website making it searchable with a bit of front-end magic.
A minor issue I had with Pagefind was file churn. Every re-index resulted in 200+ deletions and 200+ additions. This made deployment slow and wasteful. This was specifically fixed in Pagefind v1.3.0 (CloudCannon/pagefind) but I had locked to v1.2.0 and never noticed.
What I should have done is read the documentation. Should have RTFM, would have gotten a good nightâs sleep. Instead I nerd-sniped[1] myself. I stayed up way too late coding a Levenshtein Distance algorithm in Zig[2] before giving up and coping this Gist (travisstaloch).
The end result of this madness is a new step to my static site generator and plenty of learning. I built my own search and I attempt to document it below.
Creating an Index
Iâm sure there are many scientific and mathematical papers written on search indexes. Iâve read none of them. I figured a big JSON[3] array would suffice.
I used my HTML parsing utilities (git.dbushell.com) to strip tags and generate a list of words on my blog. I normalise to lowercase and strip anything that isnât aâz including diacritics.
JavaScript has a neat trick using Unicode properties:
word = word
.toLowerCase()
.normalize("NFD")
.replace(/\p{Diacritic}/gu, "");
This uses âcanonical decompositionâ which turns a single code point into two. For example "\u00F1" is ñ aka Latin Small Letter N with Tilde. Compare that to "\u006E\u0303" also nÌ but a lowercase ASCII ânâ followed by the Combining Tilde.
This process gives a form of âfuzzyâ matching when also applied to search queries. This is all academic because Iâm English and I barely used all 26 letters. I did use an umlaut once so I must account for that edge case.
My website has a vocabulary of over 12000 unique words with âandâ and âtheâ jostling for top spot. Oooh âjostlingâ I think thatâs +1. Words are stored along with an array of pages IDs and the number of occurrences on that page (IDs are a 32-bit hash of the URL). I throw away common words like âandâ to save space. If you search âandâ itâll match âbrandâ, âstandardâ, etc.
Appended to the index I have a map linking IDs to page URL and title. The entire search index is 1.5 MBs of raw JSON but only 275 KB compressed. Thatâs reasonably small and I only load it if search is used. (I do some optimisation later.)
Searching the Index
Now this is where things get silly. Iâm sure I couldâve done the next part in JavaScript but I wanted to use WebAssembly[4] and nothing was going to stop me. Iâve been learning Zig by building CLI tools and light switches. Zig can compile to Wasm.
As mentioned I use the Levenshtein Distance to match normalised keywords against my index. I literally run the calculation against every word. Sorry Google[5], Iâm not looking for a new job right now. Sounds like Google stopped caring anyway.
I rank pages based on:
- Levenshtein distance
- Number of keyword matches
- Recency of blog post
I adjusted a few magic number multipliers until the results felt right. I wonât dive too much into the Zig code (dbushell/dbushell.com) because Iâll probably refactor it multiple times.
Optimisation
1.5 MB (275 KB compressed) is fine but I can do better! JSON object keys are pure overhead. Brackets? Quotation marks? Commas? Get âem outta here! Pure bloat. 32-bit hashes as 8 byte hexadecimal strings is twice the necessary size.
I came up with a simpleâ binary format. Hashes are raw 32-bit numbers. Strings are still UTF-8 but theyâre preceded by their byte length. Arrays too are just a count followed by packed data. This reduced the file size to 556 KB (214 KB compressed).
â Somehow I mixed both big and little endian
Zig can use this data more efficiently too without a full JSON parser. I used a similar technique to packed structs. When I find an array of hashes, i.e. 5 bytes after 5 bytes repeating, itâs not parsed itâs just referenced straight from memory.
const Ref = extern struct {
hash: [4]u8,
occurrences: u8,
};
const Word = struct {
word: []const u8,
pages: [*]const Ref,
pages_len: u16,
};
The pages array is just a pointer into memory (the embedded index data). The parser doesnât iterate 5 bytes at a time creating Ref objects. It just reads the pages_len and skips ahead pages_len * 5 bytes. I can still use syntax like pages[0].hash to read array items.
There is a little bit of parsing for the Word structs. Iâm sure if I understood more I could make a binary format that exactly replicates memory layout and parse nothing. I gave up, Iâm not smart enough to handle alignment errors yet.
With these optimisations I was able to half both file size and search time.
Wasm and Workers
JavaScript does a good job of hiding the fact that itâs single-threaded. That is until you try to execute tasks during UI interaction. Delayed typing makes it painfully obvious. To fix this I initialise my Wasm module inside a Web Worker[6].
Getting data in and out of Wasm isnât straight forward. You can call functions and pass numeric values, but not strings or objects. You do have full memory access to work with though.
Wasm memory is created in JavaScript.
const memory = new WebAssembly.Memory({
initial: 10,
maximum: 100,
});
const wasm = await WebAssembly.instantiateStreaming(
fetch("search.wasm"), {
env: { memory }
},
);
const { exports } = wasm.instance;
exports.init();
Within Zig I allocate a fixed 10 KB buffer Iâll use to pass data back and forth.
var scratch: []u8 = undefined;
const scratch_len: usize = 1024 * 10;
export fn init() void {
scratch = wasm_allocator.alloc(u8, scratch_len) catch unreachable;
}
The init function is the one I called from JavaScript. These functions are synchronous which is why a worker is so important.
I have a second function to return the address of the allocated memory.
export fn getScratch() [*]const u8 {
return @ptrCast(scratch);
}
Back to JavaScript, I can write bytes to the scratch memory.
const scratch = new Uint8Array(
memory.buffer,
exports.getScratch(),
128
);
scratch.set(
[...new TextEncoder().encode("Hello, Zig!"), 0],
0
);
exports.something();
This creates a temporary Uint8Array backed by the same underlying Wasm memory at the allocated offset. I write âHello, Zig!â as a null-terminated string. Then I call something (another exported Zig function). Zig can then read the data I just wrote to scratch.
This works in reverse too. When I initialise Wasm I can provide functions.
const debug = (ptr, len) => {
const message = new TextDecoder().decode(
memory.buffer.slice(ptr, ptr + len),
);
console.debug(message);
};
const wasm = await WebAssembly.instantiateStreaming(
fetch("search.wasm"), {
env: { memory, debug }
},
);
Those functions are tagged in Zig with extern:
extern fn debug(buf: usize, len: usize) void;
I can write text to the scratch buffer and then call JavaScript with its offset and length:
const slice = try std.fmt.bufPrint(
scratch, "Hello, {s}!", .{"JavaScript"}
);
debug(@intFromPtr(slice.ptr), slice.len);
This will log âHello, JavaScript!â to the browser console.
For the actual search results my Zig code writes JSON to the buffer and informs the worker itâs done. The worker transfers data back to the main thread:
const data = new Uint8Array(
memory.buffer.slice(ptr, ptr + len),
);
self.postMessage({
type: "results",
buffer: data.buffer,
},
[data.buffer]
);
The main thread parses the JSON itself:
worker.addEventListener("message", (ev) => {
if (ev.data.type === "results") {
const data = JSON.parse(
new TextDecoder().decode(ev.data.buffer)
);
// Update UI...
}
});
Woah! That was a fair bit of back and forth. Iâm sure my code has plenty of race conditions too! It works well enough for the time being.
After all that effort you can now search my blog. Like you could before. At some point in future I should research the correct way to index a blog and make this 10x more efficient.
I added some CSS view transitions to make it look fancier.
To save bandwidth no Wasm is loaded until the search input is focused. I still employ the same DuckDuckGo fallback should anything fail to initialise.
Shoutout again to Pagefind for serving me until now đ«Ą
Sources on 'Nerd Snipe'
Sources on 'Zig'
Sources on 'WebAssembly (Wasm)'
Sources on 'Google'
Sources on 'Web Worker'
The art and sport of using a technical challenge to distract a nerd. Bonus points for sniping yourself. â©ïž
A low-level programming language with no hidden control flow and no hidden memory allocations. â©ïž
JavaScript Object Notation. An almost perfect specification. If only theyâd allowed trailing commas. â©ïž
Low-level machine code for web browsers. Wasm binary instructions are run in a virtual machine. â©ïž
Arguably the most evil company in tech resolute in their crusade to gate-keep the web and sell your privacy as a service. â©ïž
JavaScript that runs in its own thread. Handy for non-blocking background tasks. â©ïž
