Search with Zig, Wasm, and a Worker

David Bushell18/05/2025About 10 minZigArticle(s)blogdbushell.comzig

Search with Zig, Wasm, and a Worker 관련

Zig > Article(s)

Article(s)

The one where I rebuild it in Zig

Static search was a nice upgrade to my blog. cloudcannon/pagefind is super easy to use. Point it at <main> and it’ll crawl your website making it searchable with a bit of front-end magic.

A minor issue I had with Pagefind was file churn. Every re-index resulted in 200+ deletions and 200+ additions. This made deployment slow and wasteful. This was specifically fixed in Pagefind v1.3.0 (CloudCannon/pagefind) but I had locked to v1.2.0 and never noticed.

What I should have done is read the documentation. Should have RTFM, would have gotten a good night’s sleep. Instead I nerd-sniped^[1] myself. I stayed up way too late coding a Levenshtein Distance algorithm in Zig^[2] before giving up and coping this Gist (travisstaloch).

The end result of this madness is a new step to my static site generator and plenty of learning. I built my own search and I attempt to document it below.

Creating an Index

I’m sure there are many scientific and mathematical papers written on search indexes. I’ve read none of them. I figured a big JSON^[3] array would suffice.

I used my HTML parsing utilities (git.dbushell.com) to strip tags and generate a list of words on my blog. I normalise to lowercase and strip anything that isn’t a–z including diacritics.

JavaScript has a neat trick using Unicode properties:

word = word
  .toLowerCase()
  .normalize("NFD")
  .replace(/\p{Diacritic}/gu, "");

This uses “canonical decomposition” which turns a single code point into two. For example "\u00F1" is ñ aka Latin Small Letter N with Tilde. Compare that to "\u006E\u0303" also ñ but a lowercase ASCII “n” followed by the Combining Tilde.

This process gives a form of “fuzzy” matching when also applied to search queries. This is all academic because I’m English and I barely used all 26 letters. I did use an umlaut once so I must account for that edge case.

My website has a vocabulary of over 12000 unique words with “and” and “the” jostling for top spot. Oooh “jostling” I think that’s +1. Words are stored along with an array of pages IDs and the number of occurrences on that page (IDs are a 32-bit hash of the URL). I throw away common words like “and” to save space. If you search “and” it’ll match “brand”, “standard”, etc.

Appended to the index I have a map linking IDs to page URL and title. The entire search index is 1.5 MBs of raw JSON but only 275 KB compressed. That’s reasonably small and I only load it if search is used. (I do some optimisation later.)

Searching the Index

Now this is where things get silly. I’m sure I could’ve done the next part in JavaScript but I wanted to use WebAssembly^[4] and nothing was going to stop me. I’ve been learning Zig by building CLI tools and light switches. Zig can compile to Wasm.

As mentioned I use the Levenshtein Distance to match normalised keywords against my index. I literally run the calculation against every word. Sorry Google^[5], I’m not looking for a new job right now. Sounds like Google stopped caring anyway.

I rank pages based on:

Levenshtein distance
Number of keyword matches
Recency of blog post

I adjusted a few magic number multipliers until the results felt right. I won’t dive too much into the Zig code (dbushell/dbushell.com) because I’ll probably refactor it multiple times.

Optimisation

1.5 MB (275 KB compressed) is fine but I can do better! JSON object keys are pure overhead. Brackets? Quotation marks? Commas? Get ‘em outta here! Pure bloat. 32-bit hashes as 8 byte hexadecimal strings is twice the necessary size.

I came up with a simple† binary format. Hashes are raw 32-bit numbers. Strings are still UTF-8 but they’re preceded by their byte length. Arrays too are just a count followed by packed data. This reduced the file size to 556 KB (214 KB compressed).

† Somehow I mixed both big and little endian

Zig can use this data more efficiently too without a full JSON parser. I used a similar technique to packed structs. When I find an array of hashes, i.e. 5 bytes after 5 bytes repeating, it’s not parsed it’s just referenced straight from memory.

const Ref = extern struct {
    hash: [4]u8,
    occurrences: u8,
};

const Word = struct {
    word: []const u8,
    pages: [*]const Ref,
    pages_len: u16,
};

The pages array is just a pointer into memory (the embedded index data). The parser doesn’t iterate 5 bytes at a time creating Ref objects. It just reads the pages_len and skips ahead pages_len * 5 bytes. I can still use syntax like pages[0].hash to read array items.

There is a little bit of parsing for the Word structs. I’m sure if I understood more I could make a binary format that exactly replicates memory layout and parse nothing. I gave up, I’m not smart enough to handle alignment errors yet.

With these optimisations I was able to half both file size and search time.

Wasm and Workers

JavaScript does a good job of hiding the fact that it’s single-threaded. That is until you try to execute tasks during UI interaction. Delayed typing makes it painfully obvious. To fix this I initialise my Wasm module inside a Web Worker^[6].

Getting data in and out of Wasm isn’t straight forward. You can call functions and pass numeric values, but not strings or objects. You do have full memory access to work with though.

Wasm memory is created in JavaScript.

const memory = new WebAssembly.Memory({
  initial: 10,
  maximum: 100,
});

const wasm = await WebAssembly.instantiateStreaming(
  fetch("search.wasm"), {
    env: { memory }
  },
);

const { exports } = wasm.instance;

exports.init();

Within Zig I allocate a fixed 10 KB buffer I’ll use to pass data back and forth.

var scratch: []u8 = undefined;
const scratch_len: usize = 1024 * 10;

export fn init() void {
    scratch = wasm_allocator.alloc(u8, scratch_len) catch unreachable;
}

The init function is the one I called from JavaScript. These functions are synchronous which is why a worker is so important.

I have a second function to return the address of the allocated memory.

export fn getScratch() [*]const u8 {
    return @ptrCast(scratch);
}

Back to JavaScript, I can write bytes to the scratch memory.

const scratch = new Uint8Array(
  memory.buffer,
  exports.getScratch(),
  128
);
scratch.set(
  [...new TextEncoder().encode("Hello, Zig!"), 0],
  0
);
exports.something();

This creates a temporary Uint8Array backed by the same underlying Wasm memory at the allocated offset. I write “Hello, Zig!” as a null-terminated string. Then I call something (another exported Zig function). Zig can then read the data I just wrote to scratch.

This works in reverse too. When I initialise Wasm I can provide functions.

const debug = (ptr, len) => {
  const message = new TextDecoder().decode(
    memory.buffer.slice(ptr, ptr + len),
  );
  console.debug(message);
};

const wasm = await WebAssembly.instantiateStreaming(
  fetch("search.wasm"), {
    env: { memory, debug }
  },
);

Those functions are tagged in Zig with extern:

extern fn debug(buf: usize, len: usize) void;

I can write text to the scratch buffer and then call JavaScript with its offset and length:

const slice = try std.fmt.bufPrint(
    scratch, "Hello, {s}!", .{"JavaScript"}
);
debug(@intFromPtr(slice.ptr), slice.len);

This will log “Hello, JavaScript!” to the browser console.

For the actual search results my Zig code writes JSON to the buffer and informs the worker it’s done. The worker transfers data back to the main thread:

const data = new Uint8Array(
  memory.buffer.slice(ptr, ptr + len),
);
self.postMessage({
    type: "results",
    buffer: data.buffer,
  },
  [data.buffer]
);

The main thread parses the JSON itself:

worker.addEventListener("message", (ev) => {
  if (ev.data.type === "results") {
    const data = JSON.parse(
      new TextDecoder().decode(ev.data.buffer)
    );
    // Update UI...
  }
});

Woah! That was a fair bit of back and forth. I’m sure my code has plenty of race conditions too! It works well enough for the time being.

After all that effort you can now search my blog. Like you could before. At some point in future I should research the correct way to index a blog and make this 10x more efficient.

I added some CSS view transitions to make it look fancier.

To save bandwidth no Wasm is loaded until the search input is focused. I still employ the same DuckDuckGo fallback should anything fail to initialise.

Shoutout again to Pagefind for serving me until now 🫡

Sources on 'Nerd Snipe'

Nerd Sniping

Urban Dictionary: Nerd Sniping

Nerd Sniping: 1. The act of presenting someone, often a mathematician/physicist with a time consuming problem or challenge (often impossible to solve or...

Sources on 'Zig'

Home ⚡ Zig Programming Language

Zig is a general-purpose programming language and toolchain for maintaining robust, optimal and reusable software.

Ziggit - A Zig community

A community for anyone interested in the Zig Programming Language.

Welcome | zig.guide

Get started with the Zig programming language. Zig is a general-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

Learning Zig

Welcome to Learning Zig, an introduction to the Zig programming language. This guide aims to make you comfortable with Zig. It assumes prior programming experience, though not in any particular language.

Programming with Zig: From Basics to Mastery

Table of Contents 1. Introduction to Zig: History, Philosophy, and Getting Started The origins of Zig Core design principles Comparison with other languages (C

Introduction to Zig

Welcome! This is the initial page for the “Open Access” HTML version of the book “Introduction to Zig: a project-based book”, written by Pedro Duarte Faria. This is an open book that provides an introduction to the Zig programming language, which is a new general-purpose, and low-level language for building robust and optimal software.

Zig Cookbook

Zig cookbook is a collection of simple Zig programs that demonstrate good practices to accomplish common programming tasks.

Sources on 'JSON'

JSON

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language Standard ECMA-262 3rd Edition - December 1999. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

Sources on 'WebAssembly (Wasm)'

WebAssembly

WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.

WebAssembly | MDN

WebAssembly is a type of code that can be run in modern web browsers. It is a low-level assembly-like language with a compact binary format that runs with near-native performance and provides languages such as C/C++, C# and Rust with a compilation target so that they can run on the web.

Sources on 'Google'

Delete your Google Account - Google Account Help

You can delete your Google Account at any tim

Killed by Google

Killed by Google is the open source list of dead Google products, services, and devices. It serves as a tribute and memorial of beloved services and products killed by Google.

Sources on 'Web Worker'

Using Web Workers - Web APIs | MDN

Web Workers are a simple means for web content to run scripts in background threads. The worker thread can perform tasks without interfering with the user interface. In addition, they can make network requests using the fetch() or XMLHttpRequest APIs. Once created, a worker can send messages to the JavaScript code that created it by posting messages to an event handler specified by that code (and vice versa).

Using Service Workers - Web APIs | MDN

This article provides information on getting started with service workers, including basic architecture, registering a service worker, the installation and activation process for a new service worker, updating your service worker, cache control and custom responses, all in the context of an app with offline functionality.

dbushell.com

Search with Zig, Wasm, and a Worker

The one where I rebuild it in Zig

The art and sport of using a technical challenge to distract a nerd. Bonus points for sniping yourself. ↩︎
A low-level programming language with no hidden control flow and no hidden memory allocations. ↩︎
JavaScript Object Notation. An almost perfect specification. If only they’d allowed trailing commas. ↩︎
Low-level machine code for web browsers. Wasm binary instructions are run in a virtual machine. ↩︎
Arguably the most evil company in tech resolute in their crusade to gate-keep the web and sell your privacy as a service. ↩︎
JavaScript that runs in its own thread. Handy for non-blocking background tasks. ↩︎