
Hmmarkdown 2
Hmmarkdown 2 êŽë š
Everyone has an opinion on markdown[1] but why stop there? Write your own parser, make those opinions reality! Thatâs what I did with Hmmarkdown â my HTML-aware markdown library. It has built my website content for the past year.
Turns out parsing markdown (with HTML) isnât easy. My original approach evolved into a game of whac-a-mole to quash edge case bugs. Last week I began a new parsing experiment Iâd been mulling over. The idea proved workable and I finished the job.
Hmmarkdown 2 was born!
The new codebase is still rough around the edges but already an upgrade. I used my original test suite to ensure the same output. My primary goal was a more maintainable, extendable, and faster library, which it should be soon.
The Purpose
Markdown is best in itâs simplest form. Complex extensions to the syntax make no sense. If HTML is the target and easier to markup, just write HTML! Existing markdown libraries allow HTML but they skip past it. The purpose of Hmmarkdown is to allow me to write primarily markdown but interweave HTML where it makes sense.
Along with my original example, hereâs a common pattern I use:
<figure>
> blockquote
<figcaption>[reference](https://example.com)</figcaption>
</figure>
The mix of HTML and markdown above is transformed into the HTML below.
<figure>
<blockquote>
<p>blockquote</p>
</blockquote>
<figcaption>
<a href="https://example.com/">reference</a>
</figcaption>
</figure>
In practice I mix little markup but itâs extremely useful to have the ability. Was this worth the investment? Probably not, but Iâm in too deep!
Old and Busted
My old parser separated lines by \n then grouped lines by block: paragraph, blockquote, heading, list, etc. Those blocks were then parsed as HTML (crudely). Text nodes were parsed for inline markdown: links, bold, italic, etc. A subset of block-level HTML elements were passed back through the parser. Regular expressions did the heavy lifting.
New Hotness
That was the old architecture. It worked but it got messy. The new parser begins with a more traditional tokenizer. The tokenizer iterates the input character by character to generate an array of tokens, namely:
- Asterisk
- Backtick
- Underscore
- Exclamation
- Hash
- Tilde
- Plusâ
- NewlineâĄ
- Parentheses
- Square brackets
- Angle brackets
- Tag
- Text
â Wait a minute â+â is not markdown! Iâll explain laterâŠ
Every token is a single ASCII character except for Tag and Text. Tag tokens are HTML tags, like <div>, </div>, or <div/>. Text tokens are a unicode string of everything else. From the tokens, I generate a basic DOM-like tree with a ârootâ node and child tokens.
⥠I ignore carriage returns (macOS user), theyâre probably dealt with in Text nodes and eventually trimmed?
Using the HTML + markdown input example below:
<aside class="Box">
This is a **"boxed"** paragraph.
</aside>
End of document!
The initial token tree state could be visualised like this:
TAG(<root>)
TAG(<aside class="Box">)
NEWLINE
TEXT(' This is a ')
ASTERISK
ASTERISK
TEXT('"boxed"')
ASTERISK
ASTERISK
TEXT(' paragraph.')
NEWLINE
NEWLINE
NEWLINE
TEXT('End of document')
EXCLAMATION
With this tree I recursively parse the open tag nodes where the tag name is in an allowed set. This lets me ignore hard-coded HTML tags like <script>, <style>, and <iframe> where I never want to parse or modify.
Using the <aside> node for example, I iterate the children to generate a new array of children. The first two tokens NEWLINE and TEXT are appended to the new array. Then an ASTERISK token is found so consumeStrong is called. It will return a <strong> node (or nothing, for false positives). If that fails consumeEmphasis will be tried. If nothing matches the ASTERISK and TEXT tokens are appended without change. In this example **"boxed"** from the original input matches bold formatting.
The tree state now looks that this:
TAG(<root>)
TAG(<aside class="Box">)
NEWLINE
TEXT(' This is a ')
TAG(<strong>)
TEXT('"boxed"')
TEXT(' paragraph.')
NEWLINE
NEWLINE
NEWLINE
TEXT('End of document')
EXCLAMATION
The next step wraps text and inline tags with HTML paragraphs. This is probably the ropiest area of my code but it works (mostly). NEWLINE tokens play a key role and theyâre removed at this stage. Excess whitespace is also trimmed.
TAG(<root>)
TAG(<aside class="Box">)
TAG(<p>)
TEXT('This is a ')
TAG(<strong>)
TEXT('"boxed"')
TEXT(' paragraph.')
TAG(<p>)
TEXT('End of document')
EXCLAMATION
Next I merge adjacent text tokens before applying SmartyPants replacement, and finally HTML entities are escaped. In this example because the EXCLAMATION token did not match markdown image syntax it is merged as text.
TAG(<root>)
TAG(<aside class="Box">)
TAG(<p>)
TEXT('This is a ')
TAG(<strong>)
TEXT('âboxedâ')
TEXT(' paragraph.')
TAG(<p>)
TEXT('End of document!')
The final HTML output is a simple recursive function over tree nodes to generate a string. Iâve added extra formatting for readability below. HTML attributes are never parsed they just come along for the ride.
<aside class="Box">
<p>This is a <strong>âboxedâ</strong> paragraph.</p>
</aside>
<p>End of document!</p>
And that is how Hmmarkdown 2 works!
Or at least should work. Weâll see if any formatting bugs appear on my website. The new tokenizer approach means I can largely avoid regex. The supported markdown syntax is still punishingly strict and opinionated.
The code repo is public but I have plenty to tidy up and optimise. There is no validation nor error reporting. I wouldnât advise using it unless youâre me!
The Plus Token
So about that PLUS token. Whereas unordered lists start with an ASTERISK token, ordered lists would be written as NUMBER followed by PERIOD.
1. Item one
2. Item two
In fact, the numeric order and value doesnât matter to markdown.
9999. Item one
1. Item two
Both examples should output identical items marked 1 and 2 sequentially. (Some libraries do add a start attribute. Thatâs a thing I donât need.)
Anyway, this is a pain to parse. I would need eleven additional tokens for digits 0 to 9 and PERIOD. That adds overhead and a lot of false positives in the look-ahead matching. To avoid this entirely I do a cheeky bit of regex pre-processing. Ordered list lines are replaced with PLUS before I tokenize.
+ Item one
+ Item two
Now I can use the exact same logic I use to parse unordered lists.
Hmmarkdown has never supported nested lists because in over a decade of blogging Iâve never nested a list. That saves me another headache.
Tune in next year when I throw this all away and announce Hmmarkdown 3!
Sources on 'Markdown'(1)
A simple plain text markup language created by John Gruber and bastardised by everyone else. Designed for writers who enjoy teeny-weeny font sizes. â©ïž