This is a sequel to my earlier post on parsing HTML in Unison:
html-parse - HTML parser in Unison
html-parse now has a second interpreter for the same parseHtmlText pipeline. Same parser, same malformed-HTML recovery, same entrypoint. But instead of producing Html, it produces typed markdown output, with an AST, rendered text, diagnostics, and typed fatal failures.
The key insight is that parseHtmlText works through the HtmlBuild ability. Swapping the handler is all it takes to get a completely different output type:
-- existing
buildHtml do parseHtmlText "<h1>Hello</h1>"
-- new
buildMarkdown do parseHtmlText "<h1>Hello</h1>"
In place of a new parser or token walker, I decided to piggyback on the the malformed-HTML recovery stack machine in buildHtmlTree
Output is a typed envelope, not just Text
I made a major mistake in the HTML parser to not have proper diagnostics and failure modelling. Hence, the first design decision I took for the markdown builder is that the result type is not just the final structured type or Text, it is ConverstionResult:
This gives callers four independent things to inspect. rendered is empty if there is a fatal failure. diagnostics accumulates recoverable policy events regardless. ast is available even when rendering fails. And failure is None on the happy path, Some with a typed reason when something exceeded a bound.
The AST itself has block and inline constructors covering the initial supported tags.
Opinions encoded as types
Markdown, being simple on the surface, gets really weird when specifications like CommonMark or GFM are involved. I might enhance the library later to support all that weirdness, but for now, I am trying to remain dead simple.
Hence, rather than leaving tag behaviour implicit, the library classifies every HTML tag into one of five classes:
type TagClass
= SemanticTag -- h1..h6, p, a, em, strong, ul, ol, li, code, pre, blockquote, hr, br
| UnsafeTag -- script, style
| DeferredTag -- img, table, video, and table structure tags
| TransparentTag -- known wrapper tags with no markdown equivalent
| UnknownTag -- everything else
Policy dispatch is then a pure function of config and class:
type PolicyAction =
EmitDeferredRawHtml
| SuppressDeferredRawHtml
policy.apply : MarkdownPolicy -> TagClass -> PolicyAction
policy.apply policy = cases
DeferredTag ->
if allowDeferredRawHtml policy then
EmitDeferredRawHtml
else
SuppressDeferredRawHtml
_ -> EmitDeferredRawHtml
UnsafeTag never reaches policy.apply because the subtree is dropped before dispatch. TransparentTag and UnknownTag are always unwrapped. Only DeferredTag is configurable.
Failures are also an explicit closed enum:
There is no generic "conversion failed" variant. Each bound has its own constructor so callers can match on the exact reason without parsing strings.
The guard pipeline
Conversion runs in three guarded stages. Each stage either short-circuits with a typed failure or passes control forward:
guardInput catches token-count violations before any tree is built. guardConvert catches structural violations after the AST is produced. guardRender catches output-size violations after rendering. A failure at any stage returns an empty rendered string, but earlier stages still populate ast and diagnostics so callers have something to inspect.
The conversion stage also emits diagnostics for every recoverable policy event, regardless of whether the overall result is a failure. These are warnings, not errors, and they do not short-circuit the pipeline.
Wrapping up
So let's look at some examples of how all those design and implementation concepts work in practice.
Happy paths
buildMarkdown do parseHtmlText "<h1>Hello</h1><p>world</p>"
⧨
ConversionResult
(MarkdownAst
[ Heading 1 [PlainText "Hello"]
, Paragraph [PlainText "world"]
])
"# Hello\n\nworld\n\n"
[]
None
Inline formatting with a link:
buildMarkdown do
parseHtmlText "<p>Some <strong>bold</strong> and <em>italic</em> with <a href=\"https://example.com\">docs</a>.</p>"
⧨
ConversionResult
(MarkdownAst
[ Paragraph
[ PlainText "Some "
, Strong [PlainText "bold"]
, PlainText " and "
, Emphasis [PlainText "italic"]
, PlainText " with "
, Link "https://example.com" "" [PlainText "docs"]
, PlainText "."
]
])
"Some **bold** and *italic* with [docs](https://example.com).\n\n"
[]
None
Unhappy paths
Unsafe content is dropped and does not appear in the rendered output:
result = buildMarkdown do
parseHtmlText "<p>ok</p><script>alert(1)</script><p>end</p>"
ConversionResult.rendered result
⧨
"ok\n\nend\n\n"
alert(1) never makes it into output. There is no diagnostic for this: unsafe subtrees are silently tombstoned by design because emitting a warning for script tags in real-world HTML would be overwhelming noise.
Deferred tags emit raw HTML and a diagnostic with a source path:
result = buildMarkdown do parseHtmlText "<custom><img src=\"x\"/></custom>"
diagnostics result
⧨
[ Diagnostic PolicyStage DeferredToRawHtml "img" (SourceRef "img" [0, 0])
, Diagnostic PolicyStage UnknownTagUnwrapped "custom" (SourceRef "custom" [0])
]
The SourceRef carries the tag name and the index path through the tree, so callers know exactly where in the input each policy event occurred. The path [0, 0] means the first child of the first root node.
Hard limit produces a typed fatal failure:
limits = maxInputTokens.set 1 MarkdownLimits.defaults
config = MarkdownLimits.set limits MarkdownConfig.defaults
result = buildMarkdownWith config do parseHtmlText "<p>a</p><p>b</p>"
(ConversionResult.rendered result, ConversionResult.failure result)
⧨
( ""
, Some (Failure InputTokenLimitExceeded 1 6 "input token count exceeded maxInputTokens")
)
rendered is empty. failure tells you exactly which limit was breached, what the configured limit was, and what the actual count was.
Policy toggle changes behavior without touching conversion logic:
strict = MarkdownPolicy.allowDeferredRawHtml.set false MarkdownPolicy.defaults
config = MarkdownPolicy.set strict MarkdownConfig.defaults
result = buildMarkdownWith config do parseHtmlText "<p>text</p><img src=\"x\"/>"
ConversionResult.rendered result
⧨
"text\n\n"
With allowDeferredRawHtml = false, the <img> is suppressed entirely. The diagnostic still fires.
All changes for this release are visible via this PR - #2
This is the next building block for my ultimate intent: an AT Protocol-based RSS reader. More to come, happy hacking 🤘
If you have read this far, thanks! I am available for contract, temporary, freelance, and advisory roles where I can work closely with stakeholders or as an IC to solve hard product and engineering problems quickly. I am open to working in a variety of languages - Haskell, Rust, TypeScript/JavaScript, and, of course, Unison. I am fun to work with and have almost 20 years of experience, with expertise in functional programming, distributed systems, cloud-native architectures, and agentic workflows.