This is a sequel to my earlier post on parsing HTML in Unison:

html-parse - HTML parser in Unison

html-parse now has a second interpreter for the same parseHtmlText pipeline. Same parser, same malformed-HTML recovery, same entrypoint. But instead of producing Html, it produces typed markdown output, with an AST, rendered text, diagnostics, and typed fatal failures.

The key insight is that parseHtmlText works through the HtmlBuild ability. Swapping the handler is all it takes to get a completely different output type:


-- existing

buildHtml do parseHtmlText "<h1>Hello</h1>"


-- new

buildMarkdown do parseHtmlText "<h1>Hello</h1>"

In place of a new parser or token walker, I decided to piggyback on the the malformed-HTML recovery stack machine in buildHtmlTree

Output is a typed envelope, not just Text

I made a major mistake in the HTML parser to not have proper diagnostics and failure modelling. Hence, the first design decision I took for the markdown builder is that the result type is not just the final structured type or Text, it is ConverstionResult:

This gives callers four independent things to inspect. rendered is empty if there is a fatal failure. diagnostics accumulates recoverable policy events regardless. ast is available even when rendering fails. And failure is None on the happy path, Some with a typed reason when something exceeded a bound.

The AST itself has block and inline constructors covering the initial supported tags.

Opinions encoded as types

Markdown, being simple on the surface, gets really weird when specifications like CommonMark or GFM are involved. I might enhance the library later to support all that weirdness, but for now, I am trying to remain dead simple.

Hence, rather than leaving tag behaviour implicit, the library classifies every HTML tag into one of five classes:


type TagClass

  = SemanticTag    -- h1..h6, p, a, em, strong, ul, ol, li, code, pre, blockquote, hr, br

  | UnsafeTag      -- script, style

  | DeferredTag    -- img, table, video, and table structure tags

  | TransparentTag -- known wrapper tags with no markdown equivalent

  | UnknownTag     -- everything else

Policy dispatch is then a pure function of config and class:


type PolicyAction = 
  EmitDeferredRawHtml 
  | SuppressDeferredRawHtml


policy.apply : MarkdownPolicy -> TagClass -> PolicyAction

policy.apply policy = cases

  DeferredTag -> 
    if allowDeferredRawHtml policy then 
      EmitDeferredRawHtml 
    else 
      SuppressDeferredRawHtml

  _           -> EmitDeferredRawHtml

UnsafeTag never reaches policy.apply because the subtree is dropped before dispatch. TransparentTag and UnknownTag are always unwrapped. Only DeferredTag is configurable.

Failures are also an explicit closed enum:

There is no generic "conversion failed" variant. Each bound has its own constructor so callers can match on the exact reason without parsing strings.

The guard pipeline

Conversion runs in three guarded stages. Each stage either short-circuits with a typed failure or passes control forward:

guardInput catches token-count violations before any tree is built. guardConvert catches structural violations after the AST is produced. guardRender catches output-size violations after rendering. A failure at any stage returns an empty rendered string, but earlier stages still populate ast and diagnostics so callers have something to inspect.

The conversion stage also emits diagnostics for every recoverable policy event, regardless of whether the overall result is a failure. These are warnings, not errors, and they do not short-circuit the pipeline.

Wrapping up

So let's look at some examples of how all those design and implementation concepts work in practice.

Happy paths


buildMarkdown do parseHtmlText "<h1>Hello</h1><p>world</p>"



ConversionResult

  (MarkdownAst

    [ Heading 1 [PlainText "Hello"]

    , Paragraph [PlainText "world"]

    ])

  "# Hello\n\nworld\n\n"

  []

  None

Inline formatting with a link:


buildMarkdown do

  parseHtmlText "<p>Some <strong>bold</strong> and <em>italic</em> with <a href=\"https://example.com\">docs</a>.</p>"



ConversionResult

  (MarkdownAst

    [ Paragraph

        [ PlainText "Some "

        , Strong [PlainText "bold"]

        , PlainText " and "

        , Emphasis [PlainText "italic"]

        , PlainText " with "

        , Link "https://example.com" "" [PlainText "docs"]

        , PlainText "."

        ]

    ])

  "Some **bold** and *italic* with [docs](https://example.com).\n\n"

  []

  None

Unhappy paths

Unsafe content is dropped and does not appear in the rendered output:


result = buildMarkdown do

  parseHtmlText "<p>ok</p><script>alert(1)</script><p>end</p>"

ConversionResult.rendered result



"ok\n\nend\n\n"

alert(1) never makes it into output. There is no diagnostic for this: unsafe subtrees are silently tombstoned by design because emitting a warning for script tags in real-world HTML would be overwhelming noise.

Deferred tags emit raw HTML and a diagnostic with a source path:


result = buildMarkdown do parseHtmlText "<custom><img src=\"x\"/></custom>"

diagnostics result



[ Diagnostic PolicyStage DeferredToRawHtml   "img"    (SourceRef "img"    [0, 0])

, Diagnostic PolicyStage UnknownTagUnwrapped "custom" (SourceRef "custom" [0])

]

The SourceRef carries the tag name and the index path through the tree, so callers know exactly where in the input each policy event occurred. The path [0, 0] means the first child of the first root node.

Hard limit produces a typed fatal failure:


limits = maxInputTokens.set 1 MarkdownLimits.defaults

config = MarkdownLimits.set limits MarkdownConfig.defaults

result = buildMarkdownWith config do parseHtmlText "<p>a</p><p>b</p>"

(ConversionResult.rendered result, ConversionResult.failure result)



( ""

, Some (Failure InputTokenLimitExceeded 1 6 "input token count exceeded maxInputTokens")

)

rendered is empty. failure tells you exactly which limit was breached, what the configured limit was, and what the actual count was.

Policy toggle changes behavior without touching conversion logic:


strict = MarkdownPolicy.allowDeferredRawHtml.set false MarkdownPolicy.defaults

config = MarkdownPolicy.set strict MarkdownConfig.defaults

result = buildMarkdownWith config do parseHtmlText "<p>text</p><img src=\"x\"/>"

ConversionResult.rendered result



"text\n\n"

With allowDeferredRawHtml = false, the <img> is suppressed entirely. The diagnostic still fires.


All changes for this release are visible via this PR - #2

This is the next building block for my ultimate intent: an AT Protocol-based RSS reader. More to come, happy hacking 🤘


If you have read this far, thanks! I am available for contract, temporary, freelance, and advisory roles where I can work closely with stakeholders or as an IC to solve hard product and engineering problems quickly. I am open to working in a variety of languages - Haskell, Rust, TypeScript/JavaScript, and, of course, Unison. I am fun to work with and have almost 20 years of experience, with expertise in functional programming, distributed systems, cloud-native architectures, and agentic workflows.