Swizec Teller - a geek with a hatswizec.com

Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Senior Engineer Mindset cover
Learn more

    How to debug unified, rehype, or remark and fix bugs in markdown processing

    Ever had to debug a compiler? It's hard.

    Typical engineering wisdom goes that if you blame the compiler, you're wrong. You messed up and can't figure it out and it is not the compiler's fault.

    But sometimes it is.

    I'm working on a script to convert Wordpress exports to Markdown and after months of tinkering, noticed a terrible bug. JSX code samples don't work.

    You take this bit of Wordpress source code:

    <pre lang="javascript">
    class ReportSize extends React.Component {
      refCallback = element => {
        if (element) {
          this.props.getSize(element.getBoundingClientRect());
        }
      };
    
      render() {
        return (
          <div ref={this.refCallback} style={{ border: "1px solid red" }}>
            {faker.lorem.paragraphs(Math.random() * 10)}
          </div>
        );
      }
    }
    </pre>
    

    And you get this output:

    class ReportSize extends React.Component {
      refCallback = element => {
        if (element) {
          this.props.getSize(element.getBoundingClientRect());
        }
      };
    
      render() {
        return (
    
            {faker.lorem.paragraphs(Math.random() * 10)}
    
        );
      }
    }
    

    But you're supposed to get this:

    ``javascript
    class ReportSize extends React.Component {
      refCallback = element => {
        if (element) {
          this.props.getSize(element.getBoundingClientRect());
        }
      };
    
      render() {
        return (
    
            {faker.lorem.paragraphs(Math.random() * 10)}
    
        );
      }
    }
    ``
    

    The compiler that translates plain Wordpress HTML to Markdown makes a mistake. Sees HTML inside a code sample, parses it as HTML, then silently drops it.

    🙃

    How would you fix this?

    I verified the input is all correct until it goes into the html-to-markdown compiler. That's always a good first step. Garbage in, garbage out

    How compilers work

    EDCVI2Q

    Compilers work in 3 stages.

    1. The front-end parses text inputs into an Abstract Syntax Tree – AST
    2. The middle-end manipulates the AST to add optimizations and other changes
    3. The back-end turns your AST back to text

    That's the general idea. You can read about compilers in more detail in my LOLCODE-to-JavaScript compiler article.

    Modern tools for Markdown processing

    Your best bet to play with markdown right now is the Unified ecosystem. It's what Gatsby, Next, Prettier, and many others use.

    Unified is a system of plugins with plugins that let you parse, process, and output ASTs. Easy to get started and confusing to dig deep. Hard to find the code that does anything.

    Every plugin calls another plugin. 😅

    General lay of the land is that:

    • unified is the protocol that holds it all together
    • rehype is the plugin ecosystem for HTML parsing/rendering
    • remark is the plugin ecosystem for Markdown parsing/rendering

    A basic example that parses HTML and spits out Markdown goes like this:

    import parse from "rehype-parse"
    import rehype2remark from "rehype-remark"
    import stringify from "remark-stringify"
    
    unified() // initiate the pipeline
      .use(parse) // parse HTML to AST
      .use(rehype2remark) // translate Rehype AST to Remark AST
      .use(stringify) // stringify AST to markdown
      .process(inputHTML, (err, markdown) => {
        console.log(markdown)
      })
    

    How you debug a compiler

    Debugging a compiler is where life gets tricky. A whole processing pipeline, oof.

    You're trying to find out which step of the processing pipeline makes a mistake. And whether it's a mistake at all. You could still be the one holding it wrong.

    Works for everyone else. Chances are you're holding it wrong. But how do you figure out how you're holding it wrong?

    You can watch me work through the issue here:

    I go through the process methodically injecting console.logs into the pipeline to see how nodes change shape. Trying to figure out where it goes wrong.

    In the end I decided it wasn't a bug. Or if it was, it's too deep for me to fix.

    A debugging kludge

    First you create a small example. As small as possible.

    Then you build a debugging kludge – a tiny plugin that helps you hook into the compilation process.

    It's a kludge because it's temporary, dirty, and for your eyes only. Like slapping a console.log into your code to see what's going on.

    Your pipeline has two sides, if you're translating HTML to Markdown:

    1. The Rehype side
    2. The Remark side

    These require different kludges because the AST structure is different. Tag naming is different, properties are different, etc. Both are tree structures.

    Debug a Rehype AST

    To see how Rehype sees your code blocks, you use a kludge like this:

    function debugCodeBlocks() {
      function findCodeBlocks(node) {
        let nodes = []
    
        if (node.tagName === "pre") {
          nodes.push(node) // <pre> nodes are code blocks
        } else if (node.children) {
          // recursively walk through child nodes
          for (let child of node.children) {
            nodes.push(...findCodeBlocks(child))
          }
        }
    
        return nodes
      }
    
      return (tree) => {
        const codeBlocks = findCodeBlocks(tree)
    
        // deep print the codeBlocks array
        console.log(require("util").inspect(tree, false, null, true))
    
        return tree
      }
    }
    

    Look at the output and you see what Rehype sees. Every property, node structure, and value.

    Change the if to observe different nodes. I don't recommend printing the whole AST. Too much.

    Debug a Remark AST

    On the other side of the rehype2remark divide, you have a Markdown AST. Same basic structure, different naming conventions.

    Remark is easier to debug because it's well-supported by the unist ecosystem of helper plugins. 🎉

    To print code blocks on the Remark side:

    import visit from "unist-util-visit"
    
    function codeBlockDebugger() {
      return (markdownAST) => {
        visit(markdownAST, "code", (node) => {
          console.log(require("util").inspect(node, false, null, true))
        })
      }
    }
    

    The visit helper handles recursion for you.

    How to fix a compiler pipeline issue

    There's 3 ways to fix a compiler pipeline issue:

    1. On the parsing side
    2. In the middle
    3. On the stringify side

    Fixing parsers by hand is hard. Writing, too. Most parsers are auto-generated from grammar definition files.

    Rehype parses my JSX-containing code blocks as if they were HTML. I thought they should be strings.

    After digging, I discovered that Rehype uses Parse5 – an HTML5 compliant parser used in the entire Node ecosystem. With 75,000,000 downloads per month ... yeah I'm the one who's wrong.

    im_wrong giphy

    Fixing stringification is boring. You'll dig and dig and do nothing fun for days. Shit's boring, we did it for my compilers class in college.

    Plus you've got a battle-tested stringifier right there. You need to give it the right input tree 😉

    MacGyver a fix in the middle

    macgyver giphy

    Compiler middleware is the best place to add your fix.

    Take the AST, change it so stringification does what you want. Magic.

    Here's what I did for my example of JSX-containing code blocks:

    function fixCodeBlocks() {
      const settings = {
        quoteSmart: false,
        closeSelfClosing: false,
        omitOptionalTags: false,
        entities: { useShortestReferences: true },
      }
    
      return (tree) => {
        // same recursion findCodeBlocks as above
        const codeBlocks = findCodeBlocks(tree)
    
        for (let block of codeBlocks) {
          // copy position info for the whole <pre> so we can fake it later
          const position = {
            start: block.children[0].position.start,
            end: block.children[block.children.length - 1].position.end,
          }
    
          // replace children with my fakes
          block.children = []
        }
    
        return tree
      }
    }
    

    Find all the code blocks, replace their children with a fake new child.

    // add a fake <code> block
    // helps preserve language property
    {
        type: "element",
        tagName: "code",
        properties: {
            className: [
                `language-${
                    block.properties && block.properties.lang
                }`,
            ],
        },
        children: [
            // stringify the full <pre> block
            // as a text child of the new <code> block
            {
                type: "text",
                // hast-util-to-html, Rehype's stringifier
                value: toHTML(block, settings)
                    .replace("</pre>", "")
                    .replace(/\<pre.*>/, "")
                    // this fixes a bug caused by a dirty fix somewhere else :P
                    .replace(/\<p\>\<\/p\>/g, "\n\n"),
                position,
            },
        ],
    }
    

    Make the child a <code> block because Remark likes that better and stringify the whole <pre> tag into a single text node.

    You get:

    class ReportSize extends React.Component {
      refCallback = element => {
        if (element) {
          this.props.getSize(element.getBoundingClientRect());
        }
      };
    
      render() {
        return (
          <div ref="{this.refCallback}" style="{{" border:="" &#34;1px="" solid="" red&#34;="" }}="">
            {faker.lorem.paragraphs(Math.random() * 10)}
          </div>
        );
      }
    }
    

    Not perfect, but better. Need to fix the child html nodes or find a JSX stringifier 🤔

    Cheers,
    ~Swizec

    Published on August 17th, 2020 in Computer Science

    Did you enjoy this article?

    Continue reading about How to debug unified, rehype, or remark and fix bugs in markdown processing

    Semantically similar articles hand-picked by GPT-4

    Senior Mindset Book

    Get promoted, earn a bigger salary, work for top companies

    Learn more

    Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

    Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

    Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

    Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

    Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

    Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

    Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created by Swizec with ❤️