This is a technical followup to Lessons from migrating a 14 year old blog with 1500 posts to Gatsby.
Migrating from Wordpress to Markdown sounds easy. Mention it to any developer and they'll say "Pfft, an afternoon of work at worst"
- take a Wordpress export from admin tools â 10min
- find a script that converts to markdown â 20min
- sip margaritas while the script runs â 1h
Suddenly it's 6months later and you're losing your mind.
When I started migrating in September 2019, there were no good scripts. Best I could find was somebody's 7 year old college project â wordpress-to-markdown.
Complete with bugs, old JavaScript, and gnarly edge cases on my humongous site. Your site accumulates lots of cruft in 14 years đ
A script that converts Wordpress dumps into clean Markdown may have been the dumbest project I ever took on. Sooooo many edge cases đ pic.twitter.com/z8dPUMrBGk
â Swizec Teller (@Swizec) August 25, 2020
Nowadays you have wordpress-export-to-markdown from @lonekorean. Works better and is easier to use.
// https://twitter.com/kyleshevlin/status/1298309307587424256
But the output isn't what I wanted. Great for simple cases, doesn't deal with all the edge cases on a large technical site.
The challenge
The core challenge is two-fold: The easy part and the hard part.
The easy part is that Worpdress outputs valid XML that you have to parse. Tons of good libraries for this, problem solved.
You also want to download images.
Wordpress likes to use linked <img>
tags. Sometimes 3rd party, sometimes part of the site.
Gatsby, NextJS, Hugo, and other Markdown-based site generators prefer that you keep images part of your source code. Lets you do fun transformations, hosting via CDN, and ensures you don't lose your images.
I lost many images to link rot âšī¸
The hard part is that Wordpress HTML is not valid HTML.
And that's where the fun begins.
The basic setup
You're looking for a script that builds a processing pipeline:
- Parse XML
- Iterate through posts
- Create a
/out/<slug>
directory - Create a
/out/<slug>/img
directory for images - Extract metadata into Markdown frontmatter
- Download images into the
/img
directory - Hacks to convert Wordpress HTML into almost valid HTML
- Parse said HTML into a Markdown Abstract Syntax Tree (AST)
- Fix edge cases in your AST
- Output clean Markdown with frontmatter into
/out/<slug>/index.md
You can see this setup in my wordpress-to-markdown script.
function processExport(file) {
const parser = new xml2js.Parser()
fs.readFile(file, function (err, data) {
if (err) {
return console.log("Error: " + err)
}
parser.parseString(data, function (err, result) {
if (err) {
return console.log("Error parsing xml: " + err)
}
console.log("Parsed XML")
const posts = result.rss.channel[0].item
fs.mkdir("out", function () {
posts
.filter((p) => p["wp:post_type"][0] === "post")
.forEach(processPost)
})
})
})
}
Parses the XML, iterates through <item>
entries, and runs processPost
on each.
Metadata into Frontmatter
I wanted to create full frontmatter without manual edits. That means:
---
title: 'Always put side effects last'
description: ""
published: 2018-01-10
redirect_from:
- /blog/always-put-side-effects-last/swizec/8057
categories: "Startups, Technical"
hero: ./img/wp-content-uploads-2016-10-salesforce-tower-panorama-1024x358.jpg
---
Title from post title, description based on meta data, a publish date, keep old URL for redirects, combine categories and tags into categories, find a good hero/social image.
Data comes from digging around Wordpress exports and figuring out what fits.
const postTitle = typeof post.title === "string" ? post.title : post.title[0]
console.log("Post title: " + postTitle)
const postDate = isFinite(new Date(post.pubDate))
? new Date(post.pubDate)
: new Date(post["wp:post_date"])
console.log("Post Date: " + postDate)
let postData = post["content:encoded"][0]
console.log("Post length: " + postData.length + " bytes")
const slug = slugify(postTitle, {
remove: /[^\w\s]/g,
})
.toLowerCase()
.replace(/\*/g, "")
console.log("Post slug: " + slug)
// takes the longest description candidate
const description = [
post.description,
...post["wp:postmeta"].filter(
(meta) =>
meta["wp:meta_key"][0].includes("metadesc") ||
meta["wp:meta_key"][0].includes("description")
),
].sort((a, b) => b.length - a.length)[0]
// Merge categories and tags into tags
const categories = post.category && post.category.map((cat) => cat["_"])
Despite this, you'll notice lots of empty descriptions. Folks get lazy and don't write custom descriptions because Wordpress can guess from the article. I know I did đ
Should I add that guess-work or write 1500 descriptions by hand đ¤
Finding the hero image is a matter of processing all images in your article and picking the first.
Initial candidates come from your meta data.
const heroURLs = post["wp:postmeta"]
.filter(
(meta) =>
meta["wp:meta_key"][0].includes("opengraph-image") ||
meta["wp:meta_key"][0].includes("twitter-image")
)
.map((meta) => meta["wp:meta_value"][0])
.filter((url) => url.startsWith("http"))
The rest come from your article body.
let images = []
if (heroURLs.length > 0) {
const url = heroURLs[0]
;[postData, images] = await processImage({
url,
postData,
images,
directory,
})
}
// downloads images, changes each URL in article
;[postData, images] = await processImages({ postData, directory })
// finds first non-gif image
heroImage = images.find((img) => !img.endsWith("gif"))
From all this metadata, frontmatter comes together with a bit of string concatenation.
let frontmatter
try {
frontmatter = [
"---",
`title: '${postTitle.replace(/'/g, "''")}'`,
`description: "${description}"`,
`published: ${format(postDate, "yyyy-MM-dd")}`,
`redirect_from:
- ${redirect_from}`,
]
} catch (e) {
console.log("----------- BAD TIME", postTitle, postDate)
throw e
}
if (categories && categories.length > 0) {
frontmatter.push(`categories: "${categories.join(", ")}"`)
}
frontmatter.push(`hero: ${heroImage || "../../../defaultHero.jpg"}`)
frontmatter.push("---")
frontmatter.push("")
Okay that's the easy part.
Converting to Markdown and fixing edge cases
Converting Wordpress's invalid HTML to Markdown is the fun part. Edge cases make it even better.
You can choose 2 paths here:
- turndown, which is a solid HTML to Markdown converter that I didn't know about when doing this
- UnifiedJS, which is a suite of tools for manipulating ASTs used by a lot of popular libraries
I went with Unified.
Core setup looks like a pipeline of plugins. You start with an input string, parse it as HTML, turn it into Markdown, output as text.
const markdown = await new Promise((resolve, reject) => {
unified()
.use(parseHTML, {
fragment: true,
emitParseErrors: true,
duplicateAttribute: false,
})
.use(fixCodeBlocks) // edge case
.use(fixEmbeds) // edge case
.use(rehype2remark)
.use(cleanupShortcodes) // edge-ish case
.use(stringify, {
fences: true,
listItemIndent: 1,
gfm: false,
pedantic: false,
})
.process(fixBadHTML(postData), (err, markdown) => {
if (err) {
reject(err)
} else {
let content = markdown.contents
// edge case
content = content.replace(/(?<=https?:\/\/.*)\\_(?=.*\n)/g, "_")
// prettify
resolve(prettier.format(content, { parser: "mdx" }))
}
})
})
Edge case 1: Make your HTML parseable
Wordpress HTML is pretty good. Plop it in an HTML parser and, like, it won't choke ... but it won't parse correctly either.
You'll need to change double newlines to paragraph breaks. Wordpress doesn't wrap paragraphs in <p></p>
tags
function fixBadHTML(html) {
html = html.replace(/(\r?\n){2}/g, "<p></p>")
return html
}
Yep, Regex for HTML fixing. Find double newlines, replace with empty paragraphs.
Edge case 2: Bad code blocks
I wrote about fixing bad code blocks in my You though computer science has no place in webdev? Here's a fun coding challenge article.
Your challenge is that this isn't valid HTML:
<pre lang="javascript">
class ReportSize extends React.Component {
refCallback = element => {
if (element) {
this.props.getSize(element.getBoundingClientRect());
}
};
render() {
return (
<div ref={this.refCallback} style={{ border: "1px solid red" }}>
{faker.lorem.paragraphs(Math.random() * 10)}
</div>
);
}
}
</pre>
JSX tags get parsed as HTML and break your code block. You want them to include a <code></code>
tag as well. Otherwise Markdown stringifying doesn't work right.
Fixing this is tricky and I won't share the full code here. You can see it in articleCleanup.js line 77. All 139 lines of it đ¤
The process goes like this:
- Find code blocks
- Grab language definition
- Replace children with a
<code>
element - Fix JSX object props in child nodes
- Stringify block into HTML
- Clean HTML with gnarly regex buffoonery
- Run result through Prettier
for (let block of codeBlocks) {
const lang = block.properties && block.properties.lang
block.children = [
{
type: "element",
tagName: "code",
properties: {
className: lang ? [`language-${lang}`] : null,
},
children: [
{
type: "text",
value: cleanBlockHTML(
toHTML(fixJsxObjectProps(block), settings),
block.properties && block.properties.lang
),
},
],
},
]
}
Edge case 3: Fixing embeds
Lots of ways to embed 3rd party content on a wordpress site. You can use plain old links pasted on their own line, shortcodes, and full HTML embeds.
Markdown site generators like to use plain links.
You want to change code like:
<blockquote class="twitter-tweet">
<p lang="en" dir="ltr">
A script that converts Wordpress dumps into clean Markdown may have been the
dumbest project I ever took on. Sooooo many edge cases đ
<a href="https://t.co/z8dPUMrBGk">pic.twitter.com/z8dPUMrBGk</a>
</p>
— Swizec Teller (@Swizec)
<a
href="https://twitter.com/Swizec/status/1298308910072307713?ref_src=twsrc%5Etfw"
>August 25, 2020</a
>
</blockquote>
<script
async
src="https://platform.twitter.com/widgets.js"
charset="utf-8"
></script>
Into Markdown that's a link:
https://twitter.com/Swizec/status/1298308910072307713
Site generator can take this and turn it into an embed. When it starts as a blockquote, you'll have trouble.
Another 106 lines of code that I won't share here.
Basic idea is that:
- You find all blockquote nodes
- All iframe nodes
- All paragraph nodes
- Filter for potential embeds
- Fix the AST for each embed you want to support
Taking Twitter as an example, you get this:
function fixEmbeds() {
function isTweet(blockquote) {
return (
blockquote.properties &&
blockquote.properties.className &&
blockquote.properties.className.includes("twitter-tweet")
)
}
return (tree) => {
const blockquotes = findRehypeNodes(tree, "blockquote")
for (let blockquote of blockquotes) {
if (isTweet(blockquote)) {
const link = findRehypeNodes(blockquote, "a").pop()
blockquote.type = "element"
blockquote.tagName = "p"
blockquote.children = [{ type: "text", value: link.properties.href }]
}
}
return tree
}
}
Edge case 4: Fixing shortcodes
Shortcodes are a semi-standard system of snippets. Denoted by []
they give CMS users the ability to go beyond writing text.
These were popular on internet forums of the late 2000's. Wordpress supports them to this day. Don't know about others.
I wanted to get rid of most and preserve any embeds.
You can identify an embed because it's a closed shortcode prefixed with the name of a service followed by a link.
[tweet https://twitter.com/Swizec/status/1298308910072307713]
The gnarly ones are Wordpress's almost-html shortcodes. Big issue on my site were the [caption][/caption]
shortcodes.
[caption id="" align="alignnone" width="560"]<img
class=" "
title="Spirograph"
src="http://upload.wikimedia.org/wikipedia/commons/thumb/9/97/Spirograph3.jpg/800px-Spirograph3.jpg"
alt="Spirograph"
width="560"
height="420"
/>
<a
class="zem_slink"
title="Spirograph"
href="http://en.wikipedia.org/wiki/Spirograph"
rel="wikipedia"
target="_blank"
>Spirograph</a
>[/caption]
It's a shortcode tag with an image and a link. You want to get a clean Markdown image out of this. đ¤¨
You fix this mess by:
- Finding all paragraphs
- Seeing if they contain shortcodes
- Cleaning it up with Regex
Core structure is an AST traversal with a loop over candidates:
function cleanupShortcodes() {
const shortCodeOpenTag = /\[\w+ .*\]/g
const shortCodeCloseTag = /\[\/\w+]/g
const embedShortCode = /\[\w+ (https?:\/\/.*)\]/g
const captionShortCode = /\[caption.*\]/g
return (tree) => {
visit(tree, "text", (node, index, parent) => {
if (parent.type === "paragraph" && node.value) {
// clean it up
}
})
}
}
Inside the loop you then:
- Turn embed shortcodes into plain URLs with regex
// preserve embed shortcodes as plain URLs
if (node.value.match(embedShortCode)) {
node.value = node.value.replace(embedShortCode, "$1")
}
- Turn
[caption]
shortcodes into image nodes
// turn [caption] shortcodes into clean images
if (node.value.match(captionShortCode)) {
visit(parent, "text", (node) => {
node.value = ""
})
visit(parent, "link", (node) => {
node.type = "image"
node.title = node.children[0].title
node.alt = node.children[0].alt
node.url = node.children[0].url
node.children = []
})
}
This changes the parent paragraph node into an image and deletes all children text nodes.
- Remove other shortcodes
// remove other shortcodes
node.value = node.value
.replace(shortCodeOpenTag, "")
.replace(shortCodeCloseTag, "")
I couldn't find a use for them âī¸
Edge case 5: Underscores in links
This one was frustrating. Embed links can include underscores, like when you embed a tweet from @_developit
.
Markdown stringification escapes underscores because it thinks they're emphasis and doesn't understand that some text nodes are link nodes despite not being links.
https://twitter.com/_developit/status/1300154097170083842
That breaks your embed machinery. đ¤Ē
You can fix it with a dirty regex hack:
let content = markdown.contents
content = content.replace(/(?<=https?:\/\/.*)\\_(?=.*\n)/g, "_")
The reverse lookup with (?<=)
ensures you don't touch escaped underscores anywhere other than links.[^1]
The solution
You can use my script đ github/Swizec/wordpress-to-markdown
$ git clone https://github.com/Swizec/wordpress-to-markdown
# download your wordpress xml
# change filename on convert.js line 27
$ yarn
$ yarn convert
# sip margaritas
Deals with every edge case described above, produces clean markdown output. Even runs it through Prettier âī¸
PRs welcome.
Cheers,
~Swizec
[^1] the look-behind and look-ahead support in PCRE regexes means they are technically more powerful than regular languages. On the level of pushdown automata I think.
Continue reading about How to export a large Wordpress site to Markdown
Semantically similar articles hand-picked by GPT-4
- Moving 13 years of Wordpress blog to Gatsby Markdown
- Lessons from migrating a 14 year old blog with 1500 posts to Gatsby
- How to debug unified, rehype, or remark and fix bugs in markdown processing
- How to debug unified, rehype, or remark and fix bugs in markdown processing
- Use Netlify's _redirects on Gatsby Cloud
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below đ
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails đ on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. đ"
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help đ swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers đ ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are â¤ī¸