The Internet talks a lot about article extraction - taking a page and deciding what the real content is. Hell, I've written about the Uncanny valley of web scraping myself.
Article extraction is such a wide spread problem that a bunch of API's exist to help you solve it. Everything from a fringe feature in five or ten semantic API's, to startups devoted wholly to extracting articles - like Diffbot.
But what if you don't want to extract an article? What if all you want is: this is the header, here is a sidebar, these are ads, here is content, oh and this is a footer, btw those are comments.
Suddenly you are shit out of luck.
Sure, it makes sense the API's wouldn't let you do this - it's supposedly their secret magic sauce. Right?
Except it isn't.
Analyzing the different implementations of article extractors reveals that far from using a methodical approach of marking up different bits of a page, they mostly work as tree pruning algorithms - go through DOM, remove anything that's not promising, end up with the juicy article.
Nothing you could use to create a web page segmentor ...
Turns out, there is but a single very useful paper devoted to web page segmentation - Christian Kohlschütter's A Densitometric Approach to Web Page Segmentation.
Yep, the same guy who later wrote _Boilerplate Detection using Shallow Text Features, _which later turned into Boilerpipe, one of the best (most certainly the quickest) libraries web content extraction.
In the paper Kohlschütter explains that only three approaches exist:
- segmenting visually
- linguistic approach
- densitometric approach
Visual segmentation is perhaps the easiest to understand - you look at a website and as a person you instantly know where different sections are. A computer vision algorithm could do something similar. With the caveat you now have to render every page, then run a visual learning algorithm and do a bunch of things that are computationally very expensive.
The linguistic approach is somewhat more reasonable - take a page, look at distributions of words and syllables and what have you (quanititive linguistics this is called) and decide based on that. Problem here is, this only works well for large blocks of text ... the linguistic content in, say, a header might be somewhat lacking.
Block fusion algorithm
Kohlschütter's densitometric approach has a tendency to work as well as a visual algorithm, while being as fast as a lingustic approach ... bloody marvelous!
The idea is basically this:
- walk through nodes
- assign a text density to each node -> number-of-tokens / number-of-'lines'
- merge neighbor nodes with the same densities
- repeat until desired granularity is reached
The simplicity of this algorithm is just brilliant. Even better is the fact they managed to get it down to 15ms per page on average. For comparison's sake - the time it takes Readability to clean up a page is counted in seconds, an average response time from Diffbot (visual approach) is about 10 seconds per page.
Yep, that fast.
And for the icing on the cake - most main bits of the Block Fusion Algorithm are already implemented deep inside the bowels of Boilerpipe. You just have to look hard enough.
Continue reading about Web page segmentation
Semantically similar articles hand-picked by GPT-4
- Hard work is a total waste of time
- Scraping with Mechanize and BeautifulSoup
- 3 key insights that make D3.js easy to learn
- So how many readers _actually_ read a blog post?
- Science Wednesday: Towards a computational model of poetry generation
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below 👇
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️