Image via Wikipedia
Sometimes a series of great decisions can lead to a place where the best decision is a horrendously bad decision.
And you just don't realise. Boole algebra taught us early in school that a chain of good implications means the next implication will be pretty good too. Then again, not really. It might be bette to take a step back, look at the bigger bigger picture and make a totally new decision.
Very recently, hell, yesterday, this happened to me. Something nudged me from CEO mode into developer mode. To fully analyze this we have to go back to the beginning of this summer when we embarked on The Mission with Preona and finally started building LazyReadr after months of promises and figuring out if anyone's interested.
One of the first decisions we made was to run everything on Google's App Engine. Mostly because we saw what a pain it was keeping a web service running realiably and we are expecting there to be quite a heavy load when we have to parse a lot of articles and stuff.
Then somewhere in the middle of August it was time to start performing proper article scraping - taking a link and returning the main text without all the ads and navigation and other crap. After a few attempts with different services ranging from AlchemyAPI to a few different homebrew solutions we decided on Boilerpipe.
Boilerpipe is a Java library that does one thing and does it well - it extracts text from links. Great, java, so it runs on AppEngine and does what we need.
Fast forward two weeks and we realise that maybe this Boilerpipe thing isn't that great after all. Everything it does is return text. But we need to know when an article has 5 pictures in it, or subtitles, links ... stuff like that.
Image via Wikipedia
Since Boilerpipe can't do that and none of our homebrew solutions are good enough at finding content the only place left for us to go was Readability, an Arc90 "experiment", which just happens to be the best content extraction bookmarklet I have ever seen.
Welcome Rhino. A javasript engine that runs on Java.
Well ... no. Stuff didn't work. Env.js was too big for Java. There were bugs. Problems. Lots of stuff. I toiled heavily for three weeks or so until I finally got it working. But by god I got it working!
Sure it took me understanding stuff about Env.js internals I never cared about. But I got it working!
Exhausted, I deployed the final code onto App Engine.
Not exactly, but there is a project called jsdom.
Three hours later. Working web scraper. Except now it scrapes huge complex websites in a few seconds!
Hoorah! Then a day of patching up jsdom since it's a youngish project and everything doesn't work yet and we have a very sturdy scraper. Whoaw!
expecting high load -> AppEngine -> Java -> Rhino -> Env.js -> not-working-project-and-several-weeks-wasted
node.js -> jsdom -> stuff-works-in-a-day-of-work
So I guess what I'm trying to say besides the fact I'm sooper happy I got stuff working, is that I learned my lesson and that I shouldn't base my decisions solely on the previous decisions I've made. Even if all of them were good decisions by themselves.
PS: I'm contemplating whether we should make this service public and am leaning heavily on the Yes option. What do you guys think? Need an efficient and good web scraping API?
- Using jQuery and node.js to scrape html pages in 5 lines (nodejitsu.com)
Here's how it works 👇
And get thoughtful letters 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Ready to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz components your whole team can understand with React for Data Visualization
Curious about Serverless and the modern backend? Check out Serverless Handbook, modern backend for the frontend engineer.
Ready to learn how it all fits together and build a modern webapp from scratch? Learn how to launch a webapp and make your first 💰 on the side with ServerlessReact.Dev
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️