-
Image via Wikipedia
Sometimes a series of great decisions can lead to a place where the best decision is a horrendously bad decision.
And you just don't realise. Boole algebra taught us early in school that a chain of good implications means the next implication will be pretty good too. Then again, not really. It might be bette to take a step back, look at the bigger bigger picture and make a totally new decision.
Very recently, hell, yesterday, this happened to me. Something nudged me from CEO mode into developer mode. To fully analyze this we have to go back to the beginning of this summer when we embarked on The Mission with Preona and finally started building LazyReadr after months of promises and figuring out if anyone's interested.
What happen?
One of the first decisions we made was to run everything on Google's App Engine. Mostly because we saw what a pain it was keeping a web service running realiably and we are expecting there to be quite a heavy load when we have to parse a lot of articles and stuff.
Then somewhere in the middle of August it was time to start performing proper article scraping - taking a link and returning the main text without all the ads and navigation and other crap. After a few attempts with different services ranging from AlchemyAPI to a few different homebrew solutions we decided on Boilerpipe.
Boilerpipe is a Java library that does one thing and does it well - it extracts text from links. Great, java, so it runs on AppEngine and does what we need.
Fast forward two weeks and we realise that maybe this Boilerpipe thing isn't that great after all. Everything it does is return text. But we need to know when an article has 5 pictures in it, or subtitles, links ... stuff like that.
-
Image via Wikipedia
Since Boilerpipe can't do that and none of our homebrew solutions are good enough at finding content the only place left for us to go was Readability, an Arc90 "experiment", which just happens to be the best content extraction bookmarklet I have ever seen.
There's a catch though. It runs on javascript and rewriting all the code into python or java so we can run it server-side just isn't an option. Especially not when it's apparently under active development and we'd have to go to great lengths to keep up.
Welcome Rhino. A javasript engine that runs on Java.
Welcome Env.js. A fake browser implemented in JavaScript.
Combine the two and voila, we can load up a website into a fake browser running within a fake javascript environment and run readability just like it was in a browser! yay!
Well ... no. Stuff didn't work. Env.js was too big for Java. There were bugs. Problems. Lots of stuff. I toiled heavily for three weeks or so until I finally got it working. But by god I got it working!
Sure it took me understanding stuff about Env.js internals I never cared about. But I got it working!
Exhausted, I deployed the final code onto App Engine.
it died
The whole website in browser in javascript in java was just too slow. App Engine's 30s restriction was too much and everything just plain old died.
Dead end.
Effort wasted.
But then I started getting an idea. What about node.js? That's javascript. On the server. And it's fast! Surely I can run env.js in there and get everything working right?
Not exactly, but there is a project called jsdom.
Three hours later. Working web scraper. Except now it scrapes huge complex websites in a few seconds!
Hoorah! Then a day of patching up jsdom since it's a youngish project and everything doesn't work yet and we have a very sturdy scraper. Whoaw!
The lesson learned
expecting high load -> AppEngine -> Java -> Rhino -> Env.js -> not-working-project-and-several-weeks-wasted
alternative:
node.js -> jsdom -> stuff-works-in-a-day-of-work
So I guess what I'm trying to say besides the fact I'm sooper happy I got stuff working, is that I learned my lesson and that I shouldn't base my decisions solely on the previous decisions I've made. Even if all of them were good decisions by themselves.
PS: I'm contemplating whether we should make this service public and am leaning heavily on the Yes option. What do you guys think? Need an efficient and good web scraping API?
Related articles by Zemanta
- Using jQuery and node.js to scrape html pages in 5 lines (nodejitsu.com)
Continue reading about Hard work is a total waste of time
Semantically similar articles hand-picked by GPT-4
- My very own daily WTF
- That time serverless melted my credit card
- It's never been this easy to build a webapp
- Don't neglect your upgrades
- First impressions of Rails as a Javascripter
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below 👇
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️