Swizec Teller - a geek with a hatswizec.com

    Hard work is a total waste of time

    • Steacie Science and Engineering Library at Yor...

      Image via Wikipedia

    Sometimes a series of great decisions can lead to a place where the best decision is a horrendously bad decision.

    And you just don't realise. Boole algebra taught us early in school that a chain of good implications means the next implication will be pretty good too. Then again, not really. It might be bette to take a step back, look at the bigger bigger picture and make a totally new decision.

    Very recently, hell, yesterday, this happened to me. Something nudged me from CEO mode into developer mode. To fully analyze this we have to go back to the beginning of this summer when we embarked on The Mission with Preona and finally started building LazyReadr after months of promises and figuring out if anyone's interested.

    What happen?

    One of the first decisions we made was to run everything on Google's App Engine. Mostly because we saw what a pain it was keeping a web service running realiably and we are expecting there to be quite a heavy load when we have to parse a lot of articles and stuff.

    Then somewhere in the middle of August it was time to start performing proper article scraping - taking a link and returning the main text without all the ads and navigation and other crap. After a few attempts with different services ranging from AlchemyAPI to a few different homebrew solutions we decided on Boilerpipe.

    Boilerpipe is a Java library that does one thing and does it well - it extracts text from links. Great, java, so it runs on AppEngine and does what we need.

    Fast forward two weeks and we realise that maybe this Boilerpipe thing isn't that great after all. Everything it does is return text. But we need to know when an article has 5 pictures in it, or subtitles, links ... stuff like that.

    Since Boilerpipe can't do that and none of our homebrew solutions are good enough at finding content the only place left for us to go was Readability, an Arc90 "experiment", which just happens to be the best content extraction bookmarklet I have ever seen.

    There's a catch though. It runs on javascript and rewriting all the code into python or java so we can run it server-side just isn't an option. Especially not when it's apparently under active development and we'd have to go to great lengths to keep up.

    Welcome Rhino. A javasript engine that runs on Java.

    Welcome Env.js. A fake browser implemented in JavaScript.

    Combine the two and voila, we can load up a website into a fake browser running within a fake javascript environment and run readability just like it was in a browser! yay!

    Well ... no. Stuff didn't work. Env.js was too big for Java. There were bugs. Problems. Lots of stuff. I toiled heavily for three weeks or so until I finally got it working. But by god I got it working!

    Sure it took me understanding stuff about Env.js internals I never cared about. But I got it working!

    Exhausted, I deployed the final code onto App Engine.

    it died

    The whole website in browser in javascript in java was just too slow. App Engine's 30s restriction was too much and everything just plain old died.

    Dead end.

    Effort wasted.

    But then I started getting an idea. What about node.js? That's javascript. On the server. And it's fast! Surely I can run env.js in there and get everything working right?

    Not exactly, but there is a project called jsdom.

    Three hours later. Working web scraper. Except now it scrapes huge complex websites in a few seconds!

    Hoorah! Then a day of patching up jsdom since it's a youngish project and everything doesn't work yet and we have a very sturdy scraper. Whoaw!

    The lesson learned

    expecting high load -> AppEngine -> Java -> Rhino -> Env.js -> not-working-project-and-several-weeks-wasted


    node.js -> jsdom -> stuff-works-in-a-day-of-work

    So I guess what I'm trying to say besides the fact I'm sooper happy I got stuff working, is that I learned my lesson and that I shouldn't base my decisions solely on the previous decisions I've made. Even if all of them were good decisions by themselves.

    PS: I'm contemplating whether we should make this service public and am leaning heavily on the Yes option. What do you guys think? Need an efficient and good web scraping API?

    Enhanced by Zemanta

    Did you enjoy this article?

    Published on September 22nd, 2010 in AppEngine, Google App Engine, Java, JavaScript, Node.js, Programming, Uncategorized

    Learned something new?
    Want to become an expert?

    Here's how it works 👇

    Leave your email and I'll send you thoughtfully written emails every week about React, JavaScript, and your career. Lessons learned over 20 years in the industry working with companies ranging from tiny startups to Fortune5 behemoths.

    Join Swizec's Newsletter

    And get thoughtful letters 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.

    "Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"

    ~ Ashish Kumar

    Join over 14,000 engineers just like you already improving their careers with my letters, workshops, courses, and talks. ✌️

    Have a burning question that you think I can answer? I don't have all of the answers, but I have some! Hit me up on twitter or book a 30min ama for in-depth help.

    Ready to Stop copy pasting D3 examples and create data visualizations of your own?  Learn how to build scalable dataviz components your whole team can understand with React for Data Visualization

    Curious about Serverless and the modern backend? Check out Serverless Handbook, modern backend for the frontend engineer.

    Ready to learn how it all fits together and build a modern webapp from scratch? Learn how to launch a webapp and make your first 💰 on the side with ServerlessReact.Dev

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created bySwizecwith ❤️