Swizec Teller - a geek with a hatswizec.com

Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Senior Engineer Mindset cover
Learn more

    Hard work is a total waste of time

    Sometimes a series of great decisions can lead to a place where the best decision is a horrendously bad decision.

    And you just don't realise. Boole algebra taught us early in school that a chain of good implications means the next implication will be pretty good too. Then again, not really. It might be bette to take a step back, look at the bigger bigger picture and make a totally new decision.

    Very recently, hell, yesterday, this happened to me. Something nudged me from CEO mode into developer mode. To fully analyze this we have to go back to the beginning of this summer when we embarked on The Mission with Preona and finally started building LazyReadr after months of promises and figuring out if anyone's interested.

    What happen?

    One of the first decisions we made was to run everything on Google's App Engine. Mostly because we saw what a pain it was keeping a web service running realiably and we are expecting there to be quite a heavy load when we have to parse a lot of articles and stuff.

    Then somewhere in the middle of August it was time to start performing proper article scraping - taking a link and returning the main text without all the ads and navigation and other crap. After a few attempts with different services ranging from AlchemyAPI to a few different homebrew solutions we decided on Boilerpipe.

    Boilerpipe is a Java library that does one thing and does it well - it extracts text from links. Great, java, so it runs on AppEngine and does what we need.

    Fast forward two weeks and we realise that maybe this Boilerpipe thing isn't that great after all. Everything it does is return text. But we need to know when an article has 5 pictures in it, or subtitles, links ... stuff like that.

    Since Boilerpipe can't do that and none of our homebrew solutions are good enough at finding content the only place left for us to go was Readability, an Arc90 "experiment", which just happens to be the best content extraction bookmarklet I have ever seen.

    There's a catch though. It runs on javascript and rewriting all the code into python or java so we can run it server-side just isn't an option. Especially not when it's apparently under active development and we'd have to go to great lengths to keep up.

    Welcome Rhino. A javasript engine that runs on Java.

    Welcome Env.js. A fake browser implemented in JavaScript.

    Combine the two and voila, we can load up a website into a fake browser running within a fake javascript environment and run readability just like it was in a browser! yay!

    Well ... no. Stuff didn't work. Env.js was too big for Java. There were bugs. Problems. Lots of stuff. I toiled heavily for three weeks or so until I finally got it working. But by god I got it working!

    Sure it took me understanding stuff about Env.js internals I never cared about. But I got it working!

    Exhausted, I deployed the final code onto App Engine.

    it died

    The whole website in browser in javascript in java was just too slow. App Engine's 30s restriction was too much and everything just plain old died.

    Dead end.

    Effort wasted.

    But then I started getting an idea. What about node.js? That's javascript. On the server. And it's fast! Surely I can run env.js in there and get everything working right?

    Not exactly, but there is a project called jsdom.

    Three hours later. Working web scraper. Except now it scrapes huge complex websites in a few seconds!

    Hoorah! Then a day of patching up jsdom since it's a youngish project and everything doesn't work yet and we have a very sturdy scraper. Whoaw!

    The lesson learned

    expecting high load -> AppEngine -> Java -> Rhino -> Env.js -> not-working-project-and-several-weeks-wasted

    alternative:

    node.js -> jsdom -> stuff-works-in-a-day-of-work

    So I guess what I'm trying to say besides the fact I'm sooper happy I got stuff working, is that I learned my lesson and that I shouldn't base my decisions solely on the previous decisions I've made. Even if all of them were good decisions by themselves.

    PS: I'm contemplating whether we should make this service public and am leaning heavily on the Yes option. What do you guys think? Need an efficient and good web scraping API?

    Enhanced by Zemanta
    Published on September 22nd, 2010 in AppEngine, Google App Engine, Java, JavaScript, Node.js, Programming, Uncategorized

    Did you enjoy this article?

    Continue reading about Hard work is a total waste of time

    Semantically similar articles hand-picked by GPT-4

    Senior Mindset Book

    Get promoted, earn a bigger salary, work for top companies

    Learn more

    Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

    Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

    Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

    Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

    Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

    Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

    Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

    Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

    By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

    Created by Swizec with ❤️