"Application error is preventing tutors from ending their shifts 😞"
"Tutors in sessions are noticing that it says they are offline!"
It was 1:28am Monday night, two days before Thanksgiving. Literally the lowest hour of the lowest week for traffic all year. Our servers were melting.
I saw the message by accident. Good thing I was up 😅
*"@sharon looks like everything is down and our servers are on fire. I restarted the server, can you see if that helped?"
Average API response time → 30 seconds. Usually, it's 0.3 seconds.
Restarting helped. Restarting a Rails server always helps. I went to bed.
At 9am, our servers were melting.
At 10:30am, our servers were melting.
We suspected analytics, deployed a quick patch to disable the workers, restarted servers, and got to work.
The perfect storm
Saving blobs of JSON into an array is hard
LoL it's just an array with some objects how hard can it be
So how does a server melt on the lowest week of traffic for the entire year? And how does it start melting again just a few hours later even though it was fine for many months before that?
The problem with exponential problems is that you don't see them coming until it's too late.
API response times grow millisecond by millisecond, getting ever so much larger. “Nothing alarming, normal fluctuation,” you say to yourself.
Then BAM! Response times are 30 seconds and Heroku force timeouts all your requests. Now you're in trouble.
That's why it happened so much faster the 2nd time. And even faster the 3rd time.
Because saving a bunch of JSON blobs reliably is hard. The "reliably" part is what gets you.
How do you make sure clients can send analytics events and ensure none of them get dropped?
If the request fails, you store the event and try again later.
How do you avoid sending too many requests since failing requests are the likeliest cause of most errors?
You batch multiple events into the same request.
Congratulations, you just created an API that can lead to your clients DDoSing your own server.
A DDoS is a distributed denial of service attack, by the way. It happens when so many clients send your server so many requests that it crumples under the pressure.
Similar in principle to the Reddit or HackerNews hug of death. That's when a link gets so popular that the flood of traffic melts servers.
Why storing, retrying, and batching requests spells trouble
You see, there is nothing wrong with storing failed requests and retrying them. Or with batching multiple small requests into one being one. It's the smart thing to do even.
It's when your backend has bugs that it leads to trouble.
What happens if you send a batch of 10 events and the server fails to process 1 of them? Say, because an error happens when attaching additional properties?
The server says "Hey, I couldn't process this batch.”
So you say, "Oh, it failed. I'll try again later.”
Later comes. You have some new events, too. So you send a batch of 20 events.
The same event fails. Next time, you send 30.
See what that leads to yet?
That's right. Eventually, you're sending so many events at the same time that the server has trouble processing them all within a single request. This begins to impact other clients' requests.
The server is busy processing. Other clients say "Yo, here's my stuff”, and get denied.
Failures happen with no rhyme or reason. It depends completely on who is currently using the service and whether their client has a large backlog of events to send.
The problem festers and grows.
The more clients experience an issue, the more clients have large backlogs, the more issues occur.
Eventually, your server melts, and nobody can't do nothing no more.
A preprocessing queue.
When a client sends a request with some events, just store the raw payload somewhere and deal with it later.
Requests almost always succeed, so events don't pile up on clients. Saving the raw payload is also fast, which makes overall performance faster for the clients, too.
Processing offline also gives you more time. Instead of having to do everything in 30 seconds, you can spend many minutes on a big request if you have to.
Plus, you can mark specific events as failing, but still process everything else.
Lessons learned →
Saving arrays of JSON blobs is hard.
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below 👇
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Senior Mindset Book
Get promoted, earn a bigger salary, work for top companiesLearn more
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️