Here's a short and sweet story about a Friday deploy. I love Friday deploys.
Here's how it went:
- We deployed an update
- 2min later we saw SQL error messages in our "something's wrong" slack channel
- It was a distributed transaction constraint violation
- We couldn't rollback because software only moves forward
- 5min later we shipped a reverted PR
- The errors stopped
- An hour later we had the full fix ready to go
We didn't ship that one though because a Friday 4:53pm deploy feels too aggressive even to me. Especially when the systems are working and it's a problem that can wait.
Why tests didn't catch this
Distributed systems problem. Code worked locally and in tests. You do operation A then B and everything is fine.
But in production sometimes B happens before A and the database goes "lol mate hold on what is this object you're referencing??"
You could write a test for this, but you might end up with one of those flaky tests that everybody hates. You know the kind – fails every 98th time, nobody knows why, and you all just ignore it. "Oh that test? Yeah that one sucks. Hit rerun and it'll be fine".
In production that 98th time happens to a user every day 😉
And even if you did write the test you'll never know if it works because your code behaves more deterministically in a test environment or because you accurately captured all the nuance of a live production environment.
How observability did catch it, fast
It's easy. We send all error logs to a central location where they are observed by robots. When errors talk about SQL, we send them to slack as a warning. If there are lots, we trigger a proper alert that wakes people up.
We're using OTEL integrated into our python logger. Anyone can hook into this infrastructure with a current_app.logger.debug/info/warn/error
. Default error handling is already instrumented so you don't need to think about it.
Same ability exists on the client side in JavaScript.
Key to making this useful is:
- default instrumentation for defaults
- low friction to add new logs, traces, or spans
- easy search through all this data (we use Sumologic)
- anyone can make a self-serve alert to observe their code
Crucially, you don't need to deploy code to make a new alert or dashboard. As long as the events are there, you can start observing anything that you think is causing problems.
And then you can fix 'em :)
Cheers,
~Swizec
Continue reading about Why you need observability more than tests
Semantically similar articles hand-picked by GPT-4
- The day I crashed production 4 times
- Make mistakes easy to fix
- Logging 1,721,410 events per day with Postgres, Rails, Heroku, and a bit of JavaScript
- What helps you ship confidently?
- 90% of performance is data access patterns
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below 👇
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️