Data migrations:
— Swizec Teller (@Swizec) August 20, 2015
Oh I can do this in an hour. Two tops.
Two days later: (╯°□°)╯︵ ┻━┻
On Thursday we pushed to production. Three months worth of code went from no users, to thousands of users. Or hundreds. Depends how you count.
My point is, our code went through the ultimate integration test - the real world.
The whole thing was much less intense than expected. We pushed in the middle of the night, ran our migrations, waited 36 minutes for them to run, ate some pizza while waiting for the fires to start, and went home when they didn't. An uneventful deploy is the best kind of deploy.
Well okay, I say no fires started, but that's not quite true. The whole engineering team has spent the last couple of days fixing known bugs that slipped through, squashing unknown bugs that users on strange computers in strange lands discovered by clicking something they shouldn't, manually tweaking the production database to get people out of unforeseen corner cases, and implementing some features we forgot we needed.
Such is life without QA. You forget things.
As David Heinemeier Hansson once said: it's not worth fixing, if a user doesn't complain about it.
Right?
I kid, we knew hiccups would happen. Hair-on-fire mode lasting just three days after a major deploy is good. Very good.
That moment when workers run while you're importing data and duplicate ids and your db doesn't get primary keys and rails gets confused.
— Swizec Teller (@Swizec) August 21, 2015
But the week before was the most frustrating week of my life. I was this close to tearing my hair out, punching my computer in the face, and taking up goat farming. This close. does the this close finger gesture
I was doing a test run of the database migrations we accumulated in those three months. Shouldn't take more than two hours, right? Wrong. It took a week.
Some of our database migrations are also data migrations. Mistake no. 1.
None of our new migrations that we've had for three months have ever been run against the full database. Mistake no. 2.
Our whole database is under 500,000 lines. The entire data dump falls well under 200 megabytes. Our database is tiny. What could possibly go wrong?
Oh so very much could.
Most migrations ran in a fraction of a second. Many took a second or two. A few beasts took ten minutes.
Ten whole minutes for a single migration on a tiny database. The mind recoils in horror, unable to understand what could be taking so long.
When we wrote a migration, we tested it on our local machines. 2.3 gigahertz computers with six or eight CPU cores, 16 gigabytes of memory, a solid state drive, and a database that could fit on a floppy drive.
"It was fast when I tried it". Of course it was. Your computer is a monster running on bare metal. Heroku dynos are paltry "computers" running on many layers of virtualization. They're meant to zerg rush a problem, not handle a big thing on their own.
You run into problems when your migrations look like this:
def change
add_column :promotional_tokens, :type, :string
Token::PromotionalToken.reset_column_information
Token::PromotionalToken.all.find_each do |token|
if token.promotion_type.eql? 'referral'
# remove if user is not a student
user = User.find(token.user_id)
if user.is? :student #refactor it to student profile
token.update!(type: 'Token::ReferralToken' )
ref_token = Token::ReferralToken.find(token.id)
user.profile.update!(referral_token: ref_token)
else #remove it as it is now redundant
token.destroy
end
end
end
end
This migration adds a type
column to promotional_tokens
, then goes through existing tokens - of which each user has one - and updates them. Students get a new token, everybody else has theirs removed. Because that's what business wants.
There are 90,000 users in the system.
Let's play spot the problem.
- Using
.all
that loads the whole table into memory - We run a DB query for every token to fetch the user
- Then we run a DB query to update the token
- Then we run a DB query to get the token again, but with the correct object type
- Then we run a DB query to move the token reference to
user.profile
- If the user isn't a student, we run a destroy query instead
There's only 200 users who aren't students. We can ignore those.
For every user we have, this migration hits the database four times. 360,000 database queries. And just to make it more fun, each is made using ActiveRecord instead of raw SQL.
It took 10 minutes to run through the whole dataset on my local machine. It flat out crashed Heroku's one-off dyno. Out of memory.
And if it didn't crash, who knows. Our database runs on a separate server. Maybe in the same datacenter, maybe not. Heroku knows, we don't.
That makes migrations even slower.
That moment when your laptop is so much stronger than Heroku dynos that you decide to migrate data locally.
— Swizec Teller (@Swizec) August 21, 2015
I'm not kidding. I legit considered snapshotting the production database, running migrations locally, then swapping production with the updated snapshot. That could screw things up, especially permissions, but at least I could run it.
We salvaged this migration after many struggles. Like this:
def change
add_column :promotional_tokens, :type, :string
Token::PromotionalToken.reset_column_information
User.with_role(:tutor).includes(:promotional_tokens).find_each do |tutor|
tutor.promotional_tokens.destroy_all
end
Token::PromotionalToken.update_all(type: 'Token::ReferralToken')
User.with_role(:student).includes(:referral_token, :profile).find_each do |student|
tok = student.referral_token
student.profile.update_attribute(:referral_token, tok)
end
end
Doesn't look like much, but an order of magnitude less queries hit the database. The magic of scoping and .includes
.
200 queries run to delete non-student tokens. Nothing you can do about that. A single query updates the type of all tokens still in the system. And a single query per token, moves the token reference to student.profile
.
Some 90,003 queries in total. Takes a minute or two.
In school they teach you that O(4n+1) is the same as O(n+3). It's just linear complexity, who cares. Reality cares. 4 times 90,000 is a lot more than 90,000. Especially when you're talking to a database across TCP/IP.
All in all it took myself and the genius who sits across from me an entire week to get our migrations running in under 20 minutes on a MacBook Pro. We removed a lot of .all
s, reduced many factors of linear complexity, and in one epic instance even turned linear complexity into constant complexity. That was cool.
We solved the rest by using --size=px
. That ran our migrations with Heroku's beefiest dyno. 2.5 gigs of memory, dedicated machine, and 12x of computing. Whatever that means.
It took 36 minutes to run our migrations on production.
It took 36 seconds to test them in our dev environment.
Lesson learned.
Continue reading about Lesson learned, test your migrations on the big dataset
Semantically similar articles hand-picked by GPT-4
- Logging 1,721,410 events per day with Postgres, Rails, Heroku, and a bit of JavaScript
- Make mistakes easy to fix
- Why great engineers hack The Process
- My biggest React App performance boost was a backend change
- The day I crashed production 4 times
Learned something new?
Read more Software Engineering Lessons from Production
I write articles with real insight into the career and skills of a modern software engineer. "Raw and honest from the heart!" as one reader described them. Fueled by lessons learned over 20 years of building production code for side-projects, small businesses, and hyper growth startups. Both successful and not.
Subscribe below 👇
Software Engineering Lessons from Production
Join Swizec's Newsletter and get insightful emails 💌 on mindsets, tactics, and technical skills for your career. Real lessons from building production software. No bullshit.
"Man, love your simple writing! Yours is the only newsletter I open and only blog that I give a fuck to read & scroll till the end. And wow always take away lessons with me. Inspiring! And very relatable. 👌"
Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.
Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.
Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.
Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev
Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization
Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections
Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog
Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com
By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️