Lesson learned, test your migrations on the big dataset

Data migrations:

Oh I can do this in an hour. Two tops.

Two days later: (╯°□°）╯︵ ┻━┻
— Swizec Teller (@Swizec) August 20, 2015

On Thursday we pushed to production. Three months worth of code went from no users, to thousands of users. Or hundreds. Depends how you count.

My point is, our code went through the ultimate integration test - the real world.

The whole thing was much less intense than expected. We pushed in the middle of the night, ran our migrations, waited 36 minutes for them to run, ate some pizza while waiting for the fires to start, and went home when they didn't. An uneventful deploy is the best kind of deploy.

Well okay, I say no fires started, but that's not quite true. The whole engineering team has spent the last couple of days fixing known bugs that slipped through, squashing unknown bugs that users on strange computers in strange lands discovered by clicking something they shouldn't, manually tweaking the production database to get people out of unforeseen corner cases, and implementing some features we forgot we needed.

Such is life without QA. You forget things.

As David Heinemeier Hansson once said: it's not worth fixing, if a user doesn't complain about it.

Right?

I kid, we knew hiccups would happen. Hair-on-fire mode lasting just three days after a major deploy is good. Very good.

That moment when workers run while you're importing data and duplicate ids and your db doesn't get primary keys and rails gets confused.
— Swizec Teller (@Swizec) August 21, 2015

But the week before was the most frustrating week of my life. I was this close to tearing my hair out, punching my computer in the face, and taking up goat farming. This close. does the this close finger gesture

I was doing a test run of the database migrations we accumulated in those three months. Shouldn't take more than two hours, right? Wrong. It took a week.

Some of our database migrations are also data migrations. Mistake no. 1.

None of our new migrations that we've had for three months have ever been run against the full database. Mistake no. 2.

Our whole database is under 500,000 lines. The entire data dump falls well under 200 megabytes. Our database is tiny. What could possibly go wrong?

Oh so very much could.

Most migrations ran in a fraction of a second. Many took a second or two. A few beasts took ten minutes.

Ten whole minutes for a single migration on a tiny database. The mind recoils in horror, unable to understand what could be taking so long.

When we wrote a migration, we tested it on our local machines. 2.3 gigahertz computers with six or eight CPU cores, 16 gigabytes of memory, a solid state drive, and a database that could fit on a floppy drive.

"It was fast when I tried it". Of course it was. Your computer is a monster running on bare metal. Heroku dynos are paltry "computers" running on many layers of virtualization. They're meant to zerg rush a problem, not handle a big thing on their own.

You run into problems when your migrations look like this:

  def change
    add_column :promotional_tokens, :type, :string

    Token::PromotionalToken.reset_column_information

    Token::PromotionalToken.all.find_each do |token|
      if token.promotion_type.eql? 'referral'

        # remove if user is not a student
        user = User.find(token.user_id)

        if user.is? :student #refactor it to student profile
          token.update!(type: 'Token::ReferralToken' )
          ref_token = Token::ReferralToken.find(token.id)
          user.profile.update!(referral_token: ref_token)

        else #remove it as it is now redundant
          token.destroy
        end
      end
    end
  end

This migration adds a type column to promotional_tokens, then goes through existing tokens - of which each user has one - and updates them. Students get a new token, everybody else has theirs removed. Because that's what business wants.

There are 90,000 users in the system.

Let's play spot the problem.

Using .all that loads the whole table into memory
We run a DB query for every token to fetch the user
Then we run a DB query to update the token
Then we run a DB query to get the token again, but with the correct object type
Then we run a DB query to move the token reference to user.profile
If the user isn't a student, we run a destroy query instead

There's only 200 users who aren't students. We can ignore those.

For every user we have, this migration hits the database four times. 360,000 database queries. And just to make it more fun, each is made using ActiveRecord instead of raw SQL.

It took 10 minutes to run through the whole dataset on my local machine. It flat out crashed Heroku's one-off dyno. Out of memory.

And if it didn't crash, who knows. Our database runs on a separate server. Maybe in the same datacenter, maybe not. Heroku knows, we don't.

That makes migrations even slower.

That moment when your laptop is so much stronger than Heroku dynos that you decide to migrate data locally.
— Swizec Teller (@Swizec) August 21, 2015

I'm not kidding. I legit considered snapshotting the production database, running migrations locally, then swapping production with the updated snapshot. That could screw things up, especially permissions, but at least I could run it.

We salvaged this migration after many struggles. Like this:

  def change
    add_column :promotional_tokens, :type, :string

    Token::PromotionalToken.reset_column_information
    User.with_role(:tutor).includes(:promotional_tokens).find_each do |tutor|
      tutor.promotional_tokens.destroy_all
    end

    Token::PromotionalToken.update_all(type: 'Token::ReferralToken')

    User.with_role(:student).includes(:referral_token, :profile).find_each do |student|
      tok = student.referral_token

      student.profile.update_attribute(:referral_token, tok)
    end
  end

Doesn't look like much, but an order of magnitude less queries hit the database. The magic of scoping and .includes.

200 queries run to delete non-student tokens. Nothing you can do about that. A single query updates the type of all tokens still in the system. And a single query per token, moves the token reference to student.profile.

Some 90,003 queries in total. Takes a minute or two.

In school they teach you that O(4n+1) is the same as O(n+3). It's just linear complexity, who cares. Reality cares. 4 times 90,000 is a lot more than 90,000. Especially when you're talking to a database across TCP/IP.

All in all it took myself and the genius who sits across from me an entire week to get our migrations running in under 20 minutes on a MacBook Pro. We removed a lot of .alls, reduced many factors of linear complexity, and in one epic instance even turned linear complexity into constant complexity. That was cool.

We solved the rest by using --size=px. That ran our migrations with Heroku's beefiest dyno. 2.5 gigs of memory, dedicated machine, and 12x of computing. Whatever that means.

It took 36 minutes to run our migrations on production.

It took 36 seconds to test them in our dev environment.

Lesson learned.

Published on September 2nd, 2015 in Active record pattern, CPU time, Database, Heroku, Rails, Technical

Did you enjoy this article?

👎👍

Continue reading about Lesson learned, test your migrations on the big dataset

Semantically similar articles hand-picked by GPT-4

Senior Mindset Book

Get promoted, earn a bigger salary, work for top companies

Learn more

Have a burning question that you think I can answer? Hit me up on twitter and I'll do my best.

Who am I and who do I help? I'm Swizec Teller and I turn coders into engineers with "Raw and honest from the heart!" writing. No bullshit. Real insights into the career and skills of a modern software engineer.

Want to become a true senior engineer? Take ownership, have autonomy, and be a force multiplier on your team. The Senior Engineer Mindset ebook can help 👉 swizec.com/senior-mindset. These are the shifts in mindset that unlocked my career.

Curious about Serverless and the modern backend? Check out Serverless Handbook, for frontend engineers 👉 ServerlessHandbook.dev

Want to Stop copy pasting D3 examples and create data visualizations of your own? Learn how to build scalable dataviz React components your whole team can understand with React for Data Visualization

Want to get my best emails on JavaScript, React, Serverless, Fullstack Web, or Indie Hacking? Check out swizec.com/collections

Did someone amazing share this letter with you? Wonderful! You can sign up for my weekly letters for software engineers on their path to greatness, here: swizec.com/blog

Want to brush up on your modern JavaScript syntax? Check out my interactive cheatsheet: es6cheatsheet.com

By the way, just in case no one has told you it yet today: I love and appreciate you for who you are ❤️

Senior Mindset Book

Start with a free chapter

Lesson learned, test your migrations on the big dataset

Did you enjoy this article?

Continue reading about Lesson learned, test your migrations on the big dataset

Learned something new?
Read more Software Engineering Lessons from Production

Software Engineering Lessons from Production

Senior Mindset Book