Lessons from a MySQL Performance Question

There was an interesting database performance-related thread on Reddit last week. The post title was “Database was regularly averaging 100% CPU when using 4 cores and 8 GB RAM. After upgrading to 6 cores and 16 GB RAM it’s averaging 5% CPU. ????”.

The body of the post is here:

Hello, as the title says, my database was regularly averaging 100% CPU when using 4 cores and 8 GB RAM. The load average was usually around 4 or 5.

After upgrading to 6 cores and 16 GB RAM it’s averaging 5% CPU, and the load average less than 1.
Is this simply a case that the queries were too tough for 4 cores and 8 GB RAM, causing the database to be in a constant state of panic, but 6 cores and 16 GB RAM makes everything run smoothly?

I’m just surprised to see such a performance difference.

Any insights appreciated

The thread was very surprising to me. Of course, it’s not unusual that someone has some performance question about a SQL database server. It’s the way this thread proceeded that I found to be unusual: nobody answered the question. And almost all of the responses were laboring under a false understanding of how the OS worked.

It’s probably easier to consider the question itself. As it stands, it’s pretty much impossible to answer. The poster is saying that he was running MySQL and seeing 100% CPU load. Then, they moved MySQL to new hardware, they dropped to 5% CPU … and they wanted to know why.

Of course, nobody knows. Certainly, more hardware should be better. But the DBMS was presumably reinstalled (was it not?) on the new hardware. Maybe it’s a different version. Maybe it has different configuration parameters. It’s possible that the binary files were moved, but it’s also possible that the database was restored from a backup — which, depending on how the backup was done, would recreate and reorganize tables and indexes. With the fresh indexes, maybe different execution plans were determined.

We know nothing of the I/O subsystem, either. Is it precisely the same?

I guess it’s also possible the system is cloud-hosted and the poster was actually just migrating a snapshot to a different instance type. Instance types in the different cloud providers can be really subtle — some are metered, some aren’t; some are shared, some are dedicated.

Really, there’s just too little information here to explain why the observed CPU usage was so much lower. We don’t even know anything about the workload — mostly writes, mostly reads? And we certainly don’t have any single specific query (or a set of queries) to examine.

The only real, honest answer to this question is “we don’t know”, then. With so many unknowns, the cause of the observations is just not something that can be determined. Only one person of the eight or so who responded were willing to admit that.

It’s probably not great that people give incorrect answers. But one of the worst side-effects of offering an answer (even if it’s correct!) is that it will short-circuit any initiative to investigate what really happened. “Someone explained it with their answer!” is enough to stop any sensible person from trying to check into what happened. That work appears redundant, and also comes off as an affront to the person who offered the answer in the first place.

For me, I like working on teams that do investigate things. We don’t need to re-prove the quadratic equation every time we use it, but digging into assumptions and unanticipated behaviour is a great way to learn, and a great way to make a large heterogeneous system more stable. Teams that short-circuit that work, or defer it and never follow-up, end up short-changing themselves.

The other issue in the thread was even more shocking to me, and I’ll cover it in my next post.






Leave a Reply

Your email address will not be published. Required fields are marked *