I’ve been asked on numerous occasions why do peoples’ database benchmark results often vary. In other words why one observed transaction per second (TPS) value today and sometimes 20% or more off tomorrow. In short, why aren’t database benchmark results reliably repeatable?

Often people try to blame the benchmarking software. That is the logical choice since that software runs the benchmark and scores the result. But I’ve seen several database benchmarking tools exhibit this odd behavior. I know that commercial software gets rigorous QA testing, and open source or freeware tools have very active communities reporting issues. So I’m hesitant to throw any database benchmarking tools under the bus. The answer must lie elsewhere.

Another source of potential issues are the database benchmarks themselves. Maybe (and I’m totally just conjecturing here) the specs contain items open to interpretation or misinterpretation by those coding database benchmarking tools. In fact, it might even be possible that those database benchmark specs possess some design flaws. It’s possible. But most database benchmarking tools have been around for years, some even well over a decade. My belief is that QA and communities would have reported real problems or bugs long ago. So again the real answer must lie elsewhere.

However there’s yet another source of potential issues we could indirectly blame on the database benchmark specs – very few people actually read or fully comprehend them. TPS is such a simple concept that any computer literate person should be able to run a database benchmarking tool. Wrong! Most of us know how to drive an automobile, but how many of us would get behind the wheel of a formula one race car and expect good things to happen? In a dozen or so years of helping people do database benchmarking, I can honestly say that maybe 25% of the people had read and understood the benchmark spec well enough to select options or parameters in the database benchmarking tools in order to obtain desired results. So here is probably the principal source of observed performance variance. But even the minority of well-prepared people still see variance. Why?

I have a belief that should make sense – but I cannot say that I have a mathematical proof or certainty that my explanation is 100% correct. But I will offer it nonetheless as I think at an absolute minimum it will radically reduce the observed performance variance for many people.

Let’s use a very simple analogy based on an automobile’s speedometer (MPH) and tachometer (RPM). Remember that MPH is what the police measure to see if you get a speeding ticket. So MPH is like the average response time – that which the database user observes. TPS are more like RPM’s – how hard the database engine is working. This distinction is simple yet critical.

If I don’t try to see the fastest speed at which the automobile can travel (i.e. I do not push it to its limit), then the RPM’s may well be meaningless. If I’m going 40 mph @ 2500 rpms, then it really matters what gear the automobile’s transmission is in. First gear may top out at 40 MPH, while fifth gear may permit speeds of 120 MPH or more at the same RPM’s.

The same is true for database benchmarking. It’s my experience that benchmark tests that do not push the hardware to its limits tend to have widely varying results at lower levels. I don’t know if it’s a flaw in the benchmark design or the software running it. I just know that a database server needs tested at the upper end of its capacity for the results to be both meaningful and repeatable.

Hence my advice to the people I work with doing database benchmarks is to size appropriately. My old notebook with an old Core 2 Duo (P8700) with 8 GB RAM and a first generation commercial grade SSD can handle a few hundred concurrent database sessions. Remember, a common transaction has “think time” in between operations, so 100 users might really mean just 10 concurrent operations at once. Therefore a current or one prior generation of XEON CPU with 128 GB RAM and an enterprise grade IO subsystem should be able to handle tens of thousands of concurrent sessions. Running tiny database benchmarks of under 1,000 users will fall into this area where results may well not be either reliable or repeatable. So test appropriately – even if it means more money for a properly sized tool. Testing workloads that are far too small for the hardware is really just an exercise in futility to place a checkmark somewhere that benchmarks were performed.

About the Author

Bert Scalzo

Bert Scalzo is a guest-blogger for Quest and a renowned database expert, Oracle® ACE, author, database technology consultant, and formerly a member of Dell Software’s TOAD dev team. With three decades of Oracle® database experience to draw on, Bert’s webcasts garner high attendance and participation rates. His work history includes time at both Oracle Education and Oracle Consulting. Bert holds several Oracle Masters certifications and has an extensive academic background that includes a BS, MS and Ph.D. in computer science, as well as an MBA, and insurance industry designations. Bert is a highly sought-after speaker who has presented at numerous Oracle conferences and user groups, including OOW, ODTUG, IOUG, OAUG, RMOUG and many others. Bert enjoys sharing his vast knowledge on data modeling, database benchmarking, database tuning and optimization, "star schema" data warehouses, Linux® and VMware®. As a prolific writer, Bert has produced educational articles, papers and blogs for such well-respected publications as the Oracle Technology Network (OTN), Oracle Magazine, Oracle Informant, PC Week (eWeek), Dell Power Solutions Magazine, The LINUX Journal, LINUX.com, Oracle FAQ, Ask Toad and Toad World.

Start the discussion at forums.toadworld.com