Sign In

Communications of the ACM

Research highlights

Technical Perspective: The 'Art' of Automatic Benchmark Extraction

spark plug gap measurement

Credit: Science Photo Gallery

Benchmarking database systems has a long and successful history in making industrial database systems comparable and is also a cornerstone of quantifiable experimental data systems research. Defining a benchmark involves identifying a dataset, a query- and update-workload, and performance metrics, as well as creating infrastructure to generate/load data, drive queries/updates into the system-under-test and record performance metrics.

Creating good benchmarks has been described as an art. One can inspire dataset and workload design from "representative" use cases queries, typically informed by knowledge from domain experts; but also exploit technical insights from database architects in what features, operations, and data distributions should come together in order to invoke a particularly challenging task.a

While this methodology has served the database community well by creating understandable reference points such as TPC-C and TPC-DS, even to shape the narrative on what database workloads are (that is, transactional vs. analytical), such synthetic benchmarks typically fail to represent actual workloads, which are more complex and thus hard to understand, mixed in nature, and constantly changing.

Automatic benchmark extraction from real-life workloads therefore provides a powerful pathway toward quantifying database system performance that matters most for an organization, though this requires techniques for summarizing query logs into more compact workloads, and for extracting real data and anonymizing it to create benchmark datasets.

Also, benchmarking in the cloud must deal with complex set-ups in dynamically provisioned environments where distributed compute, storage and network resources must be orchestrated throughout the benchmarking workflow of dataset creation, loading, workload execution, performance measurement and correctness checking.

This paper on DIAMetrics describes a versatile framework from Google for automatic extraction of benchmarks and their distributed execution and performance monitoring. The structure of the system is clever, particularly having the components run separately and allowing multiple entry-points into the system to cater to different use cases.

The observation that input data is not necessarily under the query engines' control, and storage formats may vary, is important. Classic benchmarks assume the engine can preprocess data at will and allows access to detailed statistics about the dataset, which is often not possible in practice.

The state-of-the-art workload extraction approach in DIAMetrics considers structural query features and actual execution metrics. The approach starts with feature extraction from query logs into a standardized format. Two kinds of features are considered: syntactic features like the number of aggregations, or number of joins; and profiling features like runtime and various resource consumption metrics. The algorithm then tries to generate a compressed workload that conserves representativity of the feature distribution and simultaneously maximizes the coverage of features.

The data scrambler is a very desirable component, but the paper is light on details here, raising some questions. Certain information leakage is bound to occur, limiting safe use cases. The technique of removing correlations between columns, will also remove important query optimization challenges. Therefore, the question arises what could be additional mechanisms for data scrambling that provide good privacy properties without removing the richness correlations that characterize real data. This crucial aspect of automatic benchmark extraction seems a fruitful area of future research, say, into formal privacy-preservation bounds or privacy-respecting preservation of certain correlations.

At Google, the DIAMetrics framework has proven useful for database system developers, for performance monitoring and regression testing, as engines evolve. Database system users also use DIAMetrics to find which of the multiple Google engines (for example, F1, Procella, Dremel) best suits their use case, but also to provide performance accountability: to identify and communicate performance problems.

This paper inspired me to think about novel directions, possibly taking automatic benchmark extraction from performance monitoring accountability also toward correctness testing: one could envision enriching workload summarization with new dimensions such as code coverage and automatic generation of query correctness oracles.

A drawback is that while much of its contribution is in the clever system architecture, the actual system is not open source or even part of some publicly available service. In systems design, details matter, and readers outside Google cannot look these up in the system code or documentation.

This paper is highly recommended for those interested in state-of-the-art database benchmarking, including those who are implementing their own framework for cloud-based automatic benchmark extraction, running and monitoring—hopefully in open source.

Back to Top


Peter Boncz is a senior researcher in the Database Architectures research group at CWI, Amsterdam, The Netherlands.

Back to Top


a. This was coined "choke point"-guided benchmark design by the Linked Data Benchmark Council (

To view the accompanying paper, visit

Copyright held by author.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.


No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
Article Contents: