Survivorship Bias in Industrial Database Workloads

Authors:
Ryan Marcus, Jeffrey Tao, Peizhi Wu, Zijie Zhao
Abstract

Several industrial data platforms have recently released characterizations of their real-world workloads, prompting database researchers to consider new research directions and benchmarks that go beyond synthetics. While these workload characterizations are undoubtedly useful, researchers (including the authors of this paper) and practitioners often assume that existing workloads are representative of user needs. Through two case studies over real workload data, we show that industrial workloads represent a ``negotiation'' between users and the data platform, in which users shape their workloads to take advantage of the parts of the data platform that work well while avoiding the parts of the data platform that are not optimized. This shaping effect is an example of survivorship bias: workload characterizations are built over the queries that ``survive'' (or thrive) on a particular data platform. Based on these case studies, we make several suggestions for how both researchers and practitioners can better contextualize workload traces and characterizations.