Towards Migration-Free Just-In-Case Data Archival for Future Cloud Data Lakes using Synthetic DNA [vision]Eugenio Marinelli (Eurecom)*; Yiqing Yan (Eurecom); Virginie Magnone (IPMC); Charlotte Dumargne (IPMC); Pascal Barbry (IPMC); Thomas Heinis (Imperial College); Raja Appuswamy (Eurecom) Show AbstractDownload Paper
Given the growing adoption of data-driven decision making, cloud data lakes are increasingly facing the need to support cost-effective ``Just-in-case'' archival over long time periods to meet legal and regulatory compliance requirements. Current media technologies suffer from fundamental issues that will soon, if not already, make cost-effective data archival infeasible. In this paper, we present a vision for redesigning the archival tier of cloud data lakes based on a novel, obsolescence-free storage medium--synthetic DNA. In doing so, we make two contributions: (i) we highlight the challenges in using DNA for data archival and list several open research problems, (ii) we outline OligoArchive-DSM (OA-DSM)--an end-to-end DNA storage pipeline that we are developing to demonstrate the feasibility of our vision.
The Composable Data Management System Manifesto [vision]Pedro Pedreira (Meta Platforms)*; Orri Erling (Meta Platforms); Konstantinos Karanasos (Meta); Scott Schneider (Meta Platforms); Wes McKinney (Voltron Data); Satya Valluri (Databricks); Mohamed Zait (Databricks); Jacques Nadeau (Sundeck) Show AbstractDownload Paper
The requirement for specialization in data management systems has evolved faster than our software development practices. After decades of organic growth, this situation has created a siloed landscape composed of hundreds of products developed and maintained as monoliths, with limited reuse between systems. This fragmentation has resulted in developers often reinventing the wheel, increased maintenance costs, and slowed down innovation. It has also affected the end users, who are often required to learn the idiosyncrasies of dozens of incompatible SQL and non-SQL API dialects, and settle for systems with incomplete functionality and inconsistent semantics. In this vision paper, considering the recent popularity of open source projects aimed at standardizing different aspects of the data stack, we advocate for a paradigm shift in how data management systems are designed. We believe that by decomposing these into a modular stack of reusable components, development can be streamlined while creating a more consistent experience for users. Towards that goal, we describe the state-of-the-art, principal open source technologies, and highlight open questions and areas where additional research is needed. We hope this work will foster collaboration, motivate further research, and promote a more composable future for data management.
Check Out the Big Brain on BRAD: Simplifying Cloud Data Processing with Learned Automated Data Meshes [vision]Tim Kraska (Massachusetts Institute of Technology); Tianyu Li (MIT CSAIL); Samuel Madden (Massachusetts Institute of Technology); Markos Markakis (Massachusetts Institute of Technology); Amadou L Ngom (Massachusetts Institute of Technology); Ziniu Wu (Massachusetts Institute of Technology); Geoffrey X. Yu (Massachusetts Institute of Technology)* Show AbstractDownload Paper
The last decade of database research has led to the prevalence of specialized systems for different workloads. Consequently, organi- zations often rely on a combination of specialized systems, orga- nized in a Data Mesh. Data meshes present significant challenges for system administrators, including picking the right system for each workload, moving data between systems, maintaining con- sistency, and correctly configuring each system. Many non-expert end users (e.g., data analysts or app developers) either cannot solve their business problems, or suffer from sub-optimal performance or cost due to this complexity. We envision BRAD, a cloud system that automatically integrates and manages data and systems into an instance-optimized data mesh, allowing users to efficiently store and query data under a unified data model (i.e., relational tables) without knowledge of underlying system details. With machine learning, BRAD automatically deduces the strengths and weak- nesses of each engine through a combination of offline training and online probing. Then, BRAD uses these insights to route queries to the most suitable (combination of) system(s) for efficient execu- tion. Furthermore, BRAD automates configuration tuning, resource scaling, and data migration across component systems, and makes recommendations for more impactful decisions, such as adding or removing systems. As such, BRAD exemplifies a new class of sys- tems that utilize machine learning and the cloud to make complex data processing more accessible to end users, raising numerous new problems in database systems, machine learning, and the cloud.
The Case for Distributed Shared-Memory Databases with RDMA-Enabled Memory Disaggregation [vision]Ruihong Wang (Purdue University); Jianguo Wang (Purdue University)*; Stratos Idreos (Harvard); M. Tamer Özsu (University of Waterloo); Walid G Aref (Purdue) Show AbstractDownload Paper
Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This paper makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envision that distributed shared-memory databases (DSM-DB, for short) - that have not received much attention before - can be promising in the future with MD. We present a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.
Keep CALM and CRDT On [vision]Shadaj Laddad (UC Berkeley); Conor Power (UC Berkeley)*; Mae Milano (UC Berkeley); Alvin Cheung (UC Berkeley); Natacha Crooks (UC Berkeley); Joseph M Hellerstein (UC Berkeley) Show AbstractDownload Paper
Despite decades of research and practical experience, developers have few tools for programming reliable distributed applications without resorting to expensive coordination techniques. Conflict-free replicated datatypes (CRDTs) are a promising line of work that enable coordination-free replication and offer certain eventual consistency guarantees in a relatively simple object-oriented API. Yet CRDT guarantees extend only to data updates; observations of CRDT state are unconstrained and unsafe. We propose an agenda that embraces the simplicity of CRDTs, but provides richer, more uniform guarantees. We extend CRDTs with a query model that reasons about which queries are safe without coordination by applying monotonicity results from the CALM Theorem, and lay out a larger agenda for developing CRDT data stores that let developers safely and efficiently interact with replicated application state.
A3
(Industrial) Performance & Resource Optimization for Cloud
Chair: Hanuma Kodavalla (Microsoft)
Taurus MM: bringing multi-master to the cloud [industry]Alex Depoutovitch (Huawei)*; Paul Larson (Huawei); Jack Ng (Huawei); Shu Lin (Huawei); Chong Chen (Huawei); Guanzhu Xiong (Huawei); Emad Boctor (Huawei); Paul Lee (Huawei); Samiao Ren (Huawei); Lengdong Wu (Huawei); Yuchen Zhang (Huawei); Calvin Sun (Huawei) Show AbstractDownload Paper
A single-master database has limited update capacity because a single node handles all updates. A multi-master database potentially has higher update capacity because the load is spread across multiple nodes. However, the need to coordinate updates and ensure durability can generate high network traffic. Reducing network load is particularly important in a cloud environment where the network infrastructure is shared among thousands of tenants. In this paper, we present Taurus MM, a shared-storage multi-master database optimized for cloud environments. It implements two novel algorithms aimed at reducing network traffic plus a number of additional optimizations. The first algorithm is a new type of distributed clock that combines the small size of Lamport clocks with the effective support of distributed snapshots of vector clocks. The second algorithm is a new hybrid page and row locking protocol that significantly reduces the number of lock requests sent over the network. Experimental results on a cluster with up to eight masters demonstrate superior performance compared to Aurora multi-master and CockroachDB.
The Story of AWS Glue [industry]Mohit Saxena (Amazon Web Services)*; Benjamin Sowell (Aryn); Daiyan Alamgir (Amazon Web Services); Nitin Bahadur (Amazon Web Services); Bijay Bisht (Amazon Web Services); Santosh Chandrachood (Amazon Web Services); Chitti Keswani (Amazon Web Services); G2 Krishnamoorthy (Amazon Web Services); Austin Lee (Amazon Web Services); Bohou Li (Amazon Web Services); Zach Mitchell (Amazon Web Services); Vaibhav Porwal (Amazon Web Services); Maheedhar Reddy Chappidi (Amazon Web Services); Brian Ross (Amazon Web Services); Noritaka Sekiyama (Amazon Web Services); Omer Zaki (Amazon Web Services); Linchi Zhang (Amazon Web Services); Mehul Shah (Amazon) Show AbstractDownload Paper
AWS Glue is Amazon’s serverless data integration cloud service that makes it simple and cost effective to extract, clean, enrich, load, and organize data. Originally launched in August 2017, AWS Glue began as an extract-transform-load (ETL) service designed to relieve developers and data engineers of the undifferentiated heavy lifting needed to load databases, data warehouses, and build data lakes on Amazon S3. Since then, it has evolved to serve a larger audience including ETL specialists and data scientists, and includes a broader suite of data integration capabilities. Today, hundreds of thousands of customers use AWS Glue every month.
In this paper, we describe the use cases and challenges cloud customers face in preparing data for analytics and the tenets we chose to drive Glue’s design. We chose early on to focus on ease- of-use, scale, and extensibility. At its core, Glue offers serverless Apache Spark and Python engines backed by a purpose-built re- source manager for fast startup and auto-scaling. In Spark, it offers a new data structure — DynamicFrames — for manipulating messy schema-free semi-structured data such as event logs, a variety of transformations and tooling to simplify data preparation, and a new shuffle plugin to offload to cloud storage. It also includes a Hive- metastore compatible Data Catalog with Glue crawlers to build and manage metadata, e.g. for data lakes on Amazon S3. Finally, Glue Studio is its visual interface for authoring Spark and Python-based ETL jobs. We describe the innovations that differentiate AWS Glue and drive its popularity and how it has evolved over the years.
Anser: Adaptive Information Sharing Framework of AnalyticDB [industry]Liang Lin (Alibaba); Yuhan Li (Alibaba Cloud Computing Co. Ltd.); Bin Wu (Alibaba Group)*; Huijun Mai (Alibaba); Renjie Lou (Alibaba); Jian Tan (Alibaba); Feifei Li (Alibaba Group) Show AbstractDownload Paper
The surge in data analytics has fostered burgeoning demand to AnalyticDB on Alibaba Cloud, which has well served thousands of customers from various business sectors. The most notable feature is the diversity of the workloads it handles, including batch processing, real-time data analytics, and unstructured data analytics. To improve the overall performance for such diverse workloads, one of the major challenges is to optimize long-running complex queries without sacrificing the processing efficiency of short-running inter- active queries. While existing methods attempt to utilize runtime dynamic statistics for adaptive query processing, they often focus on specific scenarios instead of providing a holistic solution.
To address this challenge, we propose a new framework called Anser, which enhances the design of traditional distributed data warehouses by embedding a new information sharing mechanism. This allows for the efficient management of the production and consumption of various dynamic information across the system. Building on top of Anser, we introduce a novel scheduling policy that optimizes both data and information exchanges within the physical plan, enabling the acceleration of complex analytical queries without sacrificing the performance of short-running interactive queries. We conduct comprehensive experiments over public and in-house workloads to demonstrate the effectiveness and efficiency of our proposed information sharing framework.
Eigen: End-to-end Resource Optimization for Large-Scale Databases on the Cloud [industry]Ji You Li (Alibaba Group); Jiachi Zhang (Georgetown Univerisity); Wenchao Zhou (Alibaba Group)*; Yuhang Liu (alibaba); Shuai Zhang (alibaba); Xue Zhuoming (alibaba); Ding Xu (Alibaba); Hua Fan (Alibaba Group); Fangyuan Zhou (Alibaba Group); Feifei Li (Alibaba Group) Show AbstractDownload Paper
Increasingly, cloud database vendors host large-scale geographically distributed clusters to provide cloud database services. When managing the clusters, we observe that it is challenging to simultaneously maximizing resource allocation ratio and resource availability. This problem becomes more severe in modern cloud database clusters, where resource allocations occur more frequently and on a greater scale. To improve resource allocation ratio without hurting resource availability, we introduce Eigen, a large-scale cloud-native cluster management system for large-scale databases on the cloud. Based on a resource flow model, we propose a hierarchical resource management system and three resource optimization algorithms which enable end-to-end resource optimization. Furthermore, we demonstrate the system optimization that promotes user experience by reducing scheduling latencies and improving scheduling throughput. Eigen has been launched in a large-scale public-cloud production environment for 30+ months and served more than 30+ regions (100+ available zones) globally. Based on the evaluation of real-world clusters and simulated experiments, Eigen can improve the allocation ratio by over 27% (from 60% to 87.0%) on average, while the ratio of delayed resource provisions is under 0.1%.
Automatic SQL Error Mitigation in Oracle [industry]Krishna Kantikiran Pasupuleti (Oracle)*; Jiakun Li (Oracle America); Hong Su (Oracle); Mohamed Ziauddin (Oracle) Show AbstractDownload Paper
Despite best coding practices, software bugs are inevitable in a large codebase. In traditional databases, when errors occur during query processing, they disrupt user workflow until workarounds are found and applied. Manual identification of workarounds often relies on a trial-and-error method. The process is not only time-consuming but also requires domain expertise that users are often lacking. In this paper, we propose a framework to automatically mitigate errors that occur during query compilation (including optimization and code generation) without any user intervention. An error is intercepted by the database internally, a workaround is identified for it, and the query is recompiled using the workaround. The entire process remains transparent to the user with the query being executed seamlessly. The proposed technique handles SQL errors during query compilation and provides three types of mitigation strategies – i) quickly failover to one of the readily-available historical plans for the statement ii) apply targeted error-correcting directives (hints) identified from the optimizer context at the time of the error iii) modify the global configuration of the optimizer using hints.
This feature has been implemented and will be released in an upcoming version of Oracle Autonomous Database.
A4
(Industrial) ML + Systems
Chair: Avrilia Floratou (Microsoft Gray Systems Lab)
AutoSteer: Learned Query Optimization for Any SQL Database [industry]Christoph Anneser (Technical University of Munich)*; Nesime Tatbul (Intel Labs and MIT); David E Cohen (Intel); Zhenggang Xu (Meta Platforms); Prithviraj P Pandian (Meta); Nikolay Laptev (Facebook); Ryan C Marcus (Massachusetts Institute of Technology) Show AbstractDownload Paper
This paper presents AutoSteer, a learning-based solution that automatically drives query optimization in any SQL database that exposes tunable optimizer knobs. AutoSteer builds on the Bandit optimizer (Bao) and extends it with new capabilities (e.g., automated hint-set discovery) to minimize integration effort and facilitate usability in both monolithic and disaggregated SQL systems. We successfully applied AutoSteer on PostgreSQL, PrestoDB, SparkSQL, MySQL, and DuckDB - five popular open-source database engines with diverse query optimizers. We then conducted a detailed experimental evaluation with public benchmarks (JOB, Stackoverflow, TPC-DS) and a production workload from Meta's PrestoDB deployments. Our evaluation shows that AutoSteer can not only outperform these engines' native query optimizers (e.g., up to 40% improvements for PrestoDB) but can also match the performance of Bao-for-PostgreSQL with reduced human supervision and increased adaptivity, as it replaces Bao's static, expert-picked hint-sets with those that are automatically discovered. We also provide an open-source implementation of AutoSteer together with a visual tool for interactive use by query optimization experts.
EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data [industry]Yuanhang Zou (Tencent); Zhihao Ding (The Hong Kong Polytechnic University); Jieming Shi (The Hong Kong Polytechnic University)*; Shuting Guo (Tencent); Chunchen Su (Tencent); Yafei Zhang (Tencent) Show AbstractDownload Paper
In modern online services, it is of growing importance to process web-scale graph data and high-dimensional sparse data together into embeddings for downstream tasks, such as recommendation, advertisement, prediction, and classification. There exist learning methods and systems for either high-dimensional sparse data or graphs, but not both.
There is an urgent need in industry to have a system to efficiently process both types of data for higher business value, which however, is challenging. The data in Tencent contains billions of samples with sparse features in very high dimensions, and graphs are also with billions of nodes and edges. Moreover, learning models often perform expensive operations with high computational costs. It is difficult to store, manage, and retrieve massive sparse data and graph data together, since they exhibit different characteristics.
We present EmbedX, an industrial distributed learning framework from Tencent, which is versatile and efficient to support embedding on both graphs and high-dimensional sparse data. EmbedX consists of distributed server layers for graph and sparse data management, and optimized parameter and graph operators, to efficiently support 4 categories of methods, including deep learning models on high-dimensional sparse data, network embedding methods, graph neural networks, and in-house developed joint learning models on both types of data. Extensive experiments on massive Tencent data and public data demonstrate the superiority of EmbedX. For instance, on a Tencent dataset with 1.3 billion nodes, 35 billion edges, and 2.8 billion samples with sparse features in 1.6 billion dimension, EmbedX performs an order of magnitude faster for training and our joint models achieve superior effectiveness. EmbedX is deployed in Tencent. A/B test on real use cases further validates the power of EmbedX. EmbedX is implemented in C++ and open-sourced at https://github.com/Tencent/embedx.
MINT: Detecting Fraudulent Behaviors from Time-series Relational Data [industry]Fei Xiao (Shopee Singapore)*; Yuncheng Wu (National University of Singapore); Meihui Zhang (Beijing Institute of Technology); Gang Chen (Zhejiang University); Beng Chin Ooi (NUS) Show AbstractDownload Paper
The e-commerce platforms, such as Shopee, have accumulated a huge volume of time-series relational data, which contains useful information on differentiating fraud users from benign users. Existing fraud behavior detection approaches typically model the time-series data with a vanilla Recurrent Neural Network (RNN) or combine the whole sequence as a single intention without considering the temporal behavioral patterns, row-level interactions, and different view intentions. In this paper, we present MINT, a Multiview row-INteractive Time-aware framework to detect fraudulent behaviors from time-series structured data. The key idea of MINT is to build a time-aware behavior graph for each user’s time-series relational data with each row represented as an action node. We utilize the user’s temporal information to construct three different graph convolutional matrices for hierarchically learning the user’s intentions from different views, that is, short-term, medium-term, and long-term intentions. To capture more meaningful row-level interactions and alleviate the over-smoothing issue in a vanilla time-aware behavior graph, we propose a novel gated neighbor interaction mechanism to calibrate the aggregated information by each action node. Since the receptive fields of the three graph convolutional layers are designed to grow nearly exponentially, our MINT requires many fewer layers than traditional deep graph neural networks (GNNs) to capture multi-hop neighboring information, and avoids recurrent feedforward propagation, thus leading to higher training efficiency and scalability. Our extensive experiments on the large-scale e-commerce datasets from Shopee with up to 4.6 billion records and a public dataset from Amazon show that MINT achieves superior performance over 10 state-of-the-art models and provides better interpretability and scalability.
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [industry]Yanli Zhao (Meta)*; Andrew Gu (Meta); Rohan Varma (Meta); Liang Luo (Meta); Chien-Chin Huang (Meta Platforms); Min Xu (Meta Platforms); Less Wright (Meta Platforms); Hamid Shojanazeri (Meta Platforms); Myle Ott (Facebook); Sam Shleifer (Stanford University); Alban Desmaison (Meta); Can Balioglu (Meta Platforms); Pritam H Damania (Meta Platforms); Bernard Nguyen (Meta Platforms); Geeta Chauhan (Meta Platforms); Yuchen Hao (Meta Platforms); Ajit Mathews (Meta); Shen Li (Meta) Show AbstractDownload Paper
It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.
A6
(Industrial) Novel Systems for Real-World Uses
Chair: Jesus Camacho-Rodriguez (Microsoft Gray Systems Lab)
Progressive Partitioning for Parallelized Query Execution in Google's Napa [industry]Jun Tatemura (Google); Tao Zou (Google); Jagan Sankaranarayanan (Google)*; Yanlai Huang (Google); Jim Chen (Google); Yupu Zhang (Google); Kevin Lai (Google); Hao Zhang (Google); Gokul Nath Babu Manoharan (Google); Goetz Graefe (Google); Divyakant Agrawal (Google); Brad Adelberg (Google); Shilpa Kolhar (Google); Indrajit Roy (Google) Show AbstractDownload Paper
Napa holds Google's critical data warehouses in log-structured merge trees for real-time data ingestion and sub-second response for billions of queries per day. These queries are often multi-key look-ups in highly skewed tables and indexes.
In our production experience, only progressive query-specific partitioning can achieve Napa's strict query latency SLOs. Here we advocate good-enough partitioning that keeps the per-query partitioning time low without risking uneven work distribution. Our design combines pragmatic system choices and algorithmic innovations. For instance, B-trees are augmented with statistics of key distributions, thus serving the dual purpose of aiding lookups and partitioning. Furthermore, progressive partitioning is designed to be ``good enough'' thereby balancing partitioning time with performance. The resulting system is robust and successfully serves day-in-day-out billions of queries with very high quality of service forming a core infrastructure at Google.
OceanBase Paetica: A Hybrid Shared-nothing/Shared-everything Database for Supporting Single Machine and Distributed Cluster [industry]Zhifeng Yang (OceanBase); Quanqing Xu (OceanBase)*; Shanyan Gao (OceanBase, Ant Group); Chuanhui Yang (OceanBase); Guoping Wang (OceanBase, Ant Group); Yuzhong Zhao (oceanbase); Fanyu Kong (Oceanbase); Hao Liu (OceanBase); Wanhong Wang (OceanBase, Ant Group); Jinliang Xiao (OceanBase, Ant Group) Show AbstractDownload Paper
In the ongoing evolution of the OceanBase database system, it is essential to enhance its adaptability to small-scale enterprises. The system has demonstrated its stability and effectiveness not only within the Ant Group and other commercial organizations but also through the TPC-C and TPC-H tests. To address the over- head caused by the distributed component in the stand-alone mode, we designed a stand-alone and distributed integrated architecture named Paetica. Paetica enables adaptive configuration of the data- base, allowing OceanBase to support both serial and parallel exe- cutions in stand-alone and distributed scenarios, thus providing efficiency and economy. This design has been implemented in ver- sion 4.0 of the OceanBase system, and the results of the experiments show that Paetica exhibits notable scalability and outperforms other stand-alone or distributed databases. Furthermore, it enables the transition of OceanBase from primarily serving large enterprises to truly catering to small and medium enterprises, by employing a single OceanBase database for the successive stages of enterprise or business development, without the requirement for migration. Our experiments confirm that Paetica has achieved linear scalability with the increasing CPU core number within the stand-alone mode. It also outperforms MySQL and Greenplum in the Sysbench and TPC-H evaluations.
PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads [industry]Xinjun Yang (Alibaba Group); Yingqiang Zhang (Alibaba Group); Hao Chen (Alibaba Group)*; Chuan Sun (Alibaba Group); Feifei Li (Alibaba Group); Wenchao Zhou (Alibaba Group) Show AbstractDownload Paper
A classic design of cloud-native databases adopts an architecture that consists of one read/write (RW) node and one or more read-only (RO) nodes. In such a design, the propagation of logs from the RW node to the RO node(s) is typically performed asynchronously. Consequently, system designers either have to accept a loose consistency guarantee, where a read from the RO node may return stale data, or tolerate significant performance degradation in terms of read latency, as it then needs to wait for the log to be propagated and applied. Most commercial cloud-native databases choose performance over strong consistency. As a result, it makes RO nodes useless for many applications requiring a strong consistency.
This paper proposes PolarDB-SCC (PolarDB-Strongly Consistent Cluster), a cloud-native database architecture that guarantees strongly consistent reads with low latency. The core idea is to eliminate unnecessary waits and reduce the necessary wait time on RO nodes while still supporting strong consistency. To achieve this, it tracks the RW node’s modification timestamp at three progressively finer-grained levels. We further design a Linear Lamport timestamp to reduce the RO node’s timestamp fetching operations and leverage the RDMA network for all the data transferring (e.g., timestamp fetching and log shipment) to minimize network overhead and extra CPU usage. Our evaluation shows that PolarDB-SCC does not incur any noticeable overhead for ensuring strongly consistent reads compared with the eventually consistent (stale) read policy. PolarDB-SCC is already commercially available at Alibaba Cloud.
ScalarDB: Universal Transaction Manager for Polystores [industry]Hiroyuki Yamada (Scalar)*; Toshihiro Suzuki (Scalar); Yuji Ito (Scalar); Jun Nemoto (Scalar) Show AbstractDownload Paper
This paper presents ScalarDB, a universal transaction manager that achieves distributed transactions across multiple disparate databases. ScalarDB provides a database-agnostic transaction manager on top of its database abstraction; thus, it achieves transactions spanning various databases without depending on the transactional capability of underlying databases. ScalarDB is based on several research works and extended to provide a strong correctness guarantee (i.e., strict serializability), further performance optimizations, and several critical mechanisms for productization. In this paper, we describe the design and implementation of ScalarDB. We also present evaluation results showing that ScalarDB achieves database-spanning transactions with reasonable performance and near-linear scalability without sacrificing correctness. Finally, we share some case studies and lessons learned while building and running ScalarDB.
A7
(Industrial) Data Governance, Lineage, & Benchmarking
Chair: Abdul Quamar (Google)
CDSBen: Benchmarking the Performance of Storage Services in Cloud-native Database System at ByteDance [industry]Jiashu Zhang (Southern University of Science and Technology); Wen Jiang (Southern University of Science and Technology); Bo Tang (Southern University of Science and Technology)*; Haoxiang Ma (ByteDance); Cao Lixun (ByteDance); Zhongbin Jiang (ByteDance); Yuanyuan Nie (ByteDance); Fan Wang (ByteDance); Lei Zhang (ByteDance); Yuming Liang (ByteDance) Show AbstractDownload Paper
In this work, we focus on the performance benchmarking problem of storage services in cloud-native database systems, which are widely used in various cloud applications. The core idea of these systems is to separate computation and storage in traditional monolithic OLTP databases. Specifically, we first present the characteristics of two representative real I/O workloads at the storage tier of ByteDance’s cloud-native database VeDB. We then elaborate the limitations of using standard benchmarks such as TPC-C and YCSB to resemble these workloads. To overcome these limitations, we devise a learning-based I/O workload benchmark called CDSBen. We demonstrate the superiority of CDSBen by deploying it at ByteDance and showing that its generated I/O traces accurately resemble the real I/O traces in production. Additionally, we verify the accuracy and flexibility of CDSBen by generating a wide range of I/O workloads with different I/O characteristics.
Microsoft Purview: A System for Central Governance of Data [industry]Shafi Ahmad (Microsoft); Dillidorai Arumugam (Microsoft); Srdan Bozovic (Microsoft); Elnata Degefa (Microsoft); Sailesh K Duvvuri (C and AI); Steven Gott (Microsoft); Nitish Gupta (Microsoft); Joachim Hammer (Microsoft); Nivedita Kaluskar (Microsoft); Raghav Kaushik (Microsoft)*; Rakesh Khanduja (Microsoft); Prasad Mujumdar (Microsoft); Gaurav Malhotra (Microsoft); Pankaj Naik (Microsoft); Nikolas Ogg (Microsoft); Krishna Kumar Parthasarthy (Microsoft); Raghu Ramakrishnan (Microsoft); Vladimir Rodriguez (Microsoft); Rahul Sharma (Microsoft India R&D Pvt ltd); Jakub Szymaszek (Microsoft); Andreas Wolter (Microsoft) Show AbstractDownload Paper
Modern data estates are spread across data located on premises, on the edge and in the cloud, spread across various sources both operational and analytic. Data administrators who wish to enforce compliance across the entire organization have to inventory their data, identify what parts of it are sensitive, and govern the sensitive data appropriately across the entirety of their sprawling data estate. Today, governance is completely siloed; each of the data subsystems has its own governance features. Policies applied to sensitive data are applied piece-meal by iterating over all the data sources in a custom language specific to each source. This makes data governance cumbersome, error-prone, and expensive.
This paper presents Microsoft Purview, a service for unified governance of the entire data estate of an organization from a single central pane of glass. The Purview service consists of three parts: (1)~a data map that is populated by automated scanning of data sources in the organization, (2)~a system to store and manage sensitivity classification of data, and (3)~a policy system that enables data security officers to author and implement policies that span the entire organization, e.g., a policy that requires that non-full-time employees should be denied access to data classified as PII. Purview transforms data governance across a complex data estate by offering the ability to govern centrally and automating data discovery, classification and policy enforcement. By integrating with Office~365 Rights Management Service, Purview offers central governance over structured data stored in database and storage, as well as document data stored in Office~365. This paper focuses on the design and implementation challenges in building the Purview service for Attribute-Based Access Control~(ABAC) policies through a detailed description of its integration with Azure~SQL~Database.
TPCx-AI - An Industry Standard Benchmark for Artificial Intelligence and Machine Learning Systems [industry]Christoph Brücke (bankmark); Philipp Härtling (bankmark); Rodrigo D Escobar Palacios (Intel); Hamesh Patel (Intel); Tilmann Rabl (HPI, University of Potsdam)* Show AbstractDownload Paper
Artificial intelligence (AI) and machine learning (ML) techniques have existed for years, but new hardware trends and advances in model training and inference have radically improved their performance. With an ever increasing amount of algorithms, systems, and hardware solutions, it is challenging to identify good deployments even for experts. Researchers and industry experts have observed this challenge and have created several benchmark suites for AI and ML applications and systems. While they are helpful in comparing several aspects of AI applications, none of the existing benchmarks measures end-to-end performance of ML deployments. Many have been rigorously developed in collaboration between academia and industry, but no existing benchmark is standardized.
In this paper, we introduce the TPC Express Benchmark for Artificial Intelligence (TPCx-AI), the first industry standard benchmark for end-to-end machine learning deployments. TPCx-AI is the first AI benchmark that represents the pipelines typically found in common ML and AI workloads. TPCx-AI provides a full software kit, which includes data generator, driver, and two full workload implementations, one based on Python libraries and one based on Apache Spark. We describe the complete benchmark and show benchmark results for various scale factors. TPCx-AI's core contributions are a novel unified data set covering structured and unstructured data; a fully scalable data generator that can generate realistic data from GB up to PB scale; and a diverse and representative workload using different data types and algorithms, covering a wide range of aspects of real ML workloads such as data integration, data processing, training, and inference.
OneProvenance: Efficient Extraction of Dynamic Coarse-Grained Provenance From Database Query Event Logs [industry]Fotios Psallidas (Microsoft)*; Ashvin Agrawal (Microsoft); Chandru Sugunan (Snowflake); Khaled Ibrahim (Microsoft); Konstantinos Karanasos (Meta); Jesús Camacho-Rodríguez (Microsoft); Avrilia Floratou (Microsoft); Carlo Curino (Microsoft); Raghu Ramakrishnan (Microsoft) Show AbstractDownload Paper
Provenance encodes information that connects datasets, their generation workflows, and associated metadata (e.g., who or when executed a query). As such, it is instrumental for a wide range of critical governance applications (e.g., observability and auditing). Unfortunately, in the context of database systems, extracting coarse-grained provenance is a long-standing problem due to the complexity and sheer volume of database workflows. Provenance extraction from query event logs has been recently proposed as favorable because, in principle, can result in meaningful provenance graphs for provenance applications. Current approaches, however, (a) add substantial overhead to the database and provenance extraction workflows and (b)~extract provenance that is noisy, omits query execution dependencies, and is not rich enough for upstream applications. To address these problems, we introduce OneProvenance: an efficient provenance extraction system from query event logs. OneProvenance addresses the unique challenges of log-based extraction by (a)~identifying query execution dependencies through efficient log analysis, (b) extracting provenance through novel event transformations that account for query dependencies, and (c)~introducing effective filtering optimizations. Our thorough experimental analysis shows that OneProvenance can improve extraction by up to ~18X compared to state-of-the-art baselines; our optimizations reduce the extraction noise and optimize performance even further. OneProvenance is deployed at scale by Microsoft Purview and actively supports customer provenance extraction needs (https://bit.ly/3N2JVGF).
A8
(Industry) Real-Time & Stream Processing
Chair: Juan Colmenares (LinkedIn)
StreamOps: Cloud-Native Runtime Management for Streaming Services in ByteDance [industry]Yancan Mao (National University of Singapore)*; Zhanghao Chen (ByteDance); Yifan Zhang (ByteDance); Meng Wang (ByteDance); Yong Fang (ByteDance); Guanghui Zhang (ByteDance); Rui Shi (ByteDance); Richard T.B. Ma (National University of Singapore) Show AbstractDownload Paper
Stream processing is widely used for real-time data processing and decision-making, leading to tens of thousands of streaming jobs deployed in ByteDance cloud. Since those streaming jobs usually run for several days or longer and the input workloads vary over time, they usually face diverse runtime issues such as processing lag and varying failures. This requires runtime management to resolve such runtime issues automatically. However, designing a runtime management service on the ByteDance scale is challenging. In particular, the service has to concurrently manage cluster-wide streaming jobs in a scalable and extensible manner. Furthermore, it should also be able to manage diverse streaming jobs effectively.
To this end, we propose StreamOps to enable cloud-native runtime management for streaming jobs in ByteDance. StreamOps has three main designs to address the challenges. 1) To allow for scalability, StreamOps is running as a standalone lightweight control plane to manage cluster-wide streaming jobs. 2) To enable extensible runtime management, StreamOps abstracts control policies to identify and resolve runtime issues. New control policies can be implemented with a detect-diagnose-resolve programming paradigm. Each control policy is also configurable for different streaming jobs according to the performance requirements. 3) To mitigate processing lag and handling failures effectively, StreamOps features three control policies, i.e., auto-scaler, straggler detector, and job doctor, that are inspired by state-of-the-art research and production experiences at ByteDance. In this paper, we introduce the design decisions we made and the experiences we learned from building StreamOps. We evaluate StreamOps in our production environment, and the experiment results have further validated our system design.
Krypton: Real-time Serving and Analytical SQL Engine at ByteDance [industry]Jianjun Chen (Bytedance)*; Rui Shi (ByteDance); Heng Chen (ByteDance); Li Zhang (ByteDance); Ruidong Li (Bytedance.com); Wei Ding (Bytedance); Liya Fan (Bytedance corporation); Hao Wang (ByteDance); Mu Xiong (ByteDance); Yuxiang Chen (ByteDance); Benchao Dong (Bytedance); Kuankuan Guo (Bytedance); Yuanjin Lin (ByteDance Technology Co Ltd.); Xiao Liu (Bytedance); Haiyang Shi (ByteDance); Peipei Wang (ByteDance); Zikang Wang (ByteDance Technology Co Ltd.); Yang Yemeng (ByteDance Ltd.); Junda Zhao (ByteDance); Dongyan Zhou (ByteDance); Zhikai Zuo (bytedance); Yuming Liang (ByteDance) Show AbstractDownload Paper
In recent years, at ByteDance, we have started seeing more and more business scenarios that require performing real-time data serving besides complex Ad Hoc analysis over large amounts of freshly imported data. The serving workload requires performing complex queries over massive newly added data items with minimal delay. These systems are often used in mission-critical scenarios, whereas traditional OLAP systems cannot handle such use cases. To work around the problem, ByteDance products often have to use multiple systems together in production, forcing the same data to be ETLed into multiple systems, causing data consistency problems, wasting resources, and increasing learning and maintenance costs.
To solve the above problem, we built a single Hybrid Serving and Analytical Processing (HSAP) system to handle both workload types. HSAP is still in its early stage, and very few systems are yet on the market. This paper demonstrates how to build Krypton, a competitive cloud-native HSAP system that provides both excellent elasticity and query performance by utilizing many previously known query processing techniques, a hardware-accelerated hierarchical cache, and a native columnar storage format. Krypton can support high data freshness, high data ingestion rates, and strong data consistency. We also discuss lessons and best practices we learned in developing and operating Krypton in production.
Techniques and Efficiencies from Building a Real-Time DBMS [industry]V Srinivasan (Aerospike)*; B Narendran (Aerospike); Andrew Gooding (Aerospike); Thomas Lopatic (Aerospike); Kevin Porter (Aerospike); Sunil Sayyaparaju (Aerospike); Ashish Shinde (Aerospike) Show AbstractDownload Paper
This paper describes a variety of techniques from over a decade of developing Aerospike (formerly Citrusleaf), a real-time DBMS that is being used in some of the world’s largest mission-critical systems that require the highest levels of performance and availability. Such mission-critical systems have many requirements including the ability to make decisions within a strict real-time SLA (milliseconds) with no downtime, predictable performance so that the first and billionth customer gets the same experience, ability to scale up 10X (or even 100X) with no downtime, support strong consistency for applications that need it, synchronous and asynchronous replication with global transactional capabilities, and the ability to deploy in any public and private cloud environments.
We describe how using efficient algorithms to optimize every area of the DBMS helps the system achieve these stringent requirements. Specifically, we describe, effective ways to shard, place and locate data across a set of nodes, efficient identification of cluster membership and cluster changes, efficiencies generated by using a ‘smart’ client, how to effectively use replications with two copies replication instead of three-copy, how to reduce the cost of the real-time data footprint by combining the use of memory with flash storage, self-managing clusters for ease of operation including elastic scaling, networking and CPU optimizations including NUMA pinning with multi-threading. The techniques and efficiencies described here have enabled hundreds of deployments to grow by many orders of magnitude with near complete uptime.
Lindorm TSDB: A Cloud-native Time-series Database for Large-scale Monitoring Systems [industry]Shen Chunhui (alibaba); Qianyu Ouyang (Alibaba); Feibo Li (Alibaba group); Liu Zhipeng (alibaba); Longcheng Zhu (Alibaba); Yujie Zou (Alibaba Group); Qing Su (Alibaba Cloud); Tianhuan Yu (alibaba-inc); Yi Yi (Alibaba Group); Jianhong Hu (Alibaba Group); Cen Zheng (Alibaba Group)*; Bo Wen (Alibaba); Hanbang Zheng (Alibaba Group); Lunfan Xu (Alibaba Group); Sicheng Pan (Alibaba Group); Bin Wu (Alibaba Group); Xiao He (Alibaba Group); Ye Li (Alibaba); Jian Tan (Alibaba); Sheng Wang (Alibaba Group); Dan Pei (Tsinghua University); Wei Zhang (Alibaba); Feifei Li (Alibaba Group) Show AbstractDownload Paper
Internet services supported by large-scale distributed systems have become essential for our daily life. To ensure the stability and high quality of services, diverse metric data are constantly collected and managed in a time-series database to monitor the service status. However, when the number of metrics becomes massive, existing time-series databases are inefficient in handling high-rate data ingestion and queries hitting multiple metrics. Besides, they all lack the support of machine learning functions, which are crucial for sophisticated analysis of large-scale time series. In this paper, we present Lindorm TSDB, a distributed time-series database designed for handling monitoring metrics at scale. It sustains high write throughput and low query latency with massive active metrics. It also allows users to analyze data with anomaly detection and time series forecasting algorithms directly through SQL. Furthermore, Lindorm TSDB retains stable performance even during node scaling. We evaluate Lindorm TSDB under different data scales, and the results show that it outperforms two popular open-source time-series databases on both writing and query, while executing time-series machine learning tasks efficiently.
Kora: A Cloud-Native Event Streaming Platform For Kafka [industry]Anna Povzner (Confluent)*; Prince Mahajan (Confluent); Jason Gustafson (Confluent); Jun Rao (Confluent); Ismael Juma (Confluent); Feng Min (Confluent); Shriram Sridharan (Confluent); Nikhil Bhatia (Confluent); Gopi Attaluri (Confluent); Adithya Chandra (Confluent); Stanislav Kozlovski (Confluent); Rajini Sivaram (Confluent); Lucas Bradstreet (Confluent); Bob Barrett (Confluent); Dhruvil Shah (Confluent); David Jacot (Confluent); David Arthur (Confluent); Manveer Chawla (Confluent); Ron Dagostino (Confluent); Colin McCabe (Confluent); Manikumar Reddy Obili (Confluent); Kowshik Prakasam (Confluent); Jose Garcia Sancio (Confluent); Vikas Singh (Confluent); Alok Nikhil (Confluent); Kamal Gupta (Confluent) Show AbstractDownload Paper
Event streaming is an increasingly critical infrastructure service used in many industries and there is growing demand for cloud-native solutions. Confluent Cloud provides a massive scale event streaming platform built on top of Apache Kafka with tens of thousands of clusters running in 70+ regions across AWS, Google Cloud, and Azure. This paper introduces Kora, the cloud-native platform for Apache Kafka at the core of Confluent Cloud. We describe Kora’s design that enables it to meet its cloud-native goals, such as reliability, elasticity, and cost efficiency. We discuss Kora’s abstractions which allow users to think in terms of their workload requirements and not the underlying infrastructure, and we discuss how Kora is designed to provide consistent, predictable performance across cloud environments with diverse capabilities.
B1
Foundations for Patterns, Constraints, & Dependencies
Chair: Matteo Lissandrini (Aalborg University)
Representing Paths in Graph Database Pattern MatchingWim Martens (University of Bayreuth)*; Matthias Niewerth (University of Bayreuth); Tina Popp (University of Bayreuth); Carlos Rojas (PUC); Stijn Vansummeren (Hasselt University); Domagoj Vrgoč (PUC) Show AbstractDownload Paper
Modern graph database query languages such as GQL, SQL/PGQ, and their academic predecessor G-Core promote paths to first-class citizens in the sense that their pattern matching facility can return paths, as opposed to only nodes and edges. This is challenging for database engines, since graphs can have a large number of paths between a given node pair, which can cause huge intermediate results in query evaluation.
We introduce the concept of path multiset representations (PMRs), which can represent multisets of paths exponentially succinctly and therefore bring significant advantages for representing intermediate results. We give a detailed theoretical analysis that shows that they are especially well-suited for representing results of regular path queries and extensions thereof involving counting, random sampling, and unions. Our experiments show that they drastically improve scalability for regular path query evaluation, with speedups of several orders of magnitude.
Semi-Oblivious Chase Termination for Linear Existential Rules: An Experimental Study [eab]Marco Calautti (University of Milan); Mostafa Milani (The University of Western Ontario); Andreas Pieris (University of Edinburgh & University of Cyprus)* Show AbstractDownload Paper
The chase procedure is a fundamental algorithmic tool in databases that allows us to reason with constraints, such as existential rules, with a plethora of applications. It takes as input a database and a set of constraints, and iteratively completes the database as dictated by the constraints. A key challenge, though, is the fact that it may not terminate, which leads to the problem of checking whether it terminates given a database and a set of constraints. In this work, we focus on the semi-oblivious version of the chase, which is well-suited for practical implementations, and linear existential rules, a central class of constraints with several applications. In this setting, there is a mature body of theoretical work that provides syntactic characterizations of when the chase terminates, algorithms for checking chase termination, and precise complexity results. Our main objective is to experimentally evaluate the existing chase termination algorithms with the aim of understanding which input parameters affect their performance, clarifying whether they can be used in practice, and revealing their performance limitations.
Normalizing Property GraphsPhilipp Skavantzos (The University of Auckland); Sebastian Link (University of Auckland)* Show AbstractDownload Paper
Normalization aims at minimizing sources of potential data inconsistency and costs of update maintenance incurred by data redundancy. For relational databases, different classes of dependencies cause data redundancy and have resulted in proposals such as Third, Boyce-Codd, Fourth and Fifth Normal Form. Features of more advanced data models make it challenging to extend achievements from the relational model to missing, non-atomic, or uncertain data. We initiate research on the normalization of graph data, starting with a class of functional dependencies tailored to property graphs. We show that this class captures important semantics of applications, constitutes a rich source of data redundancy, its implication problem can be decided in linear time, and facilitates the normalization of property graphs flexibly tailored to their labels and properties that are targeted by applications. We normalize property graphs into Boyce-Codd Normal Form without loss of data and dependencies whenever possible for the target labels and properties, but guarantee Third Normal Form in general. Experiments on real-world property graphs quantify and qualify various benefits of graph normalization: 1) Removing redundant property values as sources of inconsistent data, 2) detecting inconsistency as violation of functional dependencies, 3) reducing update overheads by orders of magnitude, and 4) significant speed ups of aggregate queries.
Exploiting the Power of Equality-Generating Dependencies in Ontological ReasoningLuigi Bellomarini (Banca d'Italia); Davide Benedetto (Università Roma Tre); Matteo Brandetti (TU Wien); Emanuel Sallinger (TU Wien) Show AbstractDownload Paper
Equality-generating dependencies (EGDs) allow to fully exploit the power of existential quantification in ontological reasoning settings modeled via Tuple-Generating Dependencies (TGDs), by enabling value-assignment or forcing the equivalence of fresh symbols. These capabilities are at the core of many common reasoning tasks, including graph traversals, clustering, data matching and data fusion, and many more related real-world scenarios. However, the interplay of TGDs and EGDs is known to lead to undecidability or intractability of query answering in tractable Datalog+/- fragments, like Warded Datalog+/-, for which, in the sole presence of TGDs, query answering is PTIME in data complexity. Restrictions of equality constraints, like separable EGDs, have been studied, but all achieve decidability at the cost of limited expressive power, which makes them unsuitable for the mentioned tasks. This paper introduces the class of “harmless” EGDs, that subsume separable EGDs and allow to model a very broad class of tasks. We contribute a sufficient syntactic condition for testing harmlessness, an undecidable task in general. We argue that in Warded Datalog+/- with harmless EGDs, ontological reasoning is decidable and PTIME. From such theoretical underpinnings, we develop novel chase-based techniques for reasoning with harmless EGDs and present an implementation within the Vadalog system, a state-of-the-art Datalog-based reasoner. We provide full-scale experimental evaluation and comparative analysis.
Witness Generation for JSON SchemaLyes Attouche (Univerite Paris-Dauphine); Mohamed-Amine Baazizi (Sorbonne Universite); Dario Colazzo (Univ. Paris Dauphine - PSL); Giorgio Ghelli (Universita di Pisa); Carlo Sartiani (Università della Basilicata); Stefanie Scherzinger (University of Passau) Show AbstractDownload Paper
JSON Schema is a schema language for JSON documents, based on a complex combination of structural operators, Boolean operators (negation included), and recursive variables. The static analysis of JSON Schema documents comprises practically relevant problems, including schema satisfiability, inclusion, and equivalence. These problems can be reduced to witness generation: given a schema, generate an element of the schema — if it exists — and report failure otherwise. Schema satisfiability, inclusion, and equivalence have been shown to be decidable. However, no witness generation algorithm has yet been formally described. We contribute a first, direct algorithm for JSON Schema witness generation, and study its effectiveness and efficiency in experiments over several schema collections, including thousands of real-world schemas.
B2
Provenance, Trace Capture, & Process Mining
Chair: Riccardo Tommasini (INSA Lyon - LIRIS)
Erebus: Explaining the Outputs of Data Streaming QueriesDimitris Palyvos-Giannas (Chalmers University of Technology)*; Katerina Tzompanaki (CY Cergy Paris University); Marina Papatriantafilou (Chalmers University of Technology); Vincenzo Gulisano (Chalmers University of Technology) Show AbstractDownload Paper
In data streaming, why-provenance can explain why a given outcome is observed but offers no help in understanding why an expected outcome is missing. Explaining missing answers has been addressed in DBMSs, but these solutions are not directly applicable to the streaming setting, because of the extra challenges posed by limited storage and by the unbounded nature of data streams.
With our framework, Erebus, we tackle the unaddressed challenges behind explaining missing answers in streaming applications. Erebus allows users to define expectations about the results of a query, verifying at runtime if such expectations hold, and also providing explanations when expected and observed outcomes diverge (missing answers). To the best of our knowledge, Erebus is the first such solution in data streaming. Our thorough evaluation on real data shows that Erebus can explain the (missing) answers with small overheads, both in low- and higher-end devices, even when large portions of the processed data are part of such explanations
R^3: Record-Replay-Retroaction for Database-Backed ApplicationsQian Li (Stanford University)*; Peter Kraft (Stanford University); Michael Cafarella (MIT CSAIL); Çağatay Demiralp (Sigma Computing); Goetz Graefe (Google); Christos Kozyrakis (Stanford University); Michael Stonebraker (Massachusetts Institute of Technology); Lalith Suresh (VMware Research); Xiangyao Yu (University of Wisconsin-Madison); Matei Zaharia (Berkeley and Databricks) Show AbstractDownload Paper
Developers would benefit greatly from time travel: being able to faithfully replay past executions and retroactively execute modified code on past events. Currently, replay and retroaction are impractical because they require expensively capturing fine-grained timing information to reproduce concurrent accesses to shared state. In this paper, we propose practical time travel for database-backed applications, an important class of distributed applications that access shared state through transactions.
We present R^3, a novel Record-Replay-Retroaction tool. R^3 implements a lightweight interceptor to record concurrency information for applications at transaction-level granularity, enabling replay and retroaction with minimal overhead. We address key challenges in both replay and retroaction. First, we design a novel algorithm for faithfully reproducing application requests running with snapshot isolation, allowing R^3 to support most production DBMSs. Second, we develop a retroactive execution mechanism that provides high fidelity with the original trace while supporting nearly arbitrary code modifications. We demonstrate how R^3 simplifies debugging for real, hard-to-reproduce concurrency bugs from popular open-source web applications. We evaluate R^3 using TPC-C and microservice workloads and show that R^3 always-on recording has a small performance overhead (<25% for point queries but <0.1% for complex transactions like in TPC-C) during normal application execution and that R^3 can retroactively execute bugfixed code over recorded traces within 0.11 – 0.78x of the original execution time.
An Experimental Evaluation of Process Concept Drift Detection [eab]Jan Niklas Adams (Chair for Process and Data Science, RWTH Aachen)*; Cameron Pitsch (RWTH Aachen University); Tobias Brockhoff (Chair of Process and Data Science, RWTH Aachen University); Wil M.P. van der Aalst (RWTH Aachen University) Show AbstractDownload Paper
Process mining provides techniques to learn models from event data. These models can be descriptive (e.g., Petri nets) or predictive (e.g., neural networks). The learned models offer operational support to process owners by conformance checking, process enhancement, or predictive monitoring. However, processes are frequently subject to significant changes, making the learned models outdated and less valuable over time. To tackle this problem, Process Concept Drift (PCD) detection techniques are employed. By identifying when the process changes occur, one can replace learned models by relearning, updating, or discounting pre-drift knowledge. Various techniques to detect PCDs have been proposed. However, each technique's evaluation focuses on different evaluation goals out of accuracy, latency, versatility, scalability, parameter sensitivity, and robustness. Furthermore, the employed evaluation techniques and data sets differ. Since many techniques are not evaluated against more than one other technique, this lack of comparability raises one question: How do PCD detection techniques compare against each other? With this paper, we propose, implement, and apply a unified evaluation framework for PCD detection. We do this by collecting evaluation goals and evaluation techniques together with data sets. We derive a representative sample of techniques from a taxonomy for PCD detection. The implemented techniques and proposed evaluation framework are provided in a publicly available repository. We present the results of our experimental evaluation and observe that none of the implemented techniques works well across all evaluation goals. However, the results indicate future improvement points of algorithms and guide practitioners.
Mining Frequent Infix Patterns from Concurrency-Aware Process Execution VariantsMichael Martini (RWTH Aachen); Daniel Schuster (Fraunhofer-Institut für Angewandte Informationstechnik FIT)*; Wil M.P. van der Aalst (RWTH Aachen University) Show AbstractDownload Paper
Event logs, as considered in process mining, document a large number of individual process executions. Moreover, each process execution consists of various executed activities. To cope with the vast amount of process executions in event logs, the concept of variants exists that group process executions with identical ordering relations among their executed activities. Variants are an integral concept of process mining and help process analysts explore, filter, and manage large amounts of event data. In this paper, we consider concurrency-aware variants that allow activities within a process execution to be partially ordered---the execution of individual activities can overlap in time. However, the number of variants is often vast, making it challenging for process analysts to explore event data. Therefore, we present a novel approach to frequent pattern mining from concurrency-aware variants. We show that mining frequent patterns from concurrency-aware variants can be reduced to the frequent subtree mining problem. Further, we compare our proposed algorithm to a state-of-the-art frequent subtree mining algorithm exhibiting improved performance on real-life event logs.
B3
View & Change Management
Chair: Wolfgang Gatterbauer (Northeastern University)
Online Schema Evolution is (Almost) Free for Snapshot DatabasesTianxun Hu (Simon Fraser University)*; Tianzheng Wang (Simon Fraser University); Qingqing Zhou (Tencent) Show AbstractDownload Paper
Modern database applications often change their schemas to keep up with the changing requirements. However, support for online and transactional schema evolution remains challenging in existing database systems. Specifically, prior work often takes ad hoc approaches to schema evolution with “patches” applied to existing systems, leading to many corner cases and often incomplete functionality. Applications therefore often have to carefully schedule downtimes for schema changes, sacrificing availability.
This paper presents Tesseract, a new approach to online and transactional schema evolution without the aforementioned drawbacks. We design Tesseract based on a key observation: in widely used multi-versioned database systems, schema evolution can be modeled as data modification operations that change the entire table, i.e., data-definition-as-modification (DDaM). This allows us to support schema almost “for free” by leveraging the concurrency control protocol. By simple tweaks to existing snapshot isolation protocols, on a 40-core server we show that under a variety of workloads, Tesseract is able to provide online, transactional schema evolution without service downtime, and retain high application performance when schema evolution is in progress.
Making Cache Monotonic and ConsistentShuai An (University of Edinburgh); Yang Cao (University of Edinburgh)* Show AbstractDownload Paper
We propose monotonic consistent caching (MCC), a cache scheme for applications that demand consistency and monotonicity. MCC warrants that a transaction-like request always sees a consistent view of the backend database and observed writes over the cache will not be lost. We show that the complexity of MCC ranges from PTIME to NP-Complete. We characterize MCC via a notion of obsolete items, based on which we abstract a principle for designing competitive MCC policies. By applying the principle, we develop an optimal MCC policy for the batch model, where requests in a batch are known in advance. For the online and semi-online models, we develop ML-augmented policies that benefit from blackbox ML models for classifying obsolete items, while being provably competitive even if the ML is arbitrarily bad. Using benchmark and real-life traces, we show that MCC policies reduce 39.09% of database reads for Redis atop HBase and improve their throughput by 77.15%.
DBSP: Automatic Incremental View Maintenance for Rich Query LanguagesMihai Budiu (VMware Research)*; Tej Chajed (VMware Research); Frank McSherry (Materialize); Leonid Ryzhyk (VMware Research); Val Tannen (University of Pennsylvania) Show AbstractDownload Paper
Incremental view maintenance (IVM) has long been a central problem in database theory. Many solutions have been proposed for restricted classes of database languages, such as the relational algebra, or Datalog. These techniques do not naturally generalize to richer languages. In this paper we give a general, heuristic-free solution to this problem in 3 steps: (1) we describe a simple but expressive language called DBSP for describing computations over data streams; (2) we give a new mathematical definition of IVM and a general algorithm for solving IVM for arbitrary DBSP programs, and (3) we show how to model many rich database query languages using DBSP (including the full relational algebra, queries over sets and multisets, arbitrarily nested relations, aggregation, flatmap (unnest), monotonic and non-monotonic recursion, streaming aggregation, and arbitrary compositions of all of these). SQL and Datalog can both be implemented in DBSP. As a consequence, we obtain efficient incremental view maintenance algorithms for queries written in all these languages.
SageDB: An Instance-Optimized Data Analytics SystemJialin Ding (AWS); Ryan Marcus (MIT); Andreas Kipf (Amazon Web Services); Vikram Nathan (MIT); Aniruddha Nrusimha (MIT); Kapil Vaidya (MIT); Alexander van Renen (Friedrich-Alexander-Universität Erlangen-Nürnberg); Tim Kraska (MIT) Show AbstractDownload Paper
Modern data systems are typically both complex and general-purpose. They are complex because of the numerous internal knobs and parameters that users need to manually tune in order to achieve good performance; they are general-purpose because they are designed to handle diverse use cases, and therefore often do not achieve the best possible performance for any specific use case. A recent trend aims to tackle these pitfalls: instance-optimized systems are designed to automatically self-adjust in order to achieve the best performance for a specific use case, i.e., a dataset and query workload. Thus far, the research community has focused on creating instance-optimized database components, such as learned indexes and learned cardinality estimators, which are evaluated in isolation. However, to the best of our knowledge, there is no complete data system built with instance-optimization as a foundational design principle. In this paper, we present a progress report on SageDB, our effort towards building the first instance-optimized data system. SageDB synthesizes various instance-optimization techniques to automatically specialize for a given use case, while simultaneously exposing a simple user interface that places minimal technical burden on the user. Our prototype outperforms a commercial cloud-based analytics system by up to 3× on end-to-end query workloads and up to 250× on individual queries. SageDB is an ongoing research effort, and we highlight our lessons learned and key directions for future work.
B4
Information Integration & Mining
Chair: El Kindi Rezig (University of Utah)
Auto-BI: Automatically Build BI-Models Leveraging Local Join Prediction and Global Schema GraphYiming Lin (University of California at Irvine); Yeye He (Microsoft Research)*; Surajit Chaudhuri (Microsoft Research) Show AbstractDownload Paper
Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves.
We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called \textit{k-Min-Cost-Arborescence} (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called \textit{arborescence}. While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency, and can scale to the largest BI-models we encounter (with close to 100 tables).
Auto-BI is rigorously evaluated leveraging a unique dataset with over 100K real BI models we harvested, as well as 4 popular TPC benchmarks. Auto-BI is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.
Fast Algorithms for Denial Constraint DiscoveryEduardo H. M. Pena (UTFPR)*; Fabio Porto (LNCC); Felix Naumann (Hasso Plattner Institute, University of Potsdam) Show AbstractDownload Paper
Denial constraints (DCs) are an integrity constraint formalism widely used to detect inconsistencies in data. Several algorithms have been devised to discover DCs from data, as manually specifying them is burdensome and, worse yet, error-prone. The existing algorithms follow two basic steps: building an intermediate data structure from records, then enumerating the DCs from that intermediate. However, current algorithms are often inefficient in computing these intermediates. Also, it is still unclear which enumeration algorithm performs best since some of the available algorithms have not yet been compared to each other.
In response, we present a set of new algorithms with improved design choices. We introduce a parallel pipeline for rapidly computing the intermediate using custom data representations, algorithms, and indexes. For DC enumeration, we propose an inverted index, pruning, and parallel search strategies. We present hybrid approaches that integrate our techniques with previous enumeration algorithms, improving their performance in many scenarios. Our experimental study shows that the proposed DC discovery algorithms are consistently much faster (up to an order of magnitude) than the current state-of-the-art.
Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-VRoee Shraga (Northeastern University)*; Renée J. Miller (Northeastern University) Show AbstractDownload Paper
In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates explanations that use data transformations to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformations synthesis methods.
Learning and Deducing Temporal OrdersWenfei Fan (University of Edinburgh); Resul Tugay (University of Edinburgh); Yaoshu Wang (Shenzhen Institute of Computing Sciences, Shenzhen University)*; Min Xie (Shenzhen Institute of Computing Sciences); Muhammad Asif Ali (King Abdullah University of Science and Technology) Show AbstractDownload Paper
This paper studies how to determine temporal orders on attribute values in a set of tuples that pertain to the same entity, in the absence of complete timestamps. We propose a creator-critic framework to learn and deduce temporal orders by combining deep learning and rule-based deduction, referred to as GATE (Get the lATEst). The creator of GATE trains a ranking model via deep learning, to learn temporal orders and rank attribute values based on correlations among the attributes. The critic then validates the temporal orders learned and deduces more ranked pairs by chasing the data with currency constraints; it also provides augmented training data as feedback for the creator to improve the ranking in the next round. The process proceeds until the temporal order obtained becomes stable. Using real-life and synthetic datasets, we show that GATE is able to determine temporal orders with F-measure above 80%, improving deep learning by 7.8% and rule-based methods by 34.4%.
Extraction of Validating Shapes from very large Knowledge Graphs [sds]Kashif Rabbani (Aalborg University Denmark)*; Matteo Lissandrini (Aalborg University); Katja Hose (TU Wien) Show AbstractDownload Paper
Knowledge Graphs (KGs) represent heterogeneous domain knowledge on the Web and within organizations. There exist shapes constraint languages to define validating shapes to ensure the quality of the data in KGs. Existing techniques to extract validating shapes often fail to extract complete shapes, are not scalable, and are prone to produce spurious shapes. To address these shortcomings, we propose the Quality Shapes Extraction (QSE) approach to extract validating shapes in very large graphs, for which we devise both an exact and an approximate solution. QSE provides information about the reliability of shape constraints by computing their confidence and support within a KG and in doing so allows to identify shapes that are most informative and less likely to be affected by incomplete or incorrect data. To the best of our knowledge, QSE is the first approach to extract a complete set of validating shapes from WikiData. Moreover, QSE provides a 12x reduction in extraction time compared to existing approaches, while managing to filter out up to 93% of the invalid and spurious shapes, resulting in a reduction of up to 2 orders of magnitude in the number of constraints presented to the user, e.g., from 11,916 to 809 on DBpedia.
B5
Similarity Join & Entity Resolution
Chair: Reynold Cheng (University of Hong Kong)
TokenJoin: Efficient Filtering for Set Similarity Join with Maximum Weighted Bipartite MatchingAlexandros Zeakis (National and Kapodistrian University of Athens); Dimitrios Skoutas (Athena Research Center)*; Dimitris Sacharidis (ULB); Odysseas Papapetrou (TU Eindhoven); Manolis Koubarakis (University of Athens, Greece) Show AbstractDownload Paper
Set similarity join is an important problem with many applications in data discovery, cleaning and integration. To increase robustness, fuzzy set similarity join calculates the similarity of two sets based on maximum weighted bipartite matching instead of set overlap. This allows pairs of elements, represented as sets or strings, to also match approximately rather than exactly, e.g., based on Jaccard similarity or edit distance. However, this significantly increases the verification cost, making even more important the need for efficient and effective filtering techniques to reduce the number of candidate pairs. The current state-of-the-art algorithm relies on similarity computations between pairs of elements to filter candidates. In this paper, we propose token-based instead of element-based filtering, showing that it is significantly more lightweight, while offering similar or even better pruning effectiveness. Moreover, we address the top-k variant of the problem, alleviating the need for a user-specified similarity threshold. We also propose early termination to reduce the cost of verification. Our experimental results on six real-world datasets show that our approach always outperforms the state of the art, being an order of magnitude faster on average.
A Two-Level Signature Scheme for Stable Set Similarity JoinsDaniel Schmitt (University of Salzburg)*; Daniel Kocher (University of Salzburg); Nikolaus Augsten (University of Salzburg); Willi Mann (Celonis SE); Alexander Miller (University of Salzburg) Show AbstractDownload Paper
We study the set similarity join problem, which retrieves all pairs of similar sets from two collections of sets for a given distance function. Existing exact solutions employ a signature-based filter-verification framework: If two sets are similar, they must have at least one signature in common, otherwise they can be pruned safely. We observe that the choice of the signature scheme has a significant impact on the performance. Unfortunately, choosing a good signature scheme is hard because the performance heavily depends on the characteristics of the underlying dataset.
To address this problem, we propose a hybrid signature composition that leverages the most selective portion of each signature scheme. Sets with an unselective primary signature are detected, and the signatures are replaced with a more selective secondary signature. We propose a generic framework called TwoL and a cost model to balance the computational overhead and the selectivity of the signature schemes. We implement our framework with two complementary signature schemes for Jaccard similarity and Hamming distance, resulting in effective two-level hybrid indexes that join datasets with diverse characteristics efficiently. TwoL consistently outperforms state-of-the-art set similarity joins on a benchmark with 13 datasets that cover a wide range of data characteristics.
Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity MatchingDerek Paulsen (University of Wisconsin-Madison)*; Yash Govind (Apple); AnHai Doan (UW-Madison) Show AbstractDownload Paper
Blocking is a major task in entity matching (EM). Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it does quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.
Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [eab]Alexandros Zeakis (National and Kapodistrian University of Athens); George Papadakis (University of Athens); Dimitrios Skoutas (Athena Research Center)*; Manolis Koubarakis (University of Athens, Greece) Show AbstractDownload Paper
Many recent works on Entity Resolution (ER) leverage Deep Learning techniques involving language models to improve effectiveness. This is applied to both main steps of ER, i.e., blocking and matching. Several pre-trained embeddings have been tested, with the most popular ones being fastText and variants of the BERT model. However, there is no detailed analysis of their pros and cons. To cover this gap, we perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching. Our experimental results provide novel insights into the strengths and weaknesses of the main language models, facilitating researchers and practitioners to select the most suitable ones in practice.
Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching [eab]Nima Shahbazi (University of Illinois at Chicago)*; Nikola Danevski (University of Rochester); Fatemeh Nargesian (University of Rochester); Abolfazl Asudeh (University of Illinois Chicago); Divesh Srivastava (AT&T Chief Data Office) Show AbstractDownload Paper
Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation of a variety of EM techniques in this paper. We generated two social datasets from publicly available datasets for the purpose of auditing EM through the lens of fairness. Our findings underscore potential unfairness under two common conditions in real-world societies: (i) when some demographic groups are overrepresented, and (ii) when names are more similar in some groups compared to others. Among our many findings, it is noteworthy to mention that while various fairness definitions are valuable for different settings, due to EM's class imbalance nature, measures such as positive predictive value parity and true positive rate parity are, in general, more capable of revealing EM unfairness.
C1
Data Exploration/Transformation & Usability
Chair: Oliver Kennedy (University at Buffalo)
Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using ExamplesPeng Li (Georgia Institute of Technology); Yeye He (Microsoft Research)*; Cong Yan (Microsoft Research); Yue Wang (Microsoft Research); Surajit Chaudhuri (Microsoft Research) Show AbstractDownload Paper
Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables “in the wild”. Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Tableau forums.
We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 194 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.
Transactional Panorama: A Conceptual Framework for User Perception in Analytical Visual InterfacesDixin Tang (University of California at Berkeley)*; Alan Fekete (University of Sydney); Indranil Gupta (UIUC); Aditya G. Parameswaran (University of California at Berkeley) Show AbstractDownload Paper
Many tools empower analysts and data scientists to consume analysis results in a visual interface, such as a dashboard. When the underlying data changes, these results need to be updated, but this update can take a long time---all while the user continues to explore the results. In this context, tools can either (i) hide away results that haven't been updated, hindering exploration; (ii) make the updated results immediately available to the user (on the same screen as old results), leading to confusion and incorrect insights; or (iii) present old---and therefore stale---results to the user during the update. To help users reason about these options and others, and make appropriate trade-offs, we introduce Transactional Panorama, a formal framework that adopts transactions to jointly model the system refreshing the analysis results and the user interacting with them. We introduce three key properties that are important for user perception in this context, visibility (allowing users to continuously explore results), consistency (ensuring that results presented are from the same version of the data), and monotonicity (making sure that results don't ``go back in time''). Within transactional panorama, we characterize all of the feasible property combinations, design new mechanisms (that we call lenses) for presenting analysis results to the user while preserving a given property combination, formally prove their relative orderings for various performance criteria, and discuss their use cases. We propose novel algorithms to preserve each property combination and efficiently present the fresh analysis results. We implement our transactional panorama framework in a popular, open-source BI tool, illustrate the relative performance implications of different lenses, demonstrate the benefits of the novel lenses, and outline the performance improvement by our optimizations.
FEDEX: An Explainability Framework for Data Exploration StepsDaniel Deutch (Tel Aviv University); Amir Gilad (Duke University); Tova Milo (Tel Aviv University); Amit Mualem (Tel Aviv University); Amit Somech (Bar-Ilan University) Show AbstractDownload Paper
When exploring a new dataset, Data Scientists often apply analysis queries, look for insights in the resulting dataframe, and repeat to apply further queries. We propose in this paper a novel solution that assists data scientists in this laborious process. In a nutshell, our solution pinpoints the most interesting (sets of) rows in each obtained dataframe. Uniquely, our definition of interest is based on the contribution of each row to the interestingness of different columns of the entire dataframe, which, in turn, is defined using standard measures such as diversity and exceptionality. Intuitively, interesting rows are ones that explain why (some column of) the analysis query result is interesting as a whole. Rows are correlated in their contribution and so the interesting score for a set of rows may not be directly computed based on that of individual rows. We address the resulting computational challenge by restricting attention to semantically-related sets, based on multiple notions of semantic relatedness; these sets serve as more informative explanations. Our experimental study across multiple real-world datasets shows the usefulness of our system in various scenarios.
Bolt-on, Compact, and Rapid Program Slicing for Notebooks [sds]Shreya Shankar (University of California Berkeley); Stephen Macke (University of Illinois at Urbana-Champaign); Sarah Chasins (UC Berkeley); Andrew Head (University of California, Berkeley); Aditya Parameswaran (University of California, Berkeley) Show AbstractDownload Paper
Computational notebooks are commonly used for iterative workflows, such as in exploratory data analysis. This process lends itself to the accumulation of old code and hidden state, making it hard for users to reason about the lineage of, e.g., plots depicting insights or trained machine learning models. One way to reason about code used to generate various notebook data artifacts is to compute a program slice, but traditional static approaches to slicing can be both inaccurate (failing to contain relevant code for artifacts) and conservative (containing unnecessary code for an artifacts). We present nbslicer, a dynamic slicer optimized for the notebook setting whose instrumentation for resolving dynamic data dependencies is both bolt-on (and therefore portable) and switchable (allowing it to be selectively disabled in order to reduce instrumentation overhead). We demonstrate nbslicer’s ability to construct small and accurate backward slices (i.e., historical cell dependencies) and forward slices (i.e., cells affected by the "rerun" of an earlier cell), thereby improving reproducibility in notebooks and enabling faster reactive re-execution, respectively. Comparing nbslicer with a static slicer on 374 real notebook sessions, we found that nbslicer filters out far more superfluous program statements while maintaining slice correctness, giving slices that are, on average, 66% and 54% smaller for backward and forward slices, respectively.
C2
Indexing
Chair: Abolfazl Asudeh (University of Illinois at Chicago)
Towards Efficient Index Construction and Approximate Nearest Neighbor Search in High-Dimensional SpacesXi Zhao (HKUST)*; Yao Tian (The Hong Kong University of Science and Technology); Kai Huang (HKUST); Bolong Zheng (Huazhong University of Science and Technology); Xiaofang Zhou (The Hong Kong University of Science and Technology) Show AbstractDownload Paper
The approximate nearest neighbor (ANN) search in high-dimensional spaces is a fundamental but computationally very expensive problem. Many methods are designed for solving the ANN problem, such as LSH-based methods and graph-based methods.
The LSH-based methods can be costly to reach high query quality due to the hash-boundary issues, while the graph-based methods can achieve better query performance by greedy expansion in an approximate proximity graph (APG). However, the construction cost of these APGs can be one or two orders of magnitude higher than that for building hash-based indexes. In addition, they fail short in incrementally maintaining APGs as the underlying dataset evolves. In this paper, we propose a novel approach named LSH-APG to build APGs and facilitate fast ANN search using a lightweight LSH framework. LSH-APG builds an APG via consecutively inserting points based on their nearest neighbor relationship with an efficient and accurate LSH-based search strategy. A high-quality entry point selection technique and an LSH-based pruning condition are developed to accelerate index construction and query processing by reducing the number of points to be accessed during the search. LSH-APG supports fast maintenance of APGs in lieu of building them from scratch as dataset evolves. Its maintenance cost and query cost for a point is proven to be less affected by dataset cardinality. Extensive experiments on real-world and synthetic datasets demonstrate that LSH-APG incurs significantly less construction cost but achieves better query performance than existing graph-based methods.
CORE-Sketch: On Exact Computation of Median Absolute Deviation with Limited SpaceHaoquan Guan (Tsinghua University); Ziling Chen (Tsinghua University); Shaoxu Song (Tsinghua University)* Show AbstractDownload Paper
Median absolute deviation (MAD), the median of the absolute devi- ations from the median, has been found useful in various applica- tions such as outlier detection. Together with median, MAD is more robust to abnormal data than mean and standard deviation (SD). Un- fortunately, existing methods return only approximate MAD that may be far from the exact one, and thus mislead the downstream applications. Computing exact MAD is costly, however, especially in space, by storing the entire dataset in memory. In this paper, we propose COnstruction-REfinement Sketch (CORE-Sketch) for computing exact MAD. The idea is to construct some sketch within limited space, and gradually refine the sketch to find the MAD element, i.e., the element with distance to the median exactly equal to MAD. Mergeability and convergence of the method is analyzed, ensuring the correctness of the proposal and enabling parallel com- putation. Extensive experiments demonstrate that CORE-Sketch achieves significantly less space occupation compared to the afore- said baseline of No-Sketch, and has time and space costs relatively comparable to the DD-Sketch method for approximate MAD.
BP-tree: Overcoming the Point-Range Operation Tradeoff for In-Memory B-treesHelen Xu (Lawrence Berkeley National Laboratory)*; Amanda Li (Massachusetts Institute of Technology); Brian Wheatman (Johns Hopkins University); Manoj Marneni (University of Utah); Prashant Pandey (University of Utah) Show AbstractDownload Paper
B-trees are the go-to data structure for in-memory indexes in databases and storage systems. B-trees support both point operations (i.e. inserts and finds) and range operations (i.e. iterators and maps). However, there is an inherent tradeoff between point and range operations since the optimal node size for point operations is much smaller than the optimal node size for range operations. Existing implementations use a relatively small node size to achieve fast point operations at the cost of range operation throughput. We present the BP-tree, a variant of the B-tree, that overcomes the decades-old point-range operation tradeoff in traditional B-trees. In the BP-tree, the leaf nodes are much larger in size than the internal nodes to support faster range scans. To avoid any slowdown in point operations due to large leaf nodes, we introduce a new insert-optimized array called the buffered partitioned array (BPA) to efficiently organize data in leaf nodes. The BPA supports fast insertions by delaying ordering the keys in the array. This results in much faster range operations and faster point operations at the same time in the BP-tree. Our experiments show that on 48 hyperthreads, on workloads generated from the Yahoo! Cloud Serving Benchmark (YCSB), the BP- tree supports similar or faster point operation throughput (between .94×−1.2×) compared to Masstree and OpenBw-tree, two state-of- the-art in-memory key-value (KV) stores. On a YCSB workload with short scans, the BP-tree is about 7.4× faster than Masstree and 1.6× faster than OpenBw-tree. Furthermore, we extend the YCSB to add large range workloads, commonly found in database applications, and show that the BP-tree is 30× faster than Masstree and 2.5× faster than OpenBw-tree. We also provide a reference implementation for a concurrent B+-tree and find that the BP-tree supports faster (by between 1.03× −1.2×) point operations when compared to the best-case configuration for B+-trees for point operations while supporting similar performance (about .95×) on short range operations and faster (about 1.3×) long range operations.
Blink-hash: An Adaptive Hybrid Index for In-Memory Time-Series DatabasesHokeun Cha (University of Wisconsin-Madison)*; Xiangpeng Hao (University of Wisconsin Madison); Tianzheng Wang (Simon Fraser University); Huanchen Zhang (Tsinghua University); Aditya Akella (UT Austin); Xiangyao Yu (University of Wisconsin-Madison) Show AbstractDownload Paper
High-speed data ingestion is critical in time-series workloads that are driven by the growth of Internet of Things (IoT) applications. We observe that traditional tree-based indexes encounter severe scalability bottlenecks for time-series workloads that insert monotonically increasing timestamp keys into an index; all insertions go to a small memory region that sees extremely high contention.
In this work, we present a new index design, Blink-hash, that enhances a tree-based index with hash leaf nodes to mitigate the contention of monotonic insertions — insertions go to random locations within a hash node (which is much larger than a B+-tree node) to reduce conflicts.We develop further optimizations (median approximation and lazy split) to accelerate hash node splits. We also develop structure adaptation optimizations to dynamically convert a hash node to B+-tree nodes for good scan performance. Our evaluation shows that Blink-hash achieves up to 91.3× higher throughput than conventional indexes in a time-series workload that monotonically inserts timestamps into an index, while showing comparable scan performance to a well-optimized B+-tree.
ONe Index for All Kernels (ONIAK): A Zero Re-Indexing LSH Solution to ANNS-ALT (After Linear Transformation)Jingfan Meng (Georgia Institute of Technology); Huayi Wang (Georgia Institute of Technology); Jun Xu (Georgia Tech); Mitsunori Ogihara (University of Miami) Show AbstractDownload Paper
In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-specified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNS-ALT, or its dual that we call ANNS-ALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNS-ALT and all its baby problems when the data dimension 𝑑 is not too large (say 𝑑 ≤ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNS-ALT queries that can have distinct query matrices. We show by experiments that, when 𝑑 is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNS-ALT), and has query performances comparable to those of the state-of-the-art solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers from a so-called dimension blowup problem that can make the indexing time prohibitively long for a large dataset. We propose a novel algorithmic technique, called fast GOE quadratic form (FGoeQF), that completely solves the (prohibitively long indexing time) fallout of the dimension blowup problem. We also propose a Johnson-Lindenstrauss transform (JLT) based ANNS-ALT (and ANNS-ALTD) solution that significantly outperforms any competitor when 𝑑 is large.
C3
Foundation Models & Databases
Chair: Fei Chiang (McMaster University)
How Large Language Models Will Disrupt Data Management [vision]Raul Castro Fernandez (The University of Chicago)*; Aaron J Elmore (University of Chicago); Michael J Franklin (University of Chicago); Sanjay Krishnan (U Chicago); Chenhao Tan (University of Chicago) Show AbstractDownload Paper
Large language models (LLMs), such as GPT-4, are revolutionizing software’s ability to understand, process, and synthesize language. The authors of this paper believe that this advance in technology is significant enough to prompt introspection in the data management community, similar to previous technological disruptions such as the advents of the world wide web, cloud computing, and statistical machine learning. We argue that the disruptive influence that LLMs will have on data management will come from two angles. (1) A number of hard database problems, namely, entity resolution, schema matching, data discovery, and query synthesis, hit a ceiling of automation because the system does not fully understand the semantics of the underlying data. Based on large training corpora of natural language, structured data, and code, LLMs have an unprecedented ability to ground database tuples, schemas, and queries in real-world concepts. We will provide examples of how LLMs may completely change our approaches to these problems. (2) LLMs blur the line between predictive models and information retrieval systems with their ability to answer questions. We will present examples showing how large databases and information retrieval systems have complementary functionality.
Can Foundation Models Wrangle Your Data? [vision]Avanika Narayan (Stanford University)*; Ines Chami (Numbers Station); Laurel Orr (Stanford University); Christopher Ré (Stanford University) Show AbstractDownload Paper
Foundation Models (FMs) are models trained on large corpora of data that, at very large scale, can generalize to new tasks without any task-specific finetuning. As these models continue to grow in size, innovations continue to push the boundaries of what these models can do on language and image tasks. This paper aims to understand an underexplored area of FMs: classical data tasks like cleaning and integration. As a proof-of-concept, we cast five data cleaning and integration tasks as prompting tasks and evaluate the performance of FMs on these tasks. We find that large FMs generalize and achieve SoTA performance on data cleaning and integration tasks, even though they are not trained for these data tasks. We identify specific research challenges and opportunities that these models present, including challenges with private and domain specific data, and opportunities to make data management systems more accessible to non-experts. We make our code and experiments publicly available at: https://github.com/HazyResearch/fm_data_tasks.
CatSQL: Towards Real World Natural Language to SQL ApplicationsHan Fu (Alibaba Group)*; Chang Liu (Alibaba Group); Bin Wu (Alibaba Group); Feifei Li (Alibaba Group); Jian Tan (Alibaba); Jianling Sun (Zhejiang University) Show AbstractDownload Paper
Natural language to SQL (NL2SQL) techniques provide a convenient interface to access databases for data analytics and non-expert database users. Existing methods to this problem either employ a rule-base approach or a deep learning-backed solution. Rule-based approaches are hard to generalize across different domains. Deep learning-based solutions generalize better across different domains, but they often result in queries with syntactically or semantically errors which are thus not executable over the underlying database. In this work, we are the first to bridge these two approaches and make novel developments to achieve significant better performance in terms of both accuracy and runtime. In particular, our solution develops a novel CatSQL sketch, which is a template with empty slots, and develop a deep learning model to fill in the slots. Compared with existing sequence-to-sequence-based approach, our sketch-based method does not need to generate keywords which are boilerplates in the template, and thus can achieve better accuracy and run much faster. Compared with previous sketch-based approaches, our CatSQLsketch is more general, which is largely equivalent to standard SQL, and our model can leverage the values filled in one slot when filling in another to improve the performance. In addition, we propose the Semantics Correction technique, which is the first technique leverage database domain knowledge in a deep learning-based NL2SQL solution. Semantics Correction is a post-processing routine, which runs over generated SQL queries and employs rules to identify semantics errors and try to fix them. This technique significantly improves the NL2SQL accuracy. We conduct extensive evaluation on both single-domain and cross-domain benchmarks and demonstrate that our approach can significantly outperform all previous approaches in terms of both accuracy and throughput. In particular, on the state-of-the-art NL2SQL benchmark such as Spider, our CatSQL prototype outperforms the existing state-of-the-art solution by 4 points on accuracy, while achieves an up-to 63x larger throughput.
C4
Hardware Acceleration
Chair: Kyuseok Shim (Seoul National University)
Excalibur: A Virtual Machine for Adaptive Fine-grained JIT-Compiled Query Execution based on VOILATim Gubner (CWI); Peter Boncz (CWI)* Show AbstractDownload Paper
In recent years, hardware has become increasingly diverse, in terms of features as well as performance. This poses a problem for complex software in general and database systems in particular. To achieve top-notch performance, we need to exploit hardware features, but do not fully know which behave best on the current, and more-so future, machines. Specializing query execution methods for many diverse hardware platforms will significantly increase database software complexity and also poses a physical query optimization problem that cannot be solved robustly with static cost models.
In this paper, we propose an architecture that abstracts these details away. Based on the flexible domain-specific language VOILA, it can generate thousands of different flavors from a single code-base. As an abstraction, a virtual machine (VM) allows hiding physical execution details, such that the VM can transparently switch between different execution tactics within each query, applied at a fine-grained granularity. We show rules to describe a search space for good tactics, and describe efficient search strategies, that limit the overhead of adaptive JIT code generation and compilation. The VM starts executing each query in full vectorized code style, but adaptively replaces (parts of) query pipelines by code fragments compiled using different execution flavors, exploring this search space and exploiting the best tactics found, casting adaptive query execution into a Multi-Armed Bandit (MAB) problem. Excalibur, our prototype, outperforms open-source systems by up to 28x and the state-of-the-art system Umbra by up to 1.8x. In specific queries Excalibur performs up to 2x faster than static flavors.
Bringing Compiling Databases to RISC ArchitecturesFerdinand Gruber (Technical University of Munich)*; Maximilian Bandle (TUM); Alexis Engelke (Technical University of Munich); Thomas Neumann (TUM); Jana Giceva (TU Munich) Show AbstractDownload Paper
Current hardware development greatly influences the design decisions of modern database systems. For many modern performance-focused database systems, query compilation emerged as an integral part and different approaches for code generation evolved, making use of standard compilers, general-purpose compiler libraries, or domain-specific code generators. However, development primarily focused on the dominating x86-64 server architecture; but neglected current hardware developments towards other CPU architectures like ARM and other RISC architectures. Therefore, we explore the design space of code generation in database systems considering a variety of state-of-the-art compilation approaches with a set of qualitative and quantitative metrics. Based on our findings, we have developed a new code generator called FireARM for AArch64-based systems in our database system, Umbra. We identify general as well as architecture-specific challenges for custom code generation in databases and provide potential solutions to abstract or handle them. Furthermore, we present an extensive evaluation of different compilation approaches in Umbra on a wide variety of x86-64 and ARM machines. In particular, we compare quantitative performance characteristics such as compilation latency and query throughput. Our results show that using standard languages and compiler infrastructures reduces the barrier to employing query compilation and allows for high performance on big data sets, while domain-specific code generators can achieve a significantly lower compilation overhead and allow for better targeting of new architectures.
Deploying Computational Storage for HTAP DBMSs Takes More Than Just Computation OffloadingKitaek Lee (Hanyang University); Insoon Jo (Hanyang University); Jaechan Ahn (Hanyang University); Hyuk Lee (Samsung Electronics); Hwang Lee (Samsung Electronics); Woong Sul (Hanyang University); Hyungsoo Jung (Hanyang University)* Show AbstractDownload Paper
Hybrid transactional/analytical processing (HTAP) would overload database systems. To alleviate performance interference between transactions and analytics, recent research pursues the potential of in-storage processing (ISP) using commodity computational storage devices (CSDs). However, in-storage query processing faces technical challenges in HTAP environments. Continuously updated data versions pose two hurdles: (1) data items keep changing, and (2) finding visible data versions incurs excessive data access in CSDs. Such access patterns dominate the cost of query processing, which may hinder the active deployment of CSDs.
This paper addresses the core issues by proposing an analytic offload engine (AIDE) that transforms engine-specific query execution logic into vendor-neutral computation through a canonical interface. At the core of AIDE are the canonical representation of vendor-specific data and the separate management of data locators. It enables any CSD to execute vendor-neutral operations on canonical tuples with separate indexes, regardless of host databases. To eliminate excessive data access, we prescreen the indexes before offloading; thus, host-side prescreening can obviate the need for running costly version searching in CSDs and boost analytics. We implemented our prototype for PostgreSQL and MyRocks, demonstrating that AIDE supports efficient ISP for two databases using the same FPGA logic. Evaluation results show that AIDE improves query latency up to 42x on PostgreSQL and 34x on MyRocks.
Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareMaria Xekalaki (The University of Manchester); Juan Fumero (The University of Manchester); Athanasios Stratikopoulos (The University of Manchester); Katerina Doka (National Technical University of Athens); Christos Katsakioris (National Technical University of Athens); Constantinos Bitsakos (NTUA); Nectarios Koziris (NTUA); Christos Kotselidis (The University of Manchester) Show AbstractDownload Paper
The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability since the end-user Application Programming Interface (APIs) must be altered to access the underlying heterogeneous hardware. For example, current Big Data frameworks, such as Apache Spark, provide a new API that combines the existing Spark programming model with GPUs. For other Big Data frameworks, such as Flink, the integration of GPUs and FPGAs is achieved via external API calls that bypass their execution models completely. In this paper, we rethink current Big Data frameworks from a systems and programming language perspective, and introduce a novel co-designed approach for integrating hardware acceleration into their execution models. The novelty of our approach is attributed to two key design decisions: a) support for arbitrary User Defined Functions (UDFs), and b) no modifications to the user level API. The proposed approach has been prototyped in the context of Apache Flink, and enables unmodified applications written in Java to run on heterogeneous hardware, such as GPU and FPGAs, transparently to the users. The performance evaluation of the proposed solution has shown performance speedups of up to 65x on GPUs and 184x on FPGAs for suitable workloads of standard benchmarks and industrial use cases against vanilla Flink running on traditional multi-core CPUs.
C5
Optimizing Queries & Beyond
Chair: Renata Borovica-Gajic (University of Melbourne)
Analyzing the Impact of Cardinality Estimation on Execution Plans in Microsoft SQL ServerKukjin Lee (Microsoft); Anshuman Dutt (Microsoft Research)*; Vivek Narasayya (Microsoft); Surajit Chaudhuri (Microsoft Research) Show AbstractDownload Paper
Cardinality estimation is widely believed to be one of the most important causes of poor query plans. Prior studies evaluate the impact of cardinality estimation on plan quality on a set of Select-Project-Join queries on PostgreSQL DBMS. Our empirical study broadens the scope of prior studies in significant ways. First, we include complex SQL queries containing group-by, aggregation, outer joins and sub-queries from real-world workloads and industry benchmarks. We evaluate on both row-oriented and column-oriented physical designs. Our empirical study uses Microsoft SQL Server, an industry-strength DBMS with a state-of-the-art query optimizer that is equipped with techniques to optimize such complex queries. Second, we analyze the sensitivity of plan quality to cardinality errors in two ways by: (a) varying the subset of query sub-expressions for which accurate cardinalities are used, and (b) introducing progressively larger cardinality errors. Third, query processing techniques such as bitmap filtering and adaptive join have the potential to mitigate the impact of cardinality estimation errors by reducing the latency of bad plans. We evaluate the importance of accurate cardinalities in the presence of these techniques.
Robust Query Driven Cardinality Estimation under Changing WorkloadsParimarjan Negi (MIT CSAIL)*; Ziniu Wu (Massachusetts Institute of Technology); Andreas Kipf (Amazon Web Services); Nesime Tatbul (Intel Labs and MIT); Ryan Marcus (Brandeis University); Samuel Madden (Massachusetts Institute of Technology); Tim Kraska (Massachusetts Institute of Technology); Mohammad Alizadeh (MIT CSAIL) Show AbstractDownload Paper
Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage, fast inference and training, and easily adaptable for any kind of query. However, they can get unpredictably bad under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques that make them robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates.
We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40K synthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2x over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models do not perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.
Scaling a Declarative Cluster Manager Architecture with Query Optimization TechniquesKexin Rong (Georgia Institute of Technology)*; Mihai Budiu (VMware Research); Athinagoras Skiadopoulos (Stanford University); Lalith Suresh (VMware Research); Amy Tai (Google) Show AbstractDownload Paper
Cluster managers play a crucial role in data centers by distributing workloads among infrastructure resources. Declarative Cluster Management (DCM) is a new cluster management architecture that enables users to express placement policies declaratively using SQL-like queries. This paper presents our experiences in scaling this architecture from moderate-sized enterprise clusters (10^2-10^3 nodes) to hyperscale clusters (10^4 nodes) via query optimization techniques. First, we formally specify the syntax and semantics of DCM’s declarative language, C-SQL, a SQL variant used to express constraint optimization problems. We showcase how constraints on the desired state of the cluster system can be succinctly represented as C-SQL programs, and how query optimization techniques like incremental view maintenance and predicate pushdown can enhance the execution of C-SQL programs. We evaluate the effectiveness of our optimizations through a case study of building Kubernetes schedulers using C-SQL. Our optimizations demonstrated an almost 3000x speed up in database latency and reduced the size of optimization problems by as much as 1/300 of the original, without affecting the quality of the scheduling solutions.
Leveraging Application Data Constraints to Optimize Database-Backed Web ApplicationsXiaoxuan Liu (UC Berkeley)*; Shuxian Wang (UC Berkeley); Mengzhu Sun (University of California at Berkeley); Sicheng Pan (UC Berkeley); Ge Li (University of California at Berkeley); Siddharth Jha (UC Berkeley); Cong Yan (Microsoft Research); Junwen Yang (The university of chicago); Shan Lu (University of Chicago); Alvin Cheung (University of California at Berkeley) Show AbstractDownload Paper
Exploiting the relationships among data is a classical query optimization technique.
As persistent data is increasingly being created and maintained programmatically, prior work that infers data relationships from data statistics misses an important opportunity. We present Coco, the first tool that identifies data relationships by analyzing database-backed applications. Once identified, Coco leverages the constraints to optimize the application's physical design and query execution. Instead of developing a fixed set of predefined rewriting rules, Coco employs an enumerate-test-verify technique to automatically exploit the discovered data constraints to improve query execution. Each resulting rewrite is provably equivalent to the original query. Using 14 real-world web applications, our experiments show that Coco can discover numerous data constraints from code analysis and improve real-world application performance significantly.
QueryBooster: Improving SQL Performance Using Middleware Services for Human-Centered Query RewritingQiushi Bai (UC Irvine)*; Sadeem Alsudais (UC Irvine); Chen Li (UC Irvine) Show AbstractDownload Paper
SQL query performance is critical in database applications, and query rewriting is a technique that transforms an original query into an equivalent query with a better performance. In a wide range of database-supported systems, there is a unique problem where both the application and database layer are black boxes, and the developers need to use their knowledge about the data and domain to rewrite queries sent from the application to the database for better performance. Unfortunately, existing solutions do not give the users enough freedom to express their rewriting needs. To address this problem, we propose QueryBooster, a novel middleware-based service architecture for human-centered query rewriting, where users can use its expressive and easy-to-use rule language (called VarSQL) to formulate rewriting rules based on their needs. It also allows users to express rewriting intentions by providing examples of the original query and its rewritten query. QueryBooster automatically generalizes them to rewriting rules and suggests high-quality ones. We conduct a user study to show the benefits of VarSQL to formulate rewriting rules. Our experiments on real and synthetic workloads show the effectiveness of the rule-suggesting framework and the significant advantages of using QueryBooster for human-centered query rewriting to improve the end-to-end query performance.
C6
Rethinking Query Optimization & Execution
Chair: Stefanie Scherzinger (University of Passau)
Opportunities for Quantum Acceleration of Databases: Optimization of Queries and Transaction Schedules [vision]Umut Çalıkyılmaz (University of Lübeck)*; Sven Groppe (University of Lübeck); Jinghua Groppe (Universität zu Lübeck); Tobias Winker (IFIS, University of Lübeck); Stefan Prestel (Quantum Brilliance GmbH); Farida Shagieva (Quantum Brilliance GmbH); Daanish Arya (Quantum Brilliance GmbH); Florian Preis (Quantum Brilliance GmbH); Le Gruenwald (The University of Oklahoma) Show AbstractDownload Paper
The capabilities of quantum computers, such as the number of supported qubits and maximum circuit depth, have grown exponentially in recent years. Commercially relevant applications that take advantage of quantum computing are expected to be available soon. In this paper, we shed light on the possibilities of accelerating database tasks using quantum computing with the examples of optimizing queries and transaction schedules.
Asymptotically Better Query Optimization Using Indexed AlgebraPhilipp Fent (TUM)*; Guido Moerkotte (University of Mannheim); Thomas Neumann (TUM) Show AbstractDownload Paper
Query optimization is essential for the efficient execution of queries. The necessary analysis, if we can and should apply optimizations and transform the query plan, is already challenging. Traditional techniques focus on the availability of columns at individual operators, which does not scale for analysis of data flow through the query. Tracking available columns per operator takes quadratic space, which can result in multi-second optimization time for deep algebra trees.
Instead, we need to re-think the naive algebra representation to efficiently support data flow analysis.
In this paper, we introduce Indexed Algebra, a novel representation of relational algebra that makes common optimization tasks efficient. Indexed Algebra enables efficient reasoning with an auxiliary index structure based on link/cut trees that support dynamic updates and queries in O(log n). This approach not only improves the asymptotic complexity, but also allows elegant and concise formulations for the data flow questions needed for query optimization. While large queries see theoretically unbounded improvements, Indexed Algebra also improves optimization time of the relatively harmless queries of TPC-H and TPC-DS by more than 1.8x.
SlabCity: Whole-Query Optimization using Program SynthesisRui Dong (University of Michigan)*; Jie Liu (University of Michigan); Yuxuan Zhu (University of Michigan); Cong Yan (Microsoft research); Barzan Mozafari (University of Michigan); Xinyu Wang (University of Michigan) Show AbstractDownload Paper
Query rewriting is often a prerequisite for effective query optimization, particularly for poorly-written queries. Prior work on query rewriting has relied on a set of “rules” based on syntactic pattern-matching. Whether relying on manual rules or auto-generated ones, rule-based query rewriters are inherently limited in their ability to handle new query patterns. Their success is limited by the quality and quantity of the rules provided to them.
To our knowledge, we present the first synthesis-based query rewriting technique, SlabCity, capable of whole-query optimization without relying on any rewrite rules. SlabCity directly searches the space of SQL queries using a novel query synthesis algorithm that leverages a new concept called query dataflows. We evaluate SlabCity on four workloads, including a newly curated benchmark with more than 1000 real-life queries. We show that not only can SlabCity optimize more queries than state-of-the-art query rewriting techniques, but interestingly, it also leads to queries that are significantly faster than those generated by rule-based systems.
Declarative Sub-Operators for Universal Data ProcessingMichael Jungmair (Technical University of Munich)*; Jana Giceva (TU Munich) Show AbstractDownload Paper
Data processing systems face the challenge of supporting increasingly diverse workloads efficiently. At the same time, they are already bloated with internal complexity, and it is not clear how new hardware can be supported sustainably.
In this paper, we aim to resolve these issues by proposing a unified abstraction layer based on declarative sub-operators in addition to relational operators. By exposing this layer to users, they can express their non-relational workloads declaratively with sub-operators. Furthermore, the proposed sub-operators decouple the semantic implementation of operators from the efficient imperative implementation, reducing the implementation complexity for relational operators. Finally, through fine-grained automatic optimizations, the declarative sub-operators allow for automatic morsel-driven parallelism. We demonstrate the benefits not only by providing a specific set of sub-operators but also implementing them in a compiling query engine. With thorough evaluation and analysis, we show that we can support a richer set of workloads while retaining the development complexity low and being competitive in performance even with specialized systems.
C7
Transaction Processing I
Chair: Michael Abebe (Salesforce)
Fine-Grained Re-Execution for Efficient Batched Commit of Distributed TransactionsZhiyuan Dong (Shanghai Jiao Tong University)*; Zhaoguo Wang (Shanghai Jiao Tong University); Xiaodong Zhang (Shanghai Jiao Tong University); Xian Xu (SJTU); Changgeng Zhao (New York University); Haibo Chen (Shanghai Jiao Tong University); Aurojit Panda (New York University); Jinyang Li (New York University) Show AbstractDownload Paper
Distributed transaction systems incur extensive cross-node communication to execute and commit serializable OLTP transactions. As a result, their performance greatly suffers. Caching data at nodes that execute transactions can cut down remote reads. Batching transactions for validation and persistence can amortize the communication cost during committing. However, caching and batching can significantly increase the likelihood of conflicts, causing expensive aborts.
In this paper, we develop Hackwrench to address the challenge of caching and batching. Instead of aborting conflicted transactions, Hackwrench tries to repair them using fine-grained re-execution by tracking the dependencies of operations among a batch of transactions. Tracked dependencies allow Hackwrench to selectively invalidate and re-execute only those operations necessary to “fix” the conflict, which is cheaper than aborting and executing an entire batch of transactions. Evaluations using TPC-C and other micro-benchmarks show that Hackwrench can outperform existing commercial and research systems including FoundationDB, Calvin, COCO, and Sundial under comparable settings.
Epoxy: ACID Transactions Across Diverse Data StoresPeter Kraft (Stanford University)*; Qian Li (Stanford University); Xinjing Zhou (Massachusetts Institute of Technology); Peter D Bailis (Stanford University); Michael Stonebraker (Massachusetts Institute of Technology); Xiangyao Yu (University of Wisconsin-Madison); Matei Zaharia (Berkeley and Databricks) Show AbstractDownload Paper
Developers are increasingly building applications that incorporate multiple data stores, for example to manage heterogeneous data or loosely coupled microservices. Often, these require transactional safety for operations on multiple stores, but few systems support such guarantees. To solve this problem, we introduce Epoxy, a novel protocol for providing transactions across heterogeneous data stores. We make two contributions. First, we develop a novel technique for providing cross-data store transactional isolation by adapting multi-version concurrency control, storing versioning information in record metadata and filtering reads with predicates on metadata so they only see record versions in a global transaction snapshot. Second, we develop an atomic commit protocol that does not require data stores implement the participant protocol of two-phase commit, requiring only durable writes. We implement Epoxy for five data stores: Postgres, Elasticsearch, MongoDB, Google Cloud Storage, and MySQL. We evaluate it by adapting TPC-C and microservice workloads to a multi-data store environment. We find it has comparable performance to the distributed transaction protocol XA on TPC-C while providing stronger guarantees like transactional isolation, and has overhead of <10% compared to a non-transactional baseline on microservice workloads.
Cornus: Atomic Commit for a Cloud DBMS with Storage DisaggregationZhihan Guo (University of Wisconsin-Madison)*; Xinyu Zeng (University of Wisconsin-Madison); Kan Wu (University of Wisconsin-Madison); Wuh-Chwen Hwang (University of Wisconsin-Madison); Ziwei Ren (University of Wisconsin-Madison); Xiangyao Yu (University of Wisconsin-Madison); Mahesh Balakrishnan (Microsoft Research); Philip A Bernstein (Microsoft Research) Show AbstractDownload Paper
Two-phase commit (2PC) is widely used in distributed databases to ensure atomicity of distributed transactions. Conventional 2PC was originally designed for the shared-nothing architecture and has two limitations: long latency due to two eager log writes on the critical path, and blocking of progress when a coordinator fails.
Modern cloud-native databases are moving to a storage disaggregation architecture where storage is a shared highly-available service. Our key observation is that disaggregated storage enables protocol innovations that can address both the long-latency and blocking problems. We develop Cornus, an optimized 2PC protocol to achieve this goal. The only extra functionality Cornus requires is an atomic compare-and-swap capability in the storage layer, which many existing storage services already support. We present Cornus in detail and show how it addresses the two limitations. We also deploy it on real storage services including Azure Blob Storage and Redis. Empirical evaluations show that Cornus can achieve up to 1.9x latency reduction over conventional 2PC.
Nezha: Deployable and High-Performance Consensus Using Synchronized ClocksJinkun Geng (Stanford University)*; Anirudh Sivaraman (New York University); Balaji Prabhakar (Stanford University); Mendel Rosenblum (Stanford University) Show AbstractDownload Paper
This paper presents a high-performance consensus protocol, Nezha, which can be deployed by cloud tenants without support from cloud providers. Nezha bridges the gap between protocols such as Multi-Paxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have.
Nezha uses a new multicast primitive called deadline-ordered multicast (DOM). DOM uses high-accuracy software clock synchronization to synchronize sender and receiver clocks. Senders tag messages with deadlines in synchronized time; receivers process messages in deadline order, on or after their deadline.
We compare Nezha with Multi-Paxos, Fast Paxos, Raft, (optimized) NOPaxos, and 2 recent protocols, Domino and TOQ-EPaxos, that use synchronized clocks. In throughput, Nezha outperforms all baselines by a median of 5.4× (range: 1.9–20.9×). In latency, Nezha outperforms five baselines by a median of 2.3× (range: 1.3–4.0×), with one exception: it sacrifices 33% of latency compared with our optimized NOPaxos in one test. We also prototype two applications, a key-value store and a fair-access stock exchange, on top of Nezha to show that Nezha only modestly reduces their performance relative to an unreplicated system.
C8
Transaction Processing II
Chair: Yongluan Zhou (University of Copenhagen)
TiQuE: Improving the Transactional Performance of Analytical Systems for True Hybrid WorkloadsNuno Faria (INESCTEC & U. Minho)*; José Pereira (U. Minho & INESCTEC); Ana Nunes Alonso (INESC TEC & U.Minho); Ricardo Vilaça (INESC TEC and Universidade do Minho); Yunus Koning (MonetDB Solutions); Niels Nes (MonetDB Solutions) Show AbstractDownload Paper
Transactions have been a key issue in database management for a long time and there are a plethora of architectures and algorithms to support and implement them. The current state-of-the-art is focused on storage management and is tightly coupled with its design, leading, for instance, to the need for completely new engines to support new features such as Hybrid Transactional Analytical Processing (HTAP). We address this challenge with a proposal to implement transactional logic in a query language such as SQL. This means that our approach can be layered on existing analytical systems but that the retrieval of a transactional snapshot and the validation of update transactions runs in the server and can take advantage of advanced query execution capabilities of an optimizing query engine. We demonstrate our proposal, TiQuE, on MonetDB and obtain an average 500x improvement in transactional throughput while retaining good performance on analytical queries, making it competitive with the state-of-the-art HTAP systems.
Fries: Fast and Consistent Runtime Reconfiguration in Dataflow Systems with Transactional GuaranteesZuozhi Wang (U C IRVINE)*; Shengquan Ni (U C Irvine); Avinash Kumar (U C IRVINE); Chen Li (UC Irvine) Show AbstractDownload Paper
A computing job in a big data system can take a long time to run, especially for pipelined executions on data streams. Developers often need to change the computing logic of the job such as fixing a loophole in an operator or changing the machine learning model in an operator with a cheaper model to handle a sudden increase of the data-ingestion rate. Recently many systems have started supporting runtime reconfigurations to allow this type of change on the fly without killing and restarting the execution. While the delay in reconfiguration is critical to the performance, existing systems use epochs to do runtime reconfigurations, which can cause a long delay. In this paper we develop a new technique called Fries that leverages the emerging availability of fast control messages in many systems, since these messages can be sent without being blocked by data messages. We formally define consistency in runtime reconfigurations, and develop a Fries scheduler with consistency guarantees. The technique not only works for different classes of dataflows, but also works for parallel executions and supports fault tolerance. Our extensive experimental evaluation on clusters show the advantages of this technique compared to epoch-based schedulers.
Scalable and Robust Snapshot Isolation for High-Performance Storage EnginesAdnan Alhomssi (Friedrich-Alexander-Universität Erlangen-Nürnberg)*; Viktor Leis (Technische Universität München) Show AbstractDownload Paper
MVCC-based snapshot isolation promises that read queries can proceed without interfering with concurrent writes. However, as we show experimentally, in existing implementations a single long-running query can easily cause transactional throughput to collapse. Moreover, existing out-of-memory commit protocols fail to meet the scalability needs of modern multi-core systems. In this paper, we present three complementary techniques for robust and scalable snapshot isolation in out-of-memory systems. First, we propose a commit protocol that minimizes cross-thread communication for better scalability, avoids touching the write set on commit, and enables efficient fine-granular garbage collection. Second, we introduce the Graveyard Index, an auxiliary data structure that moves logically-deleted tuples out of the way of operational transactions. Third, we present an adaptive version storage scheme that enables fast garbage collection and improves scan performance of frequently-modified tuples. All techniques are engineered to scale well on multi-core processors, and together enable robust performance for complex hybrid workloads.
D1
Modern Memory & Storage I
Chair: Srini V. Srinivasan (Aerospike)
WiscSort: External Sorting For Byte-Addressable StorageVinay Banakar (University of Wisconsin Madison)*; Kan Wu (Google); Yuvraj Patel (University of Edinburgh); Kimberly Keeton (Google); Andrea C Arpaci-Dusseau (University of Wisconsin-Madison); Remzi H Arpaci-Dusseau (University of Wisconsin-Madison) Show AbstractDownload Paper
We present WiscSort, a new approach to high-performance concurrent sorting for existing and future byte-addressable storage (BAS) devices. WiscSort carefully reduces writes, exploits random reads by splitting keys and values during sorting, and performs interference-aware scheduling with thread pool sizing to avoid I/O bandwidth degradation. We introduce the BRAID model which encompasses the unique characteristics of BAS devices. Many state-of-the-art sorting systems do not comply with the BRAID model and deliver sub-optimal performance, whereas WiscSort demonstrates the effectiveness of complying with BRAID. We show that WiscSort is 2-7x faster than competing approaches on a standard sort benchmark. We evaluate the effectiveness of key-value separation on different key-value sizes and compare our concurrency optimizations with various other concurrency models. Finally, we emulate generic BAS devices and show how our techniques perform well with various combinations of hardware properties.
LRU-C: Parallelizing Database I/Os for Flash SSDsBo-Hyun Lee (Sungkyunkwan University); Mijin An (Sungkyunkwan University); Sang-Won Lee (Sungkyunkwan University)* Show AbstractDownload Paper
The conventional database buffer managers have two inherent sources of I/O serialization: read stall and mutex conflict. The serialized I/O makes storage and CPU under-utilized, limiting transaction throughput and latency. Such harm stands out on flash SSDs with asymmetric read-write speed and abundant I/O parallelism.
To make database I/Os parallel and thus leverage the parallelism in flash SSDs, we propose a novel approach to database buffering, the LRU-C method. It introduces the LRU-C pointer that points to the least-recently-used-clean page in the LRU list. Upon a page miss, LRU-C selects the current LRU-clean page as a victim and adjusts the pointer to the next LRU-clean one in the LRU list. This way, LRU-C can avoid the I/O serialization of read stalls. The LRU-C pointer enables two further optimizations for higher I/O throughput: dynamic-batch-write and parallel LRU-list manipulation. The former allows the background flusher to write more dirty pages at a time, while the latter mitigates two mutex-induced I/O serializations. Experiment results from running OLTP workloads using MySQL-based LRU-C prototype on flash SSDs show that it improves transaction throughput over the Vanilla MySQL and the state-of-the-art WAR solution by 3x and 1.52x, respectively, and also cuts the tail latency drastically. Though LRU-C might compromise the hit ratio slightly, its increased I/O throughput far offsets the reduced hit ratio.
WALTZ: Leveraging Zone Append to Tighten the Tail Latency of LSM Tree on ZNS SSDJongsung Lee (Seoul National University)*; Donguk Kim (Seoul National University); Jae W. Lee (Seoul National University) Show AbstractDownload Paper
We propose WALTZ, an LSM tree-based key-value store on the emerging Zoned Namespace (ZNS) SSD. The key contribution of WALTZ is to leverage the zone append command, which is a recent addition to ZNS SSD specifications, to provide tight tail latency. The long tail latency problem caused by the merging process of multiple parallel writes, called batch-group writes, is effectively addressed by the internal synchronization mechanism of ZNS SSD. To provide fast failover when the active zone becomes full for a write-ahead log (WAL) file during parallel append, WALTZ introduces a mechanism for WAL zone replacement and reservation. Finally, lazy metadata management allows a put query to be processed fast without requiring any other synchronizations to enable lock-free execution of individual append commands. For evaluation we use both microbenchmarks (db_bench) with varying read/write ratios and key skewnesses, and realistic social-graph workloads (MixGraph from Facebook). Our evaluation demonstrates geomean reduction of tail latency by 2.19x and 2.45x for db_bench and MixGraph, respectively, with a maximum reduction of 3.02x and 4.73x. As a side effect of eliminating the overhead of batch-group writes, WALTZ also improves the query throughput (QPS) by up to 11.7%.
FlashAlloc: Dedicating Flash Blocks By ObjectsJonghyeok Park (Hankuk University of Foreign Studies); Soyee Choi (SungKyunKwan University); Gihwan Oh (Sungkyunkwan University); Soojun Im (Samsung Electronics); Moon-Wook Oh (Samsung Electronics); Sang-Won Lee (Sungkyunkwan University)* Show AbstractDownload Paper
For a write request, today’s flash storage cannot distinguish the logical object it comes from (e.g., SSTables in RocksDB). In such object-oblivious flash devices, concurrent writes from different objects are simply packed in their arrival order to flash memory blocks; hence objects with different lifetimes are multiplexed onto the same flash blocks. This multiplexing incurs write amplification, worsening the performance.
Tackling the multiplexing problem, we propose a novel interface for flash storage, FlashAlloc. It is used to pass the logical address ranges of objects to the underlying flash device and thus to enlighten the device to stream writes by objects. The object-aware flash storage can now de-multiplex concurrent writes from multiple objects with distinct deathtimes into per-object dedicated flash blocks. In essence, the interface enables the per-object fine-grained write streaming. Given that popular data stores tend to separate writes by logical objects, we can achieve, compared to the existing solutions, transparent streaming just by calling FlashAlloc upon object creation. Also, FlashAlloc is adaptive to workload changes, and liberates the stream conflicts in the multi tenant environment.
Our experimental results using an open-source SSD prototype demonstrate that FlashAlloc can reduce the device-level write amplification factor (WAF) under RocksDB, F2FS, and MySQL by 1.5, 2.5, and 0.3, respectively and improve their throughput by 2.7x, 1.8x, and 1.2x, respectively. Also, FlashAlloc can mitigate the WAF interference among tenants: when running RocksDB and MySQL together on the same SSD, FlashAlloc reduced WAF from 2.5 to 1.6 and doubled their throughputs.
Write-Aware Timestamp Tracking: Effective and Efficient Page Replacement for Modern HardwareDemian E Vöhringer (Friedrich-Alexander-Universität Erlangen-Nürnberg)*; Viktor Leis (Technische Universität München) Show AbstractDownload Paper
In this paper, we revisit the classical data management problem of page replacement. We propose Write-Aware Timestamp Tracking (WATT), a novel replacement algorithm that is optimized for modern hardware. By explicitly tracking the access history of each cached page, WATT achieves state-of-the-art replacement effectiveness. WATT is also carefully co-designed with modern multi-core CPUs and can be implemented with very low overhead. Finally, WATT allows trading of read versus write I/O operations, which is useful for prolonging flash SSD lifetime.
D2
Modern Memory & Storage II
Chair: Peter Boncz (CWI)
When Database Meets New Storage Devices: Understanding and Exposing Performance Mismatches via Configurations [eab]Haochen He (National University of Defense Technology)*; Erci Xu (NUDT); Shanshan Li (National University of Defense Technology); Zhouyang Jia (National University of Defense Technology); Si Zheng (National University of Defense Technology); Yue Yu (National University of Defense Technology); Jun Ma (National University of Defense Technology); Xiangke Liao (School of Computer Science,National University of Defense Technology) Show AbstractDownload Paper
NVMe SSD hugely boosts the I/O speed, with up to GB/s throughput and microsecond-level latency. Unfortunately, DBMS users can often find their high-performanced storage devices tend to deliver less-than-expected or even worse performance when compared to their traditional peers. While many works focus on proposing new DBMS designs to fully exploit NVMe SSDs, few systematically study the symptoms, root causes and possible detection methods of such performance mismatches on existing databases. In this paper, we start with an empirical study where we systematically expose and analyze the performance mismatches on six popular databases via controlled configuration tuning. From the study, we find that all six databases can suffer from performance mismatches. Moreover, we conclude that the root causes can be categorized as databases’ unawareness of new storage devices characteristics in I/O size, I/O parallelism and I/O sequentiality. We report 17 mismatches to developers and 15 are confirmed. Additionally, we realize testing all configuration knobs yields low efficiency. Therefore, we propose a fast performance mismatch detection framework and evaluation shows that our framework brings two orders of magnitude speedup than baseline without sacrificing effectiveness.
What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage EnginesGabriel Haas (TUM)*; Viktor Leis (Technische Universität München) Show AbstractDownload Paper
NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.
NVM: Is it Not Very Meaningful for Databases? [eab]Dimitrios Koutsoukos (ETHZ)*; Raghav Bhartia (ETH); Michal Friedman (ETH); Ana Klimovic (ETH Zurich); Gustavo Alonso (ETHZ) Show AbstractDownload Paper
Persistent or Non Volatile Memory (PMEM) offers expanded memory capacity and faster access to persistent storage. However, there is no comprehensive empirical analysis of existing database engines under different PMEM modes, to understand how databases can benefit from the various hardware configurations. To this end, we analyze multiple different engines under common benchmarks with PMEM in AppDirect mode and Memory mode. Our results show that PMEM in Memory mode does not offer any clear performance advantage despite the larger volatile memory capacity. Also, using PMEM as persistent storage usually speeds up query execution, but with some caveats as the I/O path is not fully optimized and therefore does not always justify the additional cost. We show this to be the case through a comprehensive evaluation of different engines and database configurations under different workloads.
NV-SQL: Boosting OLTP Performance with Non-Volatile DIMMsMijin An (Sungkyunkwan University); Jonghyeok Park (Hankuk University of Foreign Studies); Tianzheng Wang (Simon Fraser University); Beomseok Nam (Sungkyunkwan University); Sang-Won Lee (Sungkyunkwan University)* Show AbstractDownload Paper
When running OLTP workloads, relational DBMSs with flash SSDs still suffer from the durability overhead. Heavy writes to SSD not only limit the performance but also shorten the storage lifespan. To mitigate the durability overhead, this paper proposes a new database architecture, NV-SQL. NV-SQL aims at absorbing a large fraction of writes written from DRAM to SSD by introducing NVDIMM into the memory hierarchy as a durable write cache. On the new architecture, NV-SQL makes two technical contributions. First, it proposes the re-update interval-based admission policy that determines which write-hot pages qualify for being cached in NVDIMM. It is novel in that the page hotness is based solely on pages’ LSN. Second, this study finds that NVDIMM-resident pages can violate the page action consistency upon crash and proposes how to detect inconsistent pages using per-page in-update flag and how to rectify them using the redo log. NV-SQL demonstrates how the ARIES-like logging and recovery techniques can be elegantly extended to support the caching and recovery for NVDIMM data. Additionally, by placing write-intensive redo buffer and DWB in NVDIMM, NV-SQL eliminates the log-force-at-commit and WAL protocols and further halves the writes to the storage. Our NV-SQL prototype running with a real NVDIMM device outperforms the same-priced vanilla MySQL with larger DRAM by several folds in terms of transaction throughput for write-intensive OLTP benchmarks. This confirms that NV-SQL is a cost-performance efficient solution to the durability problem.
D3
Sketching & Streaming
Chair: Alkis Simitsis (Athena Research and Innovation Center)
Dalton: Learned Partitioning for Distributed Data StreamsEleni Zapridou (EPFL)*; Ioannis Mytilinis (EPFL); Anastasia Ailamaki (EPFL) Show AbstractDownload Paper
To sustain the input rate of high-throughput streams, modern stream processing systems rely on parallel execution. However, skewed data yield imbalanced load assignments and create stragglers that hinder scalability. Deciding on a static partitioning for a given set of "hot" keys is not sufficient as these keys are not known in advance, and even worse, the data distribution can change unpredictably. Existing algorithms either optimize for a specific distribution or, in order to adapt, assume a centralized partitioner that processes every incoming tuple and observes the whole workload. However, this is not realistic in a distributed environment, where multiple parallel upstream operators exist, as the centralized partitioner itself becomes the bottleneck and limits scalability.
In this work, we propose Dalton: a lightweight, adaptive, yet scalable partitioning operator that relies on reinforcement learning. By memoizing state and dynamically keeping track of recent experience, Dalton: i) adjusts its policy at runtime and quickly adapts to the workload, ii) avoids redundant computations and minimizes the per-tuple partitioning overhead, and iii) efficiently scales out to multiple instances that learn cooperatively and converge to a joint policy. Our experiments indicate that Dalton scales regardless of the input data distribution and sustains 1.3x - 6.7x higher throughput than existing approaches.
Efficient framework for operating on data sketchesJakub Lemiesz (Wrocław University of Science and Technology)* Show AbstractDownload Paper
We study the problem of analyzing massive data streams based on concise data sketches. Recently, a number of papers have investigated how to estimate the results of set-theory operations based on sketches. In this paper we present a framework that allows to estimate the result of any sequence of set-theory operations.
The starting point for our solution is the solution from 2021. Compared to this solution, the newly presented sketching algorithm is much more computationally efficient as it requires on average O(log n) rather than O(n) comparisons for n stream elements. We also show that the estimator dedicated to sketches proposed in that reference solution is, in fact, a maximum likelihood estimator.
Optimistic Data Parallelism for FPGA-Accelerated SketchingMartin Kiefer (TU Berlin)*; Ilias Poulakis (TU Berlin); Eleni Tzirita Zacharatou (IT University of Copenhagen); Volker Markl (Technische Universität Berlin) Show AbstractDownload Paper
Sketches are a popular approximation technique for large datasets and high-velocity data streams. While custom FPGA-based hardware has shown admirable throughput at sketching, the state-of-the-art exploits data parallelism by fully replicating resources and constructing independent summaries for every parallel input value. We consider this approach pessimistic, as it guarantees constant processing rates by provisioning resources for the worst case.
We propose a novel optimistic sketching architecture for FPGAs that partitions a single sketch into multiple independent banks shared among all input values, thus significantly reducing resource consumption. However, skewed input data distributions can result in conflicting accesses to banks and impair the processing rate. To mitigate the effect of skew, we add mergers that exploit temporal locality by combining recent updates Our evaluation shows that an optimistic architecture is feasible and reduces the utilization of critical FPGA resources proportionally to the number of parallel input values. We further show that FPGA accelerators provide up to 2.6x higher throughput than a recent CPU and GPU, while larger sketch sizes enabled by optimistic architectures improve accuracy by up to an order of magnitude in a realistic sketching application.
Out-of-Order Sliding-Window Aggregation with Efficient Bulk Evictions and InsertionsKanat Tangwongsan (Mahidol University International College); Martin Hirzel (IBM Research)*; Scott Schneider (Meta) Show AbstractDownload Paper
Sliding-window aggregation is a foundational stream processing primitive that efficiently summarizes recent data. The state-of-the-art algorithms for sliding-window aggregation are highly efficient when stream data items are evicted or inserted one at a time, even when some of the insertions occur out-of-order. However, real-world streams are often not only out-of-order but also bursty, causing data items to be evicted or inserted in larger bulks. This paper introduces a new algorithm for sliding-window aggregation with bulk eviction and bulk insertion. For the special case of single insert and evict, our algorithm matches the theoretical complexity of the best previous out-of-order algorithms. For the case of bulk evict, our algorithm improves upon the theoretical complexity of the best previous algorithm for that case and also outperforms it in practice. For the case of bulk insert, there are no prior algorithms, and our algorithm improves upon the naive approach of emulating bulk insert with a loop over single inserts, both in theory and in practice. Overall, this paper makes high-performance algorithms for sliding window aggregation more broadly applicable by efficiently handling the ubiquitous cases of out-of-order data and bursts.
High-Performance Row Pattern Recognition Using JoinsErkang Zhu (Microsoft Research)*; Silu Huang (Microsoft Research); Surajit Chaudhuri (Microsoft Research) Show AbstractDownload Paper
The SQL standard introduced MATCH_RECOGNIZE in 2016 for row pattern recognition. Since then, MATCH_RECOGNIZE has been supported by several leading relation systems, they implemented this function using Non-Deterministic Finite Automaton (NFA). While NFA is suitable for pattern recognition in streaming scenarios, the current uses of NFA by the relational systems for historical data analysis scenarios overlook important optimization opportunities. We propose a new approach to use Join to speed up row pattern recognition in historical analysis scenarios for relational systems. Implemented as a logical plan rewrite rule, the new approach first filters the input relation to MATCH_RECOGNIZE using Joins constructed based on a subset of symbols taken from the PATTERN expression, then run the NFA-based MATCH_RECOGNIZE on the filtered rows, reducing the net cost. The rule also includes a specialized cardinality model for the Joins and a cost model for the NFA-based MATCH_RECOGNIZE operator for choosing an appropriate symbol set. The rewrite rule is applicable when the query pattern’s definition is self-contained and either the input table has no duplicates or there is a window condition. Applying the rewrite rule to a query benchmark with 1,800 queries spanning over 6 patterns and 3 pattern definitions, we observed median speedups of 5.4× on Trino (v373 with ORC files on Hive), 57.5× on SQL Server (2019) using column store and 41.6× on row store.
D4
Benchmarking & Performance I
Chair: Shi Qiao (SmartApps)
HMAB: Self-Driving Hierarchy of Bandits for Integrated Physical Database Design TuningR. Malinga Perera (University of Melbourne)*; Bastian Oetomo (University of Melbourne); Benjamin I. P. Rubinstein (University of Melbourne); Renata Borovica-Gajic (University of Melbourne) Show AbstractDownload Paper
Effective physical database design tuning requires selection of several physical design structures (PDS), such as indices and materialised views, whose combination influences overall system performance in a non-linear manner. While the simplicity of combining the results of iterative searches for individual PDSs may be appealing, such a greedy approach may yield vastly suboptimal results compared to an integrated search. We propose a new self-driving approach (HMAB) based on hierarchical multi-armed bandit learners, which can work in an integrated space of multiple PDS while avoiding the full cost of combinatorial search. HMAB eschews the optimiser cost misestimates by direct performance observations through a strategic exploration, while carefully leveraging its knowledge to prune the less useful exploration paths. As an added advantage, HMAB comes with a provable guarantee on its expected performance. To the best of our knowledge, this is the first learned system to tune both indices and materialised views in an integrated manner. We find that our solution enjoys superior empirical performance relative to state-of-the-art commercial physical database design tools that search over the integrated space of materialised views and indices. Specifically, HMAB achieves up to 96% performance gain over a state-of-the-art commercial physical database design tool when running industrial benchmarks.
M2Bench: A Database Benchmark for Multi-Model Analytic Workloads [eab]Bogyeong Kim (Seoul National University); Kyoseung Koo (Seoul National University); Undraa Enkhbat (Seoul National University); Sohyun Kim (Seoul National University Database System Lab); Juhun Kim (Seoul National University); Bongki Moon (Seoul National University)* Show AbstractDownload Paper
As the world becomes increasingly data-centric, the tasks dealt with by a database management system (DBMS) become more complex and diverse. Compared with traditional workloads that typically require only a single data model, modern-day computational tasks often involve multiple data sources and rely on more than one data model. Unfortunately, however, there is currently no standard benchmark program that can evaluate a DBMS in the various aspects of multi-model databases, especially when the array data model is concerned. In this paper, we propose M2Bench, a new benchmark program capable of evaluating a multi-model DBMS that supports several important data models such as relational, document-oriented, property graph, and array models. M2Bench consists of multi-model workloads that are inspired by real-world problems. Each task of the workload mimics a real-life scenario where at least two different models of data are involved. To demonstrate the efficacy of M2Bench, we evaluated polyglot or multi-model database systems with the M2Bench workloads and unfolded the diverse characteristics of the database systems for each data model.
Analyzing Vectorized Hash Tables Across CPU Architectures [eab]Maximilian Böther (ETH Zurich)*; Lawrence Benson (HPI, University of Potsdam); Ana Klimovic (ETH Zurich); Tilmann Rabl (HPI, University of Potsdam) Show AbstractDownload Paper
Data processing systems often leverage vector instructions to achieve higher performance. When applying vector instructions, an often overlooked data structure is the hash table, even though it is fundamental in data processing systems for operations such as indexing, aggregating, and joining. In this paper, we characterize and evaluate three fundamental vectorized hashing schemes, vectorized linear probing (VLP), vectorized fingerprinting (VFP), and bucket-based comparison (BBC). We implement these hashing schemes on the x86, ARM, and Power CPU architectures, as modern database systems must provide efficient implementations for multiple platforms due to the continuously increasing hardware heterogeneity. We present various implementation variants and platform-specific optimizations, which we evaluate for integer keys, string keys, large payloads, skewed distributions, and multiple threads. Our extensive evaluation and comparison to three scalar hashing schemes on four servers shows that BBC outperforms scalar linear probing by a factor of more than 2x, while also scaling well to high load factors. We find that vectorized hashing schemes come with caveats that need to be considered, such as the increased engineering overhead, differences between CPUs, and differences between vector ISAs, such as AVX and AVX-512, which impact performance. We conclude with key findings for vectorized hashing scheme implementations.
TSM-Bench: Benchmarking Time Series Database Systems for Monitoring Applications [eab]Abdelouahab Khelifati (University of Fribourg)*; Mourad Khayati (University of Fribourg); Anton Dignös (Free University of Bozen-Bolzano); Djellel Difallah (New York University); Philippe Cudré-Mauroux (University of Fribourg) Show AbstractDownload Paper
Time series databases are essential for the large-scale deployment of many critical industrial applications. In infrastructure monitoring, for instance, a database system must be able to process large amounts of sensor data in real-time, execute continuous queries, and handle sophisticated analytical queries such as anomaly detection or forecasting. Several benchmarks have been proposed to evaluate and understand how existing systems and design choices handle specific use cases and workloads. Unfortunately, none of them fully covers the peculiar requirements of monitoring applications. Furthermore, they fall short of providing an automated way to generate representative real-world data and workloads for testing and evaluating these systems.
We present TSM-Bench, a benchmark tailored for time series database systems used in monitoring applications. Our key contributions consist of (1) representative queries that meet the requirements that we collected from a water monitoring use case, and (2) a new scalable data generator method based on Generative Adversarial Networks (GAN) and Locality Sensitive Hashing (LSH). We demonstrate, through an extensive set of experiments, how TSM-Bench provides a comprehensive evaluation of the performance of seven leading time series database systems while offering a detailed characterization of their capabilities and trade-offs.
VeriBench: Analyzing the Performance of Database Systems with Verifiability [eab]Cong Yue (National University of Singapore); Meihui Zhang (Beijing Institute of Technology); Changhao Zhu (Beijing Institute of Technology); Gang Chen (Zhejiang University); Dumitrel Loghin (National University of Singapore); Beng Chin Ooi (NUS)* Show AbstractDownload Paper
Database systems are paying more attention to data security in recent years. Immutable systems such as blockchains, verifiable databases, and ledger databases are equipped with various verifiability mechanisms to protect data. Such systems often adopt different threat models, and techniques, therefore, have different performance implications compared to traditional database systems. So far, there is no uniform benchmarking tool for evaluating the performance of these systems, especially at the level of verification functions. In this paper, we first survey the design space of the verifiability-enabled database systems along five dimensions: threat model, authenticated data structure (ADS), query processing, verification, and audit. Based on this survey, we design and implement VeriBench, a benchmark framework for verifiability-enabled database systems. VeriBench enables a fair comparison of systems designed with different underlying technologies that share the client-side verification scheme, and focuses on design space exploration to provide a deeper understanding of different system design choices. VeriBench incorporates micro- and macro-benchmarks to provide a comprehensive evaluation. Further, VeriBench is designed to enable the easy extension for benchmarking new systems and workloads. We run VeriBench to conduct a comprehensive analysis of state-of-the-art systems comprising blockchains, ledger databases, and log transparency technologies. The results expose the weaknesses and strengths of each underlying design choice, and the insights should serve as guidance for future development.
D5
Benchmarking & Performance II
Chair: Chenhao Ma (Chinese University of Hong Kong, Shenzhen)
A Deep Dive into Common Open Formats for Analytical DBMSs [eab]Chunwei Liu (Massachusetts Institute of Technology)*; Anna Pavlenko (Microsoft Gray Systems Lab); Matteo Interlandi (Microsoft); Brandon Haynes (Microsoft Gray Systems Lab) Show AbstractDownload Paper
This paper evaluates the suitability of Apache Arrow, Parquet, and ORC as formats for subsumption in an analytical DBMS. We systematically identify and explore the high-level features that are important to support efficient querying in modern OLAP DBMSs and evaluate the ability of each format to support these features. We find that each format has trade-offs that make it more or less suitable for use as a format in a DBMS, and identify opportunities to more holistically codesign a unified in-memory and on-disk data representation. Our hope is that this study can be used as a guide for system developers designing and using these formats, as well as provide the community with directions to pursue for improving these common open formats.
The LDBC Social Network Benchmark: Business Intelligence Workload [eab]Gábor Szárnyas (CWI)*; Jack Waudby (Newcastle University); Benjamin A. Steer (pometry); Dávid Szakállas (LDBC); Altan Birler (TUM); Mingxi Wu (TigerGraph); Yuchen Zhang (TigerGraph); Peter Boncz (CWI) Show AbstractDownload Paper
The Social Network Benchmark's Business Intelligence workload (SNB BI) is a comprehensive graph OLAP benchmark targeting analytical data systems capable of supporting graph workloads. This paper marks the finalization of almost a decade of research in academia and industry via the Linked Data Benchmark Council (LDBC). SNB BI advances the state-of-the art in synthetic and scalable analytical database benchmarks in many aspects. Its base is a sophisticated data generator, implemented on a scalable distributed infrastructure, that produces a social graph with small-world phenomena, whose value properties follow skewed and correlated distributions and where values correlate with structure. This is a temporal graph where all nodes and edges follow lifespan-based rules with temporal skew enabling realistic and consistent temporal inserts and (recursive) deletes. The query workload exploiting this skew and correlation is based on LDBC's "choke point"-driven design methodology and will entice technical and scientific improvements in future (graph) database systems. SNB BI includes the first adoption of "parameter curation" in an analytical benchmark, a technique that ensures stable runtimes of query variants across different parameter values. Two performance metrics characterize peak single-query performance (power) and sustained concurrent query throughput. To demonstrate the portability of the benchmark, we present experimental results on a relational and a graph DBMS. Note that these do not constitute an official LDBC Benchmark Result -- only audited results can use this trademarked term.
Cloud Analytics Benchmark [eab]Alexander van Renen (Friedrich-Alexander-Universität Erlangen-Nürnberg)*; Viktor Leis (Technische Universität München) Show AbstractDownload Paper
The cloud facilitates the transition to a service-oriented perspective. This affects cloud-native data management in general, and data analytics in particular. Instead of managing a multi-node database cluster on-premise, end users simply send queries to a managed cloud data warehouse and receive results. While this is obviously very attractive for end users, database system architects still have to engineer systems for this new service model. There are currently many competing architectures ranging from self-hosted (Presto, PostgreSQL), over managed (Snowflake, Amazon Redshift) to query-as-a-service (Amazon Athena, Google BigQuery) offerings. Benchmarking these architectural approaches is currently difficult, and it is not even clear what the metrics for a comparison should be.
To overcome these challenges, we first analyze a real-world query trace from Snowflake and compare its properties to that of TPC-H and TPC-DS. Doing so, we identify important differences that distinguish traditional benchmarks from real-world cloud data warehouse workloads. Based on this analysis, we propose the Cloud Analytics Benchmark (CAB). By incorporating workload fluctuations and multi-tenancy, CAB allows evaluating different designs in terms of user-centered metrics such as cost and performance.
D6
Cloud DB & Parallelism
Chair: Eric Lo (Chinese University of Hong Kong)
InfiniStore: Elastic Serverless Cloud StorageJingyuan Zhang (George Mason University)*; Ao Wang (George Mason University); Xiaolong Ma (University of Nevada, Reno); Benjamin Carver (George Mason University); Nicholas John Newman (George Mason University); Ali Anwar (University of Minnesota); Lukas Rupprecht (IBM Research); Vasily Tarasov (IBM Research); Dimitrios Skourtis (Redpanda Data); Feng Yan (University of Houston); Yue Cheng (University of Virginia) Show AbstractDownload Paper
Cloud object storage such as AWS S3 is cost-effective and highly elastic but relatively slow, while high-performance cloud storage such as AWS ElastiCache is expensive and provides limited elasticity. We present a new cloud storage service called ServerlessMemory, which stores data using the memory of serverless functions. ServerlessMemory employs a sliding-window-based memory management strategy inspired by the garbage collection mechanisms used in the programming language to effectively segregate hot/cold data and provides fine-grained elasticity, good performance, and a pay-per-access cost model with extremely low cost.
We then design and implement InfiniStore, a persistent and elastic cloud storage system, which seamlessly couples the function-based ServerlessMemory layer with a persistent, inexpensive cloud object store layer. InfiniStore enables durability despite function failures using a fast parallel recovery scheme built on the autoscaling functionality of a FaaS (Function-as-a-Service) platform. We evaluate InfiniStore extensively using both microbenchmarking and two real-world applications. Results show that InfiniStore has more performance benefits for objects larger than 10 MB compared to AWS ElastiCache and Anna, and InfiniStore achieves 26.25% and 97.24% tenant-side cost reduction compared to InfiniCache and ElastiCache, respectively.
Exploiting Cloud Object Storage for High-Performance AnalyticsDominik Durner (TUM)*; Viktor Leis (Technische Universität München); Thomas Neumann (TUM) Show AbstractDownload Paper
Elasticity of compute and storage is crucial for analytical cloud database systems. All cloud vendors provide disaggregated object stores, which can be used as storage backend for analytical query engines. Until recently, local storage was unavoidable to process large tables efficiently due to the bandwidth limitations of the network infrastructure in public clouds. However, the gap between remote network and local NVMe bandwidth is closing, making cloud storage more attractive. This paper presents a blueprint for performing efficient analytics directly on cloud object stores. We derive cost- and performance-optimal retrieval configurations for cloud object stores with the first in-depth study of this foundational service in the context of analytical query processing. For achieving high retrieval performance, we present AnyBlob, a novel download manager for query engines that optimizes throughput while minimizing CPU usage. We discuss the integration of high-performance data retrieval in query engines and demonstrate it by incorporating AnyBlob in our database system Umbra. Our experiments show that even without caching, Umbra with integrated AnyBlob achieves similar performance to state-of-the-art cloud data warehouses that cache data on local SSDs while improving resource elasticity.
Pando: Enhanced Data Skipping with Logical Data PartitioningSivaprasad Sudhir (Massachusetts Institute of Technology)*; Wenbo Tao (Meta Platforms); Nikolay Laptev (Meta); Cyrille Habis (Meta); Michael Cafarella (MIT CSAIL); Samuel Madden (Massachusetts Institute of Technology) Show AbstractDownload Paper
With enormous volumes of data, quickly retrieving data that is relevant to a query is essential for achieving high performance. Modern cloud-based database systems often partition the data into blocks and employ various techniques to skip irrelevant blocks during query execution. Several algorithms, often based on historical properties of a workload of queries run over the data, have been proposed to tune the physical layout of data to reduce the number of blocks accessed. The effectiveness of these methods at skipping blocks depends on what metadata is stored and how well the physical data layout aligns with the queries. Existing work on automatic physical database design misses significant opportunities in skipping blocks because it ignores logical predicates in the workload that exhibit strongly correlated results. In this paper, we present Pando which enables significantly better block skipping than past methods by informing physical layout decisions with correlation-aware logical partitioning. Across a range of benchmark and real-world workloads, Pando attains up to 2.8X reduction in the number of blocks scanned and up to 2.3X speedup in end-to-end query execution time over the state-of-the-art techniques.
Parallelism-Optimizing Data Placement for Faster Data-Parallel ComputationsNirvik Baruah (Stanford University); Peter Kraft (Stanford University)*; Fiodar Kazhamiaka (Stanford); Peter D Bailis (Stanford University); Matei Zaharia (Berkeley and Databricks) Show AbstractDownload Paper
Systems performing large data-parallel computations, including online analytical processing (OLAP) systems like Druid and search engines like Elasticsearch, are increasingly being used for business-critical real-time applications where providing low query latency is paramount. In this paper, we investigate an underexplored factor in the performance of data-parallel queries: their parallelism. We find that to minimize the tail latency of data-parallel queries, it is critical to place data such that the data items accessed by each individual query are spread across as many machines as possible so that each query can leverage the computational resources of as many machines as possible. To optimize parallelism and minimize tail latency in real systems, we develop a novel parallelism-optimizing data placement algorithm that defines a linearly-computable measure of query parallelism, uses it to frame data placement as an optimization problem, and leverages a new optimization problem partitioning technique to scale to large cluster sizes. We apply this algorithm to popular systems such as Solr and MongoDB and show that it reduces p99 latency by 7-64% on data-parallel workloads.
Tigger: A Database Proxy That Bounces With User-BypassMatthew Butrovich (Carnegie Mellon University)*; Karthik Ramanathan (Carnegie Mellon University); John Rollinson (Army Cyber Institute); Wan Shen Lim (Carnegie Mellon University); William Zhang (Carnegie Mellon University); Justine Sherry (Carnegie Mellon University); Andrew Pavlo (Carnegie Mellon University) Show AbstractDownload Paper
Developers often deploy database-specific network proxies whereby applications connect transparently to the proxy instead of directly connecting to the database management system (DBMS). This indirection improves system performance through connection pooling, load balancing, and other DBMS-specific optimizations. Instead of simply forwarding packets, these proxies implement DBMS protocol logic (i.e., at the application layer) to achieve this behavior. Consequently, existing proxies are user-space applications that process requests as they arrive on network sockets and forward them to the appropriate destinations. This approach incurs inefficiencies as the kernel repeatedly copies buffers between user-space and kernel-space, and the associated system calls add CPU overhead.
This paper presents user-bypass, a technique to eliminate these overheads by leveraging modern operating system features that support custom code execution. User-bypass pushes application logic into kernel-space via Linux’s eBPF infrastructure. To demonstrate its benefits, we implemented Tigger, a PostgreSQL-compatible DBMS proxy using user-bypass to eliminate the overheads of traditional proxy design. We compare Tigger’s performance against other state-of-the-art proxies widely used in real-world deployments. Our experiments show that Tigger outperforms other proxies — in one scenario achieving both the lowest transaction latencies (up to 29% reduction) and lowest CPU utilization (up to 42% reduction). The results show that user-bypass implementations like Tigger are well-suited to DBMS proxies’ unique requirements.
D7
Modern Memory & Storage III
Chair: El Kindi Rezig (University of Utah)
TreeLine: An Update-In-Place Key-Value Store for Modern StorageGeoffrey X. Yu (Massachusetts Institute of Technology)*; Markos Markakis (Massachusetts Institute of Technology); Andreas Kipf (Amazon Web Services); Per-Åke Larson (University of Waterloo); Umar Farooq Minhas (Apple); Tim Kraska (Massachusetts Institute of Technology) Show AbstractDownload Paper
Many modern key-value stores, such as RocksDB, rely on log-structured merge trees (LSMs). Originally designed for spinning disks, LSMs optimize for write performance by only making sequential writes. But this optimization comes at the cost of reads: LSMs must rely on expensive compaction jobs and Bloom filters—all to maintain reasonable read performance. For NVMe SSDs, we argue that trading off read performance for write performance is no longer always needed. With enough parallelism, NVMe SSDs have comparable random and sequential access performance. This change makes update-in-place designs, which traditionally provide excellent read performance, a viable alternative to LSMs.
In this paper, we close the gap between log-structured and update-in-place designs on modern SSDs with the help of new components that take advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.20× and 2.07× respectively on average across the point workloads, and by up to 10.95× and 7.52× overall.
PIM-tree: A Skew-resistant Index for Processing-in-MemoryHongbo Kang (Tsinghua University)*; Yiwei Zhao (Carnegie Mellon University); Guy E Blelloch (Carnegie Mellon University); Laxman Dhulipala (University of Maryland, College Park); Yan Gu (UC Riverside); Charles McGuffey (Reed University); Phillip B Gibbons (Carnegie Mellon University) Show AbstractDownload Paper
The performance of today’s in-memory indexes is bottlenecked by the memory latency/bandwidth wall. Processing-in-memory (PIM) is an emerging approach that potentially mitigates this bottleneck, by enabling low-latency memory access whose aggregate memory bandwidth scales with the number of PIM nodes. There is an inherent tension, however, between minimizing inter-node communication and achieving load balance in PIM systems, in the presence of workload skew. This paper presents PIM-tree, an ordered index for PIM systems that achieves both low communication and high load balance, regardless of the degree of skew in data and queries. Our skew-resistant index is based on a novel division of labor between the host CPU and PIM nodes, which leverages the strengths of each. We introduce push-pull search, which dynamically decides whether to push queries to a PIM-tree node or pull the node’s keys back to the CPU based on workload skew. Combined with other PIM-friendly optimizations (shadow subtrees and chunked skip lists), our PIM-tree provides high-throughput, (guaranteed) low communication, and (guaranteed) high load balance, for batches of point queries, updates, and range scans. We implement PIM-tree, in addition to prior proposed PIM indexes, on the latest PIM system from UPMEM, with 32 CPU cores and 2048 PIM nodes. On workloads with 500 million keys and batches of 1 million queries, the throughput using PIM-trees is up to 69.7× and 59.1× higher than the two best prior PIM-based methods. As far as we know these are the first implementations of an ordered index on a real PIM system.
A Design Space Exploration and Evaluation for Main-Memory Hash Joins in Storage Class Memory [eab]Wentao Huang (National University of Singapore)*; Yunhong Ji (Renmin University of China); Xuan Zhou (East China Normal University); Bingsheng He (National University of Singapore); Kian-Lee Tan (National University of Singapore) Show AbstractDownload Paper
In this paper, we seek to perform a rigorous experimental study of main-memory hash joins in storage class memory (SCM). In particular, we perform a design space exploration in real SCM for two state-of-the-art join algorithms: partitioned hash join (PHJ) and non-partitioned hash join (NPHJ), and identify the most crucial factors to implement an SCM-friendly join. Moreover, we present a rigorous evaluation with a broad spectrum of workloads for both joins and provide an in-depth analysis for choosing the most suitable algorithm in real SCM environment. With the most extensive experimental analysis up-to-date, we maintain that although there is no one universal winner in all scenarios, PHJ is generally superior to NPHJ in real SCM.
Dotori: A Key-Value SSD Based KV StoreCarl Duffy (Seoul National University)*; Jaehoon Shim (Seoul National University); Sang-Hoon Kim (Ajou University); Jin-Soo Kim (Seoul National University) Show AbstractDownload Paper
Key-value SSDs (KVSSDs) represent a major shift in the storage stack design, with numerous potential benefits. Despite this, their lack of native features critical to operation in real world scenarios hinders their adoption, and these benefits go unrealized. Moreover, simply adapting existing key-value stores to run on KVSSDs proves underwhelming, as KVSSDs operate at lower raw device performance when compared to modern block SSDs.
This paper introduces Dotori. Dotori is a KVSSD based key-value store that provides much needed functionality in a KVSSD through an upper layer in the host, and takes advantage of the unique KVSSD interface to enable further gains in functionality and performance. At the core of Dotori is a novel B+tree design that is only practical when the underlying storage device is a KVSSD.
We test Dotori with an enterprise grade KVSSD against state-of-the-art block SSD based key-value stores through a range of micro-benchmarks and real world workloads. Despite low KVSSD raw device performance, Dotori achieves superior performance to these block-device based key-value stores while also showing significant gains in other important metrics.
DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent MemorySekwon Lee (University of Texas at Austin); Soujanya Ponnapalli (The University of Texas at Austin); Sharad Singhal (Hewlett Packard Labs); Marcos Aguilera (VMware Research); Kimberly Keeton (Google); Vijay Chidambaram (UT Austin and VMWare) Show AbstractDownload Paper
We present Dinomo, a novel key-value store for disaggregated persistent memory (DPM). Dinomo is the first key-value store for DPM that simultaneously achieves high common-case performance, scalability, and lightweight online reconfiguration. We observe that previously proposed key-value stores for DPM had architectural limitations that prevent them from achieving all three goals simultaneously. Dinomo uses a novel combination of techniques such as ownership partitioning, disaggregated adaptive caching, selective replication, and lock-free and log-free indexing to achieve these goals. Compared to a state-of-the-art DPM key-value store, Dinomo achieves at least 3.8× better throughput at scale on various workloads and higher scalability, while providing fast reconfiguration.
D8
Compression
Chair: Panagiotis Karras (Aarhus University)
Sim-Piece: Highly Accurate Piecewise Linear Approximation through Similar Segment MergingXenophon Kitsios (Athens University of Economics and Business); Panagiotis Liakos (University of Athens)*; Katia Papakonstantinopoulou (Athens University of Economics and Business); Yannis Kotidis (Athens University of Economics and Business) Show AbstractDownload Paper
Approximating series of timestamped data points using a sequence of line segments with a maximum error guarantee is a fundamental data compression problem, termed as piecewise linear approximation (PLA). Due to the increasing need to analyze massive collections of time-series data in diverse domains, the problem has recently received significant attention, and recent PLA algorithms that have emerged do help us handle the overwhelming amount of information, at the cost of some precision loss. More specifically, these algorithms entail a trade-off between the maximum precision loss and the space savings achieved. However, advances in the area of lossless compression are undercutting the offerings of PLA techniques in real datasets. In this work, we propose Sim-Piece, a novel lossy compression algorithm for time-series data that optimizes the space requirements of representing PLA line segments, by finding the minimum number of groups we can organize these segments into, to represent them jointly. Our experimental evaluation demonstrates that our approach readily outperforms competing techniques, attaining compression ratios with more than twofold improvement on average over what PLA algorithms can offer. This allows for providing significantly higher accuracy with equivalent space requirements. Moreover, our algorithm, due to the simplicity of its merging phase, imposes little overhead while compacting the PLA description, offering a significantly improved trade-off between space and running time. The aforementioned benefits of our approach significantly improve the efficiency in which we can store time-series data, while allowing a tight maximum error in the representation of their values.
The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar CodeAzim Afroozeh (CWI)*; Peter Boncz (CWI) Show AbstractDownload Paper
The open-source FastLanes project aims to improve big data formats, such as Parquet, ORC and columnar database formats, in multiple ways. In this paper, we significantly accelerate decoding of all common Light-Weight Compression (LWC) schemes: DICT, FOR, DELTA and RLE through better data-parallelism. We do so by re-designing the compression layout using two main ideas: (i) generalizing the value interleaving technique in the basic operation of bit-(un)packing by targeting a virtual 1024-bits SIMD register, (ii) reordering the tuples in all columns of a table in the same Unified Transposed Layout that puts tuple chunks in a common ``04261537'' order (explained in the paper); allowing for maximum independent work for all possible basic SIMD lane widths: 8, 16, 32, and 64 bits. We address the software development, maintenance and future-proofness challenges of increasing hardware diversity, by defining a virtual 1024-bits instruction set that consists of simple operators supported by all SIMD dialects; and also, importantly, by scalar code. The interleaved and tuple-reordered layout actually makes scalar decoding faster, extracting more data-parallelism from today's wide-issue CPUs. Importantly, the scalar version can be fully auto vectorized by modern compilers, eliminating technical debt in software caused by platform-specific SIMD intrinsics. Micro-benchmarks on Intel, AMD, Apple and AWS CPUs show that FastLanes accelerates decoding by factors and can achieve extreme speed, like (decoding more than 40 values per CPU cycle). FastLanes can make queries faster, as compressing the data reduces bandwidth needs, while decoding is almost free.
Toward Quantity-of-Interest Preserving Lossy Compression for Scientific DataPu Jiao (University of Kentucky); Sheng Di (Argonne National Laboratory, Lemont, IL); Hanqi Guo (The Ohio State University); Kai Zhao (Florida State University); Jiannan Tian (Washington State University); Dingwen Tao (Indiana University); Xin Liang (University of Kentucky)*; Franck Cappello (Argonne National Laboratory, Lemont, IL) Show AbstractDownload Paper
Today's scientific simulations and instruments are producing a large amount of data, leading to difficulties in storing, transmitting, and analyzing these data. While error-controlled lossy compressors are effective in significantly reducing data volumes and efficiently developing databases for multiple scientific applications, they mainly support error controls on raw data, which leaves a significant gap between the data and user's downstream analysis. This may cause unqualified uncertainties in the outcomes of the analysis, a.k.a quantities of interest (QoIs), which are the major concerns of users in adopting lossy compression in practice. In this paper, we propose rigorous mathematical theories to preserve four families of QoIs that are widely used in scientific analysis during lossy compression along with practical implementations. Specifically, we first develop the error control theory for univariate QoIs which are essential for computing physical properties such as kinetic energy, followed by multivariate QoIs that are more commonly used in real-world applications. The proposed method is integrated into a state-of-the-art compression framework in a modular fashion, which could easily adapt to new QoIs and new compression algorithms. Experiments on real-world datasets demonstrate that the proposed method successfully preserves important QoIs including kinetic energy, regional average, and isosurface without trials and errors, while offering up to 4X of the compression ratios provided by state-of-the-art compressors.
E1
Video Data
Chair: Yao Lu (Microsoft Research)
Extract-Transform-Load for Video StreamsFerdinand Kossmann (Massachusetts Institute of Technology)*; Ziniu Wu (Massachusetts Institute of Technology); Eugenie Y. Lai (Massachusetts Institute of Technology); Nesime Tatbul (Intel Labs and MIT); Lei Cao (University of Arizona/MIT); Tim Kraska (Massachusetts Institute of Technology); Samuel Madden (Massachusetts Institute of Technology) Show AbstractDownload Paper
Social media, self-driving cars, and traffic cameras produce video streams at large scales and cheap cost. However, storing and querying video at such scales is prohibitively expensive. We propose to treat large-scale video analytics as a data warehousing problem: Video is a format that is easy to produce but needs to be transformed into an application-specific format that is easy to query. Analogously, we define the problem of Video Extract-Transform-Load (V-ETL). V-ETL systems need to reduce the cost of running a user-defined V-ETL job while also giving throughput guarantees to keep up with the rate at which data is produced. We find that no current system sufficiently fulfills both needs and therefore propose Skyscraper, a system tailored to V-ETL. Skyscraper can execute arbitrary video ingestion pipelines and adaptively tunes them to reduce cost at minimal or no quality degradation, e.g., by adjusting sampling rates and resolutions to the ingested content. Skyscraper can hereby be provisioned with cheap on-premises compute and uses a combination of buffering and cloud bursting to deal with peaks in workload caused by expensive processing configurations. In our experiments, we find that Skyscraper significantly reduces the cost of V-ETL ingestion compared to adaptions of current SOTA systems, while at the same time giving robustness guarantees that these systems are lacking.
EQUI-VOCAL: Synthesizing Queries for Compositional Video Events from Limited User InteractionsEnhao Zhang (University of Washington)*; Maureen Daum (University of Washington); Dong He (University of Washington); Brandon Haynes (Microsoft Gray Systems Lab); Ranjay Krishna (University of Washington); Magdalena Balazinska (UW) Show AbstractDownload Paper
We introduce EQUI-VOCAL: a new system that automatically synthesizes queries over videos from limited user interactions. The user only provides a handful of positive and negative examples of what they are looking for. EQUI-VOCAL utilizes these initial examples and additional ones collected through active learning to efficiently synthesize complex user queries. Our approach enables users to find events without database expertise, with limited labeling effort, and without declarative specifications or sketches. Core to EQUI-VOCAL's design is the use of spatio-temporal scene graphs in its data model and query language and a novel query synthesis approach that works on large and noisy video data. Our system outperforms two baseline systems---in terms of F1 score, synthesis time, and robustness to noise---and can flexibly synthesize complex queries that the baselines do not support.
Optimizing Video Analytics with Declarative Model RelationshipsFrancisco Romero (Stanford University)*; Johann Hauswald (Stanford University); Aditi Partap (Stanford University); Daniel Kang (Stanford University); Matei Zaharia (Berkeley and Databricks); Christos Kozyrakis (Stanford University) Show AbstractDownload Paper
The availability of vast video collections and the accuracy of ML models has generated significant interest in video analytics systems. Since naively processing all frames using expensive models is impractical, researchers have proposed optimizations such as selectively using faster but less accurate models to replace or filter frames for expensive models. However, these optimizations are difficult to apply on queries with multiple predicates and models, as users must manually explore a large optimization space. Without significant systems expertise or time investment, an analyst may manually create an execution plan that is unnecessarily expensive and/or terribly inaccurate.
We propose Relational Hints, a declarative interface that allows users to suggest ML model relationships based on domain knowledge. Users can express two key relationships: when a model can replace another (CAN REPLACE) and when a model can be used to filter frames for another (CAN FILTER). We aim to design an interface to express model relationships informed by domain specific knowledge and define the constraints by which these relationships hold. We then present the VIVA video analytics system that uses relational hints to optimize SQL queries on video datasets. VIVA automatically selects and validates the hints applicable to the query, generates possible query plans using a formal set of transformations, and finds the best performance plan that meets a user’s accuracy requirements. VIVA relieves users from rewriting and manually optimizing video queries as new models become available and execution environments evolve. We evaluate VIVA implemented on top of Spark and show that hints improve performance up to 16.6x without sacrificing accuracy.
E2
Time-Series Analytics
Chair: John Paparrizos (Ohio State University)
Choose Wisely: An Extensive Evaluation of Model Selection for Anomaly Detection in Time Series [eab]Emmanouil Sylligardos (FORTH); Paul Boniol (Université de Paris)*; John Paparrizos (The Ohio State University); Panos Trahanias (FORTH); Themis Palpanas (Université Paris Cité) Show AbstractDownload Paper
Anomaly detection is a fundamental task for time-series analytics with important implications for the downstream performance of many applications. Despite increasing academic interest and the large number of methods proposed in the literature, recent benchmark and evaluation studies demonstrated that no overall best anomaly detection methods exist when applied to very heterogeneous time series datasets. Therefore, the only scalable and viable solution to solve anomaly detection over very different time series collected from diverse domains is to propose a model selection method that will select, based on time series characteristics, the best anomaly detection method to run. Existing AutoML solutions are, unfortunately, not directly applicable to time series anomaly detection, and no evaluation of time series-based approaches for model selection exists. Towards that direction, this paper studies the performance of time series classification methods used as model selection for anomaly detection. Overall, we compare 17 different classifiers over 1800 time series, and we propose the first extensive experimental evaluation of time series classification as model selection for anomaly detection. Our results demonstrate that model selection methods outperform every single anomaly detection method while being in the same order of magnitude regarding execution time. This evaluation is the first step to demonstrate the accuracy and efficiency of time series classification algorithms for anomaly detection and represents a strong baseline that can then be used to guide the model selection step in general AutoML pipelines.
Time2Feat: Learning Interpretable Representations for Multivariate Time Series Clustering [sds]Angela Bonifati (University of Lyon); Francesco Del Buono (University of Modena e Reggio Emilia); Francesco Guerra (University of Modena e Reggio Emilia); Donato Tiano (Università degli Studi di Modena e Reggio Emilia)* Show AbstractDownload Paper
Clustering multivariate time series is a critical task in many real-world applications involving multiple signals and sensors. Existing systems aim to maximize effectiveness, efficiency and scalability, but fail to guarantee the interpretability of the results. This hinders their application in critical real scenarios where human comprehension of algorithmic behavior is required. This paper introduces Time2Feat, an end-to-end machine learning system for multivariate time series (MTS) clustering. The system relies on inter-signal and intra-signal interpretable features extracted from the time series.
Then, a dimensionality reduction technique is applied to select a subset of features that retain most of the information, thus enhancing the interpretability of the results. In addition, domain experts can semi-supervise the process, by providing a small amount of MTS with a target cluster. This process further improves both accuracy
and interpretability, narrowing down the number of features used by the clustering process. We demonstrate the effectiveness, interpretability, efficiency, and robustness of Time2Feat through experiments on eighteen benchmarking time series datasets, comparing them with state-of-the-art MTS clustering methods.
Motiflets - Simple and Accurate Detection of Motifs in Time SeriesPatrick Schäfer (Humboldt-Universität zu Berlin)*; Ulf Leser (Humboldt-Universität zu Berlin) Show AbstractDownload Paper
A time series motif intuitively is a short time series that repeats itself approximately the same within a larger time series. Such motifs often represent concealed structures, such as heart beats in an ECG recording, the riff in a pop song, or sleep spindles in EEG sleep data. Motif discovery (MD) is the task of finding such motifs in a given input series. As there are varying definitions of what exactly a motif is, a number of different algorithms exist. As central parameters they all take the length l of the motif and the maximal distance r between the motif's occurrences. In practice, however, especially suitable values for r are very hard to determine upfront, and found motifs show a high variability even for very similar r values. Accordingly, finding an interesting motif with these methods requires extensive trial-and-error.
In this paper, we present a different approach to the MD problem. We define k-Motiflets as the set of exactly k occurrences of a motif of length l, whose maximum pairwise distance is minimal. This turns the MD problem upside-down: The central parameter of our approach is not the distance threshold r, but the desired number of occurrence 𝑘 of the motif, which we show is considerably more intuitive and easier to set. Based on this definition, we present exact and approximate algorithms for finding k-Motiflets and analyze their complexity. To further ease the use of our method, we describe statistical tools to automatically determine meaningful values for its input parameters. Thus, for the first time, extracting meaningful motif sets without any a-priori knowledge becomes feasible.
By evaluation on several real-world data sets and comparison to four state-of-the-art MD algorithms, we show that our proposed algorithm is both quantitatively superior to its competitors, finding larger motif sets at higher similarity, and qualitatively better, leading to clearer and easier to interpret motifs without any need for manual tuning.
Fast and Scalable Mining of Time Series Motifs with Probabilistic GuaranteesMatteo Ceccarello (University of Padova); Johann Gamper (Free University of Bozen-Bolzano, Italy) Show AbstractDownload Paper
Mining time series motifs is a fundamental, yet expensive task in exploratory data analytics. In this paper, we therefore propose a fast method to find the top-𝑘 motifs with probabilistic guarantees. Our probabilistic approach is based on Locality Sensitive Hashing and allows to prune most of the distance computations, leading to huge speedups. We improve on a straightforward application of LSH to time series data by developing a self-tuning algorithm that adapts to the data distribution. Furthermore, we include several optimizations to the algorithm, reducing redundant computations and leveraging the structure of time series data to speed up LSH computations. We prove the correctness of the algorithm and provide bounds to the cost of the basic operations it performs. An experimental evaluation shows that our algorithm is able to tackle time series of one billion points on a single CPU-based machine, performing orders of magnitude faster than the GPU-based state of the art.
OneShotSTL: One-Shot Seasonal-Trend Decomposition For Online Time Series Anomaly Detection And ForecastingXiao He (Alibaba Group)*; Ye Li (Alibaba); Jian Tan (Alibaba); Bin Wu (Alibaba Group); Feifei Li (Alibaba Group) Show AbstractDownload Paper
Seasonal-trend decomposition is one of the most fundamental concepts in time series analysis that supports various downstream tasks, including time series anomaly detection and forecasting. However, existing decomposition methods rely on batch processing with a time complexity of O(W), where W is the number of data points within a time window. Therefore, they cannot always efficiently support real-time analysis that demands low processing delay. To address this challenge, we propose OneShotSTL, an efficient and accurate algorithm that can decompose time series online with an update time complexity of O(1). OneShotSTL is more than 1,000 times faster than the batch methods, with accuracy comparable to the best counterparts. Extensive experiments on real-world benchmark datasets for downstream time series anomaly detection and forecasting tasks demonstrate that OneShotSTL is from 10 to over 1,000 times faster than the state-of-the-art methods, while still providing comparable or even better accuracy.
E3
Spatial & Multi-Dimesnional Indexing
Chair: Jieming Shi (Hong Kong Polytechnic University)
Towards Designing and Learning Piecewise Space-Filling CurvesJiangneng Li (Nanyang Technological University)*; Zheng Wang (Nanyang Technological University); Gao Cong (Nanyang Technological Univesity); Cheng Long (Nanyang Technological University); Han Mao Kiah (Nanyang Technological University); Bin Cui (Peking University) Show AbstractDownload Paper
To index multi-dimensional data, space-filling curves (SFCs) have been used to map the data to one dimension, and then a one-dimensional indexing method such as the B-tree is used to index the mapped data. The existing SFCs all adopt a single mapping scheme for the whole data space. However, a single mapping scheme often does not perform well on all the data space. In this paper, we propose a new type of SFC called piecewise SFCs, which adopts different mapping schemes for different data subspaces. Specifically, we propose a data structure called Bit Merging tree (BMTree), which can generate data subspaces and their SFCs simultaneously and achieve desirable properties of the SFC for whole data space. Furthermore, we develop a reinforcement learning based solution to build the BMTree, aiming to achieve excellent query performance. Extensive experiments show that our proposed method outperforms existing SFCs in terms of query performance.
Adaptive Indexing of Objects with Spatial ExtentFatemeh Zardbani (Aarhus University); Nikos Mamoulis (University of Ioannina); Stratos Idreos (Harvard); Panagiotis Karras (Aarhus University)* Show AbstractDownload Paper
Can we quickly explore large multidimensional data in main memory? Adaptive indexing responds to this need by building an index incrementally, in response to queries; in its default form, it indexes a single attribute or, in the presence of several attributes, one attribute per index level. Unfortunately, this approach falters when indexing spatial data objects, encountered in data exploration tasks involving multidimensional range queries. In this paper, we introduce the Adaptive Incremental R-tree (AIR-tree): the first method for the adaptive indexing of non-point spatial objects; the AIR-tree incrementally and progressively constructs an in-memory spatial index over a static array, in response to incoming queries, using a suite of heuristics for creating and splitting nodes. Our thorough experimental study on synthetic and real data and workloads shows that the AIR-tree consistently outperforms prior adaptive indexing methods focusing on multidimensional points and a pre-built static R-tree in cumulative time over at least the first thousand queries.
Adaptive Indexing in High-Dimensional Metric SpacesKonstantinos Lampropoulos (University of Ioannina); Fatemeh Zardbani (Aarhus University); Nikos Mamoulis (University of Ioannina)*; Panagiotis Karras (Aarhus University) Show AbstractDownload Paper
Similarity search in high-dimensional metric spaces is routinely used in many applications including content-based image retrieval, bioinformatics, data mining, and recommender systems. Search can be accelerated by the use of an index. However, constructing a high-dimensional index can be quite expensive and may not pay off if the number of queries against the data is not large. In these circumstances, it is beneficial to construct an index adaptively, while responding to a query workload. Existing work on multidimensional adaptive indexing partitions space into orthotopes (i.e., hyperrectangular units). This approach, however, is highly ineffective in high-dimensional spaces. In this paper, we propose AV-tree: an alternative method for adaptive high-dimensional indexing that exploits previously computed distances, using query centers as vantage points. Our experimental study shows that AV-tree yields cumulative cost for the first several hundred or even thousand queries much lower than that of pre-built indices. After thousands of queries, the per-query performance of the AV-tree converges or even surpasses that of the state-of-the-art MVP-tree. Arguably, our approach is commendable in environments where the expected number of queries is not large while there is a need to start answering queries as soon as possible, such as applications where data are updated frequently and past data soon become obsolete.
Waffle: A Workload-Aware and Query-Sensitive Framework for Disk-Based Spatial IndexingMoin Hussain Moti (HKUST)*; Panagiotis Simatis (HKUST); Dimitris Papadias (HKUST) Show AbstractDownload Paper
Although several spatial indexes achieve fast query processing, they are ineffective for highly dynamic data sets because of costly updates. On the other hand, simple structures that enable efficient updates are slow for spatial queries. In this paper, we propose Waffle, a workload-aware, query-sensitive spatial index, that effectively accommodates both update- and query-intensive workloads. Waffle combines concepts of the space and data partitioning frameworks, and constitutes a complete indexing solution.
In addition to query processing algorithms, it includes: (i) a novel bulk loading method that guarantees optimal disk page utilization on static data, (ii) algorithms for dynamic updates that guarantee zero overlapping of nodes, and (iii) a maintenance mechanism that adjusts the trade-off between query and update speed, based on the workload and query distribution. An extensive experimental evaluation confirms the superiority of Waffle against state of the art space and data partitioning indexes on update and query efficiency.
E4
Data Samples & Summaries
Chair: Silu Huang (Microsoft Research)
Bayesian Sketches for Volume Estimation in Data StreamsFrancesco Da Dalt (ETH Zürich)*; Simon Scherrer (ETH Zurich); Adrian Perrig (ETH Zurich) Show AbstractDownload Paper
Given large data streams of items, each attributable to a certain key and possessing a certain volume, the aggregate volume associated with a key is difficult to estimate in a way that is both efficient and accurate. On the one hand, exact counting with dedicated counters incurs unacceptable overhead during stream processing. On the other hand, sketch algorithms, i.e., approximate-counting techniques that share counters among keys, have suffered from a trade-off between accuracy and query efficiency: Classic sketch algorithms allow to compute rough estimates in an efficient way, whereas more recent proposals yield highly accurate estimates at the cost of greatly increased computation time. In this work, we propose three sketch algorithms that overcome this trade-off, computing highly accurate estimates with lightweight procedures. To reconcile these desiderata, we employ novel estimation methods that rely on Bayesian probability theory, countercardinality information, and basic machine-learning techniques. The combination of these techniques enables highly accurate estimates, which we demonstrate by both a theoretical worst-case analysis and an experimental evaluation. Concretely, our sketches allow to efficiently produce volume estimates with an average relative error of < 4%, which previous methods could only achieve with computations that are several orders of magnitude more expensive.
Panakos: Chasing the Tails for Multidimensional Data StreamsFuheng Zhao (UCSB)*; Punnal Ismail Khan (UCSB); Divyakant Agrawal (University of California at Santa Barbara); Amr El Abbadi (UC Santa Barbara); Arpit Gupta (University of California at Santa Barbara); Zaoxing Liu (Boston University) Show AbstractDownload Paper
System operators are often interested in extracting different feature streams from multi-dimensional data streams; and reporting their distributions at regular intervals, including the heavy-hitters that contribute to the tail portion of the feature distribution. Satisfying these requirements for increasing data rates with limited resources is challenging. This paper presents the design and implementation of Panakos that makes the best use of available resources to accurately report a given feature's distribution, its tail contributors, and other stream statistics (e.g., cardinality, entropy, etc.). Our key idea is to leverage the skewness inherent to most feature streams in the real world. We leverage this skewness by disentangling the feature stream into hot, warm, and cold items based on their feature values. We then use different data structures for tracking objects in each category. Panakos provides solid theoretical guarantees and achieves high performance for various tasks. We have implemented Panakos on both software and hardware and compared Panakos to other state-of-the-art sketches using synthetic and real-world datasets. The experimental results demonstrate that Panakos often achieves one order of magnitude better accuracy than the state-of-the-art solutions for a given memory budget.
Efficient Approximation of Certain and Possible Answers for Ranking and Window Queries over Uncertain DataSu Feng (Illinois Institute of Technology)*; Boris Glavic (Illinois Institute of Technology); Oliver A Kennedy (University at Buffalo, SUNY) Show AbstractDownload Paper
Uncertainty arises naturally in many application domains due to, e.g., data entry errors and ambiguity in data cleaning. Prior work in incomplete and probabilistic databases has investigated the semantics and efficient evaluation of ranking and top-k queries over uncertain data. However, most approaches deal with top-k and ranking in isolation and do represent uncertain input data and query results using separate, incompatible datamodels. We present an efficient approach for under- and over-approximating results of ranking, top-k, and window queries over uncertain data. Our approach integrates well with existing techniques for querying uncertain data, is efficient, and is to the best of our knowledge the first to support windowed aggregation. We design algorithms for physical operators for uncertain sorting and window aggregation, and implement them in PostgreSQL. We evaluated our approach on synthetic and real world datasets, demonstrating that it outperforms all competitors, with lower error in query results.
High-Dimensional Data CubesSachin Basil John (EPFL); Christoph Koch (EPFL, Switzerland) Show AbstractDownload Paper
This paper introduces an approach to supporting high-dimensional data cubes at interactive query speeds and moderate storage cost. The approach is based on binary(-domain) data cubes that are judiciously partially materialized; the missing information can be quickly reconstructed using statistical or linear programming techniques. This enables new applications such as exploratory data analysis for feature engineering and other fields of data science. Moreover, it removes the need to compromise when building a data cube – all columns that we might ever wish to use can be included as dimensions. Our approach also speeds up certain dice, roll-up, and drill-down operations on data cubes with hierarchical dimensions compared to traditional data cubes.
No Repetition: Fast and Reliable Sampling with Highly Concentrated HashingAnders Aamand (MIT); Debarati Das (BARC - Basic Algorithms Research Copenhagen, University of Copenhagen); Evangelos Kipouridis (BARC - Basic Algorithms Research Copenhagen, University of Copenhagen); Jakob B.T. Knudsen (BARC - Basic Algorithms Research Copenhagen, University of Copenhagen); Peter M.R. Rasmussen (BARC - Basic Algorithms Research Copenhagen, University of Copenhagen); Mikkel Thorup (BARC - Basic Algorithms Research Copenhagen, University of Copenhagen) Show AbstractDownload Paper
Stochastic sample-based estimators are among the most fundamental and universally applied tools in statistics. Such estimators are particularly important when processing huge amounts of data, where we need to be able to answer a wide range of statistical queries reliably, yet cannot afford to store the data in its full length. In many applications we need the sampling to be coordinated which is typically attained using hashing. In previous work, a common strategy to obtain reliable sample-based estimators that work within certain error bounds with high probability has been to design one that works with constant probability, and then boost the probability by taking the median over 𝑟 independent repetitions. Aamand et al. (STOC’20) recently proposed a fast and practical hashing scheme with strong concentration bounds, Tabulation-1Permutation, the first of its kind. In this paper, we demonstrate that using such a hash family for the sampling, we achieve the same high probability bounds without any need for repetitions. Using the same space, this saves a factor 𝑟 in time, and simplifies the overall algorithms. We validate our approach experimentally on both real and synthetic data. We compare Tabulation-1Permutation with other hash functions such as strongly universal hash functions and various other hash functions such as MurmurHash3 and BLAKE3, both with and without resorting to repetitions. We see that if we want reliability in terms of small error probabilities, then Tabulation-1Permutation is significantly faster.
E5
Similarity Search
Chair: Dong Deng (Rutgers University)
Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances [eab]John Paparrizos (The Ohio State University)*; Kaize Wu (University of Chicago); Aaron J Elmore (University of Chicago); Christos Faloutsos (Carnegie Mellon University); Michael J Franklin (University of Chicago) Show AbstractDownload Paper
Similarity search is a fundamental building block for analytical tasks, and its performance critically depends on the choice of distance measure. For time series, there is strong evidence that elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) distances prune unnecessary comparisons to accelerate similarity search. Despite over two decades of attention, there has never been a comprehensive study to assess the progress in this area. In addition, the research has disproportionately focused on one popular elastic measure, while other accurate measures have received little or no attention. Therefore, there is merit in developing a generalized framework to accumulate knowledge from previously developed LBs and eliminate the notoriously challenging task of designing separate LBs for each elastic measure. In this paper, we perform the first comprehensive study of 11 LBs spanning 5 elastic measures using 128 datasets. We identify four properties that constitute the effectiveness of LBs and propose the Generalized Lower Bounding (GLB) framework to satisfy all desirable properties. GLB creates cache-friendly data summaries, adaptively exploits summaries of both query and target time series, and captures boundary distances in an unsupervised manner. GLB outperforms all LBs in speedup (e.g., up to 13.5x faster against the strongest LB in terms of pruning power), establishes new state-of-the-art results for the 5 elastic measures, and provides the first LBs for 2 elastic measures with no known LBs. Overall, GLB enables the effective development of LBs to facilitate fast similarity search.
MQH: Locality Sensitive Hashing on Multi-level Quantization Errors for Point-to-Hyperplane DistancesKejing Lu (Nagoya University)*; Yoshiharu Ishikawa (Nagoya University); Chuan Xiao (Osaka University, Nagoya University) Show AbstractDownload Paper
Point-to-hyperplane nearest neighbor search (P2HNNS) is a fundamental problem which has many applications in data mining and machine learning. In this paper, we propose a provable Locality-Sensitive-Hashing(LSH) scheme based on multi-level quantization errors to solve this problem. In the indexing phase, for each data point, we compute the hash values of its residual vectors generated by a stepwise quantization process. In the query phase, for each processed point, we first determine its suitable level for hashing and then determine the size of hash bucket based on its quantization error in that level. We theoretically show that this treatment not only yields a probability guarantee on query results, but also makes the generated hash functions much more efficient to prune those false points. Experimental results on five real datasets show that the proposed approach generally runs 2X-10X faster than the state-of-the-art LSH-based approaches.
FARGO: Fast Maximum Inner Product Search via Global Multi-ProbingXi Zhao (Huazhong University of Science and Technology); Bolong Zheng (Huazhong University of Science and Technology)*; Xiaomeng Yi (Zhejiang Lab); Xiaofan Luan (ZilliZ); Charles Xie (Zilliz); Xiaofang Zhou (The Hong Kong University of Science and Technology); Christian S. Jensen (Aalborg University) Show AbstractDownload Paper
Maximum inner product search (MIPS) in high-dimensional spaces has wide applications but is computationally expensive due to the curse of dimensionality. Existing studies employ asymmetric transformations that reduce the MIPS problem to a nearest neighbor search (NNS) problem, which can be solved using locality-sensitive hashing (LSH). However, these studies usually maintain multiple hash tables and locally examine them one by one, which may cause additional costs on probing unnecessary points. In addition, LSH is applied without taking into account the properties of the inner product. In this paper, we develop a fast search framework FARGO for MIPS on large-scale, high-dimensional data. We propose a global multi-probing (GMP) strategy that exploits the properties of the inner product to globally examine high quality candidates. In addition, we develop two optimization techniques. First, different with existing transformations that introduce either distortion errors or data distribution imbalances, we design a novel transformation, called random XBOX transformation, that avoids the negative effects of data distribution imbalances. Second, we propose a global adaptive early termination condition that finds results quickly and offers theoretical guarantees. Extensive experiments with real-world data offer evidence that FARGO is capable of outperforming existing proposals in terms of both accuracy and efficiency.
Odyssey: A Journey in the Land of Distributed Data Series Similarity SearchManos Chatzakis (EPFL)*; Panagiota Fatourou (University of Crete); Eleftherios Kosmas (University of Crete); Themis Palpanas (Université Paris Cité); Botao Peng (Institute of Computing Technology, Chinese Academy of Sciences) Show AbstractDownload Paper
This paper presents Odyssey, a novel distributed data-series processing framework that efficiently addresses the critical challenges of exhibiting good speedup and ensuring high scalability in data series processing by taking advantage of the full computational capacity of modern clusters comprised of multi-core servers. Odyssey addresses a number of challenges in designing efficient and highly scalable distributed data series index, including efficient scheduling, and load-balancing without paying the prohibitive cost of moving data around. It also supports a flexible partial replication scheme, which enables Odyssey to navigate through a fundamental trade-off between data scalability and good performance during query answering. Through a wide range of configurations and using several real and synthetic datasets, our experimental analysis demonstrates that Odyssey achieves its challenging goals.
Elpis: Graph-Based Similarity Search for Scalable Data Science [sds]Ilias Azizi (Mohammed VI Polytechnic University)*; Karima Echihabi (Mohammed VI Polytechnic University); Themis Palpanas (Université Paris Cité) Show AbstractDownload Paper
The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no guarantees 𝑛𝑔-approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.
E6
Matching & Spatial Crowdsourcing
Chair: Xiaohui Yu (York University)
Privacy-preserving Cooperative Online Matching over Spatial Crowdsourcing PlatformsYi Yang (Beijing Institute of Technology)*; Yurong Cheng (Beijing institute of technology); Ye Yuan (Beijing Institute of Technology); Guoren Wang (Beijing Institute of Technology); Lei Chen (Hong Kong University of Science and Technology); Yongjiao Sun (Northeastern University) Show AbstractDownload Paper
With the continuous development of spatial crowdsourcing platform, online task assignment problem has been widely studied as a typical problem in spatial crowdsourcing. Most of the existing studies are based on a single-platform task assignment to maximize the platform's revenue. Recently, cross online task assignment has been proposed, aiming at increasing the mutual benefit through cooperations. However, existing methods fail to consider the data privacy protection in the process of cooperation and cause the leakage of sensitive data such as the location of a request and the historical data of cooperative platforms. In this paper, we propose Privacy-preserving Cooperative Online Matching (PCOM), which protects the privacy of the users and workers on their respective platforms. We design a PCOM framework and provide theoretical proof that the framework satisfies the differential privacy property. We then propose two PCOM algorithms based on two different privacy-preserving strategies. Extensive experiments on real and synthetic datasets confirm the effectiveness and efficiency of our algorithms.
ACTA: Autonomy and Coordination Task Assignment in Spatial Crowdsourcing PlatformsBoyang Li (Beijing Institute of Technology)*; Yurong Cheng (Beijing institute of technology); Ye Yuan (Beijing Institute of Technology); Yi Yang (Beijing Institute of Technology); Qianqian Jin (Beijing Institute of Technolog China); Guoren Wang (Beijing Institute of Technology) Show AbstractDownload Paper
Spatial platforms have become increasingly important in people's daily lives. Task assignment is a critical problem in these platforms that matches real-time orders to suitable workers. Most studies only focus on independent platforms that are in a competitive relationship. Recently, an emerging service model was proposed, where orders are shared with multiple similar platforms. It aims to solve the imbalance between supply and demand through cooperation. However, it faces the following main challenges: 1) Coordinating independent platforms fairly based on the limited information; 2) Building a task assignment process with personalized algorithms. In this paper, we study real applications and define the Autonomy and Coordination Task Assignment problem (ACTA) to maximize the global revenue and fairness. We propose a framework to solve ACTA that consists of public order sending, local matching, global conflict adjustment and results notification. The framework uses mid-products and public data to train a revenue estimation model to coordinate participants. We further propose dynamic weight task assignment algorithms to guarantee fairness. Through the experiments, we prove that the platforms can obtain higher revenue, which shows the effectiveness and efficiency of our work.
Online Ridesharing with Meeting PointsJiachuan Wang (HKUST); Peng Cheng (East China Normal University); Libin Zheng (Sun Yat-sen University); Lei Chen (Hong Kong University of Science and Technology); Wenjie Zhang (University of New South Wales) Show AbstractDownload Paper
Nowadays, ridesharing becomes a popular commuting mode. Dynamically arriving riders post their origins and destinations, then the platform assigns drivers to serve them. In ridesharing, different groups of riders can be served by one driver if their trips can share common routes. Recently, many ridesharing companies (e.g., Didi and Uber) further propose a new mode, namely “ridesharing with meeting points”. Specifically, with a short walking distance but less payment, riders can be picked up and dropped off around their origins and destinations, respectively. In addition, meeting points enables more flexible routing for drivers, which can potentially improve the global profit of the system. In this paper, we first formally define the Meeting-Point-based Online Ridesharing Problem (MORP). We prove that MORP is NP-hard and there is no polynomial-time deterministic algorithm with a constant competitive ratio for it. We notice that a structure of vertex set, 𝑘-skip cover, fits well to the MORP. 𝑘-skip cover tends to find the vertices (meeting points) that are convenient for riders and drivers to come and go. With meeting points, MORP tends to serve more riders with these convenient vertices. Based on the idea, we introduce a convenience-based meeting point candidates selection algorithm. We further propose a hierarchical meeting-point oriented graph (HMPO graph), which ranks vertices for assignment effectiveness and constructs 𝑘-skip cover to accelerate the whole assignment process. Finally, we utilize the merits of 𝑘-skip cover points for ridesharing and propose a novel algorithm, namely SMDB, to solve MORP. Extensive experiments on real and synthetic datasets validate the effectiveness and efficiency of our algorithms.
k-Best Egalitarian Stable Marriages for Task AssignmentSiyuan Wu (University of Macau)*; Leong Hou U (University of Macau); Panagiotis Karras (Aarhus University) Show AbstractDownload Paper
In a two-sided market with each agent ranking individuals on the other side according to their preferences, such as location or incentive, the stable marriage problem calls to find a perfect matching among the two sides such that no pair of agents prefers each other to their assigned matches. Recent studies show that the number of solutions can be large in practice. Yet the classic solution by the Gale-Shapley (GS) algorithm is optimal for agents on the one side and pessimal for those on the other side. Some algorithms find a stable marriage that optimizes a measure of the cumulative satisfaction of all agents, such as egalitarian cost. However, in many real-world circumstances, a decision-maker needs to examine a set of solutions that are stable and attentive to both sides and choose among them based on expert knowledge. With such a disposition, it is necessary to identify a set of high-quality stable marriages and provide transparent explanations for any reassigned matches to the decision-maker. In this paper, we provide efficient algorithms that find the k-best stable marriages by egalitarian cost. Our exhaustive experimental study using real-world data and realistic preferences demonstrates the efficacy and efficiency of our solution.
E7
Trajectories & Time Series
Chair: Dujian Ding (University of British Columbia)
A Deep Generative Model for Trajectory Modeling and UtilizationYong Wang (Tsinghua University); Guoliang Li (Tsinghua University)*; Kaiyu Li (Tsinghua University); Haitao Yuan (Baidu) Show AbstractDownload Paper
Modern location-based systems have stimulated explosive growth of urban trajectory data and promoted many real-world applications, e.g., trajectory prediction. However, heavy big data processing overhead and privacy concerns hinder trajectory acquisition and utilization. Inspired by regular trajectory distribution on transportation road networks, we propose to model trajectory data privately with a deep generative model and leverage the model to generate representative trajectories for downstream tasks or directly support these tasks (e.g., popularity ranking), rather than acquiring and processing the original big trajectory data. Nevertheless, it is rather challenging to model high-dimensional trajectories with time-varying yet skewed distribution. To address this problem, we model and generate trajectory sequence with judiciously encoded spatio-temporal features over skewed distribution by leveraging an important factor neglected by the literature --the underlying road properties (e.g., road types and directions), which are closely related to trajectory distribution. Specifically, we decompose trajectory into map-matched road sequence with temporal information and embed them to encode spatio-temporal features. Then, we enhance trajectory representation by encoding inherent route planning patterns from the underlying road properties. Later, we encode spatial correlations among edges and daily and weekly temporal periodicity information. Next, we employ a meta-learning module to generate trajectory sequence step by step by learning generalized trajectory distribution patterns from skewed trajectory data based on the well-encoded trajectory prefix. Last but not least, we preserve trajectory privacy by learning the model differential privately with clipping gradients. Experiments on real-world datasets show that our method significantly outperforms existing methods.
Efficient Non-Learning Similar Subtrajectory SearchJiabao Jin (East China Normal University); Peng Cheng (East China Normal University)*; Lei Chen (Hong Kong University of Science and Technology); Xuemin Lin (Shanghai Jiaotong University); Wenjie Zhang (University of New South Wales) Show AbstractDownload Paper
Similar subtrajectory search is a finer-grained operator that can better capture the similarities between one query trajectory and a portion of a data trajectory than the traditional similar trajectory search, which requires that the two checking trajectories are similar in their entirety. Many real applications (e.g., trajectory clustering and trajectory join) utilize similar subtrajectory search as a basic operator. It is considered that the time complexity is O(mn^2) for exact algorithms to solve the similar subtrajectory search problem under most trajectory distance functions in the existing studies, where m is the length of the query trajectory and n is the length of the data trajectory. In this paper, to the best of our knowledge, we are the first to propose an exact algorithm to solve the similar subtrajectory search problem in O(mn) time for most of widely used trajectory distance functions (e.g., WED, DTW, ERP, EDR and Frechet distance). Through extensive experiments on three real datasets, we demonstrate the efficiency and effectiveness of our proposed algorithms.
Effective and Efficient Route Planning Using Historical Trajectories on Road NetworksWei Tian (The Hong Kong Polytechnic University)*; Jieming Shi (The Hong Kong Polytechnic University); Siqiang Luo (Nanyang Technological University); Hui Li (Xiamen University); Xike Xie (University of Science and Technology of China); Yuanhang Zou (Tencent) Show AbstractDownload Paper
We study route planning that utilizes historical trajectories to predict a realistic route from a source to a destination on a road network at given departure time. Route planning is a fundamental task in many location-based services. It is challenging to capture latent patterns implied by complex trajectory data for accurate route planning. Recent studies mainly resort to deep learning techniques that incur immense computational costs, especially on massive data, while their effectiveness are complicated to interpret.
This paper proposes DRPK, an effective and efficient route planning method that achieves state-of-the-art performance via a series of novel algorithmic designs. In brief, observing that a route planning query (RPQ) with closer source and destination is easier to be accurately predicted, we fulfill a promising idea in DRPK to first detect the key segment of an RPQ by a classification model KSD, in order to split the RPQ into shorter RPQs, and then handle the shorter RPQs by a destination-driven route planning procedure DRP. Both KSD and DRP modules rely on a directed association (DA) indicator, which captures the dependencies between road segments from historical trajectories in a surprisingly intuitive but effective way. Leveraging the DA indicator, we develop a set of well-thought-out key segment concepts that holistically consider historical trajectories and RPQs. KSD is powered by effective encoders to detect high-quality key segments, without inspecting all segments in a road network for efficiency. We conduct extensive experiments on 5 large-scale datasets. DRPK consistently achieves the highest effectiveness, often with a significant margin over existing methods, while being much faster to train. Moreover, DRPK is efficient to handle thousands of online RPQs in a second, e.g., 2768 RPQs per second on a PT dataset, i.e., 0.36 milliseconds per RPQ.
iEDeaL: A Deep Learning Framework for Detecting Highly Imbalanced Interictal Epileptiform Discharges [sds]Qitong Wang (Université Paris Cité)*; Stephen Whitmarsh (Sorbonne Université, Paris Brain Institute - ICM, Inserm, CNRS, APHP, Pitié-Salpêtrière Hospital); Vincent Navarro (Sorbonne Université, Paris Brain Institute - ICM, Inserm, CNRS, APHP, Pitié-Salpêtrière Hospital); Themis Palpanas (Université Paris Cité) Show AbstractDownload Paper
Epilepsy is a chronic neurological disease, ranked as the second most burdensome neurological disorder worldwide. Detecting Interictal Epileptiform Discharges (IEDs) is among the most important clinician operations to support epilepsy diagnosis, rendering automatic IED detection based on electroencephalography (EEG) signals an important topic. However, most existing solutions were designed and evaluated upon artificially balanced IED datasets, which do not conform to the real-world highly imbalanced scenarios. In this work, we propose the iEDeaL framework for automatic IED detection in challenging real-world use cases. The main components of iEDeaL are the new SC neural network architecture, to efficiently detect IEDs on raw EEG series instead of extracted features, and SaSu, a novel loss function to train SC by optimizing the $F_{\beta}$-score. Experiments on two real-world imbalanced IED datasets verify the advantages of iEDeaL in offering more accurate and efficient IED detection when compared with other state-of-the-art deep learning-based and spectrogram feature-based solutions.
F1
Scalable ML I
Chair: Rajesh Bordawekar (IBM Research)
Scalable Graph Convolutional Network Training on Distributed-Memory SystemsGunduz Vehbi Demirci (University of Warwick)*; Aparajita Haldar (University of Warwick); Hakan Ferhatosmanoglu (University of Warwick) Show AbstractDownload Paper
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs. The large data sizes of graphs and their vertex features make scalable training algorithms and distributed memory systems necessary. Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges. We propose a highly parallel training algorithm that scales to large processor counts. In our solution, the large adjacency and vertex-feature matrices are partitioned among processors. We exploit the vertex-partitioning of the graph to use non-blocking point-to-point communication operations between processors for better scalability. To further minimize the parallelization overheads, we introduce a sparse matrix partitioning scheme based on a hypergraph partitioning model for full-batch training. We also propose a novel stochastic hypergraph model to encode the expected communication volume in mini-batch training. We show the merits of the hypergraph model, previously unexplored for GCN training, over the standard graph partitioning model which does not accurately encode the communication costs. Experiments performed on real-world graph datasets demonstrate that the proposed algorithms achieve considerable speedups over alternative solutions. The optimizations achieved on communication costs become even more pronounced at high scalability with many processors. The performance benefits are preserved in deeper GCNs having more layers as well as on billion-scale graphs.
FastFlow: Accelerating Deep Learning Model Training with Smart Offloading of Input Data PipelineTaegeon Um (Samsung Research)*; Byungsoo Oh (Samsung Research); Byeongchan Seo (Samsung Research); Minhyeok Kweun (samsung research); Goeun Kim (Samsung Research); Woo-Yeon Lee (Samsung Research) Show AbstractDownload Paper
When training a deep learning (DL) model, input data are preprocessed on CPUs and transformed into tensors, which are then fed into GPUs for gradient computations of model training. Expensive GPUs must be fully utilized during training to accelerate the training speed. However, intensive CPU operations for input data preprocessing (input pipeline) often lead to CPU bottlenecks; correspondingly, various DL training jobs suffer from GPU under-utilization.
We propose FastFlow, a DL training system that automatically mitigates the CPU bottleneck by offloading (scaling out) input pipelines to remote CPUs. FastFlow carefully decides various offloading decisions based on performance metrics specific to applications and allocated resources, while leveraging both local and remote CPUs to prevent the inefficient use of remote resources and minimize the training time. FastFlow’s smart offloading policy and mechanisms are seamlessly integrated with TensorFlow for users to enjoy the smart offloading features without modifying the main logic. Our evaluations on our private DL cloud with diverse workloads on various resource environments show that FastFlow improves the training throughput by 1 ∼ 4.34× compared to TensorFlow without offloading, by 1 ∼ 4.52× compared to TensorFlow with manual CPU offloading (tf.data.service), and by 0.63 ∼ 2.06× compared to GPU offloading (DALI).
MiCS: Near-linear Scaling for Training Gigantic Model on Public CloudZhen Zhang (Johns Hopkins University)*; Shuai Zheng (Amazon Web Services); Yida Wang (Amazon); Justin Chiu (Amazon); George Karypis (Amazon); Trishul A Chilimbi (Amazon); Mu Li (Amazon); Xin Jin (Peking University) Show AbstractDownload Paper
Existing general purpose frameworks for gigantic model training, i.e., dense models with billions of parameters, cannot scale efficiently on cloud environment with various networking conditions due to large communication overheads. In this paper, we propose MiCS, which Minimizes the Communication Scale to bring down communication overhead. Specifically, by decreasing the number of participants in a communication collective, MiCS can utilize existing heterogeneous network bandwidth on the cloud, reduce network traffic over slower links, reduce the latency of communications for maintaining high network bandwidth utilization, and amortize expensive global gradient synchronization overheads. Our evaluation on AWS shows that the system throughput of MiCS is up to 2.89× that of the state-of-the-art large model training systems. MiCS achieves near-linear scaling efficiency, which is up to 1.27× that of DeepSpeed. MiCS allows us to train a proprietary model with 100 billion parameters on 512 GPUs with 99.4% weak-scaling efficiency, and it is able to saturate over 54.5% theoretical computation power of each GPU on a public cloud with less GPU memory and more restricted networks than DGX-A100 clusters.
F2
Data Discovery & Learning over Related Data
Chair: Fatemeh Nargesian (University of Rochester)
JoinBoost: Grow Trees Over Normalized Data Using Only SQLZezhou Huang (Columbia University)*; Rathijit Sen (Microsoft); Jiaxiang Liu (Columbia University); Eugene Wu (Columbia University) Show AbstractDownload Paper
Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL?
We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating the $Y$ variable to the residual in the non-materialized join result. Although this view update problem is generally ambiguous, we identify addition-to-multiplication preserving, the key property of variance semi-ring to support rmse, the most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3x (1.1x) faster for random forests (gradient boosting) compared to LightGBM, and over an order magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).
Cross Modal Data Discovery over Structured and Unstructured Data LakesMohamed Y. Eltabakh (Worcester Polytechnic Institute)*; Mayuresh Kunjir (Amazon AWS); Ahmed K. Elmagarmid (QCRI); Mohammad Shahmeer Ahmad (Qatar Computing Research Institute) Show AbstractDownload Paper
Organizations are collecting increasingly large amounts of data for data-driven decision making. These data are often dumped into a centralized repository, e.g., a data lake, consisting of thousands of structured and unstructured datasets. Perversely, such mixture makes the problem of discovering tables or documents that are relevant to a user's query very challenging. Despite the recent efforts in data discovery, the problem remains widely open especially in the two fronts of (1) discovering relationships and relatedness across structured and unstructured datasets--where existing techniques suffer from either scalability, being customized for a specific problem type (e.g., entity matching or data integration), or demolishing the structural properties on its way, and (2) developing a holistic system for integrating various similarity measurements and sketches in an effective way to boost the discovery accuracy. In this paper, we propose a new data discovery system, named CMDL, for addressing these two limitations. CMDL supports the data discovery process over both structured and unstructured data while retaining the structural properties of tables. As a result, CMDL is the only system to date that empowers end-users to seamlessly pipeline the discovery tasks across the two modalities. We propose a novel multi-modal embedding representation that captures the similarities between text documents and tabular columns. The model training relies on labeled datasets generated though weak supervision, and thus the system is domain agnostic and easily generalizable. We evaluate CMDL on three real-world data lakes with diverse applications and show that our system is significantly more effective for cross-modality discovery compared to the search-based baseline techniques. Moreover, CMDL is more accurate and robust to different data types and distributions compared to the state-of-the-art systems that are limited to only the structured datasets.
RECA: Related Tables Enhanced Column Semantic Type Annotation FrameworkYushi Sun (Hong Kong University of Science and Technology)*; Hao Xin (Hong Kong University of Science and Technology); Lei Chen (Hong Kong University of Science and Technology) Show AbstractDownload Paper
Understanding the semantics of tabular data is of great importance in various downstream applications, such as schema matching, data cleaning, and data integration. Column semantic type annotation is a critical task in the semantic understanding of tabular data. Despite the fact that various approaches have been proposed, they are challenged by the difficulties of handling wide tables and incorporating complex inter-table context information. Failure to handle wide tables limits the usage of column type annotation approaches, while failure to incorporate inter-table context harms the annotation quality. Existing methods either completely ignore these problems or propose ad-hoc solutions. In this paper, we propose Related tables Enhanced Column semantic type Annotation framework (RECA), which incorporates inter-table context information by finding and aligning schema-similar and topic-relevant tables based on a novel named entity schema. The design of RECA can naturally handle wide tables and incorporate useful inter-table context information to enhance the annotation quality. We conduct extensive experiments on two web table datasets to comprehensively evaluate the performance of RECA. Our results show that RECA achieves support-weighted F1 scores of 0.853 and 0.937 with macro average F1 scores of 0.674 and 0.783 on the two datasets respectively, which outperform the state-of-the-art methods.
F3
Scalable ML II
Chair: Alekh Jindal (SmartApps)
SubStrat: A Subset-Based Optimization Strategy for Faster AutoML [sds]Teddy Lazebnik (University College London); Amit Somech (Bar-Ilan University)*; Abraham Itzhak Weinberg (Bar Ilan University) Show AbstractDownload Paper
Automated machine learning (AutoML) frameworks have become important tools in the data scientist’s arsenal, as they dramatically reduce the manual work devoted to the construction of ML pipelines. Such frameworks intelligently search among millions of possible ML pipelines - typically containing feature engineering, model selection, and hyper parameters tuning steps - and finally output an optimal pipeline in terms of predictive accuracy.
However, when the dataset is large, each individual configuration takes longer to execute, therefore the overall AutoML running times become increasingly high.
To this end, we present SubStrat, an AutoML optimization strategy that tackles the data size, rather than configuration space. It wraps existing AutoML tools, and instead of executing them directly on the entire dataset, SubStrat uses a genetic-based algorithm to find a small yet representative data subset that preserves a particular characteristic of the full data. It then employs the AutoML tool on the small subset, and finally, it refines the resulting pipeline by executing a restricted, much shorter, AutoML process on the large dataset. Our experimental results, performed on three popular AutoML frameworks, Auto-Sklearn, TPOT, and H2O show that SubStrat reduces their running times by 76.3% (on average), with only a 4.15% average decrease in the accuracy of the resulting ML pipeline
FederatedScope: A Flexible Federated Learning Platform for HeterogeneityYuexiang Xie (Alibaba Group); Zhen Wang (Alibaba Group); Dawei Gao (Alibaba-inc); Daoyuan Chen (Alibaba Group); Liuyi Yao (Alibaba Group); Weirui Kuang (Alibaba Group); Yaliang Li (Alibaba Group)*; Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group); Jingren Zhou (Alibaba Group) Show AbstractDownload Paper
Although remarkable progress has been made by existing federated learning (FL) platforms to provide infrastructures for development, these platforms may not well tackle the challenges brought by various types of heterogeneity. To fill this gap, in this paper, we propose a novel FL platform, named FederatedScope, which employs an event-driven architecture to provide users with great flexibility to independently describe the behaviors of different participants. Such a design makes it easy for users to describe participants with various local training processes, learning goals and backends, and coordinate them into an FL course with synchronous or asynchronous training strategies. Towards an easy-to-use and flexible platform, FederatedScope enables rich types of plug-in operations and components for efficient further development, and we have implemented several important components to better help users with privacy protection, attack simulation and auto-tuning. We have released FederatedScope at https://github.com/alibaba/FederatedScope to promote academic research and industrial deployment of federated learning in a wide range of scenarios.
Optimizing Tensor Programs on Flexible Storage [sigmod]Maximilian Joel Schleich (RelationalAI); Amir Shaikhha (University of Edinburgh)*; Dan Suciu (University of Washington) Show AbstractDownload Paper
Tensor programs often need to process large tensors (vectors, matrices, or higher order tensors) that require a specialized storage format for their memory layout. Several such layouts have been proposed in the literature, such as the Coordinate Format, the Compressed Sparse Row format, and many others, that were especially designed to optimally store tensors with specific sparsity properties. However, existing tensor processing systems require specialized extensions in order to take advantage of every new storage format. In this paper we describe a system that allows users to define flexible storage formats in a declarative tensor query language, similar to the language used by the tensor program. The programmer only needs to write storage mappings, which describe, in a declarative way, how the tensors are laid out in main memory. Then, we describe a cost-based optimizer that optimizes the tensor program for the specific memory layout. We demonstrate empirically significant performance improvements compared to state-of-the-art tensor processing systems.
Towards Observability for Production Machine Learning Pipelines [vision]Shreya Shankar (University of California Berkeley); Aditya Parameswaran (University of California, Berkeley) Show AbstractDownload Paper
Software organizations are increasingly incorporating machine learning (ML) into their product offerings, driving a need for new data management tools. Many of these tools facilitate the initial development of ML applications, but sustaining these applications post-deployment is difficult due to lack of real-time feedback (i.e., labels) for predictions and silent failures that could occur at any component of the ML pipeline (e.g., data distribution shift or anomalous features). We propose a new type of data management system that offers end-to-end observability, or visibility into complex system behavior, for deployed ML pipelines through assisted (1) detection, (2) diagnosis, and (3) reaction to ML-related bugs. We describe new research challenges and suggest preliminary solution ideas in all three aspects. Finally, we introduce an example architecture for a “bolt-on” ML observability system, or one that wraps around existing tools in the stack.
F4
Causality & Explanation
Chair: Davood Rafiei (University of Alberta)
Causal Data Integration [vision]Brit Youngmann (Massachusetts Institute of Technology)*; Michael Cafarella (MIT CSAIL); Babak Salimi (University of California at San Diego); Anna Zeng (Massachusetts Institute of Technology) Show AbstractDownload Paper
Causal inference is fundamental to empirical scientific discoveries in natural and social sciences; however, in the process of conducting causal inference, data management problems can lead to false discoveries. Two such problems are (I) not having all attributes required for analysis, and (ii) misidentifying which attributes are to be included in the analysis. Analysts often only have access to partial data, and they critically rely on (often unavailable or incomplete) domain knowledge to identify attributes to include for analysis, which is often given in the form of a causal DAG. We argue that data management techniques can surmount both of these challenges. In this work, we introduce the Causal Data Integration (CDI) problem, in which unobserved attributes are mined from external sources
and a corresponding causal DAG is automatically built. We identify key challenges and research opportunities in designing a CDI system, and present a system architecture for solving the CDI problem. Our preliminary experimental results demonstrate that solving CDI is achievable and pave the way for future research.
HENCE-X: Toward Heterogeneity-agnostic Multi-level Explainability for Deep Graph NetworksGe Lv (The Hong Kong University of Science and Technology)*; Chen Jason Zhang (The Hong Kong Polytechnic University); Lei Chen (HKUST) Show AbstractDownload Paper
Deep graph networks (DGNs) have demonstrated their outstanding effectiveness on both heterogeneous and homogeneous graphs. However their black-box nature does not allow human users to understand their working mechanisms. Recently, extensive efforts have been devoted to explaining DGNs' prediction, yet heterogeneity-agnostic multi-level explainability is still less explored. Since the two types of graphs are both irreplaceable in real-life applications, having a more general and end-to-end explainer becomes a natural and inevitable choice. In the meantime, feature-level explanation is often ignored by existing techniques, while topological-level explanation alone can be incomplete and deceptive. Thus, we propose a heterogeneity-agnostic multi-level explainer in this paper, named HENCE-X, which is a causality-guided method that can capture the non-linear dependencies of model behavior on the input using conditional probabilities. We theoretically prove that HENCE-X is guaranteed to find the Markov blanket of the explained prediction, meaning that all information that the prediction is dependent on is identified. Experiments on three real-world datasets show that HENCE-X outperforms state-of-the-art (SOTA) methods in generating faithful factual and counterfactual explanations of DGNs.
POEM: Pattern-Oriented Explanations of Convolutional Neural Networks [sds]Vargha Dadvar (University of Waterloo); Lukasz Golab (University of Waterloo)*; Divesh Srivastava (AT&T Chief Data Office) Show AbstractDownload Paper
Convolutional Neural Networks (CNNs) are commonly used in computer vision. However, their predictions are difficult to explain, as is the case with many deep learning models. To address this problem, we present POEM, a modular framework that produces patterns of semantic concepts such as shapes and colours to explain image classifier CNNs. POEM identifies patterns such as ``if sofa then living room'', meaning that if an image contains a sofa and the model pays attention to the sofa, then the model classifies the image as a living room. We illustrate the advantages of POEM over existing work using quantitative and qualitative experiments.
On Data-Aware Global Explainability of Graph Neural NetworksGe Lv (The Hong Kong University of Science and Technology)*; Lei Chen (HKUST) Show AbstractDownload Paper
Graph Neural Networks (GNNs) have significantly boosted the performance of many graph-based applications, yet they serve as black-box models. To understand how GNNs make decisions, explainability techniques have been extensively studied. While the majority of existing methods focus on local explainability, we propose DAG-Explainer in this work aiming for global explainability. Specifically, we observe three properties of superior explanations for a pretrained GNN: they should be highly recognized by the model, compliant with the data distribution and discriminative among all the classes. The first property entails an explanation to be faithful to the model, as the other two require the explanation to be convincing regarding the data distribution. Guided by these properties, we design metrics to quantify the quality of each single explanation and formulate the problem of finding data-aware global explanations for a pretrained GNN as an optimizing problem. We prove that the problem is NP-hard and adopt a randomized greedy algorithm to find a near optimal solution. Furthermore, we derive an improved bound of the approximation algorithm in our problem over the state-of-the-art (SOTA) best. Experimental results show that DAG-Explainer can efficiently produce meaningful and trustworthy explanations while preserving comparable quantitative evaluation results to the SOTA methods.
Computing Rule-Based Explanations by Leveraging CounterfactualsZixuan Geng (University of Washington)*; Maximilian Schleich (RelationalAI); Dan Suciu (University of Washington) Show AbstractDownload Paper
Sophisticated machine models are increasingly used for high-stakes decisions in everyday life. There is an urgent need to develop effective explanation techniques for such automated decisions. Rule Based Explanations have been proposed for high-stake decisions like loan applications, because they increase the users’ trust in the decision. However, rule-based explanations are very inefficient to compute, and existing systems sacrifice their quality in order to achieve reasonable performance. We propose a novel approach to compute rule-based explanations, by using a different type of explanation, Counterfactual Explanations, for which several efficient systems have already been developed. We prove a Duality Theorem, showing that rule-based and counterfactual-based explanations are dual to each other, then use this observation to develop an efficient algorithm for computing rule-based explanations, which uses the counterfactual-based explanation as an oracle. We conduct extensive experiments showing that our system computes rule-based explanations of higher quality, and with the same or better performance, than two previous systems, MinSetCover and Anchor.
F5
Fairness
Chair: Steven Whang (KAIST)
Why Not Yet: Fixing a Top-k Ranking that Is Not Fair to IndividualsZixuan Chen (Northeastern University)*; Panagiotis Manolios (Northeastern University); Mirek Riedewald (Northeastern University) Show AbstractDownload Paper
This work considers why-not questions in the context of top-k queries and score-based ranking functions. Following the popular linear scalarization approach for multi-objective optimization, we study rankings based on the weighted sum of multiple scores. A given weight choice may be controversial or perceived as unfair to certain individuals or organizations, triggering the question why some entity of interest has not yet shown up in the top-k. We introduce various notions of such why-not-yet queries and formally define them as satisfiability or optimization problems, whose goal is to propose alternative ranking functions that address the placement of the entities of interest. While some why-not-yet problems have linear constraints, others require quantifiers, disjunction, and negation. We propose several optimizations, ranging from a monotonic-core construction that approximates the complex constraints with a conjunction of linear ones, to various techniques that let the user control the tradeoff between running time and approximation quality. Experiments with real and synthetic data demonstrate the practicality and scalability of our technique, showing its superiority compared to the state of the art (SOA).
Consistent Range Approximation for Fair Predictive ModelingJiongli Zhu (University of California San Diego)*; Sainyam Galhotra (University of Chicago); Nazanin Sabri (University of California at San Diego); Babak Salimi (University of California at San Diego) Show AbstractDownload Paper
This paper proposes a novel framework for certifying the fairness of predictive models trained on biased data. It draws from query answering for incomplete and inconsistent databases to formulate the problem of consistent range approximation (CRA) of fairness queries for a predictive model on a target population. The framework employs background knowledge of the data collection process and biased data, working with or without limited statistics about the target population, to compute a range of answers for fairness queries. Using CRA, the framework builds predictive models that are certifiably fair on the target population, regardless of the availability of external data during training. The framework's efficacy is demonstrated through evaluations on real data, showing substantial improvement over existing state-of-the-art methods.
Models and Mechanisms for Spatial Data FairnessSina Shaham (University of Southern California); Gabriel Ghinita (Hamad Bin Khalifa University)*; Cyrus Shahabi (Computer Science Department. University of Southern California) Show AbstractDownload Paper
Fairness in data-driven decision-making studies scenarios where individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In location-based applications, decisions are based on individual whereabouts, which often correlate with sensitive attributes such as race, income, and education.
While fairness has received significant attention recently, e.g., in machine learning, there is little focus on achieving fairness when dealing with location data. Due to their characteristics and specific type of processing algorithms, location data pose important fairness challenges. We introduce the concept of {\em spatial data fairness} to address the specific challenges of location data and spatial queries. We devise a novel building block to achieve fairness in the form of {\em fair polynomials}. Next, we propose two mechanisms based on fair polynomials that achieve individual spatial fairness, corresponding to two common location-based decision-making types: {\em distance-based} and {\em zone-based}. Extensive experimental results on real data show that the proposed mechanisms achieve spatial fairness without sacrificing utility.
Satisfying Complex Top-k Fairness Constraints by Preference SubstitutionsMd. Mouinul Islam (New Jersey Institute of Technology); Dong Wei (NJIT); Baruch Schieber (New Jersey Institute of Technology); Senjuti Basu Roy (New Jersey Institute of Technology)* Show AbstractDownload Paper
Given m users (voters), where each user casts her preference for a single item (candidate) over n items (candidates) as a ballot, the preference aggregation problem returns k items (candidates) that have the k highest number of preferences (votes). Our work studies this problem considering complex fairness constraints that have to be satisfied via proportionate representations of different values of the group protected attribute(s) in the top-k results. Precisely, we study the margin finding problem under single ballot substitutions, where a single substitution amounts to removing a vote from candidate i and assigning it to candidate j and the goal is to minimize the number of single ballot substitutions needed to guarantee that the top-k results satisfy the fairness constraints. We study several variants of this problem considering how top-k fairness constraints are defined, (i) MFBinaryS and MFMultiS are defined when the fairness (proportionate representation) is defined over a single, binary or multivalued, protected attribute, respectively; (ii) MFMulti2 is studied when top-k fairness is defined over two different protected attributes; (iii) MFMulti3+ investigates the margin finding problem, considering 3 or more protected attributes. We study these problems theoretically, and present a suite of algorithms with provable guarantees. We conduct rigorous large scale experiments involving multiple real world datasets by appropriately adapting multiple state-of-the-art solutions to demonstrate the effectiveness and scalability of our proposed methods
G1
Graph Analytics I
Chair: Sibo Wang (Chinese University of Hong Kong)
SUFF: Accelerating Subgraph Matching with Historical DataXun Jian (HKUST)*; Zhiyuan Li (The Hong Kong University of Science and Technology); Lei Chen (Hong Kong University of Science and Technology) Show AbstractDownload Paper
Subgraph matching is a fundamental problem in graph theory and has wide applications in areas like sociology, chemistry, and social networks. Due to its NP-hardness, the basic approach is a brute-force search over the whole search space. Some pruning strategies have been proposed to reduce the search space. However, they are either space-inefficient or based on assumptions that the graph has specific properties. In this paper, we propose SUFF, a general and powerful structure filtering framework, which can accelerate most of the existing approaches with slight modifications. Specifically, it builds a set of filters using matching results of past queries, and uses them to prune the search space for future queries. By fully utilizing the relationship between matches of two queries, it ensures that such pruning is sound. Furthermore, several optimizations are proposed to reduce the computation and space cost for building, storing, and using filters. Extensive experiments are conducted on multiple real-world data sets and representative existing approaches. The results show that SUFF can achieve up to 15X speedup with small overheads.
Computing Graph Edit Distance via Neural Graph MatchingChengzhi Piao (Chinese University of Hong Kong); Tingyang Xu (Tencent AI Lab); Xiangguo Sun (CUHK); Yu Rong (Tencent AI Lab); Kangfei Zhao (Beijing Insitute of Technology); Hong Cheng (Chinese University of Hong Kong)* Show AbstractDownload Paper
Graph edit distance (GED) computation is a fundamental NP-hard problem in graph theory. Given a graph pair $(G_1, G_2)$, GED is defined as the minimum number of primitive operations converting $G_1$ to $G_2$. Early studies focus on search-based inexact algorithms such as A*-beam search, and greedy algorithms using bipartite matching due to its NP-hardness. They can obtain a sub-optimal solution by constructing an edit path (the sequence of operations that converts $G_1$ to $G_2$). Recent studies convert the GED between a given graph pair $(G_1, G_2)$ into a similarity score in the range $(0, 1)$ by a well designed function. Then machine learning models (mostly based on graph neural networks) are applied to predict the similarity score. They achieve a much higher numerical precision than the sub-optimal solutions found by classical algorithms. However, a major limitation is that these machine learning models cannot generate an edit path. They treat the GED computation as a pure regression task to bypass its intrinsic complexity, but ignore the essential task of converting $G_1$ to $G_2$. This severely limits the interpretability and usability of the solution.
In this paper, we propose a novel deep learning framework that solves the GED problem in a two-step manner: 1) The proposed graph neural network GEDGNN is in charge of predicting the GED value and a matching matrix; and 2) A post-processing algorithm based on $k$-best matching is used to derive $k$ possible node matchings from the matching matrix generated by GEDGNN. The best matching will finally lead to a high-quality edit path. Extensive experiments are conducted on three real graph data sets and synthetic power-law graphs to demonstrate the effectiveness of our framework. Compared to the best result of existing GNN-based models, the mean absolute error (MAE) on GED value prediction decreases by $4.9\% \sim 74.3\%$. Compared to the state-of-the-art searching algorithm Noah, the MAE on GED value based on edit path reduces by $53.6\% \sim 88.1\%$.
ARKGraph: All-Range Approximate K-Nearest-Neighbor GraphChaoji Zuo (Rutgers University - New Brunswick); Dong Deng (Rutgers University - New Brunswick)* Show AbstractDownload Paper
Given a collection of vectors, the approximate K-nearest-neighbor graph (KGraph for short) connects every vector to its approximate K-nearest-neighbors (KNN for short). KGraph plays an important role in data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of real-world objects (e.g., images and documents), which often come with a few structured attributes, such as timestamps and locations. In this paper, we study the all-range approximate K-nearest-neighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key value (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given $n$ vectors, a brute-force index stores a KGraph for every search key range, which results in $O(Kn^3)$ index size as there are $O(n^2)$ search key ranges and each KGraph takes $O(Kn)$ space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just $O(Kn\log n)$ in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need $O(\log\log n + K\log K)$ to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on real-world datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range on-the-fly.
Quasi-stable Coloring for Graph Compression: Approximating Max-Flow, Linear Programs, and CentralityMoe Kayali (University of Washington)*; Dan Suciu (University of Washington) Show AbstractDownload Paper
We propose quasi-stable coloring, an approximate version of stable coloring. Stable coloring, also called color refinement, is a well-studied technique in graph theory for classifying vertices, which can be used to build compact, lossless representations of graphs. However, its usefulness is limited due to its reliance on strict symmetries. Real data compresses very poorly using color refinement. We propose the first, to our knowledge, approximate color refinement scheme, which we call quasi-stable coloring. By using approximation, we alleviate the need for strict symmetry, and allow for a tradeoff between the degree of compression and the accuracy of the representation. We study three applications: Linear Programming, Max-Flow, and Betweenness Centrality, and provide theoretical evidence in each case that a quasi-stable coloring can lead to good approximations on the reduced graph. Next, we consider how to compute a maximal quasi-stable coloring: we prove that, in general, this problem is NP-hard, and propose a simple, yet effective algorithm based on heuristics. Finally, we evaluate experimentally the quasi-stable coloring technique on several real graphs and applications, comparing with prior approximation techniques.
Temporal SIR-GN: Efficient and Effective Structural Representation Learning for Temporal GraphsJanet Layne (Boise State University); Justin Carpenter (Boise State University); Edoardo Serra (Boise State University); Francesco Gullo (UniCredit)* Show AbstractDownload Paper
Node representation learning (NRL) generates numerical vectors (embeddings) for the nodes of a graph. Structural NRL specifically assigns similar node embeddings for those nodes that exhibit similar structural roles. This is in contrast with its proximity-based counterpart, wherein similarity between embeddings reflects spatial proximity among nodes. Structural NRL is useful for tasks such as node classification where nodes of the same class share structural roles, though there may exist a distant, or no path between them. Extensive experiments on synthetic and real datasets show superior performance in node classification and regression tasks, and superior scalability of our approach to large graphs. Athough structural NRL has been well-studied in static graphs, it has received limited attention in the temporal setting. Here, the embeddings are required to represent the evolution of nodes’ structural roles over time. The existing methods are limited in terms of efficiency and effectiveness: they scale poorly to even moderate number of timestamps, or capture structural role only tangentially. In this work, we present a novel unsupervised approach to structural representation learning for temporal graphs that overcomes these limitations. For each node, our approach clusters then aggregates the embedding of a node’s neighbors for each timestamp, followed by a further temporal aggregation of all timestamps. This is repeated for (at most) d iterations, so as to acquire information from the d-hop neighborhood of a node. Our approach takes linear time in the number of overall temporal edges, and possesses important theoretical properties that formally demonstrate its effectiveness. Extensive experiments on synthetic and real datasets show superior performance in node classification and regression tasks, and superior scalability of our approach to large graphs.
SUREL+: Moving from Walks to Sets for Scalable Subgraph-based Graph Representation Learning [sds]Haoteng Yin (Purdue University)*; Muhan Zhang (Peking University); Jianguo Wang (Purdue University); Pan Li (Georgia Tech.) Show AbstractDownload Paper
Subgraph-based graph representation learning (SGRL) has recently emerged as a powerful tool in many prediction tasks on graphs due to its advantages in model expressiveness and generalization ability. Most previous SGRL models face computational issues related to the high cost of extracting subgraphs for each training or testing query. Recently, SUREL was proposed to accelerate SGRL, which samples random walks offline and joins these walks online as a proxy of subgraphs for prediction. Thanks to the reusability of sampled walks across different queries, SUREL achieves state-of-the-art performance in terms of scalability and prediction accuracy. However, SUREL still suffers from high computational overhead caused by node redundancy in sampled walks. In this work, we propose a novel framework SUREL+ that upgrades SUREL by using node sets instead of walks to represent subgraphs. By definition, such set-based representations avoid repeated nodes, but node sets can be irregular in size. To solve this issue, we design a dedicated sparse data structure to efficiently store and access node sets, and provide a specialized operator to join them in parallel batches. SUREL+ is modularized to support multiple types of set samplers, structural features, and neural encoders to complement the loss of structural information after the reduction from walks to sets. Extensive experiments have been performed to verify the effectiveness of SUREL+ in the prediction tasks of links, relation types, and higher-order patterns. SUREL+ achieves 3-11X speedups of SUREL while maintaining comparable or even better prediction performance; compared to other SGRL baselines, SUREL+ achieves ~20X speedups and significantly improves the prediction accuracy.
Scaling Up Structural Clustering to Large Probabilistic Graphs Using Lyapunov Central Limit TheoremJoseph N Howie (University of Victoria)*; Venkatesh Srinivasan (university of victoria); Alex Thomo (University of Victoria) Show AbstractDownload Paper
Structural clustering is one of the most widely used graph clustering frameworks. In this paper, we focus on structural clustering of probabilistic graphs, which comes with significant computational challenges and has, so far, resisted efficient solutions that are able to scale to large graphs, e.g. the state-of-art can only handle graphs with a few million edges. We address the main bottleneck step of probabilistic structural clustering, computing the structural similarity of vertices based on their Jaccard similarity over the set of possible worlds of a given probabilistic graph. The state-of-art used Dynamic Programming, a quadratic run-time algorithm, that does not scale to pairs of vertices of high degree. In this paper we present a novel approach based on Lyapunov Central Limit Theorem. By using a carefully chosen set of random variables we are able to cast the computation of structural similarity to computing a one-tailed area under the Normal Distribution. Our approach has linear run-time as opposed to quadratic, and as such, it scales to much larger inputs. Extensive experiments show that our approach can handle massive graphs at web-scale which the state-of-art cannot.
Efficient Maximum k-Plex Computation over Large Sparse GraphsLijun Chang (The University of Sydney)*; Mouyi Xu (The University of Sydney); Darren Strash () Show AbstractDownload Paper
The k-plex model is a relaxation of the clique model by allowing every vertex to miss up to k neighbors. Designing exact and efficient algorithms for computing the maximum k-plex in a graph has been receiving increasing interest recently. However, the existing algorithms are still inefficient due to having major limitations. We in this paper design a new algorithm kPlexS for the maximum k-plex problem, with three novel contributions. Firstly, we propose a new framework for computing maximum k-plex over large sparse graphs, by iteratively extracting small dense subgraphs from it and then solving each of the extracted dense subgraphs by a branch-and-bound search. Secondly, we propose an efficient reduction algorithm CTCP to reduce the input graph size by exhaustively conducting vertex reduction and edge reduction. CTCP computes a smaller reduced graph and also has a lower time complexity than the existing techniques. Moreover, we iteratively invoke CTCP to reduce the input graph once a vertex has been processed and removed from it. Thirdly, we develop a branch-and-bound algorithm BBMatrix specifically targeting the dense subgraphs that are extracted from the input graph. BBMatrix represents its input graph by an adjacency matrix, and utilizes both first-order (i.e., individual vertices) and second-order information (i.e., pairs of vertices) for reduction and upper bounding. In addition, incremental techniques are proposed to efficiently apply the reduction and upper bounding during the recursion. Extensive empirical studies on large real graphs demonstrate that our algorithm kPlexS outperforms the state-of-the-art algorithms BnB, Maplex, and KpLeX.
G3
Text Processing & Search
Chair: Hazar Harmouch (Hasso Plattner Institute)
Pollock: A Data Loading Benchmark [eab]Gerardo Vitagliano (Hasso Plattner Institute)*; Mazhar Hameed (Hasso Plattner Institute); Lan Jiang (Hasso Plattner Institute); Lucas Reisener (Hasso Plattner Institute); Eugene Wu (Columbia University); Felix Naumann (Hasso Plattner Institute, University of Potsdam) Show AbstractDownload Paper
Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is csv. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard csv formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic "pollution" process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.
Text Indexing for Long Patterns: Anchors are All you NeedLorraine A. K. Ayad (Brunel University); Grigorios Loukides (King's College London)*; Solon P. Pissis (CWI) Show AbstractDownload Paper
In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time.
Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step to this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound $\ell$ on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.
Autonomously Computable Information ExtractionBesat Kassaie (University of Waterloo); Frank Wm. Tompa (University of Waterloo)* Show AbstractDownload Paper
Most optimization techniques deployed in information extraction systems assume that source documents are static. Instead, extracted relations can be considered to be materialized views defined by a language built on regular expressions. Using this perspective, we can provide an efficient verifier (using static analysis) that can be used to avoid the high cost of re-extracting information after an update. In particular, we propose an efficient mechanism to identify updates for which we can autonomously compute an extracted relation. We present experimental results that support the feasibility and practicality of this mechanism in real world extraction systems.
REmatch: a novel regex engine for finding all matchesCristian Riveros (PUC Chile); Nicolás Van Sint Jan (PUC); Domagoj Vrgoč (PUC)* Show AbstractDownload Paper
In this paper, we present the REmatch system for information extraction. REmatch is based on a recently proposed enumeration algorithm for evaluating regular expressions with capture variables supporting the all-match semantics. It tells a story of what it takes to make a theoretically optimal algorithm work in practice. As we show here, a naive implementation of the original algorithm would have a hard time dealing with realistic workloads. We thus develop a new algorithm and a series of optimizations that make REmatch as fast or faster than many popular RegEx engines while at the same time being able to return all the outputs: a task that most other engines tend to struggle with.
Web Record Extraction with InvariantsZhijia Chen (Temple University)*; Weiyi Meng (Binghamton University); Eduard Dragut (Temple University) Show AbstractDownload Paper
Web records are structured data on a Web page that embeds records retrieved from an underlying database according to some templates. Mining data records on the Web enables the integration of data from multiple Web sites for providing value-added services. Most existing works on Web record extraction make two key assumptions: (1) records are retrieved from databases with uniform schemas and (2) records are displayed in a linear structure on a Web page. These assumptions no longer hold on the modern Web. A Web page may present records of diverse entity types with different schemas and organize records hierarchically, in nested structures, to show richer relationships among records. In this paper, we revisit these assumptions and modify them to reflect Web pages on the modern Web. Based on the reformulated assumptions, we introduce the concept of invariant in Web data records and propose Miria (MIning Record InvariAnt), a bottom-up, recursive approach to construct the Web records from the invariants. The proposed approach is both effective and efficient, consistently outperforming the state-of-the-art Web record extraction methods on modern Web pages.
G4
Graph Analytics III
Chair: Siqiang Luo (Nanyang Technological University)
Distributed Graph Embedding with Information-Oriented Random WalksPeng Fang (Huazhong University of Science and Technology)*; Arijit Khan (Aalborg University); Siqiang Luo (Nanyang Technological University); Fang Wang (Huazhong University of Science and Technology); Dan Feng (Huazhong University of Science and Technology); Zhenli Li (Huazhong University of Science and Technology); Wei Yin (Huazhong University of Science and Technology); Yuchao Cao (Huazhong University of Science and Technology) Show AbstractDownload Paper
Graph embedding maps graph nodes to low-dimensional vectors, and is widely adopted in machine learning tasks. The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs, such as link prediction on Twitter with over one billion edges. Most existing graph embedding methods fall short of reaching high data scalability. In this paper, we present a general-purpose, distributed, information-centric random walk-based graph embedding framework, DistGER, which can scale to embed billion-edge graphs. DistGER incrementally computes information-centric random walks. It further leverages a multi-proximity-aware, streaming, parallel graph partitioning strategy, simultaneously achieving high local partition quality and excellent workload balancing across machines. DistGER also improves the distributed Skip-Gram learning model to generate node embeddings by optimizing the access locality, CPU throughput, and synchronization efficiency. Experiments on real-world graphs demonstrate that compared to state-of-the-art distributed graph embedding frameworks, including KnightKing, DistDGL, and Pytorch-BigGraph, DistGER exhibits 2.33×–129× acceleration, 45% reduction in cross-machines communication, and >10% effectiveness improvement in downstream tasks.
Decoupled Graph Neural Networks for Large Dynamic Graphs [sds]Yanping Zheng (Renmin University of China)*; Zhewei Wei (Renmin University of China); Jiajun Liu (CSIRO) Show AbstractDownload Paper
Real-world graphs, such as social networks, financial transactions, and recommendation systems, often demonstrate dynamic behavior. This phenomenon, known as graph stream, involves the dynamic changes of nodes and the emergence and disappearance of edges. To effectively capture both the structural and temporal aspects of these dynamic graphs, dynamic graph neural networks have been developed. However, existing methods are usually tailored to process either continuous-time or discrete-time dynamic graphs, and cannot be generalized from one to the other. In this paper, we propose a decoupled graph neural network for large dynamic graphs, including a unified dynamic propagation that supports efficient computation for both continuous and discrete dynamic graphs. Since graph structure-related computations are only performed during the propagation process, the prediction process for the downstream task can be trained separately without expensive graph computations, and therefore any sequence model can be plugged-in and used. As a result, our algorithm achieves exceptional scalability and expressiveness. We evaluate our algorithm on seven real-world datasets of both continuous-time and discrete-time dynamic graphs. The experimental results demonstrate that our algorithm achieves state-of-the-art performance in both kinds of dynamic graphs. Most notably, the scalability of our algorithm is well illustrated by its successful application to large graphs with up to over a billion temporal edges and over a hundred million nodes.
Estimating Single-Node PageRank in $\tilde{O}\left(\min\{d_t, \sqrt{m}\}\right)$ TimeHanzhi Wang (Renmin University of China)*; Zhewei Wei (Renmin University of China) Show AbstractDownload Paper
PageRank is a famous measure of graph centrality that has numerous applications in practice. The problem of computing a single node's PageRank has been the subject of extensive research over a decade. However, existing methods still incur large time complexities despite years of efforts. Even on undirected graphs where several valuable properties held by PageRank scores, the problem of locally approximating the PageRank score of a target node remains a challenging task. Two commonly adopted techniques, Monte-Carlo based random walks and backward push, both cost $O(n)$ time in the worst-case scenario, which hinders existing methods from achieving a sublinear time complexity like $O(\sqrt{m})$ on an undirected graph with $n$ nodes and $m$ edges.
In this paper, we focus on the problem of single-node PageRank computation on undirected graphs. We propose a novel algorithm, SetPush, for estimating single-node PageRank specifically on undirected graphs. With non-trival analysis, we prove that our SetPush achieves the $\tilde{O}\left(\min\left\{d_t, \sqrt{m}\right\}\right)$ time complexity for estimating the target node $t$'s PageRank with constant relative error and constant failure probability on undirected graphs. We conduct comprehensive experiments to demonstrate the effectiveness of SetPush.
Space-Efficient Random Walks on Streaming GraphsSerafeim Papadias (TU Berlin)*; Zoi Kaoudi (TU Berlin); Jorge-Arnulfo Quiané-Ruiz (IT University of Copenhagen); Volker Markl (Technische Universität Berlin) Show AbstractDownload Paper
Graphs in many applications, such as social networks and IoT, are inherently streaming, involving continuous additions and deletions of vertices and edges at high rates. Constructing random walks in a graph, i.e., sequences of vertices selected with a specific probability distribution, is a prominent task in many of these graph applications as well as machine learning (ML) on graph-structured data. In a streaming scenario, random walks need to constantly keep up with the graph updates to avoid stale walks and thus, performance degradation in the downstream tasks. We present Wharf, a system that efficiently stores and updates random walks on streaming graphs. It avoids a potential size explosion by maintaining a compressed, high-throughput, and low-latency data structure. It achieves (i) the succinct representation by coupling compressed purely functional binary trees and pairing functions for storing the walks, and (ii) efficient walk updates by effectively pruning the walk search space. We evaluate Wharf, with real and synthetic graphs, in terms of throughput and latency when updating random walks. The results show the high superiority of Wharf over inverted index- and tree-based baselines.
Density Personalized Group QueryChih-Ya Shen (National Tsing Hua University); Shao-Heng Ko (Academia Sinica); Guang-Siang Lee (Academia Sinica); Wang-Chien Lee (Pennsylvania State University, USA); De-Nian Yang (Academia Sinica)* Show AbstractDownload Paper
Research on new queries for finding dense subgraphs and groups has been actively pursued due to their many applications, especially in social network analysis and graph mining. However, existing work faces two major weaknesses: i) incapability of supporting personalized neighborhood density, and ii) inability to find sparse groups. To tackle the above issues, we propose a new query, called Density-Customized Social Group Query (DCSGQ), that accommodates the need for personalized density by allowing individual users to flexibly configure their social tightness (and sparseness) for the target group. The proposed DCSGQ is general due to flexible in configuration of neighboring social density in queries. We prove the NP-hardness and inapproximability of DCSGQ, formulate an Integer Program (IP) as a baseline, and propose an efficient algorithm, FSGSel-RR, by relaxing the IP. We then propose a fixed-parameter tractable algorithm with a performance guarantee, named FSGSel-TD, and further combine it with FSGSel-RR into a hybrid approach, named FSGSel-Hybrid, in order to strike a good balance between solution quality and efficiency. Extensive experiments on multiple large real datasets demonstrate the superior solution quality and efficiency of our approaches over existing subgraph and group queries.
G5
Searching Data
Chair: Kaiyu Li (York University)
A Generic Framework for Efficient Computation of Top-k Diverse Results [vldbj]Md Mouinul Islam (New Jersey Institute of Technology); Mahsa Asadi (New Jersey Institute of Technology); Sihem Amer-Yahia (CNRS: Centre National de la Recherche Scientifique); Senjuti Basu Roy (New Jersey Institute of Technology)* Show AbstractDownload Paper
Result diversification is extensively studied in the context of search, recommendation, and data exploration. There are numerous algorithms that return top-k results that are both diverse and relevant. These algorithms typically have computational loops that compare the pairwise diversity of records to decide which ones to retain. We propose an access primitive DivGetBatch() that replaces repeated pairwise comparisons of diversity scores of records by pairwise comparisons of “aggregate” diversity scores of a group of records, thereby improving the running time of these algorithms while preserving the same results. We integrate the access primitive inside three representative diversity algorithms and prove that the augmented algorithms leveraging the access primitive preserve original results. We analyze the worst and expected case running times of these algorithms. We propose a computational framework to design this access primitive that has a pre-computed index structure I-tree that is agnostic to the specific details of diversity algorithms. We develop principled solutions to construct and maintain I-tree. Our experiments on multiple large real-world datasets corroborate our theoretical findings, while ensuring up to a 24× speedup.
Survey of Window Types for Aggregation in Stream Processing Systems [vldbj]Juliane Verwiebe (Technische Universität Berlin); Philipp M Grulich (Technische Universität Berlin)*; Jonas Traub (Technische Universität Berlin); Volker Markl (Technische Universität Berlin) Show AbstractDownload Paper
In this paper, we present the first comprehensive survey of window types for stream processing systems which have been presented in research and commercial systems. We cover publications from the most relevant conferences, journals, and system whitepapers on stream processing, windowing, and window aggregation which have been published over the last 20 years. For each window type, we provide detailed specifications, formal notations, synonyms, and use-case examples. We classify each window type according to categories that have been proposed in literature and describe the out-of-order processing. In addition, we examine academic, commercial, and open-source systems with respect to the window types that they support. Our survey offers a comprehensive overview that may serve as a guideline for the development of stream processing systems, window aggregation techniques, and frameworks that support a variety of window types.
A Survey on Deep Learning Approaches for Text-to-SQL [vldbj]George Katsogiannis (Athena Research and Innovation Center)*; Georgia Koutrika (Athena Research and Innovation Center) Show AbstractDownload Paper
To bridge the gap between users and data, numerous text-to-SQL systems have been developed that allow users to pose natural language questions over relational databases. Recently, novel text-to-SQL systems are adopting deep learning methods with very promising results. At the same time, several challenges remain open making this area an active and flourishing field of research and development. To make real progress in building text-to-SQL systems, we need to de-mystify what has been done, understand how and when each approach can be used, and, finally, identify the research challenges ahead of us. The purpose of this survey is to present a detailed taxonomy of neural text-to-SQL systems that will enable a deeper study of all the parts of such a system. This taxonomy will allow us to make a better comparison between different approaches, as well as highlight specific challenges in each step of the process, thus enabling researchers to better strategise their quest towards the “holy grail” of database accessibility.
G7
Graph Analytics IV
Chair: Laks V.S. Lakshmanan (University of Britsh Columbia)
MiniGraph: Querying Big Graphs with a Single MachineXiaoke Zhu (Beihang University); Yang Liu (Beihang University); Shuhao Liu (Shenzhen Institute of Computing Sciences)*; Wenfei Fan (University of Edinburgh) Show AbstractDownload Paper
This paper presents MiniGraph, an out-of-core system for querying big graphs with a single machine. As opposed to previous single-machine graph systems, MiniGraph proposes a pipelined architecture to overlap I/O and CPU operations, and improves multi-core parallelism. It also introduces a hybrid model to support both vertex-centric and graph-centric parallel computations, to simplify parallel graph programming, speed up beyond-neighborhood computations, and parallelize computations within each subgraph. The model induces a two-level parallel execution model to explore both inter-subgraph and intra-subgraph parallelism. Moreover, MiniGraph develops new optimization techniques under its architecture. Using real-life graphs of different types, we show that MiniGraph is up to 76.1x faster than prior out-of-core systems, and performs better than some multi-machine systems that use up to 12 machines.
Parallel Colorful h-star Core Maintenance in Dynamic GraphsSen Gao (National University of Singapore)*; Hongchao Qin (Beijing Institute of Technology); Ronghua Li (Beijing Institute of Technology); Bingsheng He (National University of Singapore) Show AbstractDownload Paper
The higher-order structure cohesive subgraph mining is an important operator in many graph analysis tasks. Recently, the colorful h-star core model has been proposed as an effective alternative to h-clique based cohesive subgraph models, in consideration of both effi ciency and utilities in many practical applications. The existing peeling algorithms for colorful h-star core decomposition are to iteratively delete a node with the minimum colorful h-star degree. Hence, these methods are inherently sequential and suffer from two limitations: low parallelism and ineffi ciency for dynamic graphs. To enable high-performance colorful h-star core decomposition in large-scale graphs, we propose highly parallelizable local algorithms based on a novel concept of colorful h-star n-order H-index and conduct thorough analyses for its properties. Moreover, three optimizations have been developed to further improve the convergence performance. Based on our local algorithm and its optimized variants, we can effi ciently maintain colorful h-star cores in dynamic graphs. Furthermore, we design lower and upper bounds for core numbers to facilitate identifying unaffected nodes in presence of graph updates. Extensive experiments conducted on 14 large real-world datasets with billions of edges demonstrate that our proposed algorithms achieve a 10 times faster convergence speed and a three orders of magnitude speedup when handling graph changes.
MITra: A Framework for Multi-Instance Graph TraversalJia Li (Edinburgh Research Center, Central Software Institute, Huawei); Wenyue Zhao (University of Edinburgh); Nikos Ntarmos (Edinburgh Research Center, Central Software Institute, Huawei); Yang Cao (University of Edinburgh)*; Peter Buneman (The University of Edinburgh) Show AbstractDownload Paper
This paper presents MITra, a framework for composing multi-instance graph algorithms that traverse from multiple source vertices simultaneously over a single thread. Underlying MITra is an abstraction that expresses traversal logic via arithmetic operations over a numeric runtime property called vertex ranks, and separates it from computation logic. Based on this, MITra implements an interface that allows users to express traversals by declaring vertex ranks and specify computation logic via an edge function. It synthesizes multi-instance traversal algorithms from declared vertex ranks and edge functions adopted from classic single-instance algorithms, automatically sharing computation across instances and benefiting from SIMD. We show that MITra can generate multi-instance algorithms provably better than existing ones, while being more expressive than traditional frameworks. In addition to the ease of programming, we experimentally verify that MITra is on average an order of magnitude faster than approaches based on existing frameworks for common graph algorithms, and is comparable to the state-of-the-art highly optimized one-off algorithms.
Sage: A System for Uncertain Network AnalysisEunjae Lee (UNIST); Sam H. Noh (UNIST); Jiwon Seo (Hanyang University) Show AbstractDownload Paper
We propose Sage, a system for uncertain network analysis. Algorithms for uncertain network analysis require large amounts of memory and computing resources as they sample a large number of network instances and run analysis on them. Sage makes uncertain network analysis simple and efficient. By extending the edge-centric programming model, Sage makes writing sampling-based analysis algorithms as simple as writing conventional graph algorithms in Pregel-like systems. Moreover, Sage proposes four optimization techniques, namely, deterministic sampling, hybrid gathering, schedule-aware caching, and copy-on-write attributes, that exploit common properties of uncertain network analysis. Extensive evaluation of Sage with eight algorithms on six real-world networks shows that the four optimizations in Sage jointly improve performance by up to 13.9× and on average 2.7×.
G8
Community Search in Graphs
Chair: Yixiang Fang (University of Hong Kong, Shenzhen)
Influential Community Search over Large Heterogeneous Information NetworksYingli Zhou (The Chinese University of Hong Kong, Shenzhen)*; Yixiang Fang (The Chinese University of Hong Kong, Shenzhen); Wensheng Luo (School of Data Science, The Chinese University of Hong Kong, Shenzhen); Yunming Ye (Harbin Institute of Technology Shenzhen Graduate School) Show AbstractDownload Paper
Recently, the topic of influential community search has gained much attention. Given a graph, it aims to find communities of vertices with high importance values from it. Existing works mainly focus on conventional homogeneous networks, where vertices are of the same type. Thus, they cannot be applied to heterogeneous information networks (HINs) like bibliographic networks and knowledge graphs, where vertices are of multiple types and their importance values are of heterogeneity (i.e., for vertices of different types, their importance meanings are also different). In this paper, we study the problem of influential community search over large HINs. We introduce a novel community model, called heterogeneous influential community (HIC), or a set of closely connected vertices that are of the same type and high importance values, using the meta-path-based core model. An HIC not only captures the importance of vertices in a community, but also considers the influence on meta-paths connecting them. To search the HICs, we mainly consider meta-paths with two and three vertex types. Then, we develop basic algorithms by iteratively peeling vertices with low importance values, and further propose advanced algorithms by identifying the key vertices and designing pruning strategies that allow us to quickly eliminate vertices with low importance values. Extensive experiments on four real large HINs show that our solutions are effective for searching HICs, and the advanced algorithms significantly outperform baselines.
Maximal D-truss Search in Dynamic Directed GraphsAnxin Tian (Hong Kong University of Science and Technology)*; Alexander Zhou (Hong Kong University of Science and Technology); Yue Wang (Shenzhen Institute of Computing Sciences); Lei Chen (Hong Kong University of Science and Technology) Show AbstractDownload Paper
Community search (CS) aims at personalized subgraph discovery which is the key to understanding the organisation of many real-world networks. CS in undirected networks has attracted significant attention from researchers, including many solutions for various cohesive subgraph structures and for different levels of dynamism with edge insertions and deletions, while they are much less considered for directed graphs. In this paper, we propose incremental solutions of CS based on the D-truss in dynamic directed graphs, where the D-truss is a cohesive subgraph structure defined based on two types of triangles in directed graphs. We first analyze the theoretical boundedness of D-truss given edge insertions and deletions, then we present basic single-update algorithms. To improve the efficiency, we propose an order-based D-Index, associated batch-update algorithms and a fully-dynamic query algorithm. Our extensive experiments on real-world graphs show that our proposed solution achieves a significant speedup compared to the SOTA solution, the scalability over updates is also verified.
Neighborhood-based Hypergraph Core DecompositionNaheed Anjum Arafat (Nanyang Technological University)*; Arijit Khan (Aalborg University); Arpit Kumar Rai (Indian Institute of Technology, Kanpur); Bishwamittra Ghosh (National University of Singapore) Show AbstractDownload Paper
We propose neighborhood-based core decomposition: a novel way of decomposing hypergraphs into hierarchical neighborhood-cohesive subhypergraphs. Alternative approaches to decomposing hypergraphs, e.g., reduction to clique or bipartite graphs, are not meaningful in certain applications, the later also results in inefficient decomposition; while existing degree-based hypergraph decomposition does not distinguish nodes with different neighborhood sizes. Our case studies show that the proposed decomposition is more effective than degree and clique graph-based decompositions in disease intervention and in extracting provably approximate and application-wise meaningful densest subhypergraphs. We propose three algorithms: Peel, its efficient variant E-Peel, and a novel local algorithm: Local-core with parallel implementation. Our most efficient parallel algorithm Local-core(P) decomposes hypergraph with 27M nodes and 17M hyperedges in-memory within 91 seconds by adopting various optimizations. Finally, we develop a new hypergraph-core model, the (neighborhood, degree)-core by considering both neighborhood and degree constraints, design its decomposition algorithm Local-core+Peel, and demonstrate its superiority in spreading diffusion.
H1
Data Discovery & Integration
Chair: Oktie Hassanzadeh (IBM Research)
Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation LearningGrace Fan (Northeastern University)*; Jin Wang (Megagon Labs); Yuliang Li (Megagon Labs); Dan Zhang (Megagon Labs); Renée J. Miller (Northeastern University) Show AbstractDownload Paper
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).
DeepJoin: Joinable Table Discovery with Pre-trained Language ModelsYuyang Dong (NEC corporation)*; Chuan Xiao (Osaka University, Nagoya University); Takuma Nozawa (NEC); Masafumi Enomoto (NEC); Masafumi Oyamada (NEC) Show AbstractDownload Paper
Due to the usefulness in data enrichment for data analysis tasks, joinable table discovery has become an important operation in data lake management. Existing approaches target equi-joins, the most common way of combining tables for creating a unified view, or semantic joins, which tolerate misspellings and different formats to deliver more join results. They are either exact solutions whose running time is linear in the sizes of query column and target table repository, or approximate solutions lacking precision. In this paper, we propose DeepJoin, a deep learning model for accurate and efficient joinable table discovery. Our solution is an embedding-based retrieval, which employs a pre-trained language model (PLM) and is designed as one framework serving both equi- and semantic joins. We propose a set of contextualization options to transform column contents to a text sequence. The PLM reads the sequence and is fine-tuned to embed columns to vectors such that columns are expected to be joinable if they are close to each other in the vector space. Since the output of the PLM is fixed in length, the subsequent search procedure becomes independent of the column size. With a state-of-the-art approximate nearest neighbor search algorithm, the search time is sublinear in the repository size. To train the model, we devise the techniques for preparing training data as well as data augmentation. The experiments on real datasets demonstrate that by training on a small subset of a corpus, DeepJoin generalizes to large datasets and its precision consistently outperforms other approximate solutions’. DeepJoin is even more accurate than an exact solution to semantic joins when evaluated with labels from experts. Moreover, when equipped with a GPU, DeepJoin is up to two orders of magnitude faster than existing solutions.
Effective Entity Augmentation By Querying External Data SourcesChristopher Buss (Oregon State University)*; Jasmin Mousavi (Oregon State University); Mikhail Tokarev (Oregon State University); Arash Termehchy (Oregon State University); David Maier (Portland State University); Stefan Lee (Oregon State University) Show AbstractDownload Paper
Users often want to augment and enrich entities in their datasets with relevant information from external data sources. As many external sources are accessible only via keyword-search interfaces, a user usually has to manually formulate a keyword query that extract relevant information for each entity. This approach is challenging as many data sources contain numerous tuples, only a small fraction of which may contain entity-relevant information. Furthermore, different datasets may represent the same information in distinct forms and under different terms (e.g., different data source may use different names to refer to the same person). In such cases, it is difficult to formulate a query that precisely retrieves information relevant to an entity. Current methods for information enrichment mainly rely on lengthy and resource-intensive manual effort to formulate queries to discover relevant information. However, in increasingly many settings, it is important for users to get initial answers quickly and without substantial investment in resources (such as human attention). We propose a progressive approach to discovering entity-relevant information from external sources with minimal expert intervention. It leverages end users' feedback to progressively learn how to retrieve information relevant to each entity in a dataset from external data sources. Our empirical evaluation shows that our approach learns accurate strategies to deliver relevant information quickly.
Integrating Data Lake TablesAamod Khatiwada (Northeastern University)*; Roee Shraga (Northeastern University); Wolfgang Gatterbauer (Northeastern University); Renée J. Miller (Northeastern University) Show AbstractDownload Paper
We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980’s, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, the first proposal for scalable integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables.
VersaMatch: Ontology Matching with Weak SupervisionJonathan Fürst (ZHAW)*; Mauricio Fadel Argerich (NEC Laboratories Europe); Bin Cheng (NEC Laboratories Europe) Show AbstractDownload Paper
Ontology matching is crucial to data integration for across-silo data sharing and has been mainly addressed with heuristic and machine learning (ML) methods. While heuristic methods are often inflexible and hard to extend to new domains, ML methods rely on substantial and hard to obtain amounts of labeled training data. To overcome these limitations, we propose VersaMatch, a flexible, weakly-supervised ontology matching system. VersaMatch employs various weak supervision sources, such as heuristic rules, pattern matching, and external knowledge bases, to produce labels from a large amount of unlabeled data for training a discriminative ML model. For prediction, VersaMatch develops a novel ensemble model combining the weak supervision sources with the discriminative model to support generalization while retaining a high precision. Our ensemble method boosts end model performance by 4 points compared to a traditional weak-supervision baseline. In addition, compared to state-of-the-art ontology matchers, VersaMatch achieves an overall 4-point performance improvement in F1 score across 26 ontology combinations from different domains. For recently released, in-the-wild datasets, VersaMatch beats the next best matchers by 9 points in F1. Furthermore, its core weak-supervision logic can easily be improved by adding more knowledge sources and collecting more unlabeled data for training.
H2
Private Retrieval & Secure Execution I
Chair: Mostafa Milani (Univesity of Western Ontario)
Secure Shapley Value for Cross-Silo Federated LearningShuyuan Zheng (Kyoto University)*; Yang Cao (Hokkaido University); Masatoshi Yoshikawa (Osaka Seikei University) Show AbstractDownload Paper
The Shapley value (SV) is a fair and principled metric for contribution evaluation in cross-silo federated learning (cross-silo FL), wherein organizations, i.e., clients, collaboratively train prediction models with the coordination of a parameter server. However, existing SV calculation methods for FL assume that the server can access the raw FL models and public test data. This may not be a valid assumption in practice considering the emerging privacy attacks on FL models and the fact that test data might be clients’ private assets. Hence, we investigate the problem of secure SV calculation for cross-silo FL. We first propose HESV, a one-server solution based solely on homomorphic encryption (HE) for privacy protection, which has limitations in efficiency. To overcome these limitations, we propose SecSV, an efficient two-server protocol with the following novel features. First, SecSV utilizes a hybrid privacy protection scheme to avoid ciphertext–ciphertext multiplications between test data and models, which are extremely expensive under HE. Second, an efficient secure matrix multiplication method is proposed for SecSV. Third, SecSV strategically identifies and skips some test samples without significantly affecting the evaluation accuracy. Our experiments demonstrate that SecSV is 7.2-36.6× as fast as HESV, with a limited loss in the accuracy of calculated SVs.
Olive: Oblivious Federated Learning on Trusted Execution Environment Against the Risk of SparsificationFumiyuki Kato (Kyoto University)*; Yang Cao (Hokkaido University); Masatoshi Yoshikawa (Osaka Seikei University) Show AbstractDownload Paper
Combining Federated Learning (FL) with a Trusted Execution Environment (TEE) is a promising approach for realizing privacy-preserving FL, which has garnered significant academic attention in recent years. Implementing the TEE on the server side enables each round of FL to proceed without exposing the client’s gradient information to untrusted servers. This addresses usability gaps in existing secure aggregation schemes as well as utility gaps in differentially private FL. However, to address the issue using a TEE, the vulnerabilities of server-side TEEs need to be considered—this has not been sufficiently investigated in the context of FL. The main technical contribution of this study is the analysis of the vulnerabilities of TEE in FL and the defense. First, we theoretically analyze the leakage of memory access patterns, revealing the risk of sparsified gradients, which are commonly used in FL to enhance communication efficiency and model accuracy. Second, we devise an inference attack to link memory access patterns to sensitive information in the training dataset. Finally, we propose an oblivious yet efficient aggregation algorithm to prevent memory access pattern leakage. Our experiments on real-world data demonstrate that the proposed method functions efficiently in practical scales.
L2chain: Towards High-performance, Confidential and Secure Layer-2 Blockchain Solution for Decentralized ApplicationsZihuan Xu (Hong Kong University of Science and Technology); Lei Chen (Hong Kong University of Science and Technology)* Show AbstractDownload Paper
With the rapid development of blockchain, the concept of decentralized applications (DApps), built upon smart contracts, has attracted much attention in academia and industry. However, significant issues w.r.t. system throughput, transaction confidentiality, and the security guarantee of the DApp transaction execution and order correctness hinder the border adoption of blockchain DApps.
To address these issues, we propose L2chain, a novel blockchain framework aiming to scale the system through a layer-2 network where DApps process transactions in the layer-2 network and only the system state digest, acting as the state integrity proof, is maintained on-chain. To achieve high performance, we introduce the split-execute-merge (SEM) transaction processing workflow with the help of the RSA accumulator, allowing DApps to lock and update a part of the state digest in parallel. We also design a witness cache mechanism for DApp executors to reduce the transaction processing latency. To fulfill confidentiality, we leverage the trusted execution environment (TEE) for DApps to execute encrypted transactions off-chain. To ensure transaction execution and order correctness, we propose a two-step execution process for DApps to prevent attacks (i.e., rollback attacks) from subverting the state transition. Extensive experiments have demonstrated that L2chain can achieve 1.5X to 42.2X and 7.1X to 8.9X throughput improvements in permissioned and permissionless settings
SPG: Structure-Private Graph Database via SqueezePIRLing Liang (UCSB)*; Jilan Lin (UCSB); Zheng Qu (University of California at Santa Barbara); Ishtiyaque Ahmad (University of California at Santa Barbara); Fengbin Tu (UCSB); Trinabh Gupta (UCSB); Yufei Ding (University of California at Santa Barbara); Yuan Xie (University of California at Santa Barbara) Show AbstractDownload Paper
Many relational data in our daily life are represented as graphs, making graph application an important workload. Because of the large scale of graph datasets, moving graph data to the cloud becomes a popular option. To keep the confidential and private graph secure from an untrusted cloud server, many cryptographic techniques are leveraged to hide the content of the data. However, protecting only the data content is not enough for a graph database. Because the structural information of the graph can be revealed through the database accessing track.
In this work, we study the graph neural network (GNN), an important graph workload to mine information from a graph database. We find that the server is able to infer which node is processing during the edge retrieving phase and also learn its neighbor indices during GNN's aggregation phase. This leads to the leakage of the information of graph structure data. In this work, we present SPG, a structure-private graph database with SqueezePIR. Our SPG is built on top of Private Information Retrieval (PIR), which securely hide which nodes/neighbors are accessed. In addition, we propose SqueezePIR, a compression technique to overcome the computation overhead of PIR. Based on our evaluation, our SqueezePIR achieves 11.85$\times$ speedup on average with less than 2\% accuracy loss when compared to the state-of-the-art FastPIR protocol.
H3
Private Retrieval & Secure Execution II
Chair: Miti Mazmudar (University of Waterloo)
Pantheon: Private Retrieval from Public Key-Value StoreIshtiyaque Ahmad (University of California at Santa Barbara)*; Divyakant Agrawal (University of California at Santa Barbara); Amr El Abbadi (UC Santa Barbara); Trinabh Gupta (UCSB) Show AbstractDownload Paper
Consider a cloud server that owns a key-value store and provides a private query service to its clients. Preserving client privacy in this setting is difficult because the key-value store is public, and a client cannot encrypt or modify it. Therefore, privacy in this context implies hiding the accesses pattern of a client. Pantheon is a system that cryptographically allows a client to retrieve the value corresponding to a key from a public key-value store without allowing the server or any adversary to know any information about the key or value accessed. Pantheon devises a single-round retrieval protocol which reduces server-side latency by refining its cryptographic machinery and massively parallelizing the query execution workload. Using these novel techniques, Pantheon achieves a 93× improvement for server-side latency over a state-of-the-art solution.
Information-Theoretically Secure and Highly Efficient Search and Row RetrievalShantanu Sharma (New Jersey Institute of Technology)*; Yin Li (Dongguan University of Technology); Sharad Mehrotra (U.C. Irvine); Nisha Panwar (Augusta University); Komal Kumari (New Jersey Institute of Technology); Swagnik Roychoudhury (New York University) Show AbstractDownload Paper
Information-theoretic or unconditional security provides the highest level of security—independent of the computational capability of an adversary. Secret-sharing techniques achieve information-theoretic security by splitting a secret into multiple parts (called shares) and storing the shares across non-colluding servers. However, secret-sharing-based solutions suffer from high overheads due to multiple communication rounds among servers and/or information leakage due to access-patterns (i.e., the identity of rows satisfying a query) and volume (i.e., the number of rows satisfying a query).
We propose 𝑆2, an information-theoretically secure approach that uses both additive and multiplicative secret-sharing, to efficiently support a large class of selection queries involving conjunctive, disjunctive, and range conditions. Two major contributions of 𝑆2 are: (i) a new search algorithm using additive shares based on fingerprints, which were developed for string-matching over cleartext; and (ii) two row retrieval algorithms: one is based on multiplicative shares and another is based on additive shares. 𝑆2 does not require communication among servers storing shares and does not reveal any information to an adversary based on access-patterns and volume.
ZKSQL: Verifiable and Efficient Query Evaluation with Zero-Knowledge ProofsXiling Li (Northwestern University)*; Chenkai Weng (Northwestern University); Yongxin Xu (Northwestern University); Xiao Wang (Northwestern University); Jennie Rogers (Northwestern University) Show AbstractDownload Paper
Individuals and organizations are using databases to store personal information at an unprecedented rate. This creates a quandary for data providers. They are responsible for protecting the privacy of individuals described in their database. On the other hand, data providers are sometimes required to provide statistics about their data instead of sharing it wholesale with strong assurances that these answers are correct and complete such as in regulatory filings for the US SEC and other goverment organizations.
We introduce a system, ZKSQL, that provides authenticated answers to ad-hoc SQL queries with zero-knowledge proofs. Its proofs show that the answers are correct and sound with respect to the database's contents and they do not divulge any information about its input records. This system constructs proofs over the steps in a query's evaluation and it accelerates this process with authenticated set operations. We validate the efficiency of this approach over a suite of TPC-H queries and our results show that ZKSQL achieves two orders of magnitude speedup over the baseline.
Cracking-Like Join for Trusted Execution EnvironmentsKajetan Maliszewski (TU Berlin)*; Jorge-Arnulfo Quiané-Ruiz (IT University of Copenhagen); Volker Markl (Technische Universität Berlin) Show AbstractDownload Paper
Data processing on non-trusted infrastructures, such as the public cloud, has become increasingly popular, despite posing risks to data privacy. However, the existing cloud DBMSs either lack sufficient privacy guarantees or underperform. In this paper, we address both challenges (privacy and efficiency) by proposing CrkJoin, a join algorithm that leverages Trusted Execution Environments (TEE). We adapted CrkJoin to the architecture of TEEs to achieve significant improvements in latency of up to three orders of magnitude over baselines in a multi-tenant scenario. Moreover, CrkJoin offers at least 2.9x higher throughput than the state-of-the-art algorithms. Our research is unique in that it focuses on both privacy and efficiency concerns, which has not been adequately addressed in previous studies. Our findings suggest that CrkJoin makes joining in TEEs practical, and it lays a foundation towards a truly private and efficient cloud DBMS.
Enabling Secure and Efficient Data Analytics Pipeline Evolution with Trusted Execution EnvironmentHaotian Gao (National University of Singapore); Cong Yue (National University of Singapore); Tien Tuan Anh Dinh (Deakin University); Zhiyong Huang (NUS School of Computing); Beng Chin Ooi (NUS)* Show AbstractDownload Paper
Modern data analytics pipelines are highly dynamic, as they are constantly monitored and fine-tuned by both data engineers and scientists. Recent systems managing pipelines ease creating, deploying, and tracking their evolution. However, privacy concerns emerge as many of them are deployed on the public cloud with less or no trust. Unfortunately, the unique nature of pipelines prevents the adoption of existing confidential computing techniques with different computational patterns and large performance overhead. Being a potential approach, trusted execution environments (TEEs) are efficient in protecting the confidentiality and integrity of data and computation. However, fast-changing pipelines with latency requirements bring the challenge of reducing the cold start overhead - the main bottleneck in the latest TEE. To support end-to-end private pipeline evolution, we present SecCask, a TEE-based data analytics pipeline management system. SecCask overcomes the problems of a naive design that isolates complete pipeline execution in one enclave by administering enclaves and runtimes. To reduce cold start overheads, our approach consists of reusing trusted runtimes for different pipeline components and caching them to avoid the cost of initialization. We leverage the latest Intel SGX to conduct experiments on representative workloads. The results demonstrate that SecCask reduces the total execution time by 68.4% compared to not reusing, is faster than running all components in one enclave, and incurs a modest average performance overhead of 29.9% over insecure baselines.
H4
Blockchains
Chair: Sujaya Maiyya (University of Waterloo)
GlassDB: An Efficient Verifiable Ledger Database System Through TransparencyCong Yue (National University of Singapore); Tien Tuan Anh Dinh (Deakin University); Zhongle Xie (National University of Singapore); Meihui Zhang (Beijing Institute of Technology); Gang Chen (Zhejiang University); Beng Chin Ooi (NUS)*; Xiaokui Xiao (National University of Singapore) Show AbstractDownload Paper
Verifiable ledger databases protect data history against malicious tampering. Existing systems, such as blockchains and certificate transparency, are based on transparency logs — a simple abstraction allowing users to verify that a log maintained by an untrusted server is append-only. They expose a simple key-value interface. Building a practical database from transparency logs, on the other hand, remains a challenge.
In this paper, we explore the design space of verifiable ledger databases along three dimensions: abstraction, threat model, and performance. We survey existing systems and identify their two limitations, namely, the lack of transaction support and the inferior efficiency. We then present GlassDB, a distributed database system that addresses these limitations under a practical threat model. GlassDB inherits the verifiability of transparency logs, but supports transactions and offers high performance. It extends a ledger-like key-value store with a data structure for efficient proofs, and adds a concurrency control mechanism for transactions. GlassDB batches independent operations from concurrent transactions when updating the core data structures. In addition, we design a new benchmark for evaluating verifiable ledger databases, by extending YCSB and TPC-C benchmarks. Using this benchmark, we compare GlassDB against four baselines: reimplemented versions of three verifiable databases, and a verifiable map backed by a transparency log. Experimental results demonstrate that GlassDB is an efficient, transactional, and verifiable ledger database system.
FlexChain: An Elastic Disaggregated BlockchainChenyuan Wu (University of Pennsylvania)*; Mohammad Javad Amiri (University of Pennsylvania); Jared Asch (University of Pennsylvania); Heena Nagda (University of Pennsylvania); Qizhen Zhang (University of Pennsylvania); Boon Thau Loo (University of Pennsylvania) Show AbstractDownload Paper
While permissioned blockchains enable a family of data center applications, existing systems suffer from imbalanced loads across compute and memory, exacerbating the underutilization of cloud resources. This paper presents FlexChain, a novel permissioned blockchain system that addresses this challenge by physically disaggregating CPUs, DRAM, and storage devices to process different blockchain workloads efficiently. Disaggregation allows blockchain service providers to upgrade and expand hardware resources independently to support a wide range of smart contracts with diverse CPU and memory demands. Moreover, it ensures efficient resource utilization and hence prevents resource fragmentation in a data center. We have explored the design of XOV blockchain systems in a disaggregated fashion and developed a tiered key-value store that can elastically scale its memory and storage. Our design significantly speeds up the execution stage. We have also leveraged several techniques to parallelize the validation stage in FlexChain to further improve the overall blockchain performance. Our evaluation results show that FlexChain can provide independent compute and memory scalability, while incurring at most 12.8% disaggregation overhead. FlexChain achieves almost identical throughput as the state-of-the-art distributed approaches with significantly lower memory and CPU consumption for compute-intensive and memory-intensive workloads respectively.
GriDB: Scaling Blockchain Database via Sharding and Off-Chain Cross-Shard MechanismZicong Hong (The Hong Kong Polytechnic University)*; Song Guo (The Hong Kong Polytechnic University); Enyuan Zhou (The Hong Kong Polytechnic University); Wuhui Chen (Sun Yat-sen University); Huawei Huang (Sun Yat-sen University); Albert Zomaya (The University of Sydney) Show AbstractDownload Paper
Blockchain databases have attracted widespread attention but suffer from poor scalability due to underlying non-scalable blockchains. While blockchain sharding is necessary for a scalable blockchain database, it poses a new challenge named on-chain cross-shard database services. Each cross-shard database service (e.g., cross-shard queries or inter-shard load balancing) involves massive cross-shard data exchanges, while the existing cross-shard mechanisms need to process each cross-shard data exchange via the consensus of all nodes in the related shards (i.e., on-chain) to resist a Byzantine environment of blockchain, which eliminates sharding benefits.
To tackle the challenge, this paper presents GriDB, the first scalable blockchain database, by designing a novel off-chain cross-shard mechanism for efficient cross-shard database services. Borrowing the idea of off-chain payments, GriDB delegates massive cross-shard data exchange to a few nodes, each of which is randomly picked from a different shard. Considering the Byzantine environment, the untrusted delegates cooperate to generate succinct proof for cross-shard data exchanges, while the consensus is only responsible for the low-cost proof verification. However, different from payments, the database services' verification has more requirements (e.g., completeness, correctness, freshness, and availability); thus, we introduce several new authenticated data structures (ADS). Particularly, we utilize consensus to extend the threat model and reduce the complexity of traditional accumulator-based ADS for verifiable cross-shard queries with a rich set of relational operators. Moreover, we study the necessity of inter-shard load balancing for a scalable blockchain database and design an off-chain and live approach for both efficiency and availability during balancing. An evaluation of our prototype shows the performance of GriDB in terms of scalability in workloads with queries and updates.
AdaChain: A Learned Adaptive BlockchainChenyuan Wu (University of Pennsylvania)*; Bhavana Mehta (University of Pennsylvania); Mohammad Javad Amiri (University of Pennsylvania); Ryan Marcus (University of Pennsylvania); Boon Thau Loo (University of Pennsylvania) Show AbstractDownload Paper
This paper presents AdaChain, a learning-based blockchain framework that adaptively chooses the best permissioned blockchain architecture to optimize effective throughput for dynamic transaction workloads. AdaChain addresses the challenge in Blockchain-as-a-Service (BaaS) environments, where a large variety of possible smart contracts are deployed with different workload characteristics. AdaChain supports automatically adapting to an underlying, dynamically changing workload through the use of reinforcement learning. When a promising architecture is identified, AdaChain switches from the current architecture to the promising one at runtime in a secure and correct manner. Experimentally, we show that AdaChain can converge quickly to optimal architectures under changing workloads and significantly outperform fixed architectures in terms of the number of successfully committed transactions, all while incurring low additional overhead.
H6
Differential Privacy I
Chair: Xi He (University of Waterloo)
Answering Private Linear Queries Adaptively using the Common MechanismYingtai Xiao (Pennsylvania State University)*; Guanhong Wang (University of Maryland); Danfeng Zhang (Penn State); Daniel Kifer (Penn State) Show AbstractDownload Paper
When analyzing confidential data through a privacy filter, a data scientist often needs to decide which queries will best support their intended analysis. For example, an analyst may wish to study noisy two-way marginals in a dataset produced by a mechanism M1. But, if the data are relatively sparse, the analyst may choose to examine noisy one-way marginals, produced by a mechanism M2 instead. Since the choice of whether to use M1 or M2 is data-dependent, a typical differentially private workflow is to first split the privacy loss budget rho into two parts: rho1 and rho2, then use the first part rho1 to determine which mechanism to use, and the remainder rho2 to obtain noisy answers from the chosen mechanism. In a sense, the first step seems wasteful because it takes away part of the privacy loss budget that could have been used to make the query answers more accurate.
In this paper, we consider the question of whether the choice between M1 and M2 can be performed without wasting any privacy loss budget. For linear queries, we propose a method for decomposing M1 and M2 into three parts: (1) a mechanism M* that captures their shared information, (2) a mechanism M1' that captures information that is specific to M1, (3) a mechanism M2' that captures information that is specific to M2. Running M* and M1' together is completely equivalent to running M1 (both in terms of query answer accuracy and total privacy cost rho). Similarly, running M* and M2' together is completely equivalent to running M2.
Since M* will be used no matter what, the analyst can use its output to decide whether to subsequently run M1'(thus recreating the analysis supported by M1) or M2'(recreating the analysis supported by M2), without wasting privacy loss budget.
Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy [eab]Lucas Rosenblatt (New York University)*; Bernease Herman (University of Washington); Anastasia Holovenko (Ukrainian Catholic University); Wonkwon Lee (New York University); Joshua Loftus (London School of Economics); Elizabeth McKinnie (Microsoft); Taras Rumezhak (Ukrainian Catholic University); Andrii Stadnik (Ukrainian Catholic University); Bill Howe (University of Washington); Julia Stoyanovich (New York University) Show AbstractDownload Paper
Differential privacy (DP) data synthesizers are increasingly proposed to afford public release of sensitive information, offering theoretical guarantees for privacy (and, in some cases, utility), but limited empirical evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, the accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data and comparing the results.
We instantiate our methodology over a benchmark of recent peer-reviewed papers that analyze public datasets in the ICPSR social science repository. We model quantitative claims computationally to automate the experimental workflow, and model qualitative claims by reproducing visualizations and comparing the results manually. We then generate DP synthetic datasets using multiple state-of-the-art mechanisms, and estimate the likelihood that these conclusions will hold. We find that, for reasonable privacy regimes, state-of-the-art DP synthesizers are able to achieve high epistemic parity for several papers in our benchmark. However, some papers, and particularly some specific findings, are difficult to reproduce for any of the synthesizers. Given these results, we advocate for a new class of mechanisms that can reorder the priorities for DP data synthesis: favor stronger guarantees for utility (as measured by epistemic parity) and offer privacy protection with a focus on application-specific threat models and risk-assessment.
Equitable Data Valuation Meets the Right to Be Forgotten in Model MarketsHaocheng Xia (Zhejiang University); Jinfei Liu (Zhejiang University)*; Jian Lou (Zhejiang University); Zhan Qin (Zhejiang University); Kui Ren (Zhejiang University); Yang Cao (Hokkaido University); Li Xiong (Emory University) Show AbstractDownload Paper
The increasing demand for data-driven machine learning (ML) models has led to the emergence of model markets, where a broker collects personal data from data owners to produce high-usability ML models. To incentivize data owners to share their data, the broker needs to price data appropriately while protecting their privacy. For equitable data valuation, which is crucial in data pricing, Shapley value has become the most prevalent technique because it satisfies all four desirable properties in fairness: balance, symmetry, zero element, and additivity. For the right to be forgotten, which is stipulated by many data privacy protection laws to allow data owners to unlearn their data from trained models, the sharded structure in ML model training has become a de facto standard to reduce the cost of future unlearning by avoiding retraining the entire model from scratch. In this paper, we explore how the sharded structure for the right to be forgotten affects Shapley value for equitable data valuation in model markets. To adapt Shapley value for the shared structure, we propose S-Shapley value, a sharded structure-based Shapley value, which uniquely satisfies four desirable properties for data valuation. Since we prove that computing S-Shapley value is #P-complete, two sampling-based methods are developed to approximate S-Shapley value. Furthermore, to efficiently update valuation results after data owners unlearn their data, we present two delta-based algorithms that estimate the change of data value instead of the data value itself. Experimental results demonstrate the efficiency and effectiveness of the proposed algorithms.
OpBoost: A Vertical Federated Tree Boosting Framework Based on Order-Preserving DesensitizationXiaochen Li (Zhejiang university)*; Yuke Hu (Zhejiang University); Weiran Liu (Alibaba Group); Hanwen Feng (Alibaba Group); Li Peng (Alibaba Group); Yuan Hong (University of Connecticut); Kui Ren (Zhejiang University); Zhan Qin (Zhejiang University) Show AbstractDownload Paper
Vertical Federated Learning (FL) is a new paradigm that enables users with non-overlapping attributes of the same data samples to jointly train a model without directly sharing the raw data. Nevertheless, recent works show that it's still not sufficient to prevent privacy leakage from the training process or the trained model. This paper focuses on studying the privacy-preserving tree boosting algorithms under the vertical FL. The existing solutions based on cryptography involve heavy computation and communication overhead and are vulnerable to inference attacks. Although the solution based on Local Differential Privacy (LDP) addresses the above problems, it leads to the low accuracy of the trained model.
This paper explores to improve the accuracy of the widely deployed tree boosting algorithms satisfying differential privacy under vertical FL. Specifically, we introduce a framework called OpBoost. Three order-preserving desensitization algorithms satisfying a variant of LDP called distance-based LDP (dLDP) are designed to desensitize the training data. In particular, we optimize the dLDP definition and study efficient sampling distributions to further improve the accuracy and efficiency of the proposed algorithms. The proposed algorithms provide a trade-off between the privacy of pairs with large distance and the utility of desensitized values. Comprehensive evaluations show that OpBoost has a better performance on prediction accuracy of trained models compared with existing LDP approaches on reasonable settings.Our code is open source.
H7
Differential Privacy II
Chair: Yang Cao (Hokkaido University)
Longshot: Indexing Growing Databases using MPC and Differential PrivacyYanping Zhang (Duke University)*; Johes Bater (Tufts University); Kartik Nayak (Duke university); Ashwin Machanavajjhala (Duke) Show AbstractDownload Paper
In this work, we propose Longshot, a novel design for secure outsourced database systems that supports ad-hoc queries through the use of secure multi-party computation and differential privacy. By combining these two techniques, we build and maintain data structures (i.e., synopses, indexes, and stores) that improve query execution efficiency while maintaining strong privacy and security guarantees. As new data records are uploaded by data owners, these data structures are continually updated by Longshot using novel algorithms that leverage bounded information leakage to minimize the use of expensive cryptographic protocols. Furthermore, Longshot organizes the data structures as a hierarchical tree based on when the update occurred, allowing for update strategies that provide logarithmic error over time. Through this approach, Longshot introduces a tunable three-way trade-off between privacy, accuracy, and efficiency. Our experimental results confirm that our optimizations are not only asymptotic improvements but also observable in practice. In particular, we see a 5x efficiency improvement to update our data structures even when the number of updates is less than 200. Moreover, the data structures significantly improve query runtimes over time, about $\sim$$10^3$x faster compared to the baseline after 20 updates.
Saibot: A Differentially Private Data Search PlatformZezhou Huang (Columbia University)*; Jiaxiang Liu (Columbia University); Daniel Alabi (Columbia University); Raul Castro Fernandez (The University of Chicago); Eugene Wu (Columbia University) Show AbstractDownload Paper
Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset and these platforms search for augmentations—join or union-compatible datasets—that, when used to augment the requester’s dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.
We present Saibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50−90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.
DPXPlain: Privately Explaining Aggregate Query AnswersYuchao Tao (SNAP)*; Amir Gilad (The Hebrew University); Ashwin Machanavajjhala (Duke); Sudeepa Roy (Duke University, USA) Show AbstractDownload Paper
Differential privacy (DP) is the state-of-the-art and rigorous notion of privacy for answering aggregate database queries while preserving the privacy of sensitive information in the data. In today's era of data analysis, however, it poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? In the second case, even the observation made by the users on query results may be wrong. In the first case, can we still mine interesting explanations from the sensitive data while protecting its privacy? To address these challenges, we present a three-phase framework DPXPlain, which is the first system to the best of our knowledge for explaining group-by aggregate query answers with DP. In its three phases, DPXPlain (a) answers a group-by aggregate query with DP, (b) allows users to compare aggregate values of two groups and with high probability assesses whether this comparison holds or is flipped by the DP noise, and (c) eventually provides an explanation table containing the approximately 'top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps. We perform an extensive experimental analysis of DPXPlain with multiple use-cases on real and synthetic data showing that DPXPlain efficiently provides insightful explanations with good accuracy and utility.
Cache Me If You Can: Accuracy-Aware Inference Engine for Differentially Private Data ExplorationMiti Mazmudar (University of Waterloo)*; Thomas Humphries (University of Waterloo); Jiaxiang Liu (University of Waterloo); Matthew Rafuse (University of Waterloo); Xi He (University of Waterloo) Show AbstractDownload Paper
Differential privacy (DP) allows data analysts to query databases that contain users' sensitive information while providing a quantifiable privacy guarantee to users. Recent interactive DP systems such as APEx provide accuracy guarantees over the query responses, but fail to support a large number of queries with a limited total privacy budget, as they process incoming queries independently from past queries. We present an interactive, accuracy-aware DP query engine, CacheDP, which utilizes a differentially private cache of past responses, to answer the current workload at a lower privacy budget, while meeting strict accuracy guarantees. We integrate complex DP mechanisms with our structured cache, through novel cache-aware DP cost optimization. Our thorough evaluation illustrates that CacheDP can accurately answer various workload sequences, while lowering the privacy loss as compared to related work.
Multi-Analyst Differential Privacy for Online Query AnsweringDavid A Pujol (Duke University)*; Albert Sun (Duke University); Brandon T Fain (Duke University); Ashwin Machanavajjhala (Duke) Show AbstractDownload Paper
Most differentially private mechanisms are designed for the use of a single analyst. In reality, however, there are often multiple stakeholders with different and possibly conflicting priorities that must share the same privacy loss budget. This motivates the problem of equitable budget-sharing for multi-analyst differential privacy. Our previous work defined desiderata that any mechanism in this space should satisfy and introduced methods for budget-sharing in the offline case where queries are known in advance.
We extend our previous work on multi-analyst differentially private query answering to the case of online query answering, where queries come in one at a time and must be answered without knowledge of the following queries. We demonstrate that the unknown ordering of queries in the online case results in a fundamental limit in the number of queries that can be answered while satisfying the desiderata. In response, we develop two mechanisms, one which satisfies the desiderata in all cases but is subject to the fundamental limitations, and another that randomizes the input order ensuring that existing online query answering mechanisms can satisfy the desiderata.
H8
Differential Privacy III
Chair: Johes Bater (Tufts University)
Benchmarking the Utility of 𝑤-event Differential Privacy Mechanisms - When Baselines Become Mighty Competitors [eab]Christine Schäler (Karlsruhe Institute of Technology (KIT)); Thomas Hütter (University of Salzburg); Martin Schäler (University of Salzburg)* Show AbstractDownload Paper
The w-event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of w-event differential privacy, various mechanisms to implement the framework are proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent.
To this end, we formalize an empirical study of w-event mechanisms by a four-tuple containing re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected - yet explainable - result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers to break the baseline supremacy.
LDPTrace: Locally Differentially Private Trajectory SynthesisYuntao Du (Zhejiang University); Yujia Hu (Zhejiang University); Zhikun Zhang (Stanford University); Ziquan Fang (Zhejiang University); Lu Chen (Zhejiang University); Baihua Zheng (Singapore Management University); Yunjun Gao (Zhejiang University)* Show AbstractDownload Paper
Trajectory data has the potential to greatly benefit a wide-range of real-world applications, such as tracking the spread of the disease through people’s movement patterns and providing personalized location-based services based on travel preference. However, privacy concerns and data protection regulations have limited the extent to which this data is shared and utilized. To overcome this challenge, local differential privacy provides a solution by allowing people to share a perturbed version of their data, ensuring privacy as only the data owners have access to the original information. Despite its potential, existing point-based perturbation mechanisms are not suitable for real-world scenarios due to poor utility, dependence on external knowledge, high computational overhead, and vulnerability to attacks. To address these limitations, we introduce LDPTrace, a novel locally differentially private trajectory synthesis framework. Our framework takes into account three crucial patterns inferred from users’ trajectories in the local setting, allowing us to synthesize trajectories that closely resemble real ones with minimal computational cost. Additionally, we present a new method for selecting a proper grid granularity without compromising privacy. Our extensive experiments using real-world as well as synthetic data, various utility metrics and attacks, demonstrate the efficacy and efficiency of LDPTrace.
Trajectory Data Collection with Local Differential PrivacyYuemin Zhang (Harbin Engineering University); Qingqing Ye (Hong Kong Polytechnic University); Rui Chen (Harbin Engineering University)*; Haibo Hu (Hong Kong Polytechnic University); Qilong Han (Harbin Engineering University) Show AbstractDownload Paper
Trajectory data collection is a common task with many applications in our daily lives. Analyzing trajectory data enables service providers to enhance their services, which ultimately benefits users. However, directly collecting trajectory data may give rise to privacy-related issues that cannot be ignored. Local differential privacy (LDP), as the de facto privacy protection standard in a decentralized setting, enables users to perturb their trajectories locally and provides a provable privacy guarantee. Existing approaches to private trajectory data collection in a local setting typically use relaxed versions of LDP, which cannot provide a strict privacy guarantee, or require some external knowledge that is impractical to obtain and update in a timely manner. To tackle these problems, we propose a novel trajectory perturbation mechanism that relies solely on an underlying location set and satisfies pure $\epsilon$-LDP to provide a stringent privacy guarantee. In the proposed mechanism, each point's adjacent direction information in the trajectory is used in its perturbation process. Such information serves as an effective clue to connect neighboring points and can be used to restrict the possible region of a perturbed point in order to enhance utility. To the best of our knowledge, our study is the first to use direction information for trajectory perturbation under LDP. Furthermore, based on this mechanism, we present an anchor-based method that adaptively restricts the region of each perturbed trajectory, thereby significantly boosting performance without violating the privacy constraint. Extensive experiments on both real-world and synthetic datasets demonstrate the effectiveness of the proposed mechanisms.
On the Risks of Collecting Multidimensional Data Under Local Differential PrivacyHéber H. Arcolezi (Inria and École Polytechnique (IPP))*; Sébastien Gambs (UQAM); Jean-François Couchot (University of Franche-Comté); Catuscia Palamidessi (Laboratoire d'informatique de l'École polytechnique) Show AbstractDownload Paper
The private collection of multiple statistics from a population is a fundamental statistical problem. One possible approach to realize this is to rely on the local model of differential privacy (LDP). Numerous LDP protocols have been developed for the task of frequency estimation of single and multiple attributes. These studies mainly focused on improving the utility of the algorithms to ensure the server performs the estimations accurately. In this paper, we investigate privacy threats (re-identification and attribute inference attacks) against LDP protocols for multidimensional data following two state-of-the-art solutions for frequency estimation of multiple attributes. To broaden the scope of our study, we have also experimentally assessed five widely used LDP protocols, namely, generalized randomized response, optimal local hashing, subset selection, RAPPOR and optimal unary encoding. Finally, we also proposed a countermeasure that improves both utility and robustness against the identified threats. Our contributions can help practitioners aiming to collect users' statistics privately to decide which LDP mechanism best fits their needs.
PreFair: Privately Generating Justifiably Fair Synthetic DataDavid A Pujol (Duke University)*; Amir Gilad (The Hebrew University); Ashwin Machanavajjhala (Duke) Show AbstractDownload Paper
When a database is protected by Differential Privacy (DP), its usability is limited in scope. In this scenario, generating a synthetic version of the data that mimics the properties of the private data allows users to perform any operation on the synthetic data, while maintaining the privacy of the original data. Therefore, multiple works have been devoted to devising systems for DP synthetic data generation. % However, such systems do not guarantee other desired properties of the data that may be required for its use. Specifically, such systems may preserve or even increase properties of the data that make it unfair. However, such systems may preserve or even enhance properties of the data that make it unfair, rendering the synthetic data unfit for use. In this work, we present PreFair, a system that allows for DP fair synthetic data generation. PreFair extends the state-of-the-art DP data generation mechanisms by incorporating a causal fairness criterion that ensures a fair synthetic data. We adapt the notion of justifiable fairness to fit the synthetic data generation scenario. We further study the problem of generating DP fair synthetic data generation, showing its intractability and designing algorithms that are optimal under certain assumptions. We also provide an extensive experimental evaluation, showing that PreFair generates synthetic data that is significantly more fair than the data generated by leading DP data generation mechanisms, while remaining faithful to the private data.
R10
Temporal and Evolving Graphs
Chair: Udayan Khurana (IBM Research)
Auxo: A Scalable and Efficient Graph Stream Summarization StructureZhiguo Jiang (Huazhong University of Science and Tecnology); Hanhua Chen (Huazhong University of Science and Technology)*; Hai Jin (Huazhong University of Science and Technology) Show AbstractDownload Paper
A graph stream refers to a continuous stream of edges, forming a huge and fast-evolving graph. The vast volume and high update speed of a graph stream bring stringent requirements for the data management structure, including sublinear space cost, computation-efficient operation support, and scalability of the structure. Existing designs summarize a graph stream by leveraging a hash-based compressed matrix and representing an edge using its fingerprint to achieve practical storage for a graph stream with a known upper bound of data volume. However, they fail to support the dynamically extending of graph streams.
In this paper, we propose Auxo, a scalable structure to support space/time efficient summarization of dynamic graph streams. Auxo is built on a proposed novel \emph{prefix embedded tree} (PET) which leverages binary logarithmic search and common binary prefixes embedding to provide an efficient and scalable tree structure. PET reduces the item insert/query time from $O(|E|)$ to $O(log|E|)$ as well as reducing the total storage cost by a $log|E|$ scale, where $|E|$ is the size of the edge set in a graph stream. To further improve the memory utilization of PET during scaling, we propose a proportional PET structure that extends a higher level in a proportionally incremental style. We conduct comprehensive experiments on large-scale real-world datasets to evaluate the performance of this design. Results show that Auxo significantly reduces the insert and query time by one to two orders of magnitude compared to the state of the arts. Meanwhile, Auxo achieves efficiently and economically structure scaling with an average memory utilization of over $80\%$.
Scalable Time-Range k-Core Query on Temporal GraphsJunyong Yang (Wuhan University); Ming Zhong (Wuhan University)*; Yuanyuan Zhu (Wuhan University); Tieyun Qian (Wuhan University); Mengchi Liu (South China Normal University); Jeffrey Xu Yu (Chinese University of Hong Kong) Show AbstractDownload Paper
Querying cohesive subgraphs on temporal graphs with various time constraints has attracted intensive research interests recently. In this paper, we study a novel Temporal k-Core Query (TCQ) problem: given a time interval, find all distinct k-cores that exist within any subintervals from a temporal graph, which generalizes the previous historical k-core query. This problem is challenging because the number of subintervals increases quadratically to the span of time interval. For that, we propose a novel Temporal Core Decomposition (TCD) algorithm that decrementally induces temporal k-cores from the previously induced ones and thus reduces ``intra-core'' redundant computation significantly. Then, we introduce an intuitive concept named Tightest Time Interval (TTI) for temporal k-core, and design an optimization technique with theoretical guarantee that leverages TTI as a key to predict which subintervals will induce duplicated k-cores and prunes the subintervals completely in advance, thereby eliminating ``inter-core'' redundant computation. The complexity of optimized TCD (OTCD) algorithm no longer depends on the span of query time interval but only the scale of final results, which means OTCD algorithm is scalable. Moreover, we propose a compact in-memory data structure named Temporal Edge List (TEL) to implement OTCD algorithm efficiently in physical level with bounded memory requirement. TEL organizes temporal edges in a ``timeline'' and can be updated instantly when new edges arrive in dynamical temporal graphs. We compare OTCD algorithm with the incremental historical k-core query on several real-world temporal graphs, and observe that OTCD algorithm outperforms it by three orders of magnitude, even though OTCD algorithm needs none precomputed index.
Anonymous Edge Representation for Inductive Anomaly Detection in Dynamic Bipartite GraphsLanting Fang (Southeast University)*; Kaiyu Feng (Beijing Institute of Technology); Jie Gui (Southeast University); Shanshan Feng (Centre for Frontier AI Research, A*STAR); Aiqun Hu (Southeast University) Show AbstractDownload Paper
The activities in many real-world applications, such as e-commerce and online education, are usually modeled as a dynamic bipartite graph that evolves over time. It is a critical task to detect anomalies inductively in a dynamic bipartite graph. Previous approaches either focus on detecting pre-defined types of anomalies or cannot handle nodes that are unseen during the training stage. To address this challenge, we propose an effective method to learn anonymous edge representation (AER) that captures the characteristics of an edge without using identity information. We further propose a model named AER-AD to utilize AER to detect anomalies in dynamic bipartite graphs in an inductive setting. Extensive experiments on both real-life and synthetic datasets are conducted to illustrate that AER-AD outperforms state-of-the-art baselines. In terms of AUC and F1, AER-AD is able to achieve 8.38% and 14.98% higher results than the best inductive representation baselines, and 6.99% and 19.59% than the best anomaly detection baselines.
Spade: A Real-Time Fraud Detection Framework on Evolving Graphs [sds]Jiaxin Jiang (National University of Singapore)*; Yuan Li (National University of Singapore); Bingsheng He (National University of Singapore); Bryan Hooi (National University of Singapore); Jia Chen (Grab); Johan Kok Zhi Kang (Grab) Show AbstractDownload Paper
Real-time fraud detection is a challenge for most financial and electronic commercial platforms. To identify fraudulent communities, Grab, one of the largest technology companies in Southeast Asia, forms a graph from a set of transactions and detects dense subgraphs arising from abnormally large numbers of connections among fraudsters. Existing dense subgraph detection approaches focus on static graphs without considering the fact that transaction graphs are highly dynamic. Moreover, detecting dense subgraphs from scratch with graph updates is time consuming and cannot meet the real-time requirement in industry. To address this problem, we introduce an incremental real-time fraud detection framework called Spade. Spade can detect fraudulent communities in hundreds of microseconds on million-scale graphs by incrementally maintaining dense subgraphs. Furthermore, Spade supports batch updates and edge grouping to reduce response latency. Lastly, Spade provides simple but expressive APIs for the design of evolving fraud detection semantics. Developers plug their customized suspiciousness functions into Spade which incrementalizes their semantics without recasting their algorithms. Extensive experiments show that Spade detects fraudulent communities in real time on million-scale graphs. Peeling algorithms incrementalized by Spade are up to a million times faster than the static version.
Mining Bursting Core in Large Temporal GraphHongchao Qin (Beijing Institute of Technology); Rong-Hua Li (Beijing Institute of Technology); Ye Yuan (Beijing Institute of Technology); Guoren Wang (Beijing Institute of Technology); Lu Qin (UTS); Zhiwei Zhang (Hong Kong Baptist University) Show AbstractDownload Paper
Temporal graphs are ubiquitous. Mining communities that are bursting in a period of time is essential for seeking real emergency events in temporal graphs. Unfortunately, most previous studies on community mining in temporal networks ignore the bursting patterns of communities. In this paper, we study the problem of seeking bursting communities in a temporal graph. We propose a novel model, called the (𝑙,𝛿)-maximal bursting core, to represent a bursting community in a temporal graph. Specifically, an (𝑙,𝛿)-maximal bursting core is a temporal subgraph in which each node has an average degree no less than 𝛿 in a time segment with length no less than 𝑙 . To compute the (𝑙 , 𝛿) -maximal bursting core, we first develop a novel dynamic programming algorithm that can reduce time complexity of calculating the segment density from 𝑂(|T|^2) to 𝑂(|T|). Then, we propose an efficient updating algorithm which can update the segment density in 𝑂 (𝑙) time. In addition, we develop an efficient algorithm to enumerate all (𝑙,𝛿)-maximal burstin gcores that are not dominated by the others in terms of 𝑙 and 𝛿 . The results of extensive experiments on 9 real-life datasets demonstrate the effectiveness, efficiency and scalability of our algorithms.
R11
Graph Structures and Queries
Chair: Shuhao Liu (Shenzhen Institute of Computing Sciences)
Approximating Probabilistic Group Steiner Trees in GraphsShuang Yang (Renmin University of China); Yahui Sun (Renmin University of China)*; Jiesong Liu (Renmin University of China); Xiaokui Xiao (National University of Singapore); Rong-Hua Li (Beijing Institute of Technology); Zhewei Wei (Renmin University of China) Show AbstractDownload Paper
Consider an edge-weighted graph, and a number of properties of interests (PoIs). Each vertex has a probability of exhibiting each PoI. The joint probability that a set of vertices exhibits a PoI is the probability that this set contains at least one vertex that exhibits this PoI. The probabilistic group Steiner tree problem is to find a tree such that (i) for each PoI, the joint probability that the set of vertices in this tree exhibits this PoI is no smaller than a threshold value, e.g., 0.97; and (ii) the total weight of edges in this tree is the minimum. Solving this problem is useful for mining various graphs with uncertain vertex properties, but is NP-hard. The existing work focuses on certain cases, and cannot perform this task. To meet this challenge, we propose 3 approximation algorithms for solving the above problem. Let |\Gamma| be the number of PoIs, and \xi be an upper bound of the number of vertices for satisfying the threshold value of exhibiting each PoI. Algorithms 1 and 2 have tight approximation guarantees proportional to |\Gamma| and \xi, and exponential time complexities with respect to \xi and |\Gamma|, respectively. In comparison, Algorithm 3 has a looser approximation guarantee proportional to, and a polynomial time complexity with respect to, both |\Gamma| and \xi. Experiments on real and large datasets show that the proposed algorithms considerably outperform the state-of-the-art related work for finding probabilistic group Steiner trees in various cases.
Lotan: Bridging the Gap between GNNs and Scalable Graph Analytics EnginesYuhao Zhang (University of California at San Diego)*; Arun Kumar (University of California at San Diego) Show AbstractDownload Paper
Recent advances in Graph Neural Networks (GNNs) have changed the landscape of modern graph analytics. The complexity of GNN training and the challenges of GNN scalability has also sparked interest from the systems community, with efforts to build systems that provide higher efficiency and schemes to reduce costs. However, we observe that many such systems basically "reinvent the wheel" of much work done in the database world on scalable graph analytics engines. Further, they often tightly couple the scalability treatments of graph data processing with that of GNN training, resulting in entangled complex problems and systems that often do not scale well on one of those axes.
In this paper, we ask a question: How far can we push existing systems for scalable graph analytics and deep learning (DL) instead of building custom GNN systems? Are compromises inevitable on scalability and/or runtimes? We propose Lotan, the first scalable and optimized data system for full-batch GNN training with decoupled scaling that bridges the hitherto siloed worlds of graph analytics systems and DL systems. Lotan offers a series of technical innovations, including re-imagining GNN training as query plan-like dataflows, execution plan rewriting, optimized data movement between systems, a GNN-centric graph partitioning scheme, and the first known GNN model batching scheme. We prototyped Lotan on top of GraphX and PyTorch. An empirical evaluation using several real-world benchmark GNN workloads reveals a promising nuanced picture: Lotan significantly surpasses the scalability of state-of-the-art custom GNN systems, while often matching or being only slightly behind on time-to-accuracy metrics in some cases. We also show the impact of our system optimizations. Overall, our work shows that the GNN world can indeed benefit from building on top of scalable graph analytics engines. Lotan's new level of scalability can also empower new ML-oriented research on ever-larger graphs and GNNs.
Discovering Polarization Niches via Dense Subgraphs with Attractors and RepulsersAdriano Fazzone (Sapienza University of Rome); Tommaso Lanciano (KTH Royal Institute of Technology); Riccardo Denni (Sapienza University); Charalampos Tsourakakis (Boston University); Francesco Bonchi (ISI Foundation, Turin) Show AbstractDownload Paper
Detecting niches of polarization in social media is a first step to wards deploying mitigation strategies and avoiding radicalization. In this paper, we model polarization niches as close-knit dense communities of users, which are under the influence of some well-known sources of misinformation, and isolated from authoritative information sources. Based on this intuition we define the problem of finding a subgraph that maximizes a combination of (𝑖) density, (𝑖𝑖) proximity to a small set of nodes 𝐴 (named Attractors), and (𝑖𝑖𝑖) distance from another small set of nodes 𝑅 (named Repulsers). Deviating from the bulk of the literature on detecting polarization, we do not exploit text mining or sentiment analysis, nor we track the propagation of information: we only exploit the network structure and the background knowledge about the sets 𝐴 and 𝑅, which are given as input. We build on recent algorithmic advances in supermodular maximization to provide an iterative greedy algorithm, dubbed Down in the Hollow (dith), that converges fast to a near-optimal solution. Thanks to a novel theoretical upper bound, we are able to equip dith with a practical device that allows to terminate as soon as a solution with a user-specified approximation factor is found, making our algorithm very efficient in practice. Our experiments on very large networks confirm that our algorithm always returns a solution with an approximation factor better or equal to the one specified by the user, and it is scalable. Our case-studies in polarized settings, confirm the usefulness of our algorithmic primitive in detecting polarization niches.
gCore: Exploring Cross-layer Cohesiveness in Multi-layer GraphsDandan Liu (Harbin Institute of Technology); Zhaonian Zou (Harbin Institute of Technology)* Show AbstractDownload Paper
As multi-layer graphs can give a more accurate and reliable picture of the complex relationships between entities, cohesive subgraph mining, a fundamental task in graph analysis, has been studied on multi-layer graphs in the literature. However, existing cohesive subgraph models are designated for special multi-layer graphs such as multiplex networks and heterogeneous information networks. In this paper, we propose generalized core (gCore), a new notion of cohesive subgraph on general multi-layer graphs without any predefined constraints on the interconnections between vertices. The gCore model considers both the intra-layer and cross-layer cohesiveness of vertices. Three related problems are studied in this paper including gCore search (GCS), gCore decomposition (GCD), and gCore indexing (GCI). A polynomial-time algorithm based on the peeling paradigm is proposed to solve the GCS problem. By considering the containment among gCores, a ``tree of trees'' data structure called KP-tree is designed for efficiently solving the GCD problem and serving as a compact storage and index of all gCores. Several advanced lossless compaction techniques including node/subtree elimination, subtree transplant, and subtree merge are proposed to help reduce the storage overhead of the KP-tree and speed up the process of solving GCD and GCI. Besides, a KP-tree-based GCS algorithm is designed, which can retrieve any gCore in linear time in the size of the gCore and the height of the KP-tree. The experiments on $10$ real-world graphs verify the effectiveness of the gCore model and the efficiency of the proposed algorithms.
Zebra: When Temporal Graph Neural Networks Meet Temporal Personalized PageRankYiming Li (Hong Kong University of Science and Technology)*; Yanyan Shen (Shanghai Jiao Tong University); Lei Chen (Hong Kong University of Science and Technology); Mingxuan Yuan (Huawei) Show AbstractDownload Paper
Temporal graph neural networks (T-GNNs) are state-of-the-art methods for learning representations over dynamic graphs. Despite the superior performance, T-GNNs still suffer from high computational complexity caused by the tedious recursive temporal message passing scheme, which hinders their applicability to large dynamic graphs. To address the problem, we build the theoretical connection between the temporal message passing scheme adopted by T-GNNs and the temporal random walk process on dynamic graphs. Our theoretical analysis indicates that it would be possible to select a few influential temporal neighbors to compute a target node's representation without compromising the predictive performance. Based on this finding, we propose to utilize T-PPR, a parameterized metric for estimating the influence score of nodes on evolving graphs. We further develop an efficient single-scan algorithm to answer the top-k T-PPR query with rigorous approximation guarantees. Finally, we present Zebra, a scalable framework that accelerates the computation of T-GNN by directly aggregating the features of the most prominent temporal neighbors returned by the top-k T-PPR query. Extensive experiments have validated that Zebra can be up to two orders of magnitude faster than the state-of-the-art T-GNNs while attaining better performance.
R12
Reasoning, Recommendation, Classification
Chair: Besat Kassaie (University of Waterloo)
Federated Calibration and Evaluation of Binary ClassifiersGraham Cormode (University of Warwick)*; Igor L Markov (Meta) Show AbstractDownload Paper
We address two major obstacles to practical deployment of AI-based models on distributed private data. Whether a model was trained by a federation of cooperating clients or trained centrally, (1) the output scores must be calibrated, and (2) performance metrics must be evaluated — all without assembling labels in one place. In particular, we show how to perform calibration and compute the standard metrics of precision, recall, accuracy and ROC-AUC in the federated setting under three privacy models (𝑖) secure aggregation, (𝑖𝑖) distributed differential privacy, (𝑖𝑖𝑖) local differential privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efficiency. They also help decide whether a given application has sufficient data to support federated calibration and evaluation.
Efficient Fault Tolerance for Recommendation Model Training via Erasure CodingTianyu Zhang (Carnegie Mellon University); Kaige Liu (Carnegie Mellon University); Jack Kosaian (Carnegie Mellon University)*; Juncheng Yang (Carnegie Mellon University); Rashmi Vinayak (Carnegie Mellon Univerity) Show AbstractDownload Paper
Deep-learning-based recommendation models (DLRMs) are widely deployed to serve personalized content. In addition to using neural networks, DLRMs have large, sparsely-accessed embedding tables, which map categorical features to a learned dense representation. Due to the large sizes of embedding tables, DLRM training is typically distributed across the memory of tens or hundreds of nodes. Node failures are common in such large systems and must be mitigated to enable training to complete within production deadlines. Checkpointing is the primary approach used for fault tolerance in these systems, but incurs significant time overhead both during normal operation and when recovering from failures. As these overheads increase with DLRM size, checkpointing is slated to become an even larger overhead for future DLRMs, which are expected to grow. This calls for rethinking fault tolerance in DLRM training.
We present ECRec, a DLRM training system that achieves efficient fault tolerance by coupling erasure coding with the unique characteristics of DLRM training. ECRec takes a hybrid approach between erasure coding and replicating different DLRM parameters, correctly and efficiently updates redundant parameters, and enables training to proceed without pauses, while maintaining the consistency of the recovered parameters. We implement ECRec atop XDL, an open-source, industrial-scale DLRM training system. Compared to checkpointing, ECRec reduces training-time overhead on large DLRMs by up to 66\%, recovers from failure up to 9.8x faster, and continues training during recovery with only a 7--13% drop in throughput (whereas checkpointing must pause).
Scalable Reasoning on Document Stores via Instance-Aware Query RewritingOlivier Rodriguez (INRIA); Federico Ulliana (Inria)*; Marie-Laure Mugnier (University of Montpellier) Show AbstractDownload Paper
Data trees, typically encoded in JSON, are ubiquitous in data-driven applications. This ubiquity makes urgent the development of novel techniques for querying heterogeneous JSON data in a flexible manner. We propose a rule language for JSON, called constrained tree-rules, whose purpose is to provide a high-level unified view of heterogeneous JSON data and infer implicit information. As reasoning with constrained tree-rules is undecidable, we identify a relevant subset featuring tractable query answering, for which we design an automata-based query rewriting algorithm. Our approach consists of leveraging NoSQL document stores by means of a novel instance-aware query-rewriting technique. We present an extensive experimental analysis on large collections of several million JSON records. Our results show the importance of instance-aware rewriting as well as the efficiency and scalability of our approach.
Happiness Maximizing Sets under Group Fairness ConstraintsJiping Zheng (Nanjing University of Aeronautics and Astronautics); Yuan Ma (Nanjing University of Aeronautics and Astronautics); Wei Ma (Nanjing University of Aeronautics and Astronautics); Yanhao Wang (East China Normal University)*; Xiaoyang Wang (University of New South Wales) Show AbstractDownload Paper
Finding a happiness maximizing set (HMS) from a database, i.e., selecting a small subset of tuples that preserves the best score with respect to any nonnegative linear utility function, is an important problem in multi-criteria decision-making. When an HMS is extracted from a set of individuals to assist data-driven algorithmic decisions such as hiring and admission, it is crucial to ensure that the HMS can fairly represent different groups of candidates without bias and discrimination. However, although the HMS problem was extensively studied in the database community, existing algorithms do not take group fairness into account and may provide solutions that under-represent some groups.
In this paper, we propose and investigate a fair variant of HMS (FairHMS) that not only maximizes the minimum happiness ratio but also guarantees that the number of tuples chosen from each group falls within predefined lower and upper bounds. Similar to the vanilla HMS problem, we show that FairHMS is NP-hard in three and higher dimensions. Therefore, we first propose an exact interval cover-based algorithm called IntCov for FairHMS on two-dimensional databases. Then, we propose a bicriteria approximation algorithm called BiGreedy for FairHMS on multi-dimensional databases by transforming it into a submodular maximization problem under a matroid constraint. We also design an adaptive sampling strategy to improve the practical efficiency of BiGreedy. Extensive experiments on real-world and synthetic datasets confirm the efficacy and efficiency of our proposal.
Fairness Matters: A Tit-For-Tat Strategy Against Selfish MiningWeijie Sun (The Hong Kong University of Science and Technology); Zihuan Xu (Hong Kong University of Science and Technology); Lei Chen (Hong Kong University of Science and Technology) Show AbstractDownload Paper
The proof-of-work (PoW) based blockchains are more secure nowadays since profit-oriented miners contribute more computing powers in exchange for fair revenues. This virtuous circle only works under an incentive-compatible consensus, which is found to be fragile under selfish mining attacks. Specifically, selfish miners can conceal and reveal blocks strategically to earn unfairly higher revenue compared to honest behaviors. Previous countermeasures either require incompatible modifications or fail to consider the asynchronous network and multiple honest nodes setting in reality. In this paper, we introduce the unfairness measurement based on the KL-divergence from the computing power distribution to the revenue distribution of miners. To improve fairness with the existence of selfish miners, we propose a novel block promotion strategy namely Tit-for-Tat (TFT), for honest miners. In particular, based on a miner’s local observation of forks, we design the suspicious probability measurement of other nodes. Rather than promoting a fresh block instantly, miners withhold it for difierent time periods according to others’ suspicious probability before delivery. Meanwhile, to minimize the attacker’s unfair revenue, we formulate the delay vector (DV) problem for honest miners to determine the optimal withholding time. We prove that DV problem is nonconvex, and thus propose two approximation algorithms that yield n-suboptimal solutions. In addition, we extend TFT strategy to support dynamic networks. Extensive experiments validate the eficiency and efiectiveness of our strategy and algorithms to reduce unfairness by 54.62% within bounded withholding time.
R13
Queries and Systems I
Chair: Sabina Petride (Oracle)
C5: Cloned Concurrency Control that Always Keeps UpJeffrey Helt (Princeton University)*; Abhinav Sharma (Meta Platforms); Daniel J Abadi (UMD); Wyatt Lloyd (Princeton University); Jose Faleiro (Microsoft) Show AbstractDownload Paper
Asynchronously replicated primary-backup databases are commonly deployed to improve availability and offload read-only transactions. To both apply replicated writes from the primary and serve read-only transactions, the backups implement a cloned concurrency control protocol. The protocol ensures read-only transactions always return a snapshot of state that previously existed on the primary. This compels the backup to exactly copy the commit order resulting from the primary’s concurrency control. Existing cloned concurrency control protocols guarantee this by limiting the backup’s parallelism. As a result, the primary’s concurrency control executes some workloads with more parallelism than these protocols. In this paper, we prove that this parallelism gap leads to unbounded replication lag, where writes can take arbitrarily long to replicate to the backup and which has led to catastrophic failures in production systems. We then design C5, the first cloned concurrency protocol to provide bounded replication lag. We implement two versions of C5: Our evaluation in MyRocks, a widely deployed database, demonstrates C5 provides bounded replication lag. Our evaluation in Cicada, a recent in-memory database, demonstrates C5 keeps up with even the fastest of primaries.
LEON: A New Framework for ML-Aided Query OptimizationXu Chen (University of Electronic Science and Technology of China)*; Haitian Chen (University of Electronic Science and Technology of China); Zibo Liang (University of Electronic Science and Technology of China); Shuncheng Liu (University of Electronic Science and Technology of China); Jinghong Wang (Huawei Technologies Co., Ltd.); Kai Zeng (Huawei); Han Su (University of Electronic Science and Technology of China); Kai Zheng (University of Electronic Science and Technology of China) Show AbstractDownload Paper
Query optimization has long been a fundamental yet challenging topic in the database field. With the prosperity of machine learning (ML), some recent works have shown the advantages of reinforcement learning (RL) based learned query optimizer. However, they suffer from fundamental limitations due to the data-driven nature of ML. Motivated by the ML characteristics and database maturity, we propose LEON, a framework for ML-aided query optimization. LEON improves the expert query optimizer to self-adjust to the particular deployment by leveraging ML and the fundamental knowledge in the expert query optimizer. To train the ML model, a pair-wise ranking objective is proposed, which is substantially different from the previous regression objective. To help the optimizer to escape the local minima and avoid failure, a ranking and uncertainty-based exploration strategy is proposed, which discovers the valuable plans to aid the optimizer. Furthermore, an ML model-guided pruning is proposed to increase the planning efficiency without hurting too much performance. Extensive experiments offer evidence that the proposed framework can outperform the state-of-the-art methods in terms of end-to-end latency performance, training efficiency, and stability.
BASE: Bridging the Gap between Cost and Latency for Query Optimization [sds]Xu Chen (University of Electronic Science and Technology of China)*; Zhen Wang (Alibaba Group); Shuncheng Liu (University of Electronic Science and Technology of China); Yaliang Li (Alibaba Group); Kai Zeng (University Of Electronic Science And Technology Of China); Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group); Jingren Zhou (Alibaba Group); Han Su (University of Electronic Science and Technology of China); Kai Zheng (University of Electronic Science and Technology of China) Show AbstractDownload Paper
Some recent works have shown the advantages of reinforcement learning (RL) based learned query optimizers. These works often use the cost (i.e., the estimation of cost model) or the latency (i.e., execution time) as guidance signals for training their learned models. However, cost-based learning underperforms in latency and latency-based learning is time-intensive. In order to bypass such a dilemma, researchers attempt to transfer a learned value network from the cost domain to the latency domain. We recognize critical insights in cost/latency-based training, prompting us to transfer the reward function rather than the value network. Based on this idea, we propose a two-stage RL-based framework, BASE, to bridge the gap between cost and latency. After learning a policy based on cost signals in its first stage, BASE formulates transferring the reward function as a variant of inverse reinforcement learning. Intuitively, BASE learns to calibrate the reward function and updates the policy regarding the calibrated one in a mutually-improved manner. Extensive experiments exhibit the superiority of BASE on two benchmark datasets: Our optimizer outperforms traditional DBMS, using 30% less training time than SOTA methods. Meanwhile, our approach can enhance the efficiency of other learning-based optimizers.
A Randomized Blocking Structure for Streaming Record Linkage [sds]Dimitrios Karapiperis (International Hellenic University)*; Christos Tjortjis (International Hellenic University); Vassilios S. Verykios (Hellenic Open University) Show AbstractDownload Paper
A huge amount of data, in terms of streams, are collected nowadays via a variety of sources, such as sensors, mobile devices, or even raw log files. The unprecedented rate at which these data are generated and collected calls for novel record linkage methods to identify matching records pairs, which refer to the same real-world entity. Towards this direction, blocking methods are used in order to reduce the number of candidate record pairs while still maintaining high levels of accuracy. This paper introduces ExpBlock, a randomized record linkage structure, which guarantees that both the most frequently accessed and recently used blocks remain in main memory and, additionally, the records within a block are renewed on a rolling basis. Specifically, the probability of inactive blocks and older records to remain in main memory decays in order to give space to more promising blocks and fresher records, respectively. We implement these features using random choices instead of utilizing cumbersome ranking data structures in order to favour simplicity of implementation and efficiency. We showcase, through the experimental evaluation, that ExplBlock scales efficiently to data streams by providing accurate results in a timely fashion.
A Case for Graphics-driven Query ProcessingHarish Doraiswamy (Microsoft Research India)*; Vikas Kalagi (Microsoft Research India); Karthik Ramachandra (Microsoft Azure SQL India); Jayant R Haritsa (Indian Institute of Science) Show AbstractDownload Paper
Over the past decade, the database research community has directed considerable attention towards harnessing the power of GPUs in query processing engines. The proposed techniques have primarily focused on devising customized low-level mechanisms that utilize the raw hardware parallelism provided abundantly by GPU compute kernels.
In this paper, we advocate a radically different approach -- instead of dealing directly with hardware idiosyncrasies, to leverage the well-established graphics pipeline architecture baked into the GPU hardware. A variety of advantages accrue from this high-level abstraction: (a) Extracting the power of GPUs is outsourced to highly-optimized graphics drivers, thereby providing hardware-consciousness for free; (b) Query processing becomes agnostic to changes in GPU architectures (e.g. integrated vs discrete) and vendors, requiring only a change of drivers; (c) Contemporary graphics APIs also support a compute element, facilitating query operator designs that seamlessly straddle the compute and graphics worlds.
As a proof of concept of the above vision, we implement here the workhorse Join and GroupBy operators using core graphics primitives. These implementations, based on the Vulkan API, have been evaluated over large benchmark databases on vanilla hybrid computing platforms. The experimental results indicate both substantive performance benefits (typically, around 2X faster) over existing approaches, as well as auto-tuned portability to new hardware platforms.
R14
Spatial, Spatio-Temporal
Chair: Felix Naumann (Hasso Plattner Institute)
A Hierarchical Grouping Algorithm for the Multi-Vehicle Dial-a-Ride ProblemKelin Luo (University of Bonn); Alexandre M Florio (Polytechnique Montreal); Syamantak Das (IIIT Delhi); Xiangyu Guo (University at Buffalo)* Show AbstractDownload Paper
Ride-sharing is an essential aspect of modern urban mobility. In this paper, we consider a classical problem in ride-sharing – the Multi-Vehicle Dial-a-Ride Problem (Multi-Vehicle DaRP). Given a fleet of vehicles with a fixed capacity stationed at various locations and a set of ride requests specified by origin and destination, the goal is to serve all requests such that no vehicle is assigned more passengers than its capacity at any point in its trip.
We give an algorithm HGR, which is the first non-trivial approximation algorithm for the Multi-Vehicle DaRP. The main technical contribution is to reduce Multi-Vehicle DaRP to a certain capacitated partitioning problem, which we solve using a novel hierarchical grouping algorithm.
Experimental results show that the vehicle routes produced by our algorithm not only exhibit less total travel distance compared to state-of-the-art baselines, but also enjoy a small in-transit latency, which crucially relates to each individual riders’ traveling time. This suggests that HGR enhances rider experience while being energy-efficient.
Route Travel Time Estimation on A Road Network Revisited: Heterogeneity, Proximity, Periodicity and DynamicityHaitao Yuan (Nanyang Technological University)*; Guoliang Li (Tsinghua University); Zhifeng Bao (RMIT University) Show AbstractDownload Paper
In this paper, we revisit the problem of route travel time estimation on a road network and aim to boost its accuracy by capturing and utilizing spatio-temporal features from four significant aspects: heterogeneity, proximity, periodicity and dynamicity. Spatial-wise, we consider two forms of heterogeneity at link level in a road network: the turning ways between different links are heterogeneous which can make the travel time of the same link various; different links contain heterogeneous attributes and thereby lead to different travel time. In addition, we take into account the proximity: neighboring links have similar traffic patterns and lead to similar travel speeds. To this end, we build a link-connection graph to capture such heterogeneity and proximity. Temporal-wise, the weekly/daily periodicity of temporal background information (e.g., rush hours) and dynamic traffic conditions have significant impact on the travel time, which result in static and dynamic spatio-temporal features respectively. To capture such impacts, we regard the travel time/speed as a combination of static and dynamic parts, and extract many spatio-temporal relevant features for the prediction task. Talking about the methodology, it remains an open problem to build a generic learning model to boost the estimation accuracy. Hence, we design a novel encoder-decoder framework -- The encoder uses the sequence attention model to encode dynamic features from the temporal-wise perspective. The decoder first uses the heterogeneous graph attention model to decode the static part of travel speed based on static spatio-temporal features, and then leverages the sequence attention model to decode the estimated travel time from spatial-wise perspective. Extensive experiments on real datasets verify the superiority of our method as well as the importance of the four aspects outlined above.
Automatic Road Extraction with Multi-Source Data Revisited: Completeness, Smoothness and DiscriminationHaitao Yuan (Nanyang Technological University)*; Sai Wang (Wuhan University); Zhifeng Bao (RMIT University); Shangguang Wang (State Key Laboratory of Networking and Switching Technology) Show AbstractDownload Paper
Extracting roads from multi-source data, such as aerial images and vehicle trajectories, is an important way to maintain road networks in the filed of urban computing. In this paper, we revisit the problem of road extraction and aim to boost its accuracy by solving three significant issues: the insufficient complementarity among multiple sources, rough edges of extracted roads, and many false positives caused by confusing pixels. In particular, we design an end-to-end neural network model to achieve this goal. At first, this model leverages two encoding networks to extract relative information from the inputs of two sources respectively, and then applies the attention mechanism to fuse them for sufficiently capturing the complementary correlation. Next, we introduce an auxiliary task, predicting road edges based on fused representations, to make the extracted roads smooth and continuous. At last, to reduce false positives relative to confusing pixels, we propose a pixel-aware contrastive-learning module to distinguish positive (roads) and negative (objects similar to roads) pixels. In addition, to improve the model's learning effectiveness, we propose a model-agnostic transfer learning method, which first builds auxiliary tasks to pre-train the whole model, and then fine-tunes the model's parameters for the main task. Extensive experiments on real datasets verify the superiority of our method as well as the importance of solving the three issues outlined above.
Budget-Conscious Fine-Grained Configuration Optimization for Spatio-Temporal ApplicationsKeven Richly (Hasso Plattner Institute); Rainer Schlosser (Hasso Plattner Institute); Martin Boissier (Hasso Plattner Institute) Show AbstractDownload Paper
Based on the performance requirements of modern spatio-temporal data mining applications, in-memory database systems are often used to store and process the data. To efficiently utilize the scarce DRAM capacities, modern database systems support various tuning possibilities to reduce the memory footprint (e.g., data compression) or increase performance (e.g., additional indexes). However, the selection of cost and performance balancing configurations is challenging due to the vast number of possible setups consisting of mutually dependent individual decisions. In this paper, we introduce a novel approach to jointly optimize the compression, sorting, indexing, and tiering configuration for spatio-temporal workloads. Further, we consider horizontal data partitioning, which enables the independent application of different tuning options on a fine-grained level. We propose different linear programming (LP) models addressing cost dependencies at different levels of accuracy to compute optimized tuning configurations for a given workload and memory budgets. To yield maintainable and robust configurations, we extend our LP-based approach to incorporate reconfiguration costs as well as a worst-case optimization for potential workload scenarios. Further, we demonstrate on a real-world dataset that our models allow to significantly reduce the memory footprint with equal performance or increase the performance with equal memory size compared to existing tuning heuristics.
Real-time Workload Pattern Analysis for Large-scale Cloud Databases [industry]Jiaqi Wang (Zhejiang University); Tianyi Li (Aalborg University); Anni Wang (Alibaba); Xiaoze Liu (Purdue University); Lu Chen (Zhejiang University)*; Jie Chen (Alibaba); Jianye Liu (Alibaba Group); Junyang Wu (Zhejiang University); Feifei Li (Alibaba Group); Yunjun Gao (Zhejiang University) Show AbstractDownload Paper
Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases.
In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AWM encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded query by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AWM enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.
R15
Online Demos I
Chair: Besat Kassaie (University of Waterloo)
DoveDB: A Declarative and Low-Latency Video Database [demo]Ziyang Xiao (Zhejiang University); Dongxiang Zhang (Zhejiang University)*; Zepeng Li (Zhejiang University); Sai Wu (Zhejiang Univ); Kian-Lee Tan (National University of Singapore); Gang Chen (Zhejiang University) Show AbstractDownload Paper
Concerning the usability and efficiency to manage video data generated from large-scale cameras, we demonstrate DoveDB, a declarative and low-latency video database. We devise a more comprehensive video query language called VMQL to improve the expressiveness of previous SQL-like languages, which are augmented with functionalities for model-oriented management and deployment. We also propose a light-weight ingestion scheme to extract tracklets of all the moving objects and build semantic indexes to facilitate efficient query processing. For user interaction, we construct a simulation environment with 120 cameras deployed in a road network and demonstrate three interesting scenarios. Using VMQL, users are allowed to 1) train a visual model using SQL-like statement and deploy it on dozens of target cameras simultaneously for online inference; 2) submit multi-object tracking (MOT) requests on target cameras, store the ingested results and build semantic indexes; and 3) issue an aggregation or top-k query on the ingested cameras and obtain the response within milliseconds. A preliminary video introduction of DoveDB is available at https://www.youtube.com/watch?v=N139dEyvAJk
FastMosaic in Action: A New Mosaic Operator for Array DBMSs [demo]Ramon Antonio Rodriges Zalipynis (HSE University)* Show AbstractDownload Paper
Array DBMSs operate on 𝑁-d arrays. During the Data Ingestion phase, the widely used mosaic operator ingests a massive collection of overlapping arrays into a single large array, called mosaic. The operator can utilize sophisticated statistical and machine learning techniques, e.g. Canonical Correlation Analysis (CCA), to produce a high quality seamless mosaic where the contrasts between the values of cells taken from input overlapping arrays are minimized. However, the performance bottleneck becomes a major challenge when applying such advanced techniques over increasingly growing array volumes. We introduce a new, scalable way to perform CCA that is orders of magnitude faster than the popular Python’s scikit-learn library for the purpose of array mosaicking. Furthermore, we developed a hybrid web-desktop application to showcase our novel FastMosaic operator, based on this new CCA. A rich GUI enables to comprehensively investigate in/out arrays, interactively guides through an end-to-end mosaic construction on real-world geospatial arrays using FastMosaic, facilitating a convenient exploration of the FastMosaic pipeline and its internals.
CORNET: Learning Spreadsheet Formatting Rules By Example [demo]Mukul Singh (Microsoft)*; José Cambronero Sánchez (Microsoft); Sumit Gulwani (Microsoft Research); Vu Le (Microsoft); Carina Negreanu (Microsoft Research); Gust Verbruggen (Microsoft) Show AbstractDownload Paper
Data management and analysis tasks are often carried out using spreadsheet software. A popular feature in most spreadsheet platforms is the ability to define data-dependent formatting rules. These rules can express actions such as “color red all entries in a column that are negative” or “bold all rows not containing error or failure”. Unfortunately, users who want to exercise this functionality need to manually write these conditional formatting (CF) rules. We introduce CORNET, a system that automatically learns such conditional formatting rules from user examples. CORNET takes inspiration from inductive program synthesis and combines symbolic rule enumeration, based on semi-supervised clustering and iterative decision tree learning, with a neural ranker to produce accurate conditional formatting rules. In this demonstration, we show CORNET in action as a simple add-in to Microsoft’s Excel. After the user provides one or two formatted cells as examples, CORNET generates formatting rule suggestions for the user to apply on the spreadsheet.
ADOps: An Anomaly Detection Pipeline in Structured Logs [demo]Xintong Song (Netease Fuxi AI Lab)*; Yusen Zhu (NetEase Fuxi AI Lab); Jianfei Wu (Netease Fuxi AI Lab); Bai Liu (Netease Fuxi AI Lab); Hongkang Wei (Netease Fuxi AI Lab) Show AbstractDownload Paper
Anomaly detection has been extensively implemented in industry. The reality is that an application may have numerous scenarios where anomalies need to be monitored. However, the complete process of anomaly detection will take much time, including data acquisition, data processing, model training, and model deployment. In particular, some simple scenarios do not require building complex anomaly detection models. This results in a waste of resources. To solve these problems, we build an anomaly detection pipeline(ADOps) to modularize each step. For simple anomaly detection scenarios, no programming is required and new anomaly detection tasks can be created by simply modifying the configuration file. In addition, it can also improve the development efficiency of complex anomaly detection models. We show how users create anomaly detection tasks on the anomaly detection pipeline and how engineers use it to develop anomaly detection models.
A Demonstration of DLBD: Database Logic Bug Detection System [demo]Xiu Tang (Zhejiang University); Sai Wu (Zhejiang Univ)*; Dongxiang Zhang (Zhejiang University); Ziyue Wang (Zhejiang University); Gongsheng Yuan (Zhejiang University); Gang Chen (Zhejiang University) Show AbstractDownload Paper
Database management systems (DBMSs) are prone to logic bugs that can result in incorrect query results. Current debugging tools are limited to single table queries and struggle with issues like lack of ground-truth results and repetitive query space exploration. In this paper, we demonstrate DLBD, a system that automatically detects logic bugs in databases. DLBD offers holistic logic bug detection by providing automatic schema and query generation and ground-truth query result retrieval. Additionally, DLBD provides minimal test cases and preliminary root cause analysis for each bug to aid developers in reproducing and fixing detected bugs. DLBD incorporates heuristics and domain-specific knowledge to efficiently prune the search space and employs query space exploration mechanisms to avoid the repetitive search. Finally, DLBD utilizes a distributed processing framework to test database logic bugs in a scalable and efficient manner. Our system offers developers a reliable and effective way to detect and fix logic bugs in DBMSs.
DHive: Query Execution Performance Analysis via Dataflow in Apache Hive [demo]Chaozu Zhang (Southern University of Science and Technology)*; Qiaomu Shen (Southern University of Science and Technology); Bo Tang (Southern University of Science and Technology) Show AbstractDownload Paper
Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.
Lingua Manga: A Generic Large Language Model Centric System for Data Curation [demo]Zui Chen (Tsinghua University)*; Lei Cao (University of Arizona/MIT); Samuel Madden (Massachusetts Institute of Technology) Show AbstractDownload Paper
Data curation is a wide-ranging area which contains many critical but time-consuming data processing tasks. The diversity of data curation tasks makes it hard for building a general-purpose data curation system. To address this issue, we present Lingua Manga, a generic and user-friendly system that leverages pre-trained large language models. Lingua Manga is designed to enable flexible and swift development with automatic optimization to attain high performance and label efficiency. Through three example applications with different objectives and involving different types of users, we demonstrate that Lingua Manga can effectively assist both skilled programmers and low-code or even no-code users in solving data curation problems.
Interactive Demonstration of EVA [demo]Gaurav Tarlok Kakkar (Georgia Institute of Technology)*; Aryan Rajoria (Georgia Institute of Technology); Myna Prasanna Kalluraya (Georgia Institute of Technology); Ashmita Raju (Georgia Institute of Technology); Jiashen Cao (Georgia Tech); Kexin Rong (Georgia Institute of Technology); Joy Arulraj (Georgia Tech) Show AbstractDownload Paper
In this demonstration, we will present EVA, an end-to-end AI Relational database management system. We will demonstrate the capabilities and utility of EVA using three usage scenarios: (1) EVA serves as a backend for an exploratory video analytics interface developed using Streamlit and React, (2) EVA seamlessly integrates with the Python and Data Science ecosystems by allowing users to access EVA in a Python notebook alongside other popular libraries such as Pandas and Matplotlib, and (3) EVA facilitates bulk labeling with Label Studio, a widely-used labeling framework. By optimizing complex vision queries, we illustrate how EVA allows a wide range of application developers to harness the recent advances in computer vision.
R20
Indexing and Learned Indexing
Chair: Zheng Wang (Huawei Singapore Research Center)
PLIN: A Persistent Learned Index for Non-Volatile Memory with High Performance and Instant RecoveryZhou Zhang (USTC); Zhaole Chu (University of Science and Technology of China); Peiquan Jin (University of Science and Technology of China)*; Yongping Luo (University of Science and Technology of China); Xike Xie (University of Science and Technology of China); Shouhong Wan (Univerisity of Science and Technology of China); Yun Luo (Tencent); Xufei Wu (Tencent); Peng Zou (Tencent); Chunyang Zheng (Intel); Guoan Wu (Intel); Andy Rudoff (Intel) Show AbstractDownload Paper
Non-Volatile Memory (NVM) has emerged as an alternative to next-generation main memories. Although many tree indices have been proposed for NVM, they generally use B+-tree-like structures. To further improve the performance of NVM-aware indices, we consider integrating learned indexes into NVM. The challenges of such an integration are two fold: (1) existing NVM indices rely on small nodes to accelerate insertions with crash consistency, but learned indices use huge nodes to obtain a flat structure. (2) the node structure of learned indices is not NVM friendly, meaning that accessing a learned node will cause multiple NVM block misses. Thus, in this paper, we propose a new persistent learned index called PLIN. The novelty of PLIN lies in four aspects: an NVM-aware data placement strategy, locally unordered and globally ordered leaf nodes, a model copy mechanism, and a hierarchical insertion strategy. In addition, PLIN is proposed for the NVM-only architecture, which can support instant recovery. We also present optimistic concurrency control and fine-grained locking mechanisms to make PLIN scalable to concurrent requests. We conduct experiments on real persistent memory with various workloads and compare PLIN with APEX, PACtree, ROART, TLBtree, and Fast&Fair. The results show that PLIN achieves 2.08x higher insertion performance and 4.42x higher query performance than its competitors on average. Meanwhile, PLIN only needs ~30 us to recover from a system crash.
Sieve: A Learned Data-Skipping Index for Data AnalyticsYulai Tong (Huazhong University of Science and Technology)*; Jiazhen Liu (Huazhong University of science and technology); Hua Wang (Huazhong University of Science and Technology); Ke Zhou (Huazhong University of Science and Technology); Rongfeng He (Huawei Cloud Computing Technologies Co., Ltd); Qin Zhang (Huawei Cloud); Cheng Wang (Huawei Cloud Computing Technologies Co., Ltd) Show AbstractDownload Paper
Modern data analytics services are coupled with external data storage services, making I/O from remote cloud storage one of the dominant costs for query processing. Techniques such as columnar block-based data organization and compression have become standard practices for these services to save storage and processing cost. However, the problem of effectively skipping irrelevant blocks at low overhead is still open. Existing data-skipping efforts maintain lightweight summaries (e.g., min/max, histograms) for each block to filter irrelevant data. However, such techniques ignore patterns in real-world data, enabling ineffective use of the storage budget and may cause serious false positives.
This paper presents Sieve, a learning-enhanced index designed to efficiently filter out irrelevant blocks by capturing data patterns. Specifically, Sieve utilizes piece-wise linear functions to capture block distribution trends over the key space. Based on the captured trends, Sieve trades off storage consumption and false positives by grouping neighboring keys with similar block distributions into a single region. We have evaluated Sieve using Presto, and experiments on real-world datasets demonstrate that Sieve achieves up to 80% reduction in blocks accessed and 42% reduction in query times compared to its counterparts.
LMSFC: A Novel Multidimensional Index based on Learned Monotonic Space Filling CurvesJian Gao (University of New South Wales)*; Xin Cao (University of New South Wales); Xin Yao (Huawei Theory Lab); Gong Zhang (Huawei); Wei Wang (Hong Kong University of Science and Technology (Guangzhou)) Show AbstractDownload Paper
The recently proposed learned indexes have attracted much attention as they can adapt to the actual data and query distributions to attain better search efficiency. Based on this technique, several existing works build up indexes for multi-dimensional data and achieve improved query performance. A common paradigm of these works is to (i) map multi-dimensional data points to a one-dimensional space using a fixed space-filling curve (SFC) or its variant and (ii) then apply the learned indexing techniques. We notice that the first step typically uses a fixed SFC method, such as row-major order and 𝑧-order. It definitely limits the potential of learned multi-dimensional indexes to adapt variable data distributions via different query workloads.
In this paper, we propose a novel idea of learning a space-filling curve that is carefully designed and actively optimized for efficient query processing. We also identify innovative offline and online optimization opportunities common to SFC-based learned indexes and offer optimal and/or heuristic solutions. Experimental results demonstrate that our proposed method, LMSFC, outperforms state-of-the-art non-learned or learned methods across three commonly used real-world datasets and diverse experimental settings.
FILM: a Fully Learned Index for Larger-than-Memory DatabasesChaohong Ma (Renmin University of China)*; Xiaohui Yu (York University); Yifan Li (York University); Xiaofeng Meng (Renmin University of China); Aishan Maoliniyazi (Renmin University) Show AbstractDownload Paper
As modern applications generate data at an unprecedented speed and often require the querying/analysis of data spanning a large duration, it is crucial to develop indexing techniques that cater to larger-than-memory databases, where data reside on heterogeneous storage devices (such as memory and disk), and support fast data insertion and query processing. In this paper, we propose FILM, a Fully learned Index for Larger-than-Memory databases. FILM is a learned tree structure that uses simple approximation models to index data spanning different storage devices. Compared with existing techniques for larger-than-memory databases, such as anti-caching, FILM allows for more efficient query processing at significantly lower main-memory overhead. FILM is also designed to effectively address one of the bottlenecks in existing methods for indexing larger-than-memory databases that is caused by data swapping between memory and disk. More specifically, updating the LRU (for Least Recently Used) structure employed by existing methods for cold data identification (determining the data to be evicted to disk when the available memory runs out) often incurs significant delay to query processing. FILM takes a drastically different approach by proposing an adaptive LRU structure and piggybacking its update onto query processing with minimal overhead. We thoroughly study the performance of FILM and its components on a variety of datasets and workloads, and the experimental results demonstrate its superiority in improving query processing performance and reducing index storage overhead (by orders of magnitudes) compared with applicable baselines.
Learned Index: A Comprehensive Experimental Evaluation [eab]Zhaoyan Sun (Tsinghua University); Xuanhe Zhou (Tsinghua University); Guoliang Li (Tsinghua University)* Show AbstractDownload Paper
Indexes can improve query-processing performance by avoiding full table scans. Although traditional indexes (e.g., B+-tree) have been widely used, learned indexes are proposed to adopt machine learning models to reduce the query latency and index size. However, existing learned indexes are (1) not thoroughly evaluated under the same experimental framework and are (2) not comprehensively compared with different settings (e.g., key lookup, key insert, concurrent operations, bulk loading). Moreover, it is hard to select appropriate learned indexes for practitioners in different settings. To address those problems, this paper detailedly reviews existing learned indexes and discusses the design choices of key components in learned indexes, including key lookup (position inference which predicts the position of a key, and position refinement which re-searches the position if the predicted position is incorrect), key insert, concurrency, and bulk loading. Moreover, we provide a testbed to facilitate the design and test of new learned indexes for researchers. We compare state-of-the-art learned indexes in the same experimental framework, and provide findings to select suitable learned indexes under various practical scenarios.
R21
Learning and Systems
Chair: Yuxin Tang (Rice University)
CORNET: Learning Table Formatting Rules By ExampleMukul Singh (Microsoft)*; José Cambronero Sánchez (Microsoft); Sumit Gulwani (Microsoft Research); Vu Le (Microsoft); Carina Negreanu (Microsoft Research); Mohammad Raza (Microsoft); Gust Verbruggen (Microsoft) Show AbstractDownload Paper
Spreadsheets are widely used for table manipulation and presentation. Stylistic formatting of these tables is an important property for presentation and analysis. As a result, popular spreadsheet software, such as Excel, supports automatically formatting tables based on rules. Unfortunately, writing such formatting rules can be challenging for users as it requires knowledge of the underlying rule language and data logic. We present CORNET, a system that tackles the novel problem of automatically learning such formatting rules from user-provided formatted cells. CORNET takes inspiration from advances in inductive programming and combines symbolic rule enumeration with a neural ranker to learn conditional formatting rules. To motivate and evaluate our approach, we extracted tables with over 450K unique formatting rules from a corpus of over 1.8M real worksheets. Since we are the first to introduce the task of automatically learning conditional formatting rules, we compare CORNET to a wide range of symbolic and neural baselines adapted from related domains. Our results show that CORNET accurately learns rules across varying setups. Additionally, we show that in some cases CORNET can find rules that are shorter than those written by users and can also discover rules in spreadsheets that users have manually formatted. Furthermore, we present two case studies investigating the generality of our approach by extending CORNET to related data tasks (e.g., filtering) and generalizing to conditional formatting over multiple columns.
Serving and Optimizing Machine Learning Workflows on Heterogeneous InfrastructuresYongji Wu (Duke University)*; Matthew Lentz (Duke University); Danyang Zhuo (Duke University); Yao Lu (Microsoft Research) Show AbstractDownload Paper
With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous infrastructure and do not take into account the more complex and tiered computing infrastructure that includes edge devices, local hubs, edge datacenters, and cloud datacenters. On the other hand, recent AutoML efforts have provided viable solutions for model compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or even generate a series of model variants with different tradeoffs between accuracy and efficiency.
We design and implement JellyBean, a system for serving and optimizing machine learning inference workflows on heterogeneous infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean picks the most cost-efficient models that meet the accuracy target and decides how to deploy them across different tiers of infrastructures. Evaluations show that JellyBean reduces the total serving cost of visual question answering by up to 58% and vehicle tracking from the NVIDIA AI City Challenge by up to 36%, compared with state-of-the-art model selection and worker assignment solutions. JellyBean also outperforms prior ML serving systems (e.g., Spark on the cloud) up to 5x in serving costs.
LOGER: A Learned Optimizer towards Generating Efficient and Robust Query Execution PlansTianyi Chen (Key Laboratory of High Confidence Software Technologies, CS, Peking University); Jun Gao (Peking University)*; Hedui Chen (ZTE Corporation); Yaofeng Tu (ZTE Corporation) Show AbstractDownload Paper
Query optimization based on deep reinforcement learning (DRL) has become a hot research topic recently. Despite the achieved promising progress, DRL optimizers still face great challenges of robustly producing efficient plans, due to the vast search space for both join order and operator selection and the highly varying execution latency taken as the feedback signal. In this paper, we propose LOGER, a learned optimizer towards generating efficient and robust plans, aiming at producing both efficient join orders and operators. LOGER first utilizes Graph Transformer to capture relationships between tables and predicates. Then, the search space is reorganized, in which LOGER learns to restrict specific operators instead of directly selecting one for each join, while utilizing DBMS built-in optimizer to select physical operators under the restrictions. Such a strategy exploits expert knowledge to improve the robustness of plan generation while offering sufficient plan search flexibility. Furthermore, LOGER introduces ε-beam search, which keeps multiple search paths that preserve promising plans while performing guided exploration. Finally, LOGER introduces a loss function with reward weighting to further enhance performance robustness by reducing the fluctuation caused by poor operators, and log transformation to compress the range of rewards. We conduct experiments on Join Order Benchmark (JOB), TPC-DS and Stack Overflow, and demonstrate that LOGER can achieve a performance better than existing learned query optimizers, with a 2.07x speedup on JOB compared with PostgreSQL.
Falcon: A Privacy-Preserving and Interpretable Vertical Federated Learning SystemYuncheng Wu (National University of Singapore); Naili Xing (national university of singapore); Gang Chen (Zhejiang University); Tien Tuan Anh Dinh (Deakin University); Zhaojing Luo (National University of Singapore); Beng Chin Ooi (NUS)*; Xiaokui Xiao (National University of Singapore); Meihui Zhang (Beijing Institute of Technology) Show AbstractDownload Paper
Federated learning (FL) enables multiple data owners to collaboratively train machine learning (ML) models without disclosing their raw data. In the vertical federated learning (VFL) setting, the collaborating parties have data from the same set of users but with disjoint attributes. After constructing the VFL models, the parties deploy the models in production systems to infer prediction requests. In practice, the prediction output itself may not be convincing for party users to make the decisions, especially in high-stakes applications. Model interpretability is therefore essential to provide meaningful insights and better comprehension on the prediction output.
In this paper, we propose Falcon, a novel privacy-preserving and interpretable VFL system. First, Falcon supports VFL training and prediction with strong and efficient privacy protection for a wide range of ML models, including linear regression, logistic regression, and multi-layer perceptron. The protection is achieved by a hybrid strategy of threshold partially homomorphic encryption (PHE) and additive secret sharing scheme (SSS), ensuring no intermediate information disclosure. Second, Falcon facilitates understanding of VFL model predictions by a flexible and privacy-preserving interpretability framework, which enables the implementation of state-of-the-art interpretable methods in a decentralized setting. Third, Falcon supports efficient data parallelism of VFL tasks and optimizes the parallelism factors to reduce the overall execution time. Falcon is fully implemented, and on which, we conduct extensive experiments using six real-world and multiple synthetic datasets. The results demonstrate that Falcon achieves comparable accuracy to non-private algorithms and outperforms three secure baselines in terms of efficiency.
Cost-Based or Learning-Based? A Hybrid Query Optimizer for Query Plan SelectionXiang Yu (Tsinghua University); Chengliang Chai (Beijing Institute of Technology); Guoliang Li (Tsinghua University); Jiabin Liu (Tsinghua University) Show AbstractDownload Paper
Traditional cost-based optimizers are efficient and stable to generate optimal plans for simple SQL queries, but they may not generate high-quality plans for complicated queries. Thus learning-based optimizers have been proposed recently that can learn high-quality plans based on past experiences. However, learning-based optimizers cannot work well for dynamic workloads that have different distributions with training examples. In this paper, we propose a hybrid optimizer that adopts the advantages and avoids the shortcomings of these two types of optimizers, which first generates high-quality candidate plans from each type of optimizers and then selects the best plan from the candidates. There are two challenges. (1) How to generate high-quality candidates? We propose a hint-based candidate generation method that leverages the learning-based method to generate highly beneficial hints and then uses a cost-based method to supplement the hints to generate complete plans as candidates. (2) How to evaluate different candidate plans and select the best one? We propose an uncertainty-based optimal plan selection model, which predicts the execution time and the uncertainty for each plan. The uncertainty reflects the confidence of the execution time prediction. We select the plan using the uncertainty model. Experiment results on real datasets showed that our method outperformed the state-of-the-art baselines, and reduced the total latency by 25% and the tail latency by 65% compared to PostgreSQL.
R22
Parallelization and Analytics
Chair: Elisa Bertino (Purdue University)
SyncSignature: A Simple, Efficient, Parallelizable Framework for Tree Similarity JoinsNikolai Karpov (Indiana University Bloomington); Qin Zhang (Indiana University Bloomington)* Show AbstractDownload Paper
This paper introduces SyncSignature, the first fully parallelizable algorithmic framework for tree similarity joins under edit distance. SyncSignature makes use of implicit-synchronized signature generation schemes, which allow for an efficient and parallelizable candidate-generation procedure via hash join. Our experiments on large real-world datasets show that the proposed algorithms under the SyncSignature framework significantly outperform the state-of-the-art algorithm in the parallel computation environment. For datasets with big trees, they also exceed the state-of-the-art algorithms by a notable margin in the centralized/single-thread computation environment. To complement and guide the experimental study, we also provide a thorough theoretical analysis for all proposed signature generation schemes.
Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism [sds]Xupeng Miao (Carnegie Mellon University)*; Yujie Wang (Peking University); Youhe Jiang (Peking University); Chunan Shi (Peking University); Xiaonan Nie (Peking University); Hailin Zhang (Peking University); Bin Cui (Peking University) Show AbstractDownload Paper
Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make distributed training plans or apply parallelism combinations within a very limited search space. In this approach, we propose Galvatron, a new system framework that incorporates multiple popular parallelism dimensions and automatically finds the most efficient hybrid parallelism strategy. To better explore such a rarely huge search space, we 1) involve a decision tree to make decomposition and pruning based on some reasonable intuitions, and then 2) design a dynamic programming search algorithm to generate the optimal plan. Evaluations on four representative Transformer workloads show that Galvatron could perform automatically distributed training with different GPU memory budgets. Among all evluated scenarios, Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
SDPipe: A Semi-Decentralized Framework for Heterogeneity-aware Pipeline-parallel Training [sds]Xupeng Miao (Carnegie Mellon University)*; Yining Shi (Peking University); Zhi Yang (Peking University); Bin Cui (Peking University); Zhihao Jia (Carnegie Mellon University) Show AbstractDownload Paper
The increasing size of both deep learning models and training data necessitates the ability to scale out model training through pipeline-parallel training, which combines pipelined model parallelism and data parallelism. However, most of them assume an ideal homogeneous dedicated cluster. As for real cloud clusters, these approaches suffer from the intensive model synchronization overheads due to the dynamic environment heterogeneity. Such a huge challenge leaves the design in a dilemma: either the performance bottleneck of the central parameter server (PS) or severe performance degradation caused by stragglers for decentralized synchronization (like All-Reduce).
This approach presents SDPipe, a new semi-decentralized framework to get the best of both worlds, achieving both high heterogeneity tolerance and convergence efficiency in pipeline-parallel training. To provide high performance, we decentralize the communication model synchronization, which accounts for the largest proportion of synchronization overhead. In contrast, we centralize the process of group scheduling, which is lightweight but needs a global view for better performance and convergence speed against heterogeneity. SDPipe also achieves the full performance potential by proposing the adaptive group scheduling and sync-graph connectivity enforcement and guarantees fast model propagation and convergence. We show via a prototype implementation the significant advantage of SDPipe on performance and scalability, facing different environments.
FLARE: A Fast, Secure, and Memory-Efficient Distributed Analytics FrameworkXiang Li (Tsinghua University)*; Fabing Li (Xi'an Jiaotong University); Mingyu Gao (Tsinghua University) Show AbstractDownload Paper
As big data processing in the cloud becomes prevalent today, data privacy on such public platforms raises critical concerns. Hardware-based trusted execution environments (TEEs) provide promising and practical platforms for low-cost privacy-preserving data processing. However, using TEEs to enhance the security of data analytics frameworks like Apache Spark involves challenging issues when separating various framework components into trusted and untrusted domains, demanding meticulous considerations for programmability, performance, and security.
Based on Intel SGX, we build FLARE, a fast, secure, and memory-efficient data analytics framework with a familiar user programming interface and useful functionalities similar to Apache Spark. FLARE ensures confidentiality and integrity by keeping sensitive data and computations encrypted and authenticated. It also supports oblivious processing to protect against access pattern side channels. The main innovations of FLARE include a novel abstraction paradigm of shadow operators and shadow tasks to minimize trusted components and reduce domain switch overheads, memory-efficient data processing with proper granularities for different operators, and adaptive parallelization based on memory allocation intensity for better scalability. FLARE outperforms the state-of-the-art secure framework by 3.0X to 176.1X, and is also 2.8X to 28.3X faster than a monolithic libOS-based integration approach.
SODA: A Set of Fast Oblivious Algorithms in Distributed Secure Data AnalyticsXiang Li (Tsinghua University)*; Nuozhou Sun (Tsinghua University); Yunqian Luo (Tsinghua University); Mingyu Gao (Tsinghua University) Show AbstractDownload Paper
Cloud systems are now a prevalent platform to host large-scale big-data analytics applications such as machine learning and relational database. However, data privacy remains as a critical concern for public cloud systems. Existing trusted hardware could provide an isolated execution domain on an untrusted platform, but also suffers from access-pattern-based side channels at various levels including memory, disks, and networking. Oblivious algorithms can address these vulnerabilities by hiding the program data access patterns. Unfortunately, current oblivious algorithms for data analytics are limited to single-machine execution, only support simple operations, and/or suffer from significant performance overheads due to the use of expensive global sort and excessive data padding.
In this work, we propose SODA, a set of efficient and oblivious algorithms for distributed data analytics operators, including filter, aggregate, and binary equi-join. To improve performance, SODA completely avoids the expensive oblivious global sort primitive, and minimizes the data padding overheads. SODA makes use of low-cost (pseudo-)random communication instead of expensive global sort to ensure uniform data traffic in oblivious filter and aggregate. It also adopts a novel two-level bin-packing approach in oblivious join to alleviate both input redistribution and join product skewness, thus minimizing necessary data padding. Compared to the state-of-the-art system, SODA not only extends the functionality but also improves the performance. It achieves 1.1X to 14.6X speedups on complex multi-operator data analytics workloads.
R23
Trust, Security, Verifiability
Chair: Dimitrios Melissourgos (Grand Valley State University)
Frequency-revealing attacks against Frequency-hiding Order-preserving EncryptionXinle Cao (Zhejiang University); Jian Liu (Zhejiang University)*; Yongsheng Shen (Hang Zhou City Brain Co., Ltd); Xiaohua Ye (Hang Zhou City Brain Co., Ltd); Kui Ren (Zhejiang University) Show AbstractDownload Paper
Order-preserving encryption (OPE) allows efficient comparison operations over encrypted data and thus is popular in encrypted databases. However, most existing OPE schemes are vulnerable to inference attacks as they leak plaintext frequency. To this end, some frequency-hiding order-preserving encryption (FH-OPE) schemes are proposed and claim to prevent the leakage of frequency. FH-OPE schemes are considered an important step towards mitigating inference attacks.
Unfortunately, there are still vulnerabilities in all existing FH-OPE schemes. In this work, we revisit the security of all existing FH-OPE schemes. We are the first to demonstrate that plaintext frequency hidden by them is recoverable. We present three ciphertext-only attacks named frequency-revealing attacks to recover plaintext frequency. We evaluate our attacks in three real-world datasets. They recover over 90% of plaintext frequency hidden by any existing FH-OPE scheme. With frequency revealed, we also show the potentiality to apply inference attacks on existing FH-OPE schemes.
Our findings highlight the limitations of current FH-OPE schemes. We demonstrate that achieving frequency-hiding requires addressing the leakages of both non-uniform ciphertext distribution and insertion orders of ciphertexts, even though the leakage of insertion orders is often ignored in OPE.
Range Search over Encrypted Multi-Attribute DataFrancesca Falzon (Brown University)*; Evangelia Anna Markatou (Brown University); Zachary T Espiritu (Brown University); Roberto Tamassia (Brown University) Show AbstractDownload Paper
This work addresses expressive queries over encrypted data by presenting the first systematic study of multi-attribute range search on a symmetrically encrypted database outsourced to an honest-but-curious server. Prior work includes a thorough analysis of single-attribute range search schemes (e.g. Demertzis et al. 2016) and a proposed high-level approach for multi-attribute schemes (De Capitani di Vimercati et al. 2021). We first introduce a flexible framework for building secure range search schemes over multiple attributes (dimensions) by adapting a broad class of geometric search data structures to operate on encrypted data. Our framework encompasses widely used data structures such as multi-dimensional range trees and quadtrees, and has strong security properties that we formally prove. We then develop six concrete highly parallelizable range search schemes within our framework that offer a sliding scale of efficiency and security tradeoffs to suit the needs of the application. We evaluate our schemes with a formal complexity and security analysis, a prototype implementation, and an experimental evaluation on real-world datasets.
R24
Queries and Systems II
Chair: Goce Trajcevski (Iowa State University)
SEIDEN: Revisiting Query Processing in Video Database SystemsJaeho Bang (Georgia Institute of Technology); Gaurav Tarlok Kakkar (Georgia Institute of Technology)*; Pramod Chunduri (Georgia Institute of Technology); Subrata Mitra (Adobe Research); Joy Arulraj (Georgia Tech) Show AbstractDownload Paper
State-of-the-art video database management systems (VDBMSs) often use lightweight proxy models to accelerate object retrieval and aggregate queries. The key assumption underlying these systems is that the proxy model is an order of magnitude faster than the heavyweight oracle model. However, recent advances in computer vision have invalidated this assumption. Inference time of recently proposed oracle models is on par with or even lower than the proxy models used in state-of-the-art (SoTA) VDBMSs. This paper presents Seiden, a VDBMS that leverages this radical shift in the runtime gap between the oracle and proxy models. Instead of relying on a proxy model, Seiden directly applies the oracle model over a subset of frames to build a query-agnostic index, and samples additional frames to answer the query using an exploration-exploitation scheme during query processing. By leveraging the temporal continuity of the video and the output of the oracle model on the sampled frames, Seiden delivers faster query processing and better query accuracy than SoTA VDBMSs. Our empirical evaluation shows that Seiden is on average 6.6 × faster than SoTA VDBMSs across diverse queries and datasets.
Efficient Black-box Checking of Snapshot Isolation in DatabasesKaile Huang (Nanjing University); Si Liu (ETH Zurich); Zhenge Chen (Nanjing University); Hengfeng Wei (Nanjing University)*; David A Basin (ETH Zurich); Haixiang Li (Tencent, China); Anqun Pan (Tencent, China) Show AbstractDownload Paper
Snapshot isolation (SI) is a prevalent weak isolation level that avoids the performance penalty imposed by serializability and simultaneously prevents various undesired data anomalies. Nevertheless, SI anomalies have recently been found in production cloud databases that claim to provide the SI guarantee. Given the complex and often unavailable internals of such databases, a black-box SI checker is highly desirable.
In this paper we present PolySI, a black-box checker that efficiently checks SI and provides understandable counterexamples upon detecting violations. PolySI builds on a characterization of SI using generalized polygraphs (GPs), for which we establish its soundness and completeness. PolySI employs an SMT solver and also accelerates SMT solving by utilizing a compact constraint encoding of GPs and domain-specific optimizations for pruning constraints. As our extensive assessment demonstrates, PolySI successfully reproduces all of 2477 known SI anomalies, detects novel SI violations in three production cloud databases, identifies their causes, outperforms the state-of-the-art black-box checkers under a wide range of workloads, and can scale up to large workloads.
Async-fork: Mitigating Query Latency Spikes Incurred by the Fork-based Snapshot Mechanism from the OS LevelPu Pang (Shanghai Jiao Tong University); Gang Deng (Alibaba Cloud); Kaihao Bai (Shanghai Jiao Tong University); Quan Chen (Shanghai Jiao Tong University)*; Shixuan Sun (Shanghai Jiao Tong University); Bo Liu (Shanghai Jiao Tong University); Yu Xu (Alibaba Cloud); Hongbo Yao (Alibaba Cloud); Zhengheng Wang (Alibaba Group); Xiyu Wang (Alibaba Group); Zheng Liu (Alibaba Group); Zhuo Song (Alibaba Cloud); Yong Yang (Alibaba Cloud); Tao Ma (Alibaba Cloud); Minyi Guo (Shanghai Jiao Tong University) Show AbstractDownload Paper
In-memory key-value stores (IMKVSes) serve many online applications. They generally adopt the fork-based snapshot mechanism to support data backup. However, this method can result in query latency spikes because the engine is out-of-service for queries during the snapshot. In contrast to existing research optimizing snapshot algorithms, we address the problem from the operating system (OS) level, while keeping the data persistent mechanism in IMKVSes unchanged. Specifically, we first study the impact of the fork operation on query latency. Based on findings in the study, we propose Async-fork, which performs the fork operation asynchronously to reduce the out-of-service time of the engine. Async-fork is implemented in the Linux kernel and deployed into the online Redis database in public clouds. Our experiment results show that Async-fork can significantly reduce the tail latency of queries during the snapshot.
PetPS: Supporting Huge Embedding Models with Persistent Memory [sds]Minhui Xie (Tsinghua University)*; Youyou Lu (luyouyou@tsinghua.edu.cn); Qing Wang (Tsinghua University); Yangyang Feng (Tsinghua University); Jiaqiang Liu (Kuaishou); Kai Ren (Kuaishou Technology); Jiwu Shu (shujw@tsinghua.edu.cn) Show AbstractDownload Paper
Embedding models are effective for learning high-dimensional sparse data. Traditionally, they are deployed in DRAM parameter servers (PS) for online inference access. However, the everincreasing model capacity makes this practice suffer from both high storage costs and long recovery time. Rapidly developing Persistent Memory (PM) offers new opportunities to PSs owing to its large capacity at low costs, as well as its persistence, while the application of PM also faces two challenges including high read latency and heavy CPU burden. To provide a low-cost but still high-performance parameter service for online inferences, we introduce PetPS, the first production-deployed PM parameter server. (1) To escape with high PM latency, PetPS introduces a PM hash index tailored for embedding model workloads, to minimize PM access. (2) To alleviate the CPU burden, PetPS offloads parameter gathering to NICs, to avoid CPU stalls when accessing parameters on PM and thus improve CPU efficiency. Our evaluation shows that PetPS can boost throughput by 1.3−1.7× compared to PSs that use state-of-the-art PM hash indexes, or get 2.9−5.5× latency reduction with the same throughput. Since 2020, PetPS has been deployed in Kuaishou, one world-leading short video company, and successfully reduced TCO by 30% without performance degradation.
MagicScaler: Uncertainty-aware, Predictive Autoscaling [industry]Zhicheng Pan (East China Normal University); Yihang Wang (Alibaba Group); Yingying Zhang (Alibaba Group); Sean Bin Yang (Aalborg University); Yunyao Cheng (Aalborg University); Peng Chen (East China Normal University); Chenjuan Guo (ECNU); Qingsong Wen (Alibaba Group U.S.); Xiduo Tian (Alibaba Group); Yunliang Dou (Alibaba Group); Zhiqiang Zhou (Alibaba Damo Academy); Chengcheng Yang (East China Normal University); Aoying Zhou (East China Normal University); Bin Yang (East China Normal University)* Show AbstractDownload Paper
Predictive autoscaling is a key enabler for optimizing cloud resource allocation in Alibaba Cloud's Elastic Compute Service (ECS), which dynamically adjusts the ECS instances based on predicted user demands to ensure Quality of Service (QoS). However, user demands in public cloud, such as Alibaba Cloud, are often highly complex, with high uncertainty and scale-sensitive temporal dependencies, thus posing great challenges to accurate prediction of future demands. These in turn make autoscaling challenging---autoscaling needs to properly account for demand uncertainty while maintaining a reasonable trade-off between two contradictory factors, i.e., low instance running costs vs. low QoS violation risks.
To address the above challenges, we propose a novel predictive autoscaling framework MagicScaler, consisting of a Multi-scale attentive gaussian process based predictor and an uncertainty-aware scaler. First, the predictor carefully bridges the best of two successful prediction methodologies---multi-scale attention mechanisms, which are good at capturing complex, multi-scale features, and stochastic process regression, which is able to quantify prediction uncertainty, thus achieving accurate demand prediction with quantified uncertainty levels. Second, the scaler takes the quantified future demand uncertainty into a judiciously designed loss function with stochastic constraints, enabling flexible trade-off between running costs and QoS violation risks. Extensive experiments on three clusters of Alibaba Cloud in different Chinese cities demonstrate the effectiveness and efficiency of MagicScaler, which outperforms other commonly adopted scalers, thus justifying our design choices.
R25
Search and Aggregation
Chair: Xiang Lian (Kent State University)
Fast Approximate Denial Constraint DiscoveryRenjie Xiao (Fudan University); Zijing Tan (Fudan University)*; Haojin Wang (Fudan University); Shuai Ma (Beihang University) Show AbstractDownload Paper
We investigate the problem of discovering approximate denial constraints (DCs), for finding DCs that hold with some exceptions to avoid overfitting real-life dirty data and facilitate data cleaning tasks. Different methods have been proposed to address the problem, by following the same framework consisting of two phases. In the first phase a structure called evidence set is built on the given instance, and in the second phase approximate DCs are found by leveraging the evidence set. In this paper, we present novel and more efficient techniques under the same framework. (1) We optimize the evidence set construction by first building a condensed structure called clue set and then transforming the clue set to the evidence set. The clue set is more memory-efficient than the evidence set and facilitates more efficient bit operations and better cache utilization, and the transformation cost is usually trivial. We further study parallel clue set construction with multiple threads. (2) Our solution to approximate DC discovery from the evidence set is a highly non-trivial extension of the evidence inversion method for exact DC discoveries. (3) Using a host of datasets, we experimentally verify our approximate DC discovery approach is on average 8.2 and 7.5 times faster than the two state-of-the-art ones that also leverage parallelism, respectively, and our methods for the two phases are up to an order of magnitude and two orders of magnitude faster than the state-of-the-art methods, respectively.
CommunityAF: An Example-based Community Search Method via Autoregressive FlowJiazun Chen (Peking university); Yikuan Xia (Peking University); Jun Gao (Peking University)* Show AbstractDownload Paper
Example-based community search utilizes hidden patterns of given examples rather than explicit rules, which reduces the burden on users and enhances flexibility. However, existing works face challenges such as low scalability, high training cost, and improper termination during the search. Aiming at tackling all these issues, this paper proposes a community search framework named CommunityAF with three well-designed components. The first is a GNN (graph neural network) component that combines community-aware structure features to incrementally learn node embeddings over a large graph for the other two components. The second is an autoregressive flow-based generation component that is designed for fast training and model stability. The third is a scoring component that evaluates the communities and provides scores for a stable termination. Moreover, to show that CommunityAF has sufficient expressive power to cover the rules, we demonstrate that the scoring component with node features weighted by degree-related factors is able to mimic the existing structure-based community metrics. We introduce a square ranking loss to guide the training of the scoring component, and further devise a flexible termination strategy based on the inferred score change pattern over a sequence of candidate communities using beam search. We compare CommunityAF with four different categories of community search methods on six real-world datasets. The results illustrate that CommunityAF outperforms these community search methods, and achieves an average 15.3% improvement in effectiveness and 4x to 20$x speedups on different datasets relative to the state-of-the-art generative method.
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent [industry]Xiaonan Nie (Peking University)*; Yi Liu (Tencent); Fangcheng Fu (Peking University); Jinbao Xue (Tencent); Dian Jiao (Tencent); Xupeng Miao (Carnegie Mellon University); Yangyu Tao (Tencent); Bin Cui (Peking University) Show AbstractDownload Paper
Recent years have witnessed the unprecedented achievements of large-scale pre-trained models, especially Transformer models. Many products and services in Tencent Inc., such as WeChat, QQ, and Tencent Advertisement, have been opted in to gain the power of pre-trained models. In this work, we present Angel-PTM, a productive deep learning system designed for pre-training and fine-tuning Transformer models. Angel-PTM can train extremely large-scale models with hierarchical memory efficiently. The key designs of Angel-PTM are a fine-grained memory management via the Page abstraction and a unified scheduling method that coordinates computations, data movements, and communications. Furthermore, Angel-PTM supports extreme model scaling with SSD storage and implements a lock-free updating mechanism to address the SSD I/O bottlenecks. Experimental results demonstrate that Angel-PTM outperforms existing systems by up to 114.8% in terms of maximum model scale as well as up to 88.9% in terms of training throughput. Additionally, experiments on GPT3-175B and T5-MoE-1.2T models utilizing hundreds of GPUs verify our strong scalability.
HEDA: Multi-Attribute Unbounded Aggregation over Homomorphically Encrypted DatabaseXuanle Ren (Alibaba Group); Le Su (Alibaba Group); Zhen Gu (Alibaba Group)*; Sheng Wang (Alibaba Group); Feifei Li (Alibaba Group); Yuan Xie (Alibaba DAMO Academy); Song Bian (Kyoto University); Chao Li (Zhejiang University); Fan Zhang (Zhejiang University) Show AbstractDownload Paper
Recent years have witnessed the rapid development of the encrypted database, due to the increasing number of data privacy breaches and the corresponding laws and regulations that caused millions of dollars in loss. These encrypted databases may rely on different techniques, such as cryptographic primitives and trusted execution environments. In this work, we investigate the feasibility of utilizing fully homomorphic encryption (FHE) to support unbounded database aggregation queries, which typically involve comparisons as filtering predicates and a final aggregation. These operators are theoretically supported by FHE, but need careful algorithm design to maximize the efficiency and have not been explored before.We creatively use two types of FHE schemes, i.e., one for numerical and one for binary value, to enjoy their advantages respectively. To bridge the encrypted values between these two schemes for seamless query processing without client-server interaction, we propose a novel ciphertext transformation mechanism, which is of independent research interest, to close this gap. We further implement our system and test it over three TPC-H queries and a query over a real social media e-commerce database. Evaluation results show that, to process an aggregation query over 8𝑘 encrypted rows takes about 430 seconds. Although it is slower than plaintext processing in magnitudes and still has much room for improvement, as the very first work in this domain, our system demonstrates the feasibility of using FHE to process OLAP queries.
LIDER: An Efficient High-dimensional Learned Index for Large-scale Dense Passage RetrievalYifan Wang (University of Florida)*; Haodi Ma (University of Florida); Daisy Zhe Wang (Univeresity of Florida) Show AbstractDownload Paper
Passage retrieval has been studied for decades, and many recent approaches of passage retrieval are using dense embeddings generated from deep neural models, called ``dense passage retrieval''. The state-of-the-art end-to-end dense passage retrieval systems normally deploy a deep neural model followed by an approximate nearest neighbor (ANN) search module. The model generates embeddings of the corpus and queries, which are then indexed and searched by the high-performance ANN module. With the increasing data scale, the ANN module unavoidably becomes the bottleneck on efficiency. An alternative is the learned index, which achieves significantly high search efficiency by learning the data distribution and predicting the target data location. But most of the existing learned indexes are designed for low dimensional data, which are not suitable for dense passage retrieval with high-dimensional dense embeddings.
In this paper, we propose LIDER, an efficient high-dimensional Learned Index for large-scale DEnse passage Retrieval. LIDER has a clustering-based hierarchical architecture formed by two layers of core models. As the basic unit of LIDER to index and search data, a core model includes an adapted recursive model index (RMI) and a dimension reduction component which consists of an extended SortingKeys-LSH (SK-LSH) and a key re-scaling module. The dimension reduction component reduces the high-dimensional dense embeddings into one-dimensional keys and sorts them in a specific order, which are then used by the RMI to make fast prediction. Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes on passage retrieval tasks, e.g., on large-scale data it achieves 1.2x search speed and significantly higher retrieval quality than the fastest baseline in our evaluation. Furthermore, LIDER has a better capability of speed-quality trade-off.
Influence Maximization in Real-World Closed Social NetworksShixun Huang (University of Wollongong); Wenqing Lin (Tencent); Zhifeng Bao (RMIT University)*; Jiachen Sun (TENCENT) Show AbstractDownload Paper
In the last few years, many closed social networks such as WhatsAPP and WeChat have emerged to cater for people’s growing demand of privacy and independence. In a closed social network, the posted content is not available to all users or senders can set limits on who can see the posted content. Under such a constraint, we study the problem of influence maximization in a closed social network. It aims to recommend users (not just the seed users) a limited number of existing friends who will help propagate the information, such that the seed users’ influence spread can be maximized. We first prove that this problem is NP-hard. Then, we propose a highly effective yet efficient method to augment the diffusion network, which initially consists of seed users only. The augmentation is done by iteratively and intelligently selecting and inserting a limited number of edges from the original network. Through extensive experiments on real-world social networks including deployment into a real-world application, we demonstrate the effectiveness and efficiency of our proposed method.
MultiBiSage: A Web-Scale Recommendation System Using Multiple Bipartite Graphs at Pinterest [sds]Saket Gurukar (The Ohio State University)*; Nikil Pancha (Pinterest); Andrew H Zhai (Pinterest); Eric Kim (Pinterest); Samson Hu (Pinterest); Srinivasan Parthasarathy (Ohio State University); Charles Rosenberg (Pinterest); Jure Leskovec (Stanford University) Show AbstractDownload Paper
Graph Convolutional Networks (GCN) can efficiently integrate graph structure and node features to learn high-quality node embeddings. At Pinterest, we have developed and deployed PinSage, a data-efficient GCN that learns pin embeddings from the Pin-Board graph. Pinterest relies heavily on PinSage which in turn only leverages the Pin-Board graph. However, there exist several entities at Pinterest and heterogeneous interactions among these entities. These diverse entities and interactions provide important signal for recommendations and modeling. In this work, we show that training deep learning models on graphs that captures these diverse interactions can result in learning higher-quality pin embeddings than training PinSage on only the Pin-Board graph. However, building a large-scale heterogeneous graph engine that can process the entire Pinterest size data has not yet been done. In this work, we present a clever and effective solution where we break the heterogeneous graph into multiple disjoint bipartite graphs and then develop novel data-efficient MultiBiSage model that combines the signals from them. MultiBiSage can capture the graph structure of multiple bipartite graphs to learn high-quality pin embeddings. The benefit of our approach is that individual bipartite graphs can be processed with minimal changes to Pinterest’s current infrastructure, while being able to combine information from all the graphs while achieving high performance. We train MultiBiSage on six bipartite graphs including our Pin-Board graph and show that it significantly outperforms the deployed latest version of PinSage on multiple user engagement metrics. We also perform experiments on two public datasets to show that MultiBiSage is generalizable and can be applied to datasets outside of Pinterest.
Triangular Stability Maximization by Influence Spread over Social NetworksZheng Hu (Fudan University); Weiguo Zheng (Fudan University)*; Xiang Lian (Kent State University) Show AbstractDownload Paper
In many real-world applications such as social network analysis and online advertising/marketing, one of the most important and popular problems is called \emph{influence maximization} (IM), which finds a set of $k$ seed users that maximize the expected number of influenced user nodes. In practice, however, maximizing the number of influenced nodes may be far from satisfactory for real applications such as opinion promotion and collective buying. In this paper, we explore the importance of \emph{stability} and \emph{triangles} in social networks, and formulate a novel problem in the influence spread scenario, named \emph{triangular stability maximization}, over social networks, and generalize it to a \emph{general triangle influence maximization} problem, which is proved to be NP-hard. We develop an efficient \emph{reverse influence sampling} (RIS) based framework for the triangle IM with theoretical guarantees. To enable unbiased estimators, it demands probabilistic sampling of triangles, that is, sampling triangles according to their probabilities. We propose an \textit{edge-based triple sampling} approach, which is exactly equivalent to probabilistic sampling and avoids costly triangle enumeration and materialization.
To further improve the time efficiency, we also design several pruning and reduction techniques, as well as a cost-model-guided heuristic algorithm. Extensive experiments and a case study over real-world graphs confirm the effectiveness of our proposed algorithms and the superiority of our proposed \emph{triangular stability maximization} and triangle influence maximization.
Coresets over Multiple Tables for Feature-rich and Data-efficient Machine LearningJiayi Wang (Tsinghua University)*; Chengliang Chai (Beijing Institute of Technology); Nan Tang (Qatar Computing Research Institute, HBKU); Jiabin Liu (Tsinghua University); Guoliang Li (Tsinghua University) Show AbstractDownload Paper
Successful machine learning (ML) needs to learn from good data. However, one common issue about train data for ML practitioners is the lack of good features. To mitigate this problem, feature augmentation is often employed by joining with (or enriching features from) multiple tables, so as to become feature-rich ML. A consequent problem is that the enriched train data may contain too many tuples, especially if the feature augmentation is obtained through 1 (or many)-to-many or fuzzy joins. Training an ML model with a very large train dataset is data-inefficient. Coreset is often used to achieve data-efficient ML training, which selects a small subset of train data that can theoretically and practically perform similarly as using the full dataset. However, coreset selection over a large train dataset is also known to be time-consuming.
In this paper, we aim at achieving both feature-rich ML through feature augmentation and data-efficient ML through coreset selection. In order to avoid time-consuming coreset selection over a feature augmented (or fully materialized) table, we propose to efficiently select the coreset without materializing the augmented table. Note that coreset selection typically uses weighted gradients of the subset to approximate the full gradient of the entire train dataset. Our key idea is that the gradient computation for coreset selection of the augmented table can be pushed down to partial feature similarity of tuples within each individual table, without join materialization. These partial feature similarity values can be aggregated to estimate the gradient of the augmented table, which is upper bounded with provable theoretical guarantees. Extensive experiments show that our method can improve the efficiency by nearly 2 orders of magnitudes, while keeping almost the same accuracy as training with the fully augmented train data.
Auto-Tuning with Reinforcement Learning for Permissioned Blockchain SystemsMingxuan Li (Institute of Information Engineering,Chinese Academy of Sciences)*; Yazhe Wang (State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences); Shuai Ma (Beihang University); Chao Liu (Institute of Information Engineering,Chinese Academy of Sciences); Dongdong Huo (Institute of Information Engineering,Chinese Academy of Sciences); Yu Wang (Institute of Information Engineering,Chinese Academy of Sciences); Zhen Xu (State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences) Show AbstractDownload Paper
In a permissioned blockchain, performance dictates its development, which is substantially influenced by its parameters. However, research on auto-tuning for better performance has somewhat stagnated because of the difficulty posed by distributed parameters; thus, it is possible only with difficulty to propose an effective auto-tuning optimization scheme. To alleviate this issue, we lay a solid basis for our research by first exploring the relationship between parameters and performance in Hyperledger Fabric, a permissioned blockchain, and we propose Athena, a Fabric-based auto-tuning system that can automatically provide parameter configurations for optimal performance. The key of Athena is designing a new Permissioned Blockchain Multi-Agent Deep Deterministic Policy Gradient (PB-MADDPG) algorithm to realize heterogeneous parameter-tuning optimization of different types of nodes in Fabric. Moreover, we select parameters with the most significant impact on accelerating recommendation. In its application to Fabric, a typical permissioned blockchain system, with 12 peers and 7 orderers, Athena achieves a throughput improvement of 470.45% and a latency reduction of 75.66% over the default configuration. Compared with the most advanced tuning schemes (CDBTune, Qtune, and ResTune), our method is competitive in terms of throughput and latency.
R31
Transactions, Tuning, and Compression
Chair: Yu Xia (MIT)
Efficient Distributed Transaction Processing in Heterogeneous NetworksQian Zhang (RenMin University of China); Jingyao Li (Renmin University of China); Hongyao Zhao (Renmin University of China); Quanqing Xu (OceanBase); Wei Lu (Renmin University of China)*; Jinliang Xiao (OceanBase); Fusheng Han (OceanBase); Chuanhui Yang (OceanBase); Xiaoyong Du (Renmin University of China) Show AbstractDownload Paper
Countrywide and worldwide business, like gaming and social networks, drives the popularity of inter-data-center transactions. To support inter-data-center transaction processing and data center fault tolerance simultaneously, existing protocols suffer from significant performance degradation due to high-latency and unstable networks. In this paper, we propose RedT, a novel distributed transaction processing protocol that works in heterogeneous networks. In detail, nodes within a data center are inter-connected via the RDMA-capable network and nodes across data centers are inter-connected via TCP/IP networks. RedT extends two-phase commit (2PC) by decomposing transactions into sub-transactions in terms of the data center granularity, and proposing a pre-write-log mechanism that is able to reduce the number of inter-data-center round-trips from a maximal of 6 to 2. Extensive evaluation against state-of-the-art protocols shows that RedT can achieve up to 1.57X higher throughputs and 0.56X lower latency.
STARRY: Multi-master Transaction Processing on Semi-leader ArchitectureZihao Zhang (East China Normal University); Huiqi Hu (East China Normal University)*; Xuan Zhou (East China Normal University); Jiang Wang (Huawei) Show AbstractDownload Paper
Multi-master architecture is desirable for cloud databases in supporting large-scale transaction processing. To enable concurrent transaction execution on multiple computing nodes, we need an efficient transaction commit protocol on the storage layer that ensures ACID as well as consensus among replicas. A leader-based protocol is easy to implement. However, it faces the single-node bottleneck and suffers from high transaction latency in cross-region deployment. While a leaderless protocol can achieve a higher degree of parallelism, it is inefficient in resolving conflicts.
This paper proposes the semi-leader protocol, which is a new type of transaction commit protocol for multi-master transaction processing. In a nutshell, the semi-leader protocol is a hybrid protocol that offers separate commit paths for conflicting transactions and non-conflicting transactions. A centralized node, known as the sequencer, is employed to perform precise conflict resolution for conflicting transactions, while non-conflicting transactions can be committed timely in a decentralized manner. Based on the semi-leader protocol, we designed Starry, a multi-master transaction processing mechanism. Experimental results demonstrate that Starry is 1.4x and 4.21x as performant as the leaderless and leader-based protocols respectively in throughput. When dealing with high-contention workloads, Starry can significantly reduce the abort rates.
Adore: Differentially Oblivious Relational Database OperatorsLianke Qin (UCSB)*; Rajesh Jayaram (Carnegie Mellon University); Elaine Shi (Carnegie Mellon University); Zhao Song (Adobe Research); Danyang Zhuo (Duke University); Shumo Chu () Show AbstractDownload Paper
There has been a recent effort in applying differential privacy on memory access patterns to enhance data privacy. This is called differential obliviousness. Differential obliviousness is a promising direction because it provides a principled trade-off between performance and desired level of privacy. To date, it is still an open question whether differential obliviousness can speed up database processing with respect to full obliviousness. In this paper, we present the design and implementation of \textbf{Adore}: \textbf{A} set of \textbf{D}ifferentially \textbf{O}blivious \textbf{RE}lational database operators. Adore includes selection with projection, grouping with aggregation, and foreign key join. We prove that they satisfy the notion of differential obliviousness. Our differentially oblivious operators have reduced cache complexity, runtime complexity, and output size compared to their state-of-the-art fully oblivious counterparts. We also demonstrate that our implementation of these differentially oblivious operators can outperform their state-of-the-art fully oblivious counterparts by up to $7.4\times$.
PromptEM: Prompt-tuning for Low-resource Generalized Entity Matching [sds]Pengfei Wang (Zhejiang University); Xiaocan Zeng (Zhejiang University); Lu Chen (Zhejiang University); Fan Ye (Zhejiang University); Yuren Mao (Zhejiang University); Junhao Zhu (Zhejiang University); Yunjun Gao (Zhejiang University)* Show AbstractDownload Paper
Entity Matching (EM), which aims to identify whether two entity records from two relational tables refer to the same real-world entity, is one of the fundamental problems in data management. Traditional EM assumes that two tables are homogeneous with the aligned schema, while it is common that entity records of different formats (e.g., relational, semi-structured, or textual types) involve in practical scenarios. It is not practical to unify their schemas due to the different formats. To support EM on format-different entity records, Generalized Entity Matching (GEM) has been proposed and gained much attention recently. To do GEM, existing methods typically perform in a supervised learning way, which relies on a large amount of high-quality labeled examples. However, the labeling process is extremely labor-intensive, and frustrates the use of GEM. Low-resource GEM, i.e., GEM that only requires a small number of labeled examples, becomes an urgent need. To this end, this paper, for the first time, focuses on the low-resource GEM and proposes a novel low-resource GEM method, termed as PromptEM. PromptEM has addressed three challenging issues (i.e., designing GEM-specific prompt-tuning, improving pseudo-labels quality, and running efficient self-training) in low-resource GEM. Extensive experimental results on eight real benchmarks demonstrate the superiority of PromptEM in terms of effectiveness and efficiency.
Elf: Erasing-based Lossless Floating-Point CompressionRuiyuan Li (Chongqing University)*; Zheng Li (Chongqing University); Yi Wu (Chongqing University); Chao Chen (Chongqing University); Yu Zheng (JD) Show AbstractDownload Paper
There are a prohibitively large number of floating-point time series data generated at an unprecedentedly high rate. An efficient, compact and lossless compression for time series data is of great importance for a wide range of scenarios. Most existing lossless floating-point compression methods are based on the XOR operation, but they do not fully exploit the trailing zeros, which usually results in an unsatisfactory compression ratio. This paper proposes an Erasing-based Lossless Floating-point compression algorithm, i.e., Elf. The main idea of Elf is to erase the last few bits (i.e., set them to zero) of floating-point values, so the XORed values are supposed to contain many trailing zeros. The challenges of the erasing-based method are three-fold. First, how to quickly determine the erased bits? Second, how to losslessly recover the original data from the erased ones? Third, how to compactly encode the erased data? Through rigorous mathematical analysis, Elf can directly determine the erased bits and restore the original values without losing any precision. To further improve the compression ratio, we propose a novel encoding strategy for the XORed values with many trailing zeros. Elf works in a streaming fashion. It takes only O(N) (where N is the length of a time series) in time and O(1) in space, and achieves a notable compression ratio with a theoretical guarantee. Extensive experiments using 22 datasets show the powerful performance of Elf compared with 9 advanced competitors.
R32
Potpourri Online I (Systems & Algorithms)
Chair: Elisa Bertino (Purdue University)
TASK: An Efficient Framework for Instant Error-tolerant Spatial Keyword Queries on Road NetworksChengyang Luo (Zhejiang University); Qing Liu (Zhejiang University); Yunjun Gao (Zhejiang University)*; Lu Chen (Zhejiang University); Ziheng Wei (Huawei Technologies Co., Ltd.); Congcong Ge (Huawei Technologies Co., Ltd.) Show AbstractDownload Paper
Instant spatial keyword queries return the results as soon as users type in some characters instead of a complete keyword, which allow users to query the geo-textual data in a \emph{type-as-you-search} manner. However, the existing methods of instant spatial keyword queries suffer from several limitations. For example, the existing methods do not consider the typographical errors of input keywords, and cannot be applied to the road networks. To overcome these limitations, in this paper, we propose a new query type, i.e., instant error-tolerant spatial keyword queries on road networks. To answer the queries efficiently, we present a framework, termed as $\task$, which consists of index component, query component, and update component. In the index component, we design a novel index called reverse 2-hop label based trie, which seamlessly integrates spatial and textual information for each vertex of the road network. Based on our proposed index, we devise efficient algorithms to progressively return and update the query results in the query component and update component, respectively. Finally, we conduct extensive experiments on real-world road networks to evaluate the performance of our presented $\task$. Empirical results show that our proposed index and algorithms are up to 1-2 orders of magnitude faster than the baseline.
Differentially Private Vertical Federated ClusteringZitao Li (Alibaba Group)*; Tianhao Wang (University of Virginia); Ninghui Li (Purdue University) Show AbstractDownload Paper
In many applications, multiple parties have private data regarding the same set of users but on disjoint sets of attributes, and a server wants to leverage the data to train a model. To enable model learning while protecting the privacy of the data subjects, we need vertical federated learning (VFL) techniques, where the data parties share only information for training the model, instead of the private data. However, it is challenging to ensure that the shared information maintains privacy while learning accurate models. To the best of our knowledge, the algorithm proposed in this paper is the first practical solution for differentially private vertical federated k-means clustering, where the server can obtain a set of global centers with a provable differential privacy guarantee. Our algorithm assumes an untrusted central server that aggregates differentially private local centers and membership encodings from local data parties. It builds a weighted grid as the synopsis of the global dataset based on the received information. Final centers are generated by running any k-means algorithm on the weighted grid. Our approach for grid weight estimation uses a novel, light-weight, and differentially private set intersection cardinality estimation algorithm based on the Flajolet-Martin sketch. To improve the estimation accuracy in the setting with more than two data parties, we further propose a refined version of the weights estimation algorithm and a parameter tuning strategy to reduce the final k-means utility to be close to that in the central private setting. We provide theoretical utility analysis and experimental evaluation results for the cluster centers computed by our algorithm and show that our approach performs better both theoretically and empirically than the two baselines based on existing techniques.
Frequency Domain Data Encoding in Apache IoTDB [sds]Haoyu Wang (Tsinghua University); Shaoxu Song (Tsinghua University)* Show AbstractDownload Paper
Frequency domain analysis is widely conducted on time series. While online transforming from time domain to frequency do- main is costly, e.g., by Fast Fourier Transform (FFT), it is highly demanded to store the frequency domain data for reuse. However, frequency domain data encoding for efficient storage is surpris- ingly untouched. We notice that (1) the precision of data value is unnecessarily high after transforming to frequency domain and (2) the data values are with skewed distribution leading to a very large bit width for encoding. To avoid such space waste in both precision and skewness, we devise a descending bit-packing encod- ing for frequency domain data. Specifically, we quantize the data values in proper precision referring to the signal-noise-ratio (SNR) in frequency domain analysis. Moreover, we sort the data values in descending order so that the bit width could be dynamically reduced in encoding. The method has been deployed in Apache IoTDB, an open-source time-series database, not only for directly encoding frequency domain data, but also as a lossy compression of the time domain data. The extensive experiments on the system demonstrate the superiority of our encoding for both frequency domain and time domain data.
Change Propagation Without JoinsQichen Wang (Hong Kong Baptist University); Xiao Hu (University of Waterloo)*; Binyang Dai (Hong Kong University of Science and Technology); Ke Yi (Hong Kong University of Science and Technology) Show AbstractDownload Paper
We revisit the classical change propagation framework for query evaluation under updates. The standard framework takes a query plan and materializes the intermediate views, which incurs high polynomial costs in both space and time, with the join operator being the culprit. In this paper, we propose a new change propagation framework without joins, thus naturally avoiding this polynomial blowup. Meanwhile, we show that the new framework still supports constant-delay enumeration of both the deltas and the full query results, the same as in the standard framework. Furthermore, we provide a quantitative analysis of its update cost, which not only recovers many recent theoretical results on the problem, but also yields an effective approach to optimizing the query plan. The new framework is also easy to be integrated into an existing streaming database system. Experimental results show that our system prototype, implemented using Flink DataStream API, significantly outperforms other systems in terms of space, time, and latency.
BICE: Exploring Compact Search Space by Using Bipartite Matching and Cell-Wide VerificationYunyoung Choi (Alsemy); Kunsoo Park (Seoul National University); Hyunjoon Kim (Hanyang University)* Show AbstractDownload Paper
Subgraph matching is the problem of searching for all embeddings of a query graph in a data graph, and subgraph query processing (also known as subgraph search) is to find all the data graphs that contain a query graph as subgraphs. Extensive research has been done to develop practical solutions for both problems. However, the existing solutions still show limited query processing time due to a lot of unnecessary computations in search. In this paper, we focus on exploring as compact search space as possible by using three techniques: (1) pruning by bipartite matching, (2) pruning by failing sets with bipartite matching, and (3) cell-wide verification. We propose a new algorithm BICE, which combines these three techniques. We conduct extensive experiments on real-world datasets as well as synthetic datasets to evaluate the effectiveness of the techniques. Experiments show that our approach outperforms the fastest existing subgraph search algorithm by up to two orders of magnitude in terms of elapsed time to process a query. Our approach also outperforms state-of-the-art subgraph matching algorithms by up to two orders of magnitude.
R33
Systems in Industry
Chair: Qizhen Zhang (University of Toronto)
Big Data Analytic Toolkit: A general-purpose, modular, and heterogeneous acceleration toolkit for data analytical engines [industry]Jiang Li (Intel Corporation)*; Qi Xie (Intel Corporation); Yan Ma (Intel Corporation); Jian Ma (Intel Corporation); Kunshang Ji (Intel Corporation); Yizhong Zhang (Intel Corporation); Chaojun Zhang (Intel Corporation); Yixiu Chen (Intel Corporation); Gangsheng Wu (Intel Corporation); Jie Zhang (Intel Corporation); Kaidi Yang (Intel Corporation); Xinyi He (Intel Corporation); Qiuyang Shen (Intel Corporation); Yanting Tao (Intel Corporation); Haiwei Zhao (Intel Corporation); Penghui Jiao (Intel Corporation); Chengfei Zhu (Intel Corporation); David Qian (Intel Corporation); Cheng Xu (Intel Corporation) Show AbstractDownload Paper
Query compilation and hardware acceleration are important technologies for optimizing the performance of data processing engines. There have been many works on the exploration and adoption of these techniques in recent years. However, a number of engines still refrain from adopting them because of some reasons. One of the common reasons claims that the intricacies of these techniques make engines too complex to maintain. Another major barrier is the lack of widely accepted architectures and libraries of these techniques, which leads to the adoption often starting from scratch with lots of effort. In this paper, we propose Intel Big Data Analytic Toolkit (BDTK), an open-source C++ acceleration toolkit library for analytical data processing engines. BDTK provides lightweight, easy-to-connect, reusable components with interoperable interfaces to support query compilation and hardware accelerators. The query compilation in BDTK leverages vectorized execution and data-centric code generation to achieve high performance. BDTK could be integrated into different engines and helps them to adapt query compilation and hardware accelerators to optimize performance bottlenecks with less engineering effort.
Towards General and Efficient Online Tuning for Spark [industry]Yang Li (Tencent)*; Huaijun Jiang (Peking University); Yu Shen (Peking University); Yide Fang (Tencent); Xiaofeng Yang (Tencent); Danqing Huang (Tencent); Xinyi Zhang (Peking University); Wentao Zhang (Peking University); Ce Zhang (ETH); Peng Chen (Tencent); Bin Cui (Peking University) Show AbstractDownload Paper
The distributed data analytic system -- Spark is a common choice for processing massive volumes of heterogeneous data, while it is challenging to tune its parameters to achieve high performance. Recent studies try to employ auto-tuning techniques to solve this problem but suffer from three issues: limited functionality, high overhead and inefficient search.
In this paper, we present a general and efficient Spark tuning framework that can deal with the three issues simultaneously. First, we introduce a generalized tuning formulation, which can support multiple tuning goals and constraints conveniently, and a Bayesian optimization (BO) based solution to solve this generalized optimization problem. Second, to avoid high overhead from additional offline evaluations in existing methods, we propose to tune parameters along with the actual periodic executions of each job (i.e., online evaluations). To ensure safety during online job executions, we design a safe configuration acquisition method that models the safe region. Finally, three innovative techniques are leveraged to further accelerate the search process: adaptive sub-space generation, approximate gradient descent, and meta-learning method.
We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent. The empirical results on both public benchmarks and large-scale production tasks demonstrate its superiority in terms of practicality, generality, and efficiency. Notably, this service saves an average of 57.00% memory cost and 34.93% CPU cost on 25K in-production tasks within 20 iterations, respectively.
SimpleTS: An Efficient and Universal Model Selection Framework for Time Series Forecasting [industry]Yuanyuan Yao (Zhejiang University); Dimeng Li (Alibaba Group); Hailiang Jie (Zhejiang University); Lu Chen (Zhejiang University)*; Tianyi Li (Aalborg University); Jie Chen (Alibaba); Jiaqi Wang (Zhejiang University); Feifei Li (Alibaba Group); Yunjun Gao (Zhejiang University) Show AbstractDownload Paper
Time series forecasting, that predicts events through a sequence of time, has received increasing attention in past decades. The diverse range of time series forecasting models presents a challenge for selecting the most suitable model for a given dataset. As such, the Alibaba Cloud database monitoring system must address the issue of selecting an optimal forecasting model for a single time series data. While several model selection frameworks, including AutoAI-TS, have been developed to predict a dataset, their effectiveness may be limited as they may not adapt well to all types of time series, resulting in reduced prediction accuracy. Alternatively, models such as AutoForecast, which train on individual data points, may offer better adaptability but are limited by longer training time required.
In this paper, we introduce SimpleTS, a versatile framework for time series forecasting that exhibits high efficiency and accuracy across all types of time series data. When performing an online prediction task, SimpleTS first classifies input time series into one type, and then efficiently selects the most suitable prediction model for this type. To optimize performance, SimpleTS (i) clusters models with similar performance to improve the efficiency of classification; (ii) uses soft labeling and weighted representation learning to achieve higher classification accuracy for different time series types. Extensive experiments on 3 private datasets and 52 public datasets show that SimpleTS outperforms the state-of-the-art toolkits in terms of both training time and prediction accuracy.
FEBench: A Benchmark for Real-Time Relational Data Feature Extraction [industry]Xuanhe Zhou (Tsinghua University); Cheng Chen (4Paradigm); Kunyi Li (Tsinghua University); Bingsheng He (National University of Singapore); Mian Lu (4Paradigm)*; Qiaosheng Liu (4Paradigm); Wei Huang (4Paradigm); Guoliang Li (Tsinghua University); Zhao Zheng (4Paradigm); Yuqiang Chen (4Paradigm) Show AbstractDownload Paper
As the use of online AI inference services rapidly expands in various applications (e.g., fraud detection in banking, product recommendation in e-commerce), real-time feature extraction (RTFE) systems have been developed to compute the requested features from incoming data tuples in ultra-low latency. Similar to relational databases, these RTFE procedures can be expressed using SQL-like languages. However, there is a lack of research on the workload characteristics and benchmarks for RTFE, especially in comparison with existing database workloads and benchmarks (e.g., concurrent transactions in TPC-C). In this paper, we study the RTFE workload characteristics using over one hundred real datasets from open repositories (e.g. Kaggle, Tianchi, UCI ML, KiltHub) and those from 4Paradigm and its customers. The study highlights the significant differences between RTFE workloads and existing database benchmarks in terms of application scenarios, operator distributions and query structures. Based on these findings, we propose to develop a real-time feature extraction benchmark named FEBench based on the four important criteria for a domain-specific benchmark proposed by Jim Gray. FEBench consists of selected representative datasets, query templates, and an online request simulator. We use FEBench to evaluate the effectiveness of feature extraction systems including OpenMLDB and Flink, and find that each system exhibits distinct advantages and limitations in terms of overall latency, tail latency, and concurrency performance.
R34
Potpourri Online II (Learning & Mining)
Chair: Laks V.S. Lakshmanan (University of British Columbia)
Self-Training for Label-Efficient Information Extraction from Semi-Structured Web-PagesRitesh Sarkhel (Ohio State University)*; Binxuan Huang (Amazon); Colin Lockard (Amazon); Prashant Shiralkar (Amazon) Show AbstractDownload Paper
Information Extraction (IE) from semi-structured web-pages is a long studied problem. Training a model for this extraction task requires a large number of human-labeled samples. Prior works have proposed transferable models to improve the label-efficiency of this training process. Extraction performance of transferable models, however depends on the size of their fine-tuning corpus. This holds true for large language models (LLM) such as GPT-3 as well. Generalist models like LLMs need to be fine-tuned on in-domain, human-labeled samples for competitive performance on this extraction task. Constructing a large-scale fine-tuning corpus with human-labeled samples, however, requires significant effort. In this paper, we develop a Label-Efficient Self-Training Algorithm (LEAST) to improve the label-efficiency of this fine-tuning process. Our contributions are two-fold. First, we develop a semi-supervised generative model that facilitates the construction of a large-scale fine-tuning corpus with minimal human-effort. Second, to ensure that the extraction performance does not suffer due to noisy training samples in our fine-tuning corpus, we develop an uncertainty-aware training strategy. Experiments on two publicly available datasets show that LEAST generalizes to multiple verticals and backbone models. Using LEAST, we can train models with less than ten human-labeled pages from each website, outperforming strong baselines while reducing the number of human-labeled training samples needed for comparable performance by up to 11x.
Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial NetworksJinfeng Peng (Northeastern University)*; Derong Shen (Northeastern University); Nan Tang (Qatar Computing Research Institute, HBKU); Tieying Liu (Northeastern University); Yue Kou (Northeastern University); Tiezheng Nie (Northeastern University); Hang Cui (University of Illinois at Urbana-Champaign); Ge Yu (Northeastern University) Show AbstractDownload Paper
We study the problem of self-supervised and interpretable data cleaning, which automatically extracts interpretable data repair rules from dirty data. In this paper, we propose a novel framework, namely Garf, based on sequence generative adversarial networks (SeqGAN). One key information Garf tries to capture is data repair rules (for example, if the city is “Dothan”, then the county should be “Houston”). Garf employs a SeqGAN consisting of a generator 𝐺 and a discriminator 𝐷 that trains 𝐺 to learn the dependency relationships (e.g., given a city value “Dothan” as input, the county can be determined as “Houston”). After training, the generator𝐺 can be used to generate data repair rules, but may contain both trusted and untrusted rules, especially when learning from dirty data. To mitigate this problem, Garf further updates the learned relationships with another discriminator 𝐷′ to iteratively improve the quality of both rules and data. Garf takes advantages of both logical and learning-based methods, which allow cleaning dirty data with high interpretability and have no requirements for prior knowledge and training data. Extensive experiments on real-world and synthetic datasets demonstrate the effectiveness of Garf. Garf achieves new state-of-the-art data cleaning result with high accuracy, through learning from dirty datasets without human supervision.
FirmTruss Community Search in Multilayer NetworksAli Behrouz (Cornell University)*; Farnoosh Hashemi (The University of British Columbia); Laks V.S. Lakshmanan (The University of British Columbia) Show AbstractDownload Paper
In applications such as biological, social, and transportation networks, interactions between objects span multiple aspects. For accurately modeling such applications, multilayer networks have been proposed. Community search allows for personalized community discovery and has a wide range of applications in large real-world networks. While community search has been widely explored for single-layer graphs, the problem for multilayer graphs has just recently attracted attention. Existing community models in multilayer graphs have several limitations, including disconnectivity, free-rider effect, resolution limits, and inefficiency. To address these limitations, we study the problem of community search over large multilayer graphs. We first introduce FirmTruss, a novel dense structure in multilayer networks, which extends the notion of truss to multilayer graphs. We show that FirmTrusses possess nice structural and computational properties and bring many advantages compared to the existing models. Building on this, we present a new community model based on FirmTruss, called FTCS, and show that finding an FTCS community is NP-hard. We propose two efficient 2-approximation algorithms, and show that no polynomial-time algorithm can have a better approximation guarantee unless P = NP. We propose an index-based method to further improve the efficiency of the algorithms. We then consider attributed multilayer networks and propose a new community model based on network homophily. We show that community search in attributed multilayer graphs is NP-hard and present an effective and efficient approximation algorithm. Experimental studies on real-world graphs with ground-truth communities validate the quality of the solutions we obtain and the efficiency of the proposed algorithms.
Efficient Triangle-Connected Truss Community Search In Dynamic GraphsTianyang Xu (Wuhan University); Zhao Lu (Wuhan University); Yuanyuan Zhu (Wuhan University)* Show AbstractDownload Paper
Community search studies the retrieval of certain community structures containing query vertices, which has received lots of attention recently. $k$-truss is a fundamental community structure where each edge is contained in at least $k-2$ triangles. Triangle-connected $k$-truss community ($k$-TTC) is a widely-used variant of $k$-truss, which is a maximal $k$-truss where edges can reach each other via a series of edge-adjacent triangles. Although existing works have provided indexes and query algorithms for $k$-TTC search, the cohesiveness of a $k$-TTC (diameter upper bound) has not been theoretically analyzed and the triangle connectivity has not been efficiently captured. Thus, we revisit the $k$-TTC search problem in dynamic graphs, aiming to achieve a deeper understanding of $k$-TTC. First, we prove that the diameter of a $k$-TTC with $n$ vertices is bounded by $\lfloor\frac{2n}{k+1}\rfloor$. Then, we encapsulate triangle connectivity with two novel concepts, partial class and truss-precedence, based on which we build our compact index, EquiTree, to support the efficient $k$-TTC search. We also provide efficient index construction and maintenance algorithms for the dynamic change of graphs. Compared with the state-of-the-art methods, our extensive experiments show that EquiTree can boost search efficiency up to two orders of magnitude at a small cost of index construction and maintenance.
R35
Online Demos II
Chair: Goce Trajcevski (Iowa State University)
Lynx: A Graph Query Framework for Multiple Heterogeneous Data Sources [demo]Zhihong Shen (Chinese Academy of Sciences, Computer Network Information Center); Hu Chuan (UCAS, CAS, CNIC)*; Zihao Zhao (CNIC,CAS,UCAS) Show AbstractDownload Paper
Graph model are increasingly popular among modern applications for its ability to model complex relationships between entities. Users tend to query the data as a graph with graph operations (e.g., graph navigation and exploration). However, a large fraction of the data resides in relational databases or other storage systems. Challenges arise in uniformly querying multiple heterogeneous data sources as a graph. Traditional solutions are limited by time-consuming data integration, expensive development effort, and incomplete query requirements. Thus, we developed Lynx, a general graph query framework, to simplify querying graph data by converting complex statements into basic graph operations. Instead of connecting directly to the data sources, Lynx retrieves data through user-implemented interfaces for those graph operations. We demonstrate Lynx's capabilities through real-world scenarios, showcasing Lynx's ability to process graph queries on multiple heterogeneous data sources and also to be used as a generic graph query engine development framework.
ChainDash: An Ad-Hoc Blockchain Data Analytics System [demo]Yushi Liu (East China Normal University); Liwei Yuan (Blockchain Platform Division, Ant Group); Zhihao Chen (East China Normal University); Yekai Yu (East China Normal University); Zhao Zhang (East China Normal University)*; Cheqing Jin (East China Normal University); Ying Yan (Ant Group) Show AbstractDownload Paper
The emergence of digital asset applications, driven by Web 3.0 and powered by blockchain technology, has led to a growing demand for blockchain-specific graph analytics to unearth the insights. However, current blockchain data analytics systems are unable to perform efficient ad-hoc graph analytics over both live and past time windows due to their inefficient data synchronization and slow graph snapshots retrieval capability. To address these issues, we propose ChainDash, a blockchain data analytics system that dedicates a highly-parallelized data synchronization component and a retrieval-optimized temporal graph store. By leveraging these techniques, ChainDash supports efficient ad-hoc graph analytics of smart contract activities over arbitrary time windows. In the demonstration, we showcase the interactive visualization interfaces of ChainDash, where attendees will execute customized queries for ad-hoc graph analytics of blockchain data.
CEDA: Learned Cardinality Estimation with Domain Adaptation [demo]Zilong Wang (Beijing Jiaotong University); Qixiong Zeng (School of Computer and Information Technology, Beijing Jiaotong University); Ning Wang (School of Computer and Information Technology, Beijing Jiaotong University)*; Haowen Lu (Beijing Jiaotong University); Yue Zhang (Beijing Jiaotong University) Show AbstractDownload Paper
Cardinality Estimation (CE) is a fundamental but critical problem in DBMS query optimization, while deep learning techniques have made significant breakthroughs in the research of CE. However, apart from requiring sufficiently large training data to cover all possible query regions for accurate estimation, current query-driven CE methods also suffer from workload drifts. In fact, retraining or fine-tuning needs cardinality labels as ground truth and obtaining the labels through DBMS is also expensive. Therefore, we propose CEDA, a novel domain-adaptive CE system. CEDA can achieve more accurate estimations by automatically generating workloads as training data according to the data distribution in the database, and incorporating histogram information into an attention-based cardinality estimator. To solve the problem of workload drifts in real-world environments, CEDA adopts a domain adaptation strategy, making the model more robust and perform well on an unlabeled workload with a large difference from the feature distribution of the training set.
Sniffer: A Novel Model Type Detection System against Machine-Learning-as-a-Service Platforms [demo]Zhuo Ma (Xidian University); Yilong Yang (Xidian University); Bin Xiao (Chongqing University of Posts and Telecommunications)*; Yang Liu (Xidian University); Xinjing Liu (Xidian University); Zhuoran Ma (Xidian University); Tong Yang (Peking University) Show AbstractDownload Paper
Recent works explore several attacks against Machine-Learning-as-a-Service (MLaaS) platforms (e.g., the model stealing attack), allegedly posing potential real-world threats beyond viability in laboratories. However, hampered by model-type-sensitive, most of the attacks can hardly break mainstream real-world MLaaS platforms. That is, many MLaaS attacks are designed against only one certain type of model, such as tree models or neural networks. As the black-box MLaaS interface hides model type info, the attacker cannot choose a proper attack method with confidence, limiting the attack performance. In this paper, we demonstrate a system, named Sniffer, that is capable of making model-type-sensitive attacks ''great again'' in real-world applications. Specifically, Sniffer consists of four components: Generator, Querier, Probe, and Arsenal. The first two components work for preparing attack samples. Probe, as the most characteristic component in Sniffer, implements a series of self-designed algorithms to determine the type of models hidden behind the black-box MLaaS interfaces. With model type info unraveled, an optimum method can be selected from Arsenal (containing multiple attack methods) to accomplish its attack. Our demonstration shows how the audience can interact with Sniffer in a web-based interface against five mainstream MLaaS platforms.
TsQuality: Measuring Time Series Data Quality in Apache IoTDB [demo]Yuanhui Qiu (Tsinghua University); Chenguang Fang (Tsinghua University); Shaoxu Song (Tsinghua University)*; Xiangdong Huang (Tsinghua University); Chen Wang (Timecho Limited); Jianmin Wang (Tsinghua University, China) Show AbstractDownload Paper
Time series has been found with various data quality issues, e.g., owing to sensor failure or network transmission errors in the Internet of Things (IoT). It is highly demanded to have an overview of the data quality issues on the millions of time series stored in a database. In this demo, we design and implement TsQuality, a
system for measuring the data quality in Apache IoTDB. Four time series data quality measures, completeness, consistency, timeliness, and validity, are implemented as functions in Apache IoTDB or operators in Apache Spark. These data quality measures are also interpreted by navigating dirty points in different granularity. It is also well-integrated with the big data eco-system, connecting to Apache Zeppelin for SQL query, and Apache Superset for an overview of data quality.
A Learned Query Rewrite System [demo]Xuanhe Zhou (Tsinghua University); Guoliang Li (Tsinghua University)*; Jianming Wu (Tsinghua University); Jiesi Liu (Tsinghua University); Zhaoyan Sun (Tsinghua University); Xinning Zhang (Tsinghua University) Show AbstractDownload Paper
Query rewriting is a challenging task that transforms a SQL query to improve its performance while maintaining its result set. However, it is difficult to rewrite SQL queries, which often involve complex logical structures, and there are numerous candidate rewrite strategies for such queries, making it an NP-hard problem. Existing databases or query optimization engines adopt heuristics to rewrite queries, but these approaches may not be able to judiciously and adaptively apply the rewrite rules and may cause significant performance regression in some cases (e.g., correlated subqueries may not be eliminated). To address these limitations, we introduce LearnedRewrite, a query rewrite system that combines traditional and learned algorithms (i.e., Monte Carlo tree search + hybrid estimator) to rewrite queries. We have implemented the system in Calcite, and experimental results demonstrate LearnedRewrite achieves superior performance on three real datasets. The demo website is publicly available at http://rewrite\_demo.dbmind.cn/.
AQUA: Automatic Collaborative Query Processing in Analytical Database [demo]Yuchen Peng (Zhejiang University); Ke Chen (Zhejiang University)*; Lidan Shou (Zhejiang University); Dawei Jiang (Zhejiang University); Gang Chen (Zhejiang University) Show AbstractDownload Paper
Data analysts nowadays are keen to have analytical capabilities involving deep learning (DL). Collaborative queries, which employ relational operations to process structured data and DL models to process unstructured data, provide a powerful facility for DL-based in-database analysis. The classical approach to support collaborative queries in relational databases is to integrate DL models with user-defined functions (UDFs) in a general-purpose language (e.g., C++) to process unstructured data. This approach suffers from sub-optimal performance as the opaque UDFs preclude the generation of an optimal query plan. A recent work, DL2SQL, addresses the problem of collaborative query optimization by first converting DL computations into SQL subqueries and then using a classical relational query optimizer to optimize the entire collaborative query. However, the DL2SQL approach compromises usability by requiring data analysts to manually manage DL-related data and tune query performance.
To this end, this paper introduces AQUA, an analytical database designed for efficient collaborative query processing. Built on DL2SQL, AQUA automates translations from collaborative queries into SQL queries. To enhance usability, AQUA introduces two techniques: 1) a declarative scheme for DL-related data management, and 2) DL-specific optimizations for collaborative query processing, eliminating the burden of manual data management and performance tuning from the data analysts. We demonstrate the key contributions of AQUA via a web APP that allows the audience to perform collaborative queries on the CIFAR-10 dataset.
Fanglue: An Interactive System for Decision Rule Crafting [demo]Chen Qian (Ant Group); Shiwei Liang (Ant Group); Zhaoyang Wang (Ant Group); Yin Lou (Ant Group)* Show AbstractDownload Paper
In many applications the training data do not always contain sufficient information to produce high-quality decision rules for standard (end-to-end) rule mining algorithms, and human experts have to incorporate domain knowledge during rule induction in order to get meaningful results. In this work we present Fanglue, a home-grown system inside Alipay, for interactive decision rule crafting. Fanglue is a distributed in-memory system and is highly responsive when processing large-scale datasets. In addition, Fanglue extends the standard representation of a decision rule by introducing disjunctive clauses. Having disjunctive clauses can improve the coverage and robustness of a decision rule, especially for fraud prevention in Fintech applications.
RESCU-SQL: Oblivious Querying for the Zero Trust Cloud [demo]Xiling Li (Northwestern University)*; Gefei Tan (Northwestern University); Xiao Wang (Northwestern University); Jennie Rogers (Northwestern University); Soamar Homsi (Air Force Research Laboratory) Show AbstractDownload Paper
Cloud service providers offer robust infrastructure for rent to organizations of all kinds. High stakes applications, such as the ones in defense and healthcare, are turning to the public cloud for a cost-effective, geographically distributed, always available solution to their hosting needs. Many such users are unwilling or unable to delegate their data to this third-party infrastructure.
In this demonstration, we introduce RESCU-SQL, a zero-trust platform for resilient and secure SQL querying outsourced to one or more cloud service providers. RESCU-SQL users can query their DBMS using cloud infrastructure alone without revealing their private records to anyone. It does so by executing the query over secure multiparty computation. We call this system zero trust because it can tolerate any number of malicious servers provided one of them remains honest. Our demo will offer an interactive dashboard with which attendees can observe the performance of RESCU-SQL deployed on several in-cloud nodes for the TPC-H benchmark. Attendees can select a computing party and inject messages from it to explore how quickly it detects and reacts to a malicious party. This is the first SQL system to support all-but-one maliciously secure querying over a semi-honest coordinator for efficiency.
U6
Scalable ML III
Chair: Aida Sheshbolouki (University of Waterloo)
Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data ProgrammingCheng-Yu Hsieh (University of Washington); Jieyu Zhang (University of Washington); Alexander Ratner (University of Washington) Show AbstractDownload Paper
Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process of how these heuristics are created in practice has remained under-explored. In this work, we formalize the development process of labeling heuristics as an interactive procedure, built around the existing workflow where users draw ideas from a selected set of development data for designing the heuristic sources. With the formalism, shown in Figure 1, we study two core problems of (1) how to strategically select the development data to guide users in efficiently creating informative heuristics, and (2) how to exploit the information within the development process to contextualize and better learn from the resultant heuristics. Building upon two novel methodologies that effectively tackle the respective problems considered, we present Nemo, an end-to-end interactive system that improves the overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS approach.
Collective Grounding: Applying Database Techniques to Grounding Templated ModelsEriq Augustine (UCSC)*; Lise Getoor (University of California Santa Cruz) Show AbstractDownload Paper
The process of instantiating, or "grounding", a first-order model is a fundamental component of reasoning in logic. It has been widely studied in the context of theorem proving, database theory, and artificial intelligence. Within the relational learning community, the concept of grounding has been expanded to apply to models that use more general templates in the place of first-order logical formulae. In order to perform inference, grounding of these templates is required for instantiating a distribution over possible worlds. However, because of the complex data dependencies stemming from instantiating generalized templates with interconnected data, grounding is often the key computational bottleneck to relational learning. While we motivate our work in the context of relational learning, similar issues arise in probabilistic databases, particularly those that do not make strong tuple independence assumptions. In this paper, we investigate how key techniques from relational database theory can be utilized to improve the computational efficiency of the grounding process. We introduce the notion of collective grounding which treats logical programs not as a collection of independent rules, but instead as a joint set of interdependent workloads that can be shared. We introduce the theoretical concept of collective grounding, the components necessary in a collective grounding system, implementations of these components, and show how to use database theory to speed up these components. We demonstrate collective groundings effectiveness on seven popular datasets, and show up to a 70% reduction in runtime using collective grounding. Our results are fully reproducible and all code, data, and experimental scripts are included.
On Efficient Approximate Queries over Machine Learning ModelsDujian Ding (University of British Columbia)*; Sihem Amer-Yahia (CNRS); Laks V.S. Lakshmanan (The University of British Columbia) Show AbstractDownload Paper
The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.
Accelerating Aggregation Queries on Unstructured Streams of DataMatthew D Russo (Stanford University)*; Tatsunori Hashimoto (Stanford); Daniel Kang (UIUC); Yi Sun (University of Chicago); Matei Zaharia (Berkeley and Databricks) Show AbstractDownload Paper
Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams.
In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models (``proxies”) and sampling techniques to limit the execution of an expensive high-precision model (an ``oracle”) to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm.
SIFTER: Space-Efficient Value Iteration for Finite-Horizon MDPs [sds]Konstantinos Skitsas (Aarhus University); Ioannis G Papageorgiou (NTUA); Mohammad Sadegh Talebi (University of Copenhagen); Verena Kantere (NTUA); Michael Katehakis (Rutgers University); Panagiotis Karras (Aarhus University)* Show AbstractDownload Paper
Can we solve finite-horizon Markov decision processes (FHMDPs) while raising low memory requirements? Such models find application in many cases where an decision-making agent needs to act in a probabilistic environment, from resource management to medicine to service provisioning. However, computing optimal policies such an agent should follow by dynamic programming value iteration raises either prohibitive space complexity, or, in reverse, non-scalable time complexity requirements. This scalability question has been largely neglected. In this paper, we propose SIFTER (Space Efficient Finite Horizon MDPs) a suite of algorithms that achieve a golden middle between space and time requirements. Our former algorithm raises space complexity growing with the square root of the horizon's length without a time-complexity overhead, while the latter’s space requirements depend only logarithmically in horizon length with a corresponding logarithmic time complexity overhead. A thorough experimental study under diverse settings confirms that SIFTER algorithms achieve the predicted gains, while approximation techniques do not achieve the same combination of time efficiency, space efficiency, and result quality.
U7
Scalable ML IV
Chair: Amir Shaikhha (University of Edinburgh)
Marigold: Efficient k-means Clustering in High Dimensions [sds]Kasper Overgaard Mortensen (Aarhus University); Fatemeh Zardbani (Aarhus University); Mohammad Ahsanul Haque (AAU); Steinn Ymir Agustsson (Aarhus University); Davide Mottin (Aarhus University); Philip Hofmann (Aarhus University); Panagiotis Karras (Aarhus University)* Show AbstractDownload Paper
How can we efficiently and scalably cluster high-dimensional data? The k-means algorithm clusters data by iteratively reducing intracluster Euclidean distance until convergence. While it finds applications from recommendation engines to image segmentation, its application to high-dimensional data is hindered by the need to repeatedly compute Euclidean distances among points and centroids. In this paper, we propose Marigold (k-means for high-dimensional data), a scalable algorithm for k-means clustering in high dimensions. Marigold prunes distance calculations by means of (i) a tight distance-bounding scheme; (ii) a stepwise calculation over a multiresolution transform; and (iii) exploiting the triangle inequality. To our knowledge, such an arsenal of pruning techniques has not been hitherto applied to k-means. Our work is motivated by time-critical Angle-Resolved Photoemission Spectroscopy (ARPES) experiments, where it is vital to detect clusters among high-dimensional spectra in real time. In a thorough experimental study with real-world data sets we demonstrate that Marigold efficiently clusters large high-dimensional data, achieving approximately one order of magnitude improvement over prior art.
Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random ForestsChristian Lülf (University of Münster)*; Denis Mayr Lima Martins (University of Münster); Marcos Antonio Vaz Salles (Independent Researcher); Yongluan Zhou (University of Copenhagen); Fabian Gieseke (University of Münster) Show AbstractDownload Paper
The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find “interesting” objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds by only a single server, compared to hours needed by classical scanning-based approaches
on similar hardware.
Similarity search in the blink of an eye with compressed indicesCecilia Aguerrebere (Intel Labs)*; Ishwar Singh Bhati (Intel); Mark Hildebrand (Intel Corporation); Mariano Tepper (Intel Labs); Theodore L Willke (Intel Labs) Show AbstractDownload Paper
Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications. Graph-based indices are currently the best performing techniques for billion-scale similarity search. However, their random-access memory pattern presents challenges to realize their full potential. In this work, we present new techniques and systems for creating faster and smaller graph-based indices. To this end, we introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and scalar quantization to improve search performance with fast similarity computations and a reduced effective bandwidth, while decreasing memory footprint and barely impacting accuracy. LVQ, when combined with a new high-performance computing system for graph-based similarity search, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory.
SHiFT: An Efficient, Flexible Search Engine for Transfer LearningCedric Renggli (UZH)*; Xiaozhe Yao (ETH Zurich); Luka Kolar (ETH Zurich); Luka Rimanic (ETH Zurich); Ana Klimovic (ETH Zurich); Ce Zhang (ETH) Show AbstractDownload Paper
Transfer learning can be seen as a data- and compute-efficient alternative to training models from scratch. The emergence of rich model repositories, such as TensorFlow Hub, enables practitioners and researchers to unleash the potential of these models across a wide range of downstream tasks. As these repositories keep growing exponentially, efficiently selecting a good model for the task at hand becomes paramount. However, a single generic search strategy (e.g., taking the model with the highest linear classifier accuracy) does not lead to optimal model selection for diverse downstream tasks. In fact, using hybrid or mixed strategies can often be beneficial. Therefore, we propose SHiFT, the first downstream task-aware, flexible, and efficient model search engine for transfer learning. Users interface with SHiFT using the SHiFT-QL query language, which gives users the flexibility to customize their search criteria. We optimize SHiFT-QL queries using a cost-based decision maker and evaluate them on a wide rang of tasks. Motivated by the iterative nature of machine learning development, we further support efficient incremental executions of our queries, which requires a special implementation when jointly used with our optimizations.
TOD: GPU-accelerated Outlier Detection via Tensor OperationsYue Zhao (University of Southern California)*; George H Chen (Carnegie Mellon University); Zhihao Jia (Carnegie Mellon University) Show AbstractDownload Paper
Outlier detection (OD) is a key machine learning task for finding rare and deviant data samples, with many time-critical applications such as fraud detection and intrusion detection. In this work, we propose TOD, the first tensor-based system for efficient and scalable outlier detection on distributed multi-GPU machines. A key idea behind TOD is decomposing complex OD applications into a small collection of basic tensor algebra operators. This decomposition enables TOD to accelerate OD computations by leveraging recent advances in deep learning infrastructure in both hardware and software. Moreover, to deploy memory-intensive OD applications on modern GPUs with limited on-device memory, we introduce two key techniques. First, provable quantization speeds up OD computations and reduces its memory footprint by automatically performing specific floating-point operations in lower precision while provably guaranteeing no accuracy loss. Second, to exploit the aggregated compute resources and memory capacity of multiple GPUs, we introduce automatic batching, which decomposes OD computations into small batches for both sequential execution on a single GPU and parallel execution across multiple GPUs.
TOD supports a diverse set of OD algorithms. Evaluation on 11 real-world and 3 synthetic OD datasets shows that TOD is on average 10.9x faster than the leading CPU-based OD system PyOD (with a maximum speedup of 38.9x), and can handle much larger datasets than existing GPU-based OD systems. In addition, TOD allows easy integration of new OD operators, enabling fast prototyping of emerging and yet-to-discovered OD algorithms.
V7
Learned Indexes & Query Processing/Optimization I
Chair: Ibrahim Sabek (University of Southern California)
DILI: A Distribution-Driven Learned IndexPengfei Li (Alibaba Group)*; Hua Lu (Roskilde University); Rong Zhu (Alibaba Group); Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group); Long Yang (Peking University); Gang Pan (Zhejiang University) Show AbstractDownload Paper
Targeting in-memory one-dimensional search keys, we propose a novel DIstribution-driven Learned Index tree (DILI), where a concise and computation-efficient linear regression model is used for each node. An internal node's key range is equally divided by its child nodes such that a key search enjoys perfect model prediction accuracy to find the relevant leaf node. A leaf node uses machine learning models to generate searchable data layout and thus accurately predicts the data record position for a key. To construct DILI, we first build a bottom-up tree with linear regression models according to global and local key distributions. Using the bottom-up tree, we build DILI in a top-down manner, individualizing the fanouts for internal nodes according to local distributions. DILI strikes a good balance between the number of leaf nodes and the height of the tree, two critical factors of key search time. Moreover, we design flexible algorithms for DILI to efficiently insert and delete keys and automatically adjust the tree structure when necessary. Extensive experimental results show that DILI outperforms the state-of-the-art alternatives on different kinds of workloads.
FASTgres: Making Learned Query Optimizer Hinting EffectiveLucas Woltmann (Technische Universität Dresden); Jerome Thiessat (TU Dresden); Claudio Hartmann (Technische Universität Dresden); Dirk Habich (TU Dresden)*; Wolfgang Lehner (TU Dresden) Show AbstractDownload Paper
The traditional and well-established cost-based query optimizer approach enumerates different execution plans for each query, assesses each plan with costs, and selects the plan that promises the lowest costs for execution. However, the optimal execution plan is not always selected. To steer the optimizer in the right direction, many query optimizers provide configuration parameters called query optimizer hints. These hints can be set for every single query separately. To show the great potential of these hints for the optimization of analytical queries, we present results of a comprehensive and in-depth evaluation using three benchmarks and two different versions of the open-source database system PostgreSQL. In particular, we highlight that query optimizer hinting is a non-trivial challenge. To solve this challenge, we propose FASTgres, a learning-based context-aware classification strategy for hint set prediction. Compared to related work, FASTgres provides transparent and direct hint set predictions with consistent performance improvements. In our end-to-end evaluation, we demonstrate that FASTgres effectively reduces benchmark runtimes by a factor of up to 3.25x with only steering the cost-based optimizer.
Lero: A Learning-to-Rank Query OptimizerRong Zhu (Alibaba Group)*; Wei Chen (Alibaba); Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group); Xingguang Chen (The Chinese University of Hong Kong); Andreas Pfadler (Alibaba Group); Ziniu Wu (Massachusetts Institute of Technology); Jingren Zhou (Alibaba Group) Show AbstractDownload Paper
A recent line of works apply machine learning techniques to assist or rebuild cost-based query optimizers in DBMS. While exhibiting superiority in some benchmarks, their deficiencies, e.g., unstable performance, high training cost, and slow model updating, stem from the inherent hardness of predicting the cost or latency of execution plans using machine learning models. In this paper, we introduce a learning-to-rank query optimizer, called Lero, which builds on top of a native query optimizer and continuously learns to improve the optimization performance. The key observation is that the relative order or rank of plans, rather than the exact cost or latency, is sufficient for query optimization. Lero employs a pairwise approach to train a classifier to compare any two plans and tell which one is better. Such a binary classification task is much easier than the regression task to predict the cost or latency, in terms of model efficiency and accuracy. Rather than building a learned optimizer from scratch, Lero is designed to leverage decades of wisdom of databases and improve the native query optimizer. With its non-intrusive design, Lero can be implemented on top of any existing DBMS with minimal integration efforts. We implement Lero and demonstrate its outstanding performance using PostgreSQL. In our experiments, Lero achieves near optimal performance on several benchmarks. It reduces the plan execution time of the native optimizer in PostgreSQL by up to 70% and other learned query optimizers by up to 37%. Meanwhile, Lero continuously learns and automatically adapts to query workloads and changes in data.
Learned Index Benefits: Machine Learning Based Index Performance EstimationJiachen Shi (Nanyang Technological University); Gao Cong (Nanyang Technological Univesity); Xiaoli Li (Institute for Infocomm Research, A*STAR, Singapore/Nanyang Technological University) Show AbstractDownload Paper
Index selection remains one of the most challenging problems in relational database management systems. To find an optimum index configuration for a workload, accurately and efficiently quantifying the benefits of each candidate index configuration is indispensable. As materializing each index configuration candidate and physically ex- ecuting queries are infeasible, most of index tuners rely on the cost estimations from optimizer with "what-if" API. However, "what-if" based index benefit estimations have the following two limitations. Firstly, they generate significant errors, which compromise index recommendation quality. Secondly, generating query plans and benefit estimations for each candidate index configuration takes a considerable amount of time. To address the two challenges in index selection, we propose an effective end-to-end machine learning based index benefit estimator. In particular, we propose novel feature extraction and encoding techniques that do not rely on "what-if" call to generate query plan for each index configuration candidate. In addition, we design an attention mechanism to address index interaction issue and aggregate the impacts of different query operations. Finally, we leverage transfer learning technique to improve the estimator’s learning ability for adaption to new database. Comprehensive experiments are conducted on different workloads, and extensive experimental results show that our proposed method outperforms "what-if" based index benefit estimations in terms of accuracy and efficiency. In addition, integrating our method into existing index selection algorithms can significantly improve index recommendation quality.
V8
Learned Indexes & Query Processing/Optimization II
Chair: Kurt Stockinger (University of Zurich)
The Case for Learned In-Memory Joins [eab]Ibrahim Sabek (Massachusetts Institute of Technology)*; Tim Kraska (Massachusetts Institute of Technology) Show AbstractDownload Paper
In-memory join is an essential operator in any database engine. It has been extensively investigated in the database literature. In this paper, we study whether exploiting the CDF-based learned models to boost the join performance is practical or not. To the best of our knowledge, we are the first to fill this gap. We investigate the usage of CDF-based models and learned indexes (e.g., Recursive Model Index (RMI) and RadixSpline) in the three join categories; indexed nested loop join (INLJ), sort-based joins (SJ) and hash-based joins (HJ). Our study shows that there is a room to improve the performance of all the three join categories through our proposed optimized learned variants. Our experimental analysis showed that these optimized learned variants outperform the state-of-the-art techniques in many scenarios and with different datasets.
ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Join Algorithms via Reinforcement LearningJunxiong Wang (Cornell University)*; Immanuel Trummer (Cornell); Ahmet Kara (University of Zurich); Dan Olteanu (University of Zurich) Show AbstractDownload Paper
The performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. Selecting good orders before query execution is hard, due to the large space of possible orders and unreliable execution cost estimates in case of data skew or data correlation. We propose ADOPT, a query engine that combines adaptive query processing with a worst-case optimal join algorithm, which uses an order on the join attributes instead of a join order on relations. ADOPT divides query execution into episodes in which different attribute orders are tried. Based on run time feedback on attribute order performance, ADOPT converges quickly to near-optimal orders. It avoids redundant work across different orders via a novel data structure, keeping track of parts of the join input that have been successfully processed. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments with various data sets and queries, it outperforms baselines, including commercial and open-source systems using worst-case optimal join algorithms, whenever queries become complex and therefore difficult to optimize.
Simple Adaptive Query Processing vs. Learned Query Optimizers: Observations and Analysis [eab]Yunjia Zhang (University of Wisconsin-Madison)*; Yannis Chronis (University of Wisconsin Madison); Jignesh M. Patel (Carnegie Mellon University); Theodoros Rekatsinas (ETH Zurich) Show AbstractDownload Paper
There have been many decades of work on optimizing query processing in database management systems. Recently, modern machine learning (ML), and specifically reinforcement learning (RL), have gained increased attention as a means to develop a query optimizer (QO). In this work, we take a closer look at two recent state-of-the-art (SOTA) RL-based QO methods to better understand their behavior. We find that these RL-based methods do not generalize as well as it seems at first glance. Thus, we ask a simple question: How do SOTA RL-based QOs compare to a simple, modern, adaptive query processing approach? To answer this question, we choose two simple adaptive query processing techniques and implemented them in PostgreSQL. The first adapts an individual join operation on-the-fly and switches between a Nested Loop Join algorithm and a Hash Join algorithm to avoid sub-optimal join algorithm decisions. The second is a technique called Lookahead Information Passing (LIP), in which adaptive semijoin techniques are used to make a pipeline of join operations execute efficiently. To our surprise, we find that this simple adaptive query processing approach is not only competitive to the SOTA RL-based approaches but, in some cases, outperforms the RL-based approaches. The adaptive approach is also appealing because it does not require an expensive training step, and it is fully interpretable compared to the RL-based QO approaches. Further, the adaptive method works across complex query constructs that RL-based QO methods currently cannot optimize.
Can Learned Models Replace Hash Functions? [eab]Ibrahim Sabek (Massachusetts Institute of Technology)*; Kapil Vaidya (Massachusetts Institute of Technology); Dominik Horn (Technical University of Munich (TUM)); Andreas Kipf (Amazon Web Services); Michael Mitzenmacher (Harvard); Tim Kraska (Massachusetts Institute of Technology) Show AbstractDownload Paper
Hashing is a fundamental operation in database management, playing a key role in the implementation of numerous core database data structures and algorithms. Traditional hash functions aim to mimic a function that maps a key to a random value, which can result in collisions, where multiple keys are mapped to the same value. There are many well-known schemes like chaining, probing, and cuckoo hashing to handle collisions. In this work, we aim to study if using learned models instead of traditional hash functions can reduce collisions and whether such a reduction translates to improved performance, particularly for indexing and joins.We show that learned models reduce collisions in some cases, which depend on how the data is distributed. To evaluate the effectiveness of learned models as hash function, we test them with bucket chaining, linear probing, and
cuckoo hash tables. We find that learned models can (1) yield a 1.4x lower probe latency, and (2) reduce the non-partitioned hash join runtime with 28% over the next best baseline for certain datasets. On the other hand, if the data distribution is not suitable, we either do not see gains or see worse performance. In summary, we find that learned models can indeed outperform hash functions, but only for certain data distributions.
SkinnerMT: Parallelizing for Efficiency and Robustness in Adaptive Query Processing on Multicore PlatformsZiyun Wei (Cornell University)*; Immanuel Trummer (Cornell) Show AbstractDownload Paper
SkinnerMT is an adaptive query processing engine, specialized for multi-core platforms. SkinnerMT features different strategies for parallel processing that allow users to trade between average run time and performance robustness. First, SkinnerMT supports execution strategies that execute multiple query plans in parallel, thereby reducing the risk to find near-optimal plans late and improving robustness. Second, SkinnerMT supports data-parallel processing strategies. Its parallel multi-way join algorithm is sensitive to the assignment from tuples to threads. Here, SkinnerMT uses a cost-based optimization strategy, based on runtime feedback. Finally, SkinnerMT supports hybrid processing methods, mixing parallel search with data-parallel processing. The experiments show that parallel search increases robustness while parallel processing increases average-case performance. The hybrid approach combines advantages from both. Compared to traditional database systems, SkinnerMT is preferable for benchmarks where query optimization is hard. Compared to prior adaptive processing baselines, SkinnerMT exploits parallelism better.
Tutorial-1
Private Information Retrieval in Large Scale Public Data Repositories [tutorial]Ishtiyaque Ahmad (University of California at Santa Barbara)*; Divyakant Agrawal (University of California at Santa Barbara); Amr El Abbadi (UC Santa Barbara); Trinabh Gupta (UCSB) Show AbstractDownload Paper
The tutorial focuses on Private Information Retrieval (PIR), which allows clients to privately query public or server-owned databases without disclosing their queries. The tutorial covers the basic concepts of PIR such as its types, construction, and critical building blocks, including homomorphic encryption. It also discusses the performance of PIR, existing optimizations for scalability, real-life applications of PIR, and ways to extend its functionalities.
Tutorial-2
Databases on Modern Networks: A Decade of Research That Now Comes into Practice [tutorial]Alberto Lerner (University of Fribourg)*; Carsten Binnig (TU Darmstadt); Philippe Cudré-Mauroux (Exascale Infolab, Fribourg University); Rana Hussein (University of Fribourg); Matthias Jasny (TU Darmstadt); Theo Jepsen (USI); Dan Ports (MSR); Lasse Thostrup (TU Darmstadt); Tobias Ziegler (TU Darmstadt) Show AbstractDownload Paper
Modern cloud networks are a fundamental pillar of data-intensive applications. They provide high-speed transaction (packet) rates and low overhead, enabling, for instance, truly scalable database designs. These networks, however, are fundamentally different from conventional ones. Arguably, the two key discerning technologies are RDMA and programmable network devices. Today, these technologies are not niche technologies anymore and are widely deployed across all major cloud vendors. The question is thus not if but how a new breed of data-intensive applications can benefit from modern networks, given the perceived difficulty in using and programming them. This tutorial addresses these challenges by exposing how the underlying principles changed as the network evolved and by presenting the new system design opportunities they opened. In the process, we also discuss several hard-earned lessons accumulated by making the transition first-hand.
Tutorial-3
Full-Power Graph Querying: State of the Art and Challenges [tutorial]Ioana Manolescu (Inria and Institut Polytechnique de Paris); Madhulika Mohanty (Inria Saclay)* Show AbstractDownload Paper
Graph databases are enjoying enormous popularity, through both their RDF and Property Graphs (PG) incarnations, in a variety of applications. To query graphs, query languages provide structured, as well as unstructured primitives. While structured queries allow expressing precise information needs, they are unsuited for exploring unfamiliar datasets, as they require prior knowledge of the schema and structure of the dataset. Prior research on keyword search in graph databases do not suffer from this limitation. However, keyword queries do not allow expressing precise search criteria when users do know some.
This tutorial (1.5 hours) builds a continuum between structured graph querying through languages such as SPARQL and GPML, a recently proposed standard for PG querying, on one hand, and graph keyword search, on the other hand. In this space between querying and information retrieval, we analyze the features of modern query languages that go toward unstructured search, discuss their strength, limitations, and compare their computational complexity. In particular, we focus on (𝑖) lessons learned from the rich literature of graph keyword search, in particular with respect to result scoring; (𝑖𝑖) language mechanisms for integrating both complex structured querying and powerful methods to search for connections users do not know in advance. We conclude by discussing the open challenges and future work directions.
Tutorial-4
Efficient Execution of User-Defined Functions in SQL Queries [tutorial]Alkis Simitsis (Athena Research Center)*; Yannis E Foufoulas (University of Athens) Show AbstractDownload Paper
User-defined functions (UDFs) have been widely used to overcome the expressivity limitations of SQL and complement its declarative nature with functional capabilities. UDFs are particularly useful in today's applications that involve complex data analytics and machine learning algorithms and logic. However, UDFs pose significant performance challenges in query processing and optimization, largely due to the mismatch of the UDF execution and SQL processing environments. In this tutorial, we present state-of-the-art methods and systems towards efficient execution of UDFs in SQL queries. We focus on low-level techniques for physical optimization and compilation of UDF queries, describe and compare the core, recent approaches in the area, discuss their advantages and limitations, identify critical gaps in theory and practice, and propose promising future research directions.
Tutorial-5
Data and AI Model Markets: Opportunities for Data and Model Sharing, Discovery, and Integration [tutorial]Jian Pei (Simon Fraser University)*; Raul Castro Fernandez (The University of Chicago); Xiaohui Yu (York University) Show AbstractDownload Paper
In this tutorial, we aim to provide a comprehensive and interdisciplinary introduction to data and AI model markets. Unlike a few recent surveys and tutorials that concentrate only on the economics aspect, we take a novel perspective and examine data and AI model markets as grand opportunities to address the long-standing problem of data and model sharing, discovery, and integration. This is a core theme in data management and data science. We define data and AI model markets as places and mechanisms that enable multiple parties to share, discover, and integrate data and AI resources and generate added value. We motivate the importance of data and model markets using practical examples, present the current industry landscape of such markets, explore the modules and options of such markets from multiple dimensions, including assets in the markets (e.g., data versus models), platforms, and participants. Furthermore, we summarize the latest advancements and examine the future directions of data and AI model markets as mechanisms for enabling and facilitating sharing, discovery, and integration.
Tutorial-6
Machine Learning for Subgraph Extraction: Methods, Applications and Challenges [tutorial]Kai Siong Yow (Nanyang Technological University)*; Ningyi Liao (Nanyang Technological University); Siqiang Luo (Nanyang Technological University); Reynold Cheng (The University of Hong Kong, China) Show AbstractDownload Paper
Subgraphs are obtained by extracting a subset of vertices and a subset of edges from the associated original graphs, and many graph properties are known to be inherited by subgraphs. Subgraphs can be applied in many areas such as social networks, recommender systems, biochemistry and fraud discovery. Researchers from various communities have paid a great deal of attention to investigate numerous subgraph problems, by proposing algorithms that mainly extract important structures of a given graph. There are however some limitations that should be addressed, with regard to the efficiency, effectiveness and scalability of these traditional algorithms. As a consequence, machine learning techniques---one of the most latest trends---have recently been employed in the database community to address various subgraph problems considering that they have been shown to be beneficial in dealing with graph-related problems. We discuss learning-based approaches for four well known subgraph problems in this tutorial, namely subgraph isomorphism, maximum common subgraph, community detection and community search problems. We give a general description of each proposed model, and analyse its design and performance. To allow further investigations on relevant subgraph problems, we suggest some potential future directions in this area. We believe that this work can be used as one of the primary resources, for researchers who intend to develop learning models in solving problems that are closely related to subgraphs.
Tutorial-7
Building a Collaborative Data Analytics System: Opportunities and Challenges [tutorial]Zuozhi Wang (U C IRVINE)*; Chen Li (UC Irvine) Show AbstractDownload Paper
Real-time collaboration has become increasingly important in various applications, from document creation to data analytics. Although collaboration features are prevalent in editing applications, they remain rare in data-analytics applications, where the need for collaboration is arguably even more crucial. This tutorial aims to provide attendees with a comprehensive understanding of the challenges and design decisions associated with supporting real-time collaboration and user interactions in data analytics systems. We will discuss popular conflict resolution technologies, the unique challenges of facilitating collaborative experiences during the workflow construction and execution phases, and the complexities of supporting responsive user interactions during job execution.
Tutorial-9
Natural Language Interfaces for Databases with Deep Learning [tutorial]George Katsogiannis-Meimarakis (Athena Research Center)*; Mike Xydas (Athena R.C.); Georgia Koutrika (ATHENA Research Center) Show AbstractDownload Paper
In the age of the Digital Revolution, almost all human activities, from industrial and business operations to medical and academic research, are reliant on the constant integration and utilisation of ever-increasing volumes of data. However, the explosive volume and complexity of data makes data querying and exploration challenging even for experts, and makes the need to democratise the access to data, even for non-technical users, all the more evident. It is time to lift all technical barriers, by empowering users to access relational databases through conversation. We consider 3 main research areas that a natural language data interface is based on: Text-to-SQL, SQL-to-Text, and Data-to-Text. The purpose of this tutorial is a deep dive into these areas, covering state-of-the-art techniques and models, and explaining how the progress in the deep learning field has led to impressive advancements. We will present benchmarks that sparked research and competition, and discuss open problems and research opportunities with one of the most important challenges being the integration of these 3 research areas into one conversational system.
Time series data are ubiquitous; large volumes of such data are routinely created in scientific, industrial, entertainment, medical and biological domains. Examples include ECG data, gait analysis, stock market quotes, machine health telemetry, search engine throughput volumes etc. VLDB has traditionally been a home to much of the community’s best research on time series, with three to eight papers on time series appearing in the conference each year. What do we want to do with such time series? Everything! Classification, clustering, joins, anomaly detection, motif discovery, similarity search, visualization, summarization, compression, segmentation, rule discovery etc. Rather than a deep dive in just one of these subtopics, in this tutorial I will show the handful of high-level tools, representations, definitions, distance measures and primitives, that can be combined to solve the first 90 to 99.9% of the problems listed above. The tutorial will be illustrated with numerous real-world examples created just for this tutorial, including examples from robotics, wearables, medical telemetry, astronomy, and (especially) animal behavior. Moreover, all sample datasets and code snippets will be released so that after the tutorial the attendees can first reproduce the results demonstrated, before attempting similar analysis on their data.
Tutorial-11
A Tutorial on Visual Representations of Relational Queries [tutorial]Wolfgang Gatterbauer (Northeastern University)* Show AbstractDownload Paper
Query formulation is increasingly performed by systems that need to guess a user’s intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the “right” query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries.
This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).
Demo-Group-A
PSFQ: A Blockchain-based Privacy-preserving and Verifiable Student Feedback Questionnaire Platform [demo]Wangze Ni (Hong Kong University of Science and Technology); Pengze Chen (Hong Kong University of Science and Technology); Lei Chen (Hong Kong University of Science and Technology)* Show AbstractDownload Paper
Recently, more and more higher education institutions have been using student feedback questionnaires (SFQ) to evaluate teaching. However, existing SFQ systems have two shortcomings. The first is that the respondent of an SFQ is not anonymous. The second is that the statistical report of SFQs can be manipulated. To tackle these two shortcomings, we develop a novel SFQ system, namely PSFQ. In PSFQ, the respondent of an SFQ is mixed with multiple users by a ring signature. PSFQ uses an advanced ring signature approach to minimize the size of a ring signature when anonymity satisfies the requirements. Thus, the first shortcoming has been overcome. Moreover, all answers are encrypted by homomorphic encryption and stored on the blockchain, enabling users to verify the correctness of the statistical reports. Our demonstration will showcase how PSFQ provides confidential SFQ responses while ensuring the correctness of statistical reports.
Showcasing Data Management Challenges for Future IoT Applications with NebulaStream [demo]Aljoscha P Lepping (TU Berlin)*; Hoang Mi Pham (Technische Universität Berlin); Laura Mons (DIMA); Balint Rueb (TU Berlin); Ankit Chaudhary (Technische Universität Berlin); Philipp M Grulich (Technische Universität Berlin); Steffen Zeuch (Technische Universität Berlin); Volker Markl (Technische Universität Berlin) Show AbstractDownload Paper
Data management systems will face several new challenges in supporting IoT applications during the coming years. These challenges arise from managing large numbers of heterogeneous IoT devices and require combining elastic cloud and fog resources in unified fog-cloud environments.
In this demonstration, we introduce a smart city simulation called IoTropolis and use it to create interactive eHealth and Smart Grid application scenarios. We use these scenarios to showcase three key challenges of unified fog-cloud environments. Furthermore, we demonstrate how our recently proposed data management system for the IoT NebulaStream addresses these challenges. Visitors to our demonstration can configure and interact with the scenarios to manage electricity usage in IoTropolis or to distribute patients across different hospitals. Thereby, visitors can actively engage with the challenges showcased by IoTropolis and utilize NebulaStream to address them. As a result, our demonstration enables visitors to experience data management for future IoT applications.
KGNav: A Knowledge Graph Navigational Visual Query System [demo]Xiang Wang (Tianjin University); Xin Wang (Tianjin University)*; Zhaozhuo Li (Tianjin University); Dong Han (Tianjin Academy of Fine Arts) Show AbstractDownload Paper
Visual query is a vital technique for comprehending and analyzing knowledge graphs, which provides an effective method to lower the barrier of querying knowledge graphs for non-professional users. Nevertheless, visual query techniques for knowledge graphs and ontologies that have emerged in recent years cannot bridge the gap between global information provided by the knowledge graph schema and underlying data of knowledge graph. Thus it cannot fully exploit the global information to navigate users for querying knowledge graphs. This demonstration showcases KGNav, a Knowledge Graph Navigational visual query system. KGNav (1) redefines the minimal unit of operation to abstract the conceptual hierarchy, i.e., Knowledge Graph Schema, in the domain from the original knowledge graph in an offline semi-automatic way through the equivalence relations between these units; it also (2) provides a series of operators and an interactive GUI to capture user query intentions, guiding users to explore the Knowledge Graph Schema to achieve in-depth analysis of knowledge graphs. We will demonstrate the capability of KGNav in reducing tedious queries, enabling users to swiftly grasp the structure of the knowledge graph, and performing queries through several fundamental scenarios.
On-the-fly Data Transformation in Action [demo]Ju Hyoung Mun (Boston University)*; Konstantinos Karatsenidis (Boston University); Tarikul Islam Papon (Boston University); Shahin Roozkhosh (Boston University); Denis Hoornaert (Technical University of Munich); Ahmed Sanaullah (Red Hat); Ulrich Drepper (Red Hat); Renato Mancuso (Boston University); Manos Athanassoulis (Boston University) Show AbstractDownload Paper
Transactional and analytical database management systems (DBMS) typically employ different data layouts: row-stores for the first and column-stores for the latter. In order to bridge the requirements of the two without maintaining two systems and two (or more) copies of the data, our proposed system Relational Memory employs specialized hardware that transforms the base row table into arbitrary column groups at query execution time. This approach maximizes the cache locality and is easy to use via a simple abstraction that allows transparent on-the-fly data transformation. Here, we demonstrate how to deploy and use Relational Memory via four representative scenarios. The demonstration uses the full-stack implementation of Relational Memory on the Xilinx Zynq UltraScale+ MPSoC platform. Conference participants will interact with Relational Memory deployed in the actual platform.
Explaining Differentially Private Query Results With DPXPlain [demo]Tingyu Wang (Duke University); Yuchao Tao (SNAP); Amir Gilad (The Hebrew University)*; Ashwin Machanavajjhala (Duke); Sudeepa Roy (Duke University, USA) Show AbstractDownload Paper
Employing Differential Privacy (DP), the state-of-the-art privacy standard, to answer aggregate database queries poses new challenges for users to understand the trends and anomalies observed in the query results: Is the unexpected answer due to the data itself, or is it due to the extra noise that must be added to preserve DP? We propose to demonstrate DPXplain, the first system for explaining group-by aggregate query answers with DP. DPXplain allows users to compare values of two groups and receive a validity check, and further provides an explanation table with an interactive visualization, containing the approximately `top-k' explanation predicates along with their relative influences and ranks in the form of confidence intervals, while guaranteeing DP in all steps.
Ganos Aero: A Cloud-Native System for Big Raster Data Management and Processing [demo]Fei Xiao (Alibaba Group); Jiong Xie (Alibaba Group)*; Zhida Chen (Alibaba Group); Feifei Li (Alibaba Group); Zhen Chen (Alibaba Corp.); Jianwei Liu (alibaba); Yinpei Liu (Alibaba Group) Show AbstractDownload Paper
The development of Earth Observation technology contributes to the production of massive raster data. It is vital to manage and conduct analytical tasks on the raster data. Existing solutions employ dedicated systems for the raster data management and processing, respectively, incurring problems such as data redundancy, difficulty in updating, expensive data transferring and transformation, etc. To cope with these limitations, this demonstration presents Ganos Aero, a cloud-native system for big raster data management and processing. Ganos Aero proposes a unified raster data model for both the data management and processing, which stores a single copy of the raster data and without performing an expensive tiling procedure, and thus achieves significant improvement in the storage and updating efficiency. To enable efficient query and batch task processing, Ganos Aero implements an on-the-fly tile production mechanism, and optimizes its performance using the cloud features including decoupling compute from storage and pushing costly operations closer to the storage layer.
Since deployed in Alibaba Cloud in 2022, Ganos Aero has been playing a critical role in many real applications including the modern agriculture, environment monitoring and protection, et al.
Demonstration of OpenDBML, a Framework for Democratizing In-Database Machine Learning [demo]Mahdi Ghorbani (University of Edinburgh); Amir Shaikhha (University of Edinburgh)* Show AbstractDownload Paper
Machine learning over relational data has been used in several applications. The traditional approach of joining relations first and then training a model on the joined table is time-consuming and requires a significant amount of memory. Recent research has focused on in-database machine learning (in-DB ML) to address this issue; these methods train the models over relations without joining, resulting in a more efficient process. However, such systems have ad-hoc user interfaces and specific data formats, making them challenging to use. To address this problem, this paper presents OpenDBML, a framework for democratizing in-DB ML. OpenDBML offers a Python interface for multiple in-DB ML systems, a set of commonly used datasets, and the ability to add new datasets and in-DB ML systems via both Python and web interfaces. The paper also presents comprehensive demonstration scenarios to illustrate how to use OpenDBML effectively.
Demonstration of SPARQL-𝑀𝐿: An Interfacing Language for Supporting Graph Machine Learning for RDF Graphs [demo]Hussein Shahata Abdallah (Concordia University)*; Waleed Afandi (Concordia University); Essam Mansour (Concordia University) Show AbstractDownload Paper
This demo paper presents KGNet, a graph machine learning-enabled RDF engine. KGNet integrates graph machine learning (GML) models with existing RDF engines as query operators to support node classification and link prediction tasks. For easy integration, KGNet extends the SPARQL language with user-defined predicates to support the GML operators. We refer to this extension as SPARQL-ML query. Our SPARQL-ML query optimizer is in charge of optimizing the selection of the near-optimal GML models. The development of KGNet poses research opportunities in various areas spanning KG management. In the paper, we demonstrate the ease of integration between the RDF engines and GML models through the SPARQL-ML inference query language. We present several real use cases of different GML tasks on real KGs. Using KGNet, users do not need to learn a new scripting language or have a deep understanding of GML methods. The audience will experience KGNet with different KGs and GML models, as shown in our demo video and Colab notebook.
Approximate Queries over Concurrent Updates [demo]Congying Wang (University at buffalo); Nithin Sastry Tellapuri (University at Buffalo); Sphoorthi Keshannagari (University at Buffalo); Dylan Zinsley (University at Buffalo); Zhuoyue Zhao (University at Buffalo)*; Dong Xie (Penn State University) Show AbstractDownload Paper
Approximate Query Processing (AQP) systems produce estimation of query answers using small random samples. It is attractive for the users who are willing to trade accuracy for low query latency. On the other hand, real-world data are often subject to concurrent updates. If the user wants to perform real-time approximate data analysis, the AQP system must support concurrent updates and sampling. Towards that, we recently developed a new concurrent index, AB-tree, to support efficient sampling under updates. In this work, we will demonstrate the feasibility of supporting real-time approximate data analysis in online transaction settings using index-assisted sampling.
DuckPGQ: Bringing SQL/PGQ to DuckDB [demo]Daniel ten Wolde (Centrum Wiskunde & Informatica)*; Gábor Szárnyas (CWI); Peter Boncz (Centrum Wiskunde & Informatica) Show AbstractDownload Paper
We demonstrate the most important new feature of SQL:2023, namely SQL/PGQ, which eases querying graphs using SQL by introducing new syntax for pattern matching and (shortest) path-finding. We show how support for SQL/PGQ can be integrated into an RDBMS, specifically in the DuckDB system, using an extension module called DuckPGQ. As such, we also demonstrate the use of the DuckDB extensibility mechanism, which allows to add new functions, data types, operators, optimizer rules, storage systems and even parsers to DuckDB. We also describe the new data structures and algorithms that the DuckPGQ module is based on, and how they are injected in SQL plans. While the demonstrated DuckPGQ extension module is lean and efficient, we sketch a roadmap to (i) improve its performance through new algorithms (factorized and worst-case optimal joins) and better parallelism and (ii) extend its functionality to scenarios beyond SQL, e.g. building and analyzing Graph Neural Networks.
Demo of QueryBooster: Supporting Middleware-based SQL Query Rewriting as a Service [demo]Qiushi Bai (UC Irvine)*; Sadeem Alsudais (UC Irvine); Chen Li (UC Irvine) Show AbstractDownload Paper
Query rewriting is an important technique to optimize SQL performance in databases. With the prevalent use of business intelligence systems and object-relational mapping frameworks, existing rewriting capabilities inside databases are insufficient to optimize machine-generated queries. In this paper, we propose a novel system called "QueryBooster," to support SQL query rewriting as a cloud service. It provides a powerful and easy-to-use Web interface for users to formulate rewriting rules via a language or express rewriting intentions by providing example query pairs. It allows multiple users to share rewriting knowledge and automatically suggests shared rewriting rules for users. It requires no modifications or plugin installations to applications or databases. In this demonstration, we use real-world applications and datasets to show the user experience of QueryBooster to rewrite their application queries and share rewriting knowledge.
Portals: A Showcase of Multi-Dataflow Stateful Serverless [demo]Jonas Spenger (KTH Royal Institute of Technology)*; Chengyang Huang (KTH Royal Institute of Technology); Philipp Haller (KTH Royal Institute of Technology); Paris Carbone (KTH Royal Institute of Technology) Show AbstractDownload Paper
Serverless applications spanning the cloud and edge require flexible programming frameworks for expressing compositions across the different levels of deployment. Another critical aspect for applications with state is failure resilience beyond the scope of a single dataflow graph that is the current standard in data streaming systems. This paper presents Portals, an interactive, stateful dataflow composition framework with strong end-to-end guarantees. Portals enables event-driven, resilient applications that span across dataflow graphs and serverless deployments. The demonstration exhibits three scenarios in our multi-dataflow streaming-based system: dynamically composing a stateful serverless application; an interactive cloud and edge serverless application; and a Portals browser playground.
XDB in Action: Decentralized Cross-Database Query Processing for Black-Box DBMSes [demo]Haralampos Gavriilidis (Technische Universität Berlin)*; Leonhard Rose (Technische Universität Berlin); Joel Ziegler (Technische Universität Berlin); Kaustubh Beedkar (IIT Delhi); Jorge-Arnulfo Quiané-Ruiz (IT University of Copenhagen); Volker Markl (Technische Universität Berlin) Show AbstractDownload Paper
Data are naturally produced at different locations and hence stored on different DBMSes. To maximize the value of the collected data, today's users combine data from different sources. Research in data integration has proposed the Mediator-Wrapper (MW) architecture to enable ad-hoc querying processing over multiple sources. The MW approach is desirable for users, as they do not need to deal with heterogeneous data sources. However, from a query processing perspective, the MW approach is inefficient: First, one needs to provision the mediating execution engine with resources. Second, during query processing, data gets "centralized" within the mediating engine, which causes redundant data movement. Recently, we proposed in-situ cross-database query processing, a paradigm for federated query processing without a mediating engine. Our approach optimizes runtime performance and reduces data movement by leveraging existing systems, eliminating the need for an additional federated query engine. In this demonstration, we showcase XDB, our prototype for in-situ cross-database query processing. We demonstrate several aspects of XDB, i.e. the cross-database environment, our optimization techniques, and its decentralized execution phase.
Demo-Group-B
QO-Insight: Inspecting Steered Query Optimizers [demo]Christoph Anneser (Technical University of Munich)*; Mario Petruccelli (TUM); Nesime Tatbul (Intel Labs and MIT); David E Cohen (Intel); Zhenggang Xu (Meta Platforms); Prithviraj P Pandian (Meta); Nikolay Laptev (Facebook); Ryan C Marcus (Massachusetts Institute of Technology); Alfons Kemper (TUM) Show AbstractDownload Paper
Steered query optimizers address the planning mistakes of traditional query optimizers by providing them with hints on a per-query basis, thereby guiding them in the right direction. This paper introduces QO-Insight, a visual tool designed for exploring query execution traces of such steered query optimizers. Although steered query optimizers are typically perceived as black boxes, QO-Insight empowers database administrators and experts to gain qualitative insights and enhance their performance through visual inspection and analysis.
Demonstrating Waffle: A Self-driving Grid Index [demo]Dalsu Choi (Korea University); Hyunsik Yoon (Korea University); Hyubjin Lee (Korea University); Yon Dohn Chung (Korea University)* Show AbstractDownload Paper
This paper demonstrates Waffle, a self-driving grid indexing system for moving objects. We introduce system architecture, system workflow, and user scenarios. Waffle enables the management of moving objects with less human effort while automatically improving performance.
CM-Explorer: Dissecting Data Ingestion Problems [demo]Niels Bylois (Hasselt University)*; Frank Neven (Hasselt University); Stijn Vansummeren (Hasselt University) Show AbstractDownload Paper
Data ingestion validation, the task of certifying the quality of continuously collected data, is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. We employ conditional unit tests based on conditional metrics (CMs) that compute data quality signals over specific parts of the ingestion data and therefore allow for a fine-grained detection of errors. A violated conditional unit test specifies a set of erroneous tuples in a natural way: the subrelation that its CM refers to. Unfortunately, the downside of their fine-grained nature is that violating unit tests are often correlated: a single error in an ingestion batch may cause multiple tests (each referring to different parts of the batch) to fail. The key challenge is therefore to untangle this correlation and filter out the most relevant violated conditional unit tests, i.e., tests that identify a core set of erroneous tuples and act as an explanation for the errors. We present CM-Explorer, a system that supports data stewards in quickly finding the most relevant violated conditional unit tests. The system consists of three components: (1) a graph explorer for visualizing the correlation structure of the violated unit tests; (2) a relation explorer for browsing the tuples selected by conditional unit tests; and, (3) a history explorer to get insight why conditional unit tests are violated. In this paper, we discuss these components and present the different scenarios that we make available for the demonstration.
Solving Hard Variants of Database Schema Matching on Quantum Computers [demo]Kristin Fritsch (University of Passau); Stefanie Scherzinger (University of Passau)* Show AbstractDownload Paper
With quantum computers now available as cloud services, there is a global quest for applications where a quantum advantage can be shown. Naturally, data management is a candidate domain. Workable solutions require the design of hybrid quantum algorithms, where a quantum computing unit (a QPU) and classical computing (via CPUs) cooperate towards solving a problem.
This demo illustrates such an end-to-end solution targeting NP-hard variants of database schema matching. Our demo is intended to be educational (and hopefully inspirational), allowing participants to explore the critical design decisions, such as the handover between phases of QPU- and CPU-based computation. It will also allow participants to experience hands-on -- through playful interaction -- how easily problem sizes exceed the limitations of today's QPUs.
To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks [demo]Nico Schäfer (RPTU Kaiserslautern-Landau); Damjan Gjurovski (RPTU Kaiserslautern-Landau)*; Angjela Davitkova (RPTU Kaiserslautern-Landau); Sebastian Michel (RPTU Kaiserslautern-Landau) Show AbstractDownload Paper
While existing data management solutions try to keep up with novel data formats and features, a myriad of valuable functionality is often only accessible via programming language libraries. Particularly for machine learning tasks, there is a wealth of pre-trained models and easy-to-use libraries that allow a wide audience to harness state-of-the-art machine learning. We propose the demonstration of a highly modularized data processor for semi-structured data that can be extended by means of plain Python scripts. Next to commonly supported user-defined functions, the deep decomposition allows augmenting the core engine with additional index structures, customized import and export routines, and custom aggregation functions. For several use cases, we detail how user-defined modules can be quickly realized and invite the audience to write and apply custom code, to tailor provided code snippets that we bring along to own preferences to solve data analytics tasks involving sentiment analysis of Twitter tweets.
BrewER: Entity Resolution On-Demand [demo]Luca Zecchini (Università degli Studi di Modena e Reggio Emilia)*; Giovanni Simonini (University of Modena and Reggio Emilia); Sonia Bergamaschi (Università di Modena e Reggio Emilia); Felix Naumann (Hasso Plattner Institute, University of Potsdam) Show AbstractDownload Paper
The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task).
We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.
Web Connector: A Unified API Wrapper to Simplify Web Data Collection [demo]Weiyuan Wu (Simon Fraser University)*; Pei Wang (Simon Fraser University); Yi Xie (Simon Fraser University); Yejia Liu (Simon Fraser University); George Chow (Simon Fraser University); Jiannan Wang (Simon Fraser University) Show AbstractDownload Paper
Collecting structured data from Web APIs, such as the Twitter API, Yelp Fusion API, Spotify API, and DBLP API, is a common task in the data science lifecycle, but it requires advanced programming skills for data scientists. To simplify web data collection and lower the barrier to entry, API wrappers have been developed to wrap API calls into easy-to-use functions. However, existing API wrappers are not standardized, which means that users must download and maintain multiple API wrappers and learn how to use each of them, while developers must spend considerable time creating an API wrapper for any new website. In this demo, we present the Web Connector, which unifies API wrappers to overcome these limitations. First, the Web Connector has an easy-to-use programming interface, designed to provide a user experience similar to that of reading data from relational databases. Second, the Web Connector's novel system architecture requires minimal effort to fetch data for end-users with an existing API description file. Third, the Web Connector includes a semi-automatic API description file generator that leverages the concept of generation by example to create new API wrappers without writing code.
ERICA: Query Refinement for Diversity Constraint Satisfaction [demo]Jinyang Li (University of Michigan)*; Alon Silberstein (Ben Gurion University); Yuval Moskovitch (Ben Gurion University); Julia Stoyanovich (New York University); H. V. Jagadish (University of Michigan) Show AbstractDownload Paper
Relational queries are commonly used to support decision making in critical domains like hiring and college admissions. For example, a college admissions officer may need to select a subset of the applicants for in-person interviews, who individually meet the qualification requirements (e.g., have a sufficiently high GPA) and are collectively demographically diverse (e.g., include a sufficient number of candidates of each gender and of each race). However, traditional relational queries only support selection conditions checked against each input tuple, and they do not support diversity conditions checked against multiple, possibly overlapping, groups of output tuples. To address this shortcoming, we present ERICA, an interactive system that proposes minimal modifications for selection queries to have them satisfy constraints on the cardinalities of multiple groups in the result. We demonstrate the effectiveness of ERICA using several real-life datasets and diversity requirements.
DataRinse: Semantic Transforms for Data preparation based on Code Mining [demo]Ibrahim Abdelaziz (IBM Research); Julian Dolby (IBM Research); Udayan Khurana (IBM Research); Horst Samulowitz (IBM Research); Kavitha Srinivas (IBM Research)* Show AbstractDownload Paper
Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.
Demonstrating ADOPT: Adaptively Optimizing Attribute Orders for Worst-Case Optimal Joins via Reinforcement Learning [demo]Junxiong Wang (Cornell University)*; Mitchell E Gray (Cornell); Immanuel Trummer (Cornell); Ahmet Kara (University of Zurich); Dan Olteanu (University of Zurich) Show AbstractDownload Paper
Performance of worst-case optimal join algorithms depends on the order in which the join attributes are processed. It is challenging to identify suitable orders prior to query execution due to the huge search space of possible orders and unreliable execution cost estimates in case of data skew or data correlation.
We demonstrate ADOPT, a novel query engine that integrates adaptive query processing with a worst-case optimal join algorithm. ADOPT divides query execution into episodes, during which different attribute orders are invoked. With runtime feedback on performance of different attribute orders, ADOPT rapidly approaches near-optimal orders. Moreover, ADOPT uses a unique data structure which keeps track of the processed input data to prevent redundant work across different episodes. It selects attribute orders to try via reinforcement learning, balancing the need for exploring new orders with the desire to exploit promising orders. In experiments, ADOPT outperforms baselines, including commercial and open-source systems utilizing worst-case optimal join algorithms, particularly for complex queries that are difficult to optimize.
Demonstrating GPT-DB: Generating Query-Specific and Customizable Code for SQL Processing with GPT-4 [demo]Immanuel Trummer (Cornell University)* Show AbstractDownload Paper
GPT-DB generates code for SQL processing in general-purpose programming languages such as Python. Generated code can be freely customized using user-provided natural language instructions. This enables users, for instance, to try out specific libraries for SQL processing or to generate non-standard output while processing.
GPT-DB is based on OpenAI's GPT model series, neural networks capable of translating natural language instructions into code. By default, GPT-DB exploits the most recently released GPT-4 model whereas visitors may also select prior versions for comparison. GPT-DB automatically generates query-specific prompts, instructing GPT on code generation. These prompts include a description of the target database, as well as logical query plans described as natural language text, and instructions for customization. GPT-DB automatically verifies, and possibly re-generates, code using a reference database system for result comparisons. It enables users to select code samples for training, thereby increasing accuracy for future queries. The proposed demonstration showcases code generation for various queries and with varying instructions for code customization.
PikePlace: Generating Intelligence for Marketplace Datasets [demo]Shi Qiao (SmartApps); Alekh Jindal (SmartApps)* Show AbstractDownload Paper
There is a renewed interest in data marketplaces with cloud data warehouses that make sharing and accessing data on-demand and extremely easy. However, analyzing marketplace datasets is challenge since current tools for creating the data models are manual and slow. In this paper, we propose to demonstrate a learning- based approach to discover, deploy, and optimize data models. We present the resulting system, PikePlace, show an evaluation over Snowflake marketplace and TPC-H datasets, and describe several demonstration scenarios that the audience can play with.
Demo-Group-C
PAINE Demo: Optimizing Video Selection Queries With Commonsense Knowledge [demo]Wenjia He (University of Michigan)*; Ibrahim Sabek (Massachusetts Institute of Technology); Yuze Lou (University of Michigan); Michael Cafarella (MIT CSAIL) Show AbstractDownload Paper
Because video is becoming more popular and constitutes a major part of data collection, we have the need to process video selection queries --- selecting videos that contain target objects. However, a naive scan of a video corpus without optimization would be extremely inefficient due to applying complex detectors to irrelevant videos. This demo presents PAINE; a video query system that employs a novel index mechanism to optimize video selection queries via commonsense knowledge. PAINE samples video frames to build an inexpensive lossy index, then leverages probabilistic models based on existing commonsense knowledge sources to capture the semantic-level correlation among video frames, thereby allowing PAINE to predict the content of unindexed video. These models can predict which videos are likely to satisfy selection predicates so to as avoid PAINE from processing irrelevant videos. We will demonstrate a system prototype of PAINE for accelerating the processing of video selection queries, allowing VLDB'23 participants to use the PAINE interface to run queries. Users can compare PAINE with the baseline, the SCAN method.
DeepVQL: Deep Video Queries on PostgreSQL [demo]Dong June Lew (Kunsan National University); Kihyun Yoo (Kunsan National University); Kwang Woo Nam (Kunsan National University, South Korea)* Show AbstractDownload Paper
The recent development of mobile and camera devices has led to the generation, sharing, and usage of massive amounts of video data. As a result, deep learning technology has gained attention as an alternative for video recognition and situation judgment. Recently, new systems supporting SQL-like declarative query languages have emerged, focusing on developing their own systems to support new queries combined with deep learning that are not supported by existing systems. The proposed DeepVQL system in this paper is implemented by expanding the PostgreSQL system. DeepVQL supports video database functions and provides various user-defined functions for object detection, object tracking and video analytics queries. The advantage of this system is its ability to utilize queries with specific spatial regions or temporal durations as conditions for analyzing moving objects in traffic videos.
EQUI-VOCAL Demonstration: Synthesizing Video Queries from User Interactions [demo]Enhao Zhang (University of Washington)*; Maureen Daum (University of Washington); Dong He (University of Washington); Manasi Ganti (University of Washington Seattle); Brandon Haynes (Microsoft Gray Systems Lab); Ranjay Krishna (University of Washington); Magdalena Balazinska (UW) Show AbstractDownload Paper
We demonstrate EQUI-VOCAL, a system that synthesizes compositional queries over videos from user feedback. EQUI-VOCAL enables users to query a video database for complex events by providing a few positive and negative examples of what they are looking for and labeling a small number of additional system-selected examples. Using those user inputs, EQUI-VOCAL synthesizes declarative queries that can then retrieve additional instances of the desired events. The demonstration makes two contributions: it introduces EQUI-VOCAL's graphical user interface and enables conference attendees to experiment with EQUI-VOCAL's query-by-example approach. Both enable users to gain a better understanding of how EQUI-VOCAL efficiently identifies events using its novel query synthesis approach and explore the impact of hyperparameters and label noise on system performance.
Interpretable Clustering of Multivariate Time Series with Time2Feat [demo]Angela Bonifati (University of Lyon); Francesco Del Buono (University of Modena e Reggio Emilia); Francesco Guerra (University of Modena e Reggio Emilia)*; Miki Lombardi (Adobe); Donato Tiano (Università degli Studi di Modena e Reggio Emilia) Show AbstractDownload Paper
This paper showcases Time2Feat, an end-to-end machine learning system for Multivariate Time Series (MTS) clustering. The system relies on interpretable inter-signal and intra-signal features extracted from the time series. Then, a dimensionality reduction technique is applied to select a subset of features that retain most of the information, thus enhancing the interpretability of the results. In addition, the system enables domain specialists to semi-supervise the process by submitting a small collection of MTS with a target cluster. This process further improves both accuracy and interpretability, by reducing the number of features used by the clustering process. The demonstration shows the application of Time2Feat to various MTS datasets, by creating clusters from MTS datasets of interest, experimenting with different settings and using the approach capabilities to interpret the clusters generated.
mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses Over and Over? [demo]Stefan Grafberger (University of Amsterdam)*; Shubha Guha (University of Amsterdam); Paul Groth (University of Amsterdam); Sebastian Schelter (University of Amsterdam) Show AbstractDownload Paper
Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this pipeline variant to see how the change impacts the pipeline's output score.
We recently proposed mlwhatif, a library that enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. We demonstrate how data scientists can leverage mlwhatif for a variety of pipelines and three different what-if analyses focusing on the robustness of a pipeline against data errors, the impact of data cleaning operations, and the impact of data preprocessing operations on fairness. In particular, we demonstrate step-by-step how mlwhatif generates and optimizes the required execution plans for the pipeline analyses. Our library is publicly available at https://github.com/stefan-grafberger/mlwhatif.
VisualNeo: Bridging the Gap between Visual Query Interfaces and Graph Query Engines [demo]Kai Huang (HKUST)*; Houdong Liang (Hong Kong University of Science and Technology); Chongchong Yao (Hong Kong University of Science and Technology); Xi Zhao (The Hong Kong University of Science and Technology); Yue Cui (The Hong Kong University of Science and Technology); Yao Tian (The Hong Kong University of Science and Technology); Ruiyuan Zhang (The Hong Kong university of Science and Technology); Xiaofang Zhou (Hong Kong University of Sci and Tech) Show AbstractDownload Paper
Visual Graph Query Interfaces (VQIs) empower non-programmers to query graph data by constructing visual queries intuitively. De- vising efficient technologies in Graph Query Engines (GQEs) for interactive search and exploration has also been studied for years. However, these two vibrant scientific fields are traditionally independent of each other, causing a vast barrier for users who wish to explore the full-stack operations of graph querying. In this demonstration, we propose a novel VQI system built upon Neo4j called VisualNeo that facilities an efficient subgraph query in large graph databases. VisualNeo inherits several advanced features from re- cent advanced VQIs, which include the data-driven GUI design and canned pattern generation. Additionally, it embodies a database manager module in order that users can connect to generic Neo4j databases. It performs query processing through the Neo4j driver and provides an aesthetic query result exploration.
KG-Roar: Interactive Datalog-based Reasoning on Virtual Knowledge Graphs [demo]Luigi Bellomarini (Banca d'Italia)*; Marco Benedetti (Banca d'Italia); Andrea Gentili (Banca d'Italia); Davide Magnanimi (Politecnico di Milano); Emanuel Sallinger (TU Wien) Show AbstractDownload Paper
Logic-based Knowledge Graphs (KGs) are gaining momentum in academia and industry thanks to the rise of expressive and efficient languages for Knowledge Representation and Reasoning (KRR). These languages accurately express business rules, through which valuable new knowledge is derived. A versatile and scalable back-end reasoner, like Vadalog, a state-of-the-art system for logic-based KGs---based on an extension of Datalog---executes the reasoning. In this demo, we present KG-Roar, a web-based interactive development and navigation environment for logical KGs. The system lets the user augment an input graph database with intensional definitions of new nodes and edges and turn it into a KG, via the metaphor of reasoning widgets---user-defined or off-the-shelf code snippets that capture business definitions in the Vadalog language. Then, the user can seamlessly browse the original and the derived nodes and edges within a ``Virtual Knowledge Graph'', which is reasoned upon and generated interactively at runtime, thanks to the scalability and responsiveness of Vadalog. KG-Roar is domain-independent but domain aware, as exploration controls are contextually generated based on the intensional definitions. We walk the audience through KG-Roar showcasing the construction of certain business definitions and putting it into action on a real-world financial KG, from our work with the Bank of Italy.
Visualizing Spreadsheet Formula Graphs Compactly [demo]Fanchao Chen (Fudan University); Dixin Tang (University of California at Berkeley)*; Haotian Li (The Hong Kong University of Science and Technology); Aditya G. Parameswaran (University of California at Berkeley) Show AbstractDownload Paper
Spreadsheets are a ubiquitous data analysis tool, empowering non-programmers and programmers alike to easily express their computations by writing formulae alongside data. The dependencies created by formulae are tracked as formula graphs, which play a central role in many spreadsheet applications and are critical to the interactivity and usability of spreadsheet systems. Unfortunately, as formula graphs become large and complex, it becomes harder for end-users to make sense of formula graphs and trace the dependents or precedents of cells to check the accuracy of individual formulae and identify sources of errors. In this paper, we demonstrate a spreadsheet formula graph visualization tool, TACO-Viewer, developed as a plugin for Microsoft Excel. Our plugin leverages TACO, our framework for compactly and efficiently representing formula graphs. TACO compresses formula graphs using a key spreadsheet property: tabular locality, which means that cells close to each other are likely to have similar formula structures. This compact representation enables end-users to more easily consume complex dependencies and reduces the response time for tracing dependents and precedents. TACO-Viewer, our visualization plugin, depicts the compact representation of TACO and supports users in visually tracing dependents and precedents. As part of our demonstration, attendees can compare the visualizations of different formula graphs using TACO, Excel’s built-in dependency tracing tool, and an approach that does not compress formula graphs, and quantitatively compare the different response times of different approaches.
FS-Real: A Real-World Cross-Device Federated Learning Platform [demo]Dawei Gao (Alibaba-inc); Daoyuan Chen (Alibaba Group)*; Zitao Li (Alibaba Group); Yuexiang Xie (Alibaba Group); Xuchen Pan (Alibaba Group); Yaliang Li (Alibaba Group); Bolin Ding (Data Analytics and Intelligence Lab, Alibaba Group); Jingren Zhou (Alibaba Group) Show AbstractDownload Paper
Federated learning (FL) is a general distributed machine learning paradigm that provides solutions for tasks where data cannot be shared directly. Due to the difficulties in communication management and heterogeneity of distributed data and devices, initiating and using an FL algorithm for real-world cross-device scenarios requires significant repetitive effort but may not be transferable to similar projects. To reduce the effort required for developing and deploying FL algorithms, we present FS-REAL, an open-source FL platform designed to address the need of a general and efficient infrastructure for real-world cross-device FL. In this paper, we introduce the key components of FS-REAL and demonstrate that FS-REAL has the following capabilities: 1) reducing the programming burden of FL algorithm development with plug-and-play and adaptable runtimes on Android and other Internet of Things (IoT) devices; 2) handling a large number of heterogeneous devices efficiently and robustly with our communication management components; 3) supporting a wide range of advanced FL algorithms with flexible configuration and extension; 4) alleviating the costs and efforts for deployment, evaluation, simulation, and performance optimization of FL algorithms with automatized tool kits.
Odyssey: An Engine Enabling The Time-Series Clustering Journey [demo]John Paparrizos (The Ohio State University)*; Sai Prasanna Teja Reddy (Exelon Utilities) Show AbstractDownload Paper
Clustering is one of the most popular time-series tasks because it enables unsupervised data exploration and often serves as a subroutine or preprocessing step for other tasks. Despite being the subject of active research across disciplines for decades, only limited efforts focused on benchmarking clustering methods for time series. Unfortunately, these studies have (i) omitted popular methods and entire classes of methods; (ii) considered limited choices for underlying distance measures; (iii) performed evaluations on a small number of datasets; or (iv) avoided rigorous statistical validation of the findings. In addition, the sudden enthusiasm and recent slew of proposed deep learning methods underscore the vital need for a comprehensive study. Motivated by the aforementioned limitations, we present Odyssey, a modular and extensible web engine to comprehensively evaluate 80 time-series clustering methods spanning 9 different classes from the data mining, machine learning, and deep learning literature. Odyssey enables rigorous statistical analysis across 128 diverse time-series datasets. Through its interactive interface, Odyssey (i) reveals the best-performing method per class; (ii) identifies classes performing exceptionally well that were previously omitted; (iii) challenges claims about the use of elastic measures in clustering; (iv) highlights the effects of parameter tuning; and (v) debunks claims of superiority of deep learning methods. Odyssey does not only facilitate the most extensive study ever performed in this area but, importantly, reveals an illusion of progress while, in reality, none of the evaluated methods could outperform a traditional method, namely, $k$-Shape, with a statistically significant difference. Overall, Odyssey lays the foundations for advancing the state of the art in time-series clustering.
SHEVA: A Visual Analytics System for Statistical Hypothesis Exploration [demo]Vicente N de Almeida (UFRGS)*; Eduardo Ribeiro (Universidade Federal do Tocantins); Nassim Bouarour (CNRS, University Grenoble Alpes); Joao Luiz Dihl Comba (UFRGS); Sihem Amer-Yahia (CNRS) Show AbstractDownload Paper
We demonstrate SHEVA, a System for Hypothesis Exploration with Visual Analytics. SHEVA adopts an Exploratory Data Analysis (EDA) approach to discovering statistically-sound insights from large datasets. The system addresses three longstanding challenges in Multiple Hypothesis Testing: (i) the likelihood of rejecting the null hypothesis by chance, (ii) the pitfall of not being representative of the input data, and (iii) the ability to navigate among many data regions while preserving the user's train of thought. To address (i) & (ii), SHEVA implements significance adjustment methods that account for data-informed properties such as coverage and novelty. To address (iii), SHEVA proposes to guide users by recommending one-sample and two-sample hypotheses in a stepwise fashion following a data hierarchy. Users may choose from a collection of pre-trained hypothesis exploration policies and let SHEVA guide them through the most significant hypotheses in the data, or intervene to override suggested hypotheses. Furthermore, SHEVA relies on data-to-visual element mappings to convey hypothesis testing results in an interpretable fashion, and allows hypothesis pipelines to be stored and retrieved later to be tested on new datasets.
R
Join Order Selection with Deep Reinforcement Learning: Fundamentals, Techniques, and Challenges [tutorial]Zhengtong Yan (University of Helsinki)*; Valter Uotila (University of Helsinki); Jiaheng Lu (University of Helsinki) Show AbstractDownload Paper
Join Order Selection (JOS) is a fundamental challenge in query optimization, as it significantly affects query performance. However, finding an optimal join order is an NP-hard problem due to the exponentially large search space. Despite the decades-long effort, traditional methods still suffer from limitations. Deep Reinforcement Learning (DRL) approaches have recently gained growing interest and shown superior performance over traditional methods. These DRL-based methods could leverage prior experience through the trial-and-error strategy to automatically explore the optimal join order. This tutorial will focus on recent DRL-based approaches for join order selection by providing a comprehensive overview of the various approaches. We will start by briefly introducing the core concepts of join ordering and the traditional methods for JOS. Next, we will provide some preliminary knowledge about DRL and then delve into DRL-based join order selection approaches by offering detailed information on those methods, analyzing their relationships, and summarizing their weaknesses and strengths. To help the audience gain a deeper understanding of DRL approaches for JOS, we will present two open-source demonstrations and compare their differences. Finally, we will identify research challenges and open problems to provide insights into future research directions. This tutorial will provide valuable guidance for developing more practical DRL approaches for JOS.