VLDB 2021: Demonstrations
The demonstrations that have been accepted in the conference are listed below. The demonstrations are split in four groups, given below.
Group 1: Graphs and other Non-Standard Data Types and Applications [Demo Blocks 1 & 3]
KDV-Explorer: A Near Real-Time Kernel Density Visualization System for Spatial Analysis [Download Paper] (Hong Kong Baptist University), (University of Macau), (University of Macau), (University of Macau), (The University of Hong Kong), (University of Macau), (The University of Hong Kong, China) Kernel density visualization (KDV) is a commonly used visualization tool for many spatial analysis tasks, including disease outbreak detection, crime hotspot detection, traffic accident hotspot detection, etc. Although the most popular geographical information systems, e.g., QGIS, and ArcGIS, can also support this operation, these solutions are not scalable to generate a single KDV for datasets with million-scale data points, let alone to support exploratory operations (e.g., zoom in, zoom out, and panning operations) with KDV in near real-time (< 5 sec). In this demonstration, we develop a near real-time visualization system, called KDV-Explorer, that is built on top of our prior study on the efficient kernel density computation. Participants will be invited to conduct some kernel density analysis on three large-scale datasets (up to 1.3 million data points), including the traffic accident dataset, crime dataset and COVID-19 dataset. We will also compare the performance of our solution and the solutions in QGIS and ArcGIS.
Path Advisor: A Multi-Functional Campus Map Tool for Shortest Path [Download Paper] (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology) The shortest path in both the two dimensional (2D) plane and the three dimensional (3D) terrain is extensively used both in industry and academia. Although there are some map visualization tools for viewing the shortest path in 2D and 3D views, we find two limitations: (1) they are not applicable for map applications with obstacles (such as the wall in a building), and (2) they look unrealistic and strange when a road network approach is blindly adopted. Motivated by this, we developed a web-based multi-functional campus map tool called Path Advisor, which allows users to visualize the shortest path in the 2D view, the bird's eye view and the virtual reality view (VR view). Path Advisor uses Dijkstra’s shortest path algorithm and breadth-first tree in the 2D view, and the weighted shortest surface path algorithm in the bird's eye view and the VR view. We shot a video for demonstrating Path Advisor at https://youtu.be/ZgdjyXXHwqg.
GraphScope: A One-Stop
Large Graph Processing System [Download Paper]
(Peking University & Alibaba Group),
(Univ. of Edinburgh ),
Due to diverse graph data and algorithms, programming and orchestration of complex computation
pipelines have become the major challenges to making use of graph applications for Web-scale data
analysis. GraphScope aims to provide a one-stop and efficient solution for a wide range of graph
computations at scale. It extends previous systems by offering a unified and high-level programming
interface and allowing the seamless integration of specialized graph engines in a general data-parallel
As we will show in this demo, GraphScope enables developers to write sequential graph programs in Python and provides automatic parallel execution on a cluster. This further allows GraphScope to seamlessly integrate with existing data processing systems in PyData ecosystem. To validate GraphScope’s efficiency, we will compare a complex, multi-staged processing pipeline for a real- life fraud detection task with a manually assembled implementation comprising multiple systems. GraphScope achieves a 2.86× speedup on a trillion-scale graph in real production at Alibaba.
Comprehensive Evaluation of Question Answering Systems over Knowledge Graphs Through Deep Analysis of
Benchmarks [Download Paper]
A plethora of question answering (QA) systems that retrieve answers to natural language questions from
knowledge graphs have been developed in recent years. However, choosing a benchmark to accurately
assess the quality of a question answering system is a challenging task due to the high degree of
variations among the available benchmarks with respect to their fine-grained properties.
In this demonstration, we introduce CBench, an extensible, and more informative benchmarking suite for analyzing benchmarks and evaluating QA systems. CBench can be used to analyze existing benchmarks with respect to several fine-grained linguistic, syntactic, and structural properties of the questions and queries in the benchmarks. Moreover, CBench can be used to facilitate the evaluation of QA systems using a set of popular benchmarks that can be augmented with other user-provided benchmarks. CBench not only evaluates a QA system based on popular single-number metrics but also gives a detailed analysis of the linguistic, syntactic, and structural properties of answered and unanswered questions to help the developers of QA systems to better understand where their system excels and where it struggles.
Demonstration of Marius: Graph Embeddings with a Single Machine [Download Paper] (University of Wisconsin-Madison), (University of Wisconsin-Madison), (University of Wisconsin-Madison), (University of Wisconsin-Madison), (University of Wisconsin-Madison), (University of Wisconsin-Madison), (University of Wisconsin, Madison) Graph embeddings have emerged as the de facto representation for modern machine learning over graph data structures. The goal of graph embedding models is to convert high-dimensional sparse graphs into low-dimensional, dense and continuous vector spaces that preserve the graph structure properties. However, learning a graph embedding model is a resource intensive process, and existing solutions rely on expensive distributed computation to scale training to instances that do not fit in GPU memory. This demonstration showcases Marius: a new open-source engine for learning graph embedding models over billion-edge graphs on a single machine. Marius is built around a recently-introduced architecture for machine learning over graphs that utilizes pipelining and a novel data replacement policy to maximize GPU utilization and exploit the entire memory hierarchy (including disk, CPU, and GPU memory) to scale to large instances. The audience will experience how to develop, train, and deploy graph embedding models using Marius' configuration-driven programming model. Moreover, the audience will have the opportunity to explore Marius' deployments on applications including link-prediction on WikiKG90M and reasoning queries on a paleobiology knowledge graph. Marius is available as open source software at https://marius-project.org.
Apperception: A Database Management System for Geospatial Video Data [Download Paper]
(University of California Berkeley),
(University of California Berkeley),
(University of Washington),
(University of California, Berkeley),
Many recent video applications—including traffic monitoring, drone analytics, autonomous driving, and
virtual reality—require piecing together, combining, and operating over many related video streams.
Despite the massive data volumes involved and the need to jointly reason (both spatially and temporally)
about these videos, current techniques to store and manipulate such data are often limited to file
systems and simple video processing frameworks that reason about a single video in isolation.
We present Apperception, a new type of database management system optimized for geospatial video applications. Apperception comes with a data model for multiple geospatial video data streams, and a programming interface for developers to collectively reason about the entities observed in those videos. Our demo will let users write queries over video using Apperception and retrieve (in real-time) both metadata and rendered video data. Users can also compare results and observe speedups achieved by using Apperception.
Automated energy consumption forecasting with EnForce [Download Paper] (National Technical University of Athens), (National Technical University of Athens), (National Technical University of Athens)* The need to reduce energy consumption on a global scale has beenof high importance during the last years. Research has createdmethods to make highly accurate forecasts on the energy consump-tion of buildings and there have been efforts towards the provisionof automated forecasting for time series prediction problems. En-Force is a novel system that provides fully automatic forecastingon time series data, referring to the energy consumption of build-ings. It uses statistical techniques and deep learning methods tomake predictions on univariate or multivariate time series data,so that exogenous factors, such as outside temperature, are takeninto account. Moreover, the proposed system provides automatic data preprocessing and, therefore, handles noisy data, with missingvalues and outliers. EnForce includes full API support and can beused both by experts and non-experts. The proposed demonstration showcases the advantages and technical features of EnForce.
RealGraph-Web: A Graph Analysis Platform on the Web [Download Paper] (Hanyang University), (Hanyang University), (Hanyang University, Korea)* In this demo, we present RealGraph-Web, a web-based platform that provides various kinds of graph analysis services. RealGraph-Web is based on RealGraph, a graph engine that addresses the problem of performance degradation in processing real-world big graphs, achieving great performance improvement up to 44 times over existing state-of-the-art graph engines. RealGraph-Web runs on a single machine with a web-based interface, thereby allowing users to easily and conveniently enjoy graph analysis services and perform various graph algorithms anywhere on the web. In this demo, we present how a user can analyze a graph on RealGraph-Web in three steps and get the analysis result quickly via a graphical user interface.
T-Cove: An exposure tracing System based on Cleaning Wi-Fi Events on Organizational Premises [Download Paper] (University of California, Irvine), (U.C. Irvine), (U.C. Irvine), (University of California, Irvine) WiFi connectivity events, generated when a mobile device connects to WiFi access points can serve as a robust, passive, (almost) zero-cost indoor localization technology. The challenge is the coarse level localization it offers that limits its usefulness. We recently developed a novel data cleaning based approach, LOCATER, that exploits patterns in the network data to achieve accuracy as high as 90% at room level granularity making it possible to use network data to support a much larger class of applications. In this paper, we demonstrate one such application to help organizations track levels of occupancy, and potential exposure of the inhabitants of the buildings to others possibly infected on their premises. The system, entitled T-Cove, is in operational use at over 20 buildings at UCI and has now become part of the reopening procedure of the schools. The demonstration will highlight T-Cove functionalities over both live data and data captured in the past.
SChain: A Scalable Consortium Blockchain Exploiting Intra- and Inter-Block Concurrency [Download Paper] (East China Normal University), (Ant Group), (Ant Group), (East China Normal University), (East China Normal University), (East China Normal University), (East China Normal University), (East China Normal University), (Ant Group), (Ant Group) We demonstrate SChain, a consortium blockchain that scales transaction processing to support large-scale enterprise applications. The unique advantage of SChain stems from the exploitation of both intra- and inter-block concurrency. The intra-block concurrency not only takes advantage of the multi-core processor on a single peer but also leverages the capacity of multiple peers. The inter-block concurrency enables simultaneous processing across multiple blocks to increase the utilization of various peers. In our demonstration, we use real-time dashboards containing visualization based on the output of SChain to give the attendees interactive explorations of how SChain achieves intra- and inter-block concurrency.
EPICGen: An Experimental Platform for Indoor Congestion Generation and Forecasting [Download Paper] (University of Southern California), (University of Pittsburgh), (University of Pittsburgh), (Computer Science Department. University of Southern California) Effectively and accurately forecasting the congestion in indoor spaces has become particularly important during the pandemic in order to reduce the risk of exposure to airborne viruses. However, there is a lack of readily available indoor congestion data to train such models. Therefore, in this demo paper we propose EPICGen, an experimental platform for indoor congestion generation to support congestion forecasting in indoor spaces. EPICGen consists of two components: (i) Grid Overlayer, which models the floor plans of buildings; and (ii) Congestion Generator, a realistic indoor congestion generator. We demonstrate EPICGen through an intuitive map-based user interface that enables end-users to customize the parameters of the system and visualize generated datasets.
Compliant Geo-distributed Data Processing in Action [Download Paper] (TU Berlin), (TU Berlin), (TU Berlin), (Technische Universitat Berlin) In this paper we present our work on compliant geo-distributed data processing. Our work focuses on the new dimension of dataflow constraints that regulate the movement of data across geographical or institutional borders. For example, European directives may regulate transferring only certain information fields (such as non personal information) or aggregated data. Thus, it is crucial for distributed data processing frameworks to consider compliance with respect to dataflow constraints derived from these regulations. We have developed a compliance-based data processing framework, which (i) allows for the declarative specification of dataflow constraints, (ii) determines if a query can be translated into a compliant distributed query execution plan, and (iii) executes the compliant plan over distributed SQL databases. We demonstrate our framework using a geo-distributed adaptation of the TPC-H benchmark data. Our framework provides an interactive dashboard, which allows users to specify dataflow constraints, and analyze and execute compliant distributed query execution plans.
Query-Driven Video Event Processing for the Internet of Multimedia Things [Download Paper] (Insight SFI Research Centre for Data Analytics), (Insight SFI Centre for Data Analytics), (Insight SFI Centre for Data Analytics), (Insight SFI Centre for Data Analytics), (Insight SFI Centre for Data Analytics) Advances in Deep Neural Network (DNN) techniques have revolutionized video analytics and unlocked the potential for querying and mining video event patterns. This paper details GNOSIS, an event processing platform to perform near-real-time video event detection in a distributed setting. GNOSIS follows a serverless approach where its component acts as independent microservices and can be deployed at multiple nodes. GNOSIS uses a declarative query-driven approach where users can write customize queries for spatiotemporal video event reasoning. The system converts the incoming video streams into a continuous evolving graph stream using machine learning (ML) and DNN models pipeline and applies graph matching for video event pattern detection. GNOSIS can perform both stateful and stateless video event matching. To improve Quality of Service (QoS), recent work in GNOSIS incorporates optimization techniques like adaptive scheduling, energy efficiency, and content-driven windows. This paper demonstrates the Occupational Health and Safety query use cases to show the GNOSIS efficacy.
GEDet: Detecting Erroneous Nodes with A Few Examples [Download Paper] (Case Western Reserve University), (Case Western Reserve University), (Pacific Northwest National Laboratory), (Case Western Reserve University) Detecting nodes with erroneous values in real-world graphs remains challenging due to the lack of examples and various error scenarios. We demonstrate GEDet, an error detection engine that can detect erroneous nodes in graphs with a few examples. The GEDet framework tackles error detection as a few-shot node classification problem. We invite the attendees to experience the following unique features. (1) Few-shot detection. Users only need to provide a few examples of erroneous nodes to perform error detection with GEDet. GEDet achieves desirable accuracy with (a) a graph augmentation module, which automatically generates synthetic examples to learn the classifier, and (b) an adversarial detection module, which improves classifiers to better distinguish erroneous nodes from both cleaned nodes and synthetic examples. We show that GEDet significantly improves the state-of-the- art error detection methods. (2) Diverse error scenarios. GEDet profiles data errors with a built-in library of transformation functions from correct values to errors. Users can also easily "plug-in" new error types or examples. (3) User-centric detection. GEDet supports (a) an active learning mode to engage users to verify detected results, and adapts the error detection process accordingly; and (b) visual interfaces to interpret and track detected errors.
Group 2: Machine Learning with and for Database Systems [Demo Blocks 2 & 4]
Refiner: A Reliable Incentive-Driven Federated Learning System Powered by Blockchain [Download Paper] (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University) Modern mobile applications often produce decentralized data, i.e., a huge amount of privacy-sensitive data distributed over a large number of mobile devices. Techniques for learning models from decentralized data must properly handle two natures of such data, namely privacy and massive engagement. Federated learning (FL) is a promising approach for such a learning task since the technique learns models from data without exposing privacy. However, traditional FL methods assume that the participating mobile devices are honest volunteers. This assumption makes traditional FL methods unsuitable for applications where two kinds of participants are engaged: 1) self-interested participants who, without economical stimulus, are reluctant to contribute their computing resources unconditionally, and 2) malicious participants who send corrupt updates to disrupt the learning process. This paper proposes Refiner, a reliable federated learning system for tackling the challenges introduced by massive engagements of self-interested and malicious participants. Refiner is built upon Ethereum, a public blockchain platform. To engage self-interested participants, we introduce an incentive mechanism which rewards each participant in terms of the amount of its training data and the performance of its local updates. To handle malicious participants, we propose an audit scheme which employs a committee of randomly chosen validators for punishing them with no reward and preclude corrupt updates from the global model. The proposed incentive and audit scheme is implemented with cryptocurrency and smart contract, two primitives offered by Ethereum. This paper demonstrates the main features of Refiner by training a digit classification model on the MNIST dataset.
ADESIT: Visualize the Limits of your Data in a Machine Learning Process [Download Paper] (INSA Lyon), (INSA Lyon), (LIRIS), (LIRIS) Thanks to the numerous machine learning tools available to us nowadays, it is easier than ever to derive a model from a dataset in the frame of a supervised learning problem. However, when this model behaves poorly compared with an expected performance, the underlying question of the existence of such a model is often underlooked and one might just be tempted to try different parameters or just choose another model architecture. This is why the quality of the learning examples should be considered as early as possible as it acts as a go/no go signal for the following potentially costly learning process. With ADESIT, we provide a way to evaluate the ability of a dataset to perform well for a given supervised learning problem through statistics and visual exploration. Notably, we base our work on recent studies proposing the use of functional dependencies and specifically counterexample analysis to provide dataset cleanliness statistics but also a theoretical upper bound on the prediction accuracy directly linked to the problem settings (measurement uncertainty, expected generalization...). In brief, ADESIT is intended to be part of an iterative data refinement process right after data selection and right before the machine learning process itself. With further analysis for a given problem, the user can characterize, clean and export dynamically selected subsets, allowing to better understand what regions of the data could be refined and where the data precision must be improved by using, for example, new or more precise sensors.
HyMAC: A Hybrid Matrix Computation System [Download Paper] (East China Normal University), (East China Normal University), (East China Normal University), (TU Berlin), (Technische Universitat Berlin), (East China Normal University), (East China Normal University ) Distributed matrix computation is common in large-scale data processing and machine learning applications. Iterative-convergent algorithms involving matrix computation share a common property: parameters converge non-uniformly. This property can be exploited to avoid redundant computation via incremental evaluation. Unfortunately, existing systems that support distributed matrix computation, like SystemML, do not employ incremental evaluation. Moreover, incremental evaluation does not always outperform classical matrix computation, which we refer to as a full evaluation. To leverage the benefit of increments, we propose a new system called HyMAC, which performs hybrid plans to balance the trade-off between full and incremental evaluation at each iteration. In this demonstration, attendees will have an opportunity to experience the effect that full, incremental, and hybrid plans have on iterative algorithms.
Just Move It! Dynamic Parameter Allocation in Action [Download Paper] (Technische Universitat Berlin), (TU Berlin), (TU Berlin), (Universitat Mannheim), (Technische Universitat Berlin) Parameter servers (PSs) ease the implementation of distributed machine learning systems, but their performance can fall behind that of single machine baselines due to communication overhead. We demonstrate LAPSE, an open source PS with dynamic parameter allocation. Previous work has shown that dynamic parameter allocation can improve PS performance by up to two orders of magnitude and lead to near-linear speed-ups over single machine baselines. This demonstration illustrates how LAPSE is used and why it can provide order-of-magnitude speed-ups over other PSs. To do so, this demonstration interactively analyzes and visualizes how dynamic parameter allocation looks like in action.
PostCENN: PostgreSQL with Machine Learning Models for Cardinality Estimation [Download Paper] (Technische Universitat Dresden), (Technische Universitat Dresden), (Technische Universitat Dresden), (TU Dresden), (TU Dresden) In this demo, we present PostCENN, an enhanced PostgreSQL database system with an end-to-end integration of machine learning (ML) models for cardinality estimation. In general, cardinality estimation is a topic with a long history in the database community. While traditional models like histograms are extensively used, recent works mainly focus on developing new approaches using ML models. However, traditional as well as ML models have their own advantages and disadvantages. With PostCENN, we aim to combine both to maximize their potentials for cardinality estimation by introducing ML models as a novel means to increase the accuracy of the cardinality estimation for certain parts of the database schema. To achieve this, we integrate ML models as first class citizen in PostgreSQL with a well-defined end-to-end life cycle. This life cycle consists of creating ML models for different sub-parts of the database schema, triggering the training, using ML models within the query optimizer in a transparent way, and deleting ML models.
DENOUNCER: Detection of Unfairness in Classifiers [Download Paper] (University of Michigan), (University of Michigan), (University of Michigan) The use of automated data-driven tools for decision-making has gained popularity in recent years. At the same time, the reported cases of algorithmic bias and discrimination increase as well, which in turn lead to an extensive study of algorithmic fairness. Numerous notions of fairness have been proposed, designed to capture different scenarios. These measures typically refer to a "protected group" in data, defined using values of some sensitive attributes. Confirming whether a fairness definition holds for a given group is a simple task, but detecting groups that are treated unfairly by the algorithm may be computationally prohibitive as the number of possible groups is combinatorial. We present a method for detecting such groups efficiently for various fairness definitions. Our solution is implemented in a system called DENOUNCER, an interactive system that allows users to explore different fairness measures of a (trained) classifier for a given test data. We propose to demonstrate the usefulness of DENOUNCER using real-life data and illustrate the effectiveness of our method.
A Demonstration of QARTA: An ML-based System for Accurate Map Services [Download Paper] (Qatar Computing Research Institute), (Qatar Computing Research Institute), (University of Minnesota), (Qatar Computing Research Institute), (University of Minnesota - Twin Cities)* This demo presents QARTA; an open-source full-fledged system for highly accurate and scalable map services. QARTA employs machine learning techniques to: (a)~construct its own highly accurate map in terms of both map topology and edge weights, and (b)~calibrate its query answers based on contextual information, including transportation modality, underlying algorithm, and time of day/week. The demo is based on actual deployment of QARTA in all Taxis in the State of Qatar and in the third-largest food delivery company in the country, and receiving hundreds of thousands of daily API calls with a real-time response time. Audience will be able to interact with the demo through various scenarios that show QARTA map and query accuracy as well as internals of QARTA.
Demonstration of Panda: A
Weakly Supervised Entity Matching System [Download Paper]
(Georgia Institute of Technology), (Georgia Institute of Technology), (GATECH), (GATECH), (Microsoft Research)
Entity matching (EM) refers to the problem of identifying tuple pairs in one or more relations that refer to the same real world entities. Supervised machine learning (ML) approaches, and deep learning based approaches in particular, typically achieve state-of-the-art matching results. However, these approaches require many labeled examples, in the form of matching and non-matching pairs, which are expensive and time-consuming to label.
In this paper, we introduce Panda, a weakly supervised system specifically designed for EM. Panda uses the same labeling function abstraction as Snorkel, where labeling functions (LF) are user-provided programs that can generate large amounts of (somewhat noisy) labels quickly and cheaply, which can then be combined via a labeling model to generate accurate final predictions. To support users developing LFs for EM, Panda provides an integrated development environment (IDE) that lives in a modern browser architecture. Panda's IDE facilitates the development, debugging, and life-cycle management of LFs in the context of EM tasks, similar to how IDEs such as Visual Studio or Eclipse excel in general-purpose programming. Panda's IDE includes many novel features purpose-built for EM, such as smart data sampling, a builtin library of EM utility functions, automatically generated LFs, visual debugging of LFs, and finally, an EM-specific labeling model. We show in this demo that Panda IDE can greatly accelerate the development of high-quality EM solutions using weak supervision.
for Deep Learning [Download Paper]
(Tsinghua University), (Tsinghua University), (Tsinghua University), (Tsinghua University), (Qatar Computing Research Institute, HBKU)
Deep learning (DL) has widespread applications and has revolutionized many industries. Although automated machine learning (AutoML) can help us away from coding for DL models, the acquisition of lots of high-quality data for model training remains a main bottleneck for many DL projects, simply because it requires high human cost. Despite many works on weak supervision (i.e., adding weak labels to seen data) and data augmentation (i.e., generating more data based on seen data), automatically acquiring training data, via smartly searching a pool of training data collected from open ML benchmarks and data markets, is not explored.
In this demonstration, we demonstrate a new system, automatic data acquisition (AutoData), which automatically searches training data from a heterogeneous data repository and interacts with AutoML. It faces two main challenges. (1) How to search high-quality data from a large repository for a given DL task? (2) How does AutoData interact with AutoML to guide the search? To address these challenges, we propose a reinforcement learning (RL)-based framework in AutoData to guide the iterative search process. AutoData encodes current training data and feedbacks of AutoML, learns a policy to search fresh data, and trains in iterations. We demonstrate with two real-life scenarios, image classification and relational data prediction, showing that AutoData can select high-quality data to improve the model.
DBMind: A Self-Driving Platform in openGauss [Download Paper] (Tsinghua), (Tsinghua University), (Tsinghua University), (Tsinghua university), (Tsinghua University), (Huawei), (Huawei), (Huawei), (Huawei) We demonstrate a self-driving system DBMind, which provides three autonomous capabilities in database, including self-monitoring, self-diagnosis and self-optimization. First, self-monitoring judiciously collects database metrics and detects anomalies (e.g., slow queries and IO contention), which can profile database status while only slightly affecting system performance (<5%). Then, self-diagnosis utilizes an LSTM model to analyze the root causes of the anomalies and automatically detect root causes from a pre-defined failure hierarchy. Next, self-monitoring automatically optimizes the database performance using learning-based techniques, including deep reinforcement learning based knob tuning, reinforcement learning based index selection, and encoding-based view selection. We have implemented DBMind in an open source database openGauss and demonstrated real scenarios.
Assassin: an Automatic claSSificAtion system baSed on algorithm SelectIoN [Download Paper] (Harbin Institute of Technology), (Harbin Institute of Technology), (Harbin Institute of Technology), (Harbin Institute of Technology), (Harbin Institute of Technology), (HIT) The increasing complexity of data analysis tasks makes it dependent on human expertise and challenging for non-experts. One of the major challenges faced in data analysis is the selection of the proper algorithm for given tasks and data sets. Motivated by this, we develop Assassin, aiming at helping users without enough expertise to automatically select optimal algorithms for classification tasks. By embedding meta-learning techniques and reinforced policy, our system can automatically extract experiences from previous tasks and train a meta-classifier to implement algorithm recommendations. Then we apply genetic search to explore hyperparameter configuration for the selected algorithm. We demonstrate Assassin with classification tasks from OpenML. The system chooses an appropriate algorithm and optimal hyperparameter configuration for them to achieve a high-level performance target. The Assassin has a user-friendly interface that allows users to customize the parameters during the search process.
A Demonstration of the Exathlon Benchmarking Platform for Explainable Anomaly Detection [Download Paper] (Ecole Polytechnique), (Ecole Polytechnique), (Ecole Polytechnique), (Ecole Polytechnique), (Ecole Polytechnique), (Intel Labs and MIT) In this demo, we introduce Exathlon – a new benchmarking platform for explainable anomaly detection over high-dimensional time series. We designed Exathlon to support data scientists and researchers in developing and evaluating learned models and algorithms for detecting anomalous patterns as well as discovering their explanations. This demo will showcase Exathlon’s curated anomaly dataset, novel benchmarking methodology, and end-to-end data science pipeline in action via example usage scenarios.
An Intermediate Representation for Hybrid Database and Machine Learning Workloads [Download Paper] (University of Edinburgh), (University of Washington), (University of Zurich) IFAQ is an intermediate representation and compilation framework for hybrid database and machine learning workloads expressible using iterative programs with functional aggregate queries. We demonstrate IFAQ for several OLAP queries, linear algebra expressions, and learning factorization machines over training datasets defined by feature extraction queries over relational databases.
An Extensible and Reusable Pipeline for Automated Utterance Paraphrases [Download Paper] (Universite Claude Bernard lyon 1), (University of New South Wales), (Universite Claude Bernard Lyon 1), (University of New South Wales, Australia), (Universite Claude Bernard Lyon 1) In this demonstration paper we showcase an extensible and reusable pipeline for automatic paraphrase generation, i.e., reformulating sentences using different words. Capturing the nuances of human language is fundamental to the effectiveness of Conversational AI systems, as it allows them to deal with the different ways users can utter their requests in natural language. Traditional approaches to utterance paraphrasing acquisition, such as hiring experts or crowdsourcing, involve processes that are often costly or time consuming, and with their own trade-offs in terms of quality. Automatic paraphrasing is emerging as an attractive alternative that promises a fast, scalable and cost-effective process. In this paper we showcase how our extensible and reusable pipeline for automated utterance paraphrasing can support the development of Conversational AI systems by integrating and extending existing techniques under an unified and configurable framework.
Group 3: Aspects of Query Processing [Demo Blocks 1 & 3]
MultiCategory: Multi-model Query Processing Meets Category Theory and Functional Programming [Download Paper] (University of Helsinki), (University of Helsinki), (Oracle), (Oracle), (Oracle), (Soulmates.ai) The variety of data is one of the important issues in the era of Big Data. The data are naturally organized in different formats and models, including structured data, semi-structured data, and unstructured data. Prior research has envisioned an approach to abstract multi-model data with a schema category and an instance category by using category theory. In this paper, we demonstrate a system, called MultiCategory, which processes multi-model queries based on category theory and functional programming. This demo is centered around four main scenarios to show a tangible system. First, we show how to build a schema category and an instance category by loading different models of data, including relational, XML, key-value, and graph data. Second, we show a few examples of query processing by using the functional programming language Haskell. Third, we demo the flexible outputs with different models of data for the same input query. Fourth, to better understand the category theoretical structure behind the queries, we offer a variety of graphical hooks to explore and visualize queries as graphs with respect to the schema category, as well as the query processing procedure with Haskell.
Cquirrel: Continuous Query Processing over Acyclic Relational Schemas [Download Paper] (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (Hong Kong University of Science and Technology), (Hong Kong Univ. of Science and Technology), (Alibaba), (Alibaba Group), (Alibaba Inc.) We will demonstrate Cquirrel, a continuous query processing engine built on top of Flink. Cquirrel assumes a relational schema where the foreign-key constraints form a directed acyclic graph, and supports any selection-projection-join-aggregation query where all join conditions are between a primary key and a foreign key. It allows arbitrary updates to any of the relations, and outputs the deltas in the query answers in real-time. It provides much better support for multi-way joins than the native join operator in Flink. Meanwhile, it offers better performance, scalability, and fault tolerance than other continuous query processing engines.
DeFiHap: Detecting and Fixing HiveQL Anti-Patterns [Download Paper] (Shanghai Jiao Tong University), (Shanghai Jiao Tong University), (Shanghai Jiao Tong University), (Shanghai Jiao Tong University), (Shanghai Jiao Tong University), (Shanghai Jiao Tong University) The emergence of Hive greatly facilitates the management of massive data stored in various places. Meanwhile, data scientists face challenges during HiveQL programming – they may not use correct and/or efficient HiveQL statements in their programs; developers may also introduce anti-patterns indeliberately into HiveQL programs, leading to poor performance, low maintainability, and/or program crashes. This paper presents an empirical study on HiveQL programming, in which 38 HiveQL anti-patterns are revealed. We then design and implement DeFiHap, the first tool for automatically detecting and fixing HiveQL anti-patterns. DeFiHap detects HiveQL anti-patterns via analyzing the abstract syntax trees of HiveQL statements and Hive configurations, and generates fix suggestions by rule-based rewriting and performance tuning techniques. The experimental results show that DeFiHap is effective. In particular, DeFiHap detects 25 anti-patterns and generates fix suggestions for 17 of them.
Low-Latency Compilation of SQL Queries to Machine Code [Download Paper]
(TU Dortmund University),
(TU Dortmund University)
Query compilation has proven to be one of the most efficient query processing techniques. Despite its fast processing speed, the additional compilation times of the technique limit its applicability. This is because the approach is most beneficial only when the improvements in processing time clearly exceed the additional compilation time.
Recently the feasibility of query compilers with very low compilation times has been shown. This may prove query compilation as a merely universal approach. In this article and in the corresponding live demo, we show the capabilities of the ReSQL database system, which uses the intermediate representation Flounder IR to achieve very low compilation times. ReSQL reduces the compilation times from SQL to machine code compared to existing LLVM-based techniques by up to 101.1x for real-world analytic queries.
Sound of Databases: Sonification of a Semantic Web Database Engine [Download Paper] (University of Lubeck), (University of Lubeck), (University of Lubeck) Sonifications map data to auditory dimensions and offer a new audible experience to their listeners. We propose a sonification of query processing paired with a corresponding visualization both integrated in a web application. In this demonstration we show that the sonification of different types of relational operators generates different sound patterns, which can be recognized and identified by listeners increasing their understanding of the operators' functionality and supports easy remembering of requirements like merge joins work on sorted input. Furthermore, new ways of analyzing query processing are possible with the sonification approach.
Nested Collections Efficiently [Download Paper]
(Oxford University), (Oxford University), (Oxford University), (University of Edinburgh)
Nested relational query languages have long been seen as an attractive tool for scenarios involving large hierarchical datasets. In recent years, there has been a resurgence of interest in nested relational languages. One driver has been the affinity of these languages for large-scale processing platforms such as Spark and Flink.
This demonstration gives a tour of TraNCE, a new system for processing nested data on top of distributed processing systems. The core innovation of the system is a compiler that processes nested relational queries in a series of transformations; these include variants of two prior techniques, shredding and unnesting, as well as a materialization transformation that customizes the way levels of the nested output are generated. The TraNCE platform builds on these techniques by adding components for users to create and visualize queries, as well as data explorer and notebook execution targets to facilitate the construction of large-scale data science applications. The demonstration will both showcase the system from the viewpoint of usability by data scientists and illustrate the data management techniques employed.
Debugging Missing Answers for Spark Queries over Nested Data with Breadcrumb [Download Paper] (University of Stuttgart), (University of Cincinnati), (Illinois Institute of Technology), (Universitat Stuttgart) We present Breadcrumb, a system that aids developers in debugging queries through query-based explanations for missing answers. Given as input a query and an expected, but missing, query result, Breadcrumb identifies operators in the input query that are responsible for the failure to derive the missing answer. These operators form explanations that guide developers who can then focus their debugging efforts on fixing these parts of the query. Breadcrumb is implemented on top of Apache Spark. Our approach is the first that scales to big data dimensions and is capable of finding explanations for common errors in queries over nested and de-normalized data, e.g., errors based on misinterpreting schema semantics.
Interactive Demonstration of SQLCHECK [Download Paper] (Georgia Institute Of Technology), (Georgia Institute of Technology), (Georgia Institute of Technology), (Georgia Institute of Technology), (Georgia Tech), (Georgia Institute of Technology) We will demonstrate a prototype of SQLCHECK, a holistic toolchain for automatically finding and fixing anti-patterns in database applications. The advent of modern database-as-a-service platforms has made it easy for developers to quickly create scalable applications.However, it is still challenging for developers to design performant, maintainable, and accurate applications. This is because developers may unknowingly introduce anti-patterns in the application’s SQL statements. These anti-patterns are design decisions that are intended to solve a problem, but often lead to other problems by violating fundamental design principles. SQLCHECK leverages techniques for automatically: (1) detecting anti-patterns with high accuracy, (2) ranking them based on their impact on performance, maintainability, and accuracy of applications, and (3) suggesting alternative queries and changes to the data-base design to fix these anti-patterns. We will show how SQLCHECK suggests fixes for high-impact anti-patterns using rule-based query refactoring techniques. We will demonstrate that SQLCHECK enables developers to create more performant, maintainable, and accurate applications. We will show the prevalence of these anti-patterns in a large collection of queries and databases collected from open-source repositories.
Demonstration of Generating Explanations for Black-Box Algorithms Using Lewis [Download Paper] (University of California, San Diego), (University of Massachusetts Amherst), (University of California San Diego), (Unievristy of California at San Diego) Explainable artificial intelligence (XAI) aims to reduce the opacity of AI-based decision-making systems, allowing humans to scrutinize and trust them. Unlike prior work that attributes the responsibility for an algorithm’s decisions to its inputs as a purely associational concept, we propose a principled causality-based approach for explaining black-box decision-making systems. We present the demonstration of Lewis, a system that generates explanations for black-box algorithms at the global, contextual, and local levels, and provides actionable recourse for individuals negatively affected by an algorithm’s decision. Lewis makes no assumptions about the internals of the algorithm except for the availability of its input-output data. The explanations generated by Lewis are based on probabilistic contrastive counterfactuals, a concept that can be traced back to philosophical, cognitive, and social foundations of theories on how humans generate and select explanations. We describe the system layout of Lewis wherein an end-user specifies the underlying causal model and Lewis generates explanations for particular use-cases, compares them with explanations generated by state-of-the-art approaches in XAI, and provides actionable recourse when applicable. Lewis has been developed as open-source software; the code and the demonstration video are available at lewis- system.github.io.
Wikinegata: a Knowledge Base with Interesting Negative Statements [Download Paper] ( Max-Planck-Institut fur Informatik), (Max-Planck-Institut fur Informatik, Germany), (Max-Planck-Institut fur Informatik), (The University of Edinburgh) Databases about general-world knowledge, so-called knowledge bases (KBs), are important in applications such as search and question answering. Traditionally, although KBs use open world assumption, popular KBs only store positive information, but withhold from taking any stance towards statements *not* contained in them. In this demo, we show that storing and presenting noteworthy negative statements would be important to overcome current limitations in various use cases. In particular, we introduce the Wikinegata portal, a platform to explore negative statements for Wikidata entities, by implementing a peer-based ranking method for inferring interesting negations in KBs. The demo is available at http://d5demos.mpi-inf.mpg.de/negation.
Full Encryption: An end to end encryption mechanism in GaussDB [Download Paper] (Huawei Technologies Co., Ltd.), (Huawei Technologies Co., Ltd.), (Huawei Technologies Co., Ltd.), (Huawei Technologies Co., Ltd.) In this paper, we present a novel mechanism called Full Encryption (FE) in GaussDB. FE-in-GaussDB provides column-level encryption for sensitive data, and secures the asset from any malicious cloud administrator or information leakage attack. It ensures not only the security of operations on ciphertext data, but also the efficiency of query execution, by combining the advantages of cryptography algorithms (i.e. software mode) and Trusted Execution Enviroment (i.e. hardware mode). With this, FE-in-GaussDB supports full-scene query processing including the matching, the comparison and other rich computing functionalities. We demonstrate the prototype of FE-in-GaussDB and an experimental performance evaluation to prove its availability and effectiveness.
Processing of Arbitrary Data-Intensive Applications without ETL [Download Paper]
(Johannes Gutenberg-University Mainz), (Johannes Gutenberg-University Mainz), (Johannes Gutenberg-University Mainz), (Johannes Gutenberg-University Mainz)
The volume of data that is processed and produced by modern data-intensive applications is constantly increasing. Of course, along with the volume, the interest in analyzing and interpreting this data increases as well. As a consequence, more and more DBMSs and processing frameworks are specialized towards the efficient execution of long-running, read-only analytical queries. Unfortunately, to enable analysis, the data first has to be moved from the source application to the analytics tool via a lengthy ETL process, which increases the runtime and complexity of the analysis pipeline.
In this work, we advocate to simply skip ETL altogether. With AnyOLAP, we can perform online analysis of data directly within the source application and while it is running.
In the proposed demonstration, the audience will get the chance to put AnyOLAP to the test on a set of data-intensive applications that are supposed to be analyzed while they are up and running. As the entire analysis pipeline of AnyOLAP will be exposed to the audience in form of live and interactive visualizations, users will be able to experience the benefits of true online analysis firsthand.
A Demonstration of NoDA: Unified Access to NoSQL Stores [Download Paper] (University of Piraeus), (University of Pireaus), (University of Pireaus), (University of the Aegean) In this demo paper, we present a system prototype, called NoDA, that unifies access to NoSQL stores, by exposing a single interface to big data developers. This hides the heterogeneity of NoSQL stores, in terms of different query languages, non-standardized access, and different data models. NoDA comprises a layer positioned on top of NoSQL stores that defines a set of basic data access operators (filter, project, aggregate, etc.), implemented for different NoSQL engines. Furthermore, NoDA is extended to support more complex operators, such as geospatial operators. The provision of generic data access operators enables a declarative interface using SQL as query language. We demonstrate NoDA by showcasing that the exact same query can be processed by different NoSQL stores, without any modification or transformation whatsoever.
AutoExecutor: Predictive Parallelism for Spark SQL Queries [Download Paper] (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft), (Microsoft) Right-sizing resources for query execution is important for cost-efficient performance, but estimating how performance is affected by resource allocations, upfront, before query execution is difficult. We demonstrate AutoExecutor, a predictive system that uses machine learning models to predict query run times as a function of the number of allocated executors, that limits the maximum allowed parallelism, for Spark SQL queries running on Azure Synapse.
Group 4: Data Science, Data Cleaning and Discovery [Demo Blocks 2 & 4]
A Demonstration of KGLac: A Data Discovery and Enrichment Platform for Data Science [Download Paper] (Concordia University), (Concordia University), (BorealisAI), (Concordia University) Data science's growing success relies on knowing where a relevant dataset exists, understanding its impact on a specific task, finding ways to enrich a dataset, and leveraging insights derived from it. With the growth of open data initiatives, data scientists need an extensible set of effective discovery operations to find relevant data from their enterprise datasets accessible via data discovery systems or open datasets accessible via data portals. Existing portals and systems suffer from limited discovery support and do not track the use of a dataset and insights derived from it. We will demonstrate KGLac, a system that captures metadata and semantics of datasets to construct a knowledge graph (GLac) interconnecting data items, e.g., tables and columns. KGLac supports various data discovery operations via SPARQL queries for table discovery, unionable and joinable tables, plus annotation with related derived insights. We harness a broad range of Machine Learning (ML) approaches with GLac to enable automatic graph learning for advanced and semantic data discovery. The demo will showcase how KGLac facilitates data discovery and enrichment while developing an ML pipeline to evaluate potential gender salary bias in IT jobs.
Intermittent Human-in-the-Loop Model Selection using Cerebro: A Demonstration [Download Paper] (UC San Diego), (University of California, San Diego), (University of California, San Diego) Deep learning (DL) is revolutionizing many fields. However, there is a major bottleneck for the wide adoption of DL: the pain of model selection, which requires exploring a large configuration space of model architecture and training hyper-parameters before picking the best model. The two existing popular paradigms for exploring this configuration space pose a false dichotomy. AutoML-based model selection explores configurations with high-throughput but uses human intuition minimally. Alternatively, interactive human-in-the-loop model selection completely relies on human intuition to explore the configuration space but often has very low throughput. To mitigate the above drawbacks, we propose a new paradigm for model selection that we call intermittent human-in-the-loop model selection. In this demonstration, we will showcase our approach using five real-world deep learning model selection workloads. A short video of our demonstration is available on our project web page: https://adalabucsd.github.io/cerebro.html.
Demonstration of Dealer: An End-to-End Model Marketplace with Differential Privacy [Download Paper] (Zhejiang University), (Zhejiang University), (Zhejiang University), (Zhejiang University), (Emory University), (Renmin University of China), (Emory University), (Simon Fraser University), (UIUC) Data-driven machine learning (ML) has witnessed great success across a variety of application domains. Since ML model training relies on a large amount of data, there is a growing demand for high-quality data to be collected for ML model training. Data markets can be employed to significantly facilitate data collection. In this work, we demonstrate Dealer, an enD-to-end model marketplace with differential privacy. Dealer consists of three entities, data owners, the broker, and model buyers. Data owners receive compensation for their data usages allocated by the broker; The broker collects data from data owners, builds and sells models to model buyers; Model buyers buy their target models from the broker. We demonstrate the functionalities of the three participating entities and the abbreviated interactions between them. The demonstration allows the audience to understand and experience interactively the process of model trading. The audience can act as a data owner to control what and how the data would be compensated, can act as a broker to price machine learning models with maximum revenue, as well as can act as a model buyer to purchase target models that meet expectations.
ATLANTIC: Making Database
Differentially Private and Faster with Accuracy Guarantee [Download Paper]
(MIT), (Google), (Worcester Polytechnic Institute), (MIT), (Tsinghua University)
Differential privacy promises to enable data sharing and general data analytics while protecting individual privacy. Because the private data is often stored in the form of relational database that supports SQL queries, making SQL-based analytics differentially private is thus critical. However, the existing SQL-based differentially private systems either only focus on specific type of SQL queries such as COUNT or substantially modify the database engine, thus obstructing adoption in practice.
Worse yet, these systems often do not guarantee the desired accuracy by the applications. In this demonstration, using the driving trace workload from Cambridge Mobile Telematics (CMT), we show that our ATLANTIC system, as a database middleware, enforces differential privacy for real-world SQL queries with provable accuracy guarantees and is compatible with existing databases.
Moreover, using a sampling-based technique, ATLANTIC significantly speeds up the query execution, yet effectively amplifying the privacy guarantee.
From Papers to Practice: The openclean Open-Source Data Cleaning Library [Download Paper] (NYU), (New York University), (New York University), (New York University) Data preparation is still a major bottleneck for many data science projects. Even though many sophisticated algorithms and tools have been proposed in the research literature, it is difficult for practitioners to integrate them into their data wrangling efforts. We present openclean, a open-source Python library for data cleaning and profiling. openclean integrates data profiling and cleaning tools in a single environment that is easy and intuitive to use. We designed openclean to be extensible and make it easy to add new functionality. By doing so, it will not only become easier for users to access state-of-the-art algorithms for their data preparation efforts, but also allow researchers to integrate their work and evaluate its effectiveness in practice. We envision openclean as a first step to build a community of practitioners and researchers in the field. In our demo, we outline the main components and design decisions in the development of openclean and demonstrate the current functionality of the library on real-world use cases.
Auctus: A Dataset Search Engine for Data Discovery and Augmentation [Download Paper] (New York University), (NYU), (New York University), (New York University), (NYU), (New York University) The large volumes of structured data currently available, from Web tables to open-data portals and enterprise data, open up new opportunities for progress in answering many important scientific, societal, and business questions. However, finding relevant data is difficult. While search engines have addressed this problem for Web documents, there are many new challenges involved in supporting the discovery of structured data. We demonstrate how the Auctus dataset search engine addresses some of these challenges. We describe the system architecture and how users can explore datasets through a rich set of queries. We also present case studies which show how Auctus supports data augmentation to improve machine learning models as well as to enrich analytics.
A Demonstration of Relic: A System for REtrospective Lineage InferenCe of Data Workflows [Download Paper] (University of Chicago), (Microsoft), (University of Chicago) The ad-hoc, heterogeneous process of modern data science typically involves loading, cleaning, and mutating dataset(s) into multiple versions recorded as artifacts operated on by various tools within a single data science workflow. Lineage information, including the source datasets, data transformation programs or scripts, or manual annotations, is rarely captured, making it difficult to infer the relationships between artifacts in a given workflow retrospectively. We demonstrate RELIC, a tool to retrospectively infer the lineage of data artifacts generated as a result of typical data science workflows, with an interactive demonstration that allows users to input artifact files and visualize the inferred lineage in a web-based setting.
DatAgent: The Imminent Age of Intelligent Data Assistants [Download Paper] (Athena Research Center), (Athena Research Center), (Athena Research Center), (Athena Research Center), (Athena Research Center), (Athena) In this demonstration, we present DatAgent, an intelligent data assistant system that allows users to ask queries in natural language, can respond in natural language as well, actively guides the user using different types of recommendations and hints, and learns from user actions. We will demonstrate different exploration scenarios that show how the system and the user engage in a human-like interaction inspired by the interaction paradigm of chatbots and virtual assistants.
DICE: Data Discovery by Example [Download Paper] (MIT), (National Institute of Technology Hamirpur), (University of Massachusetts Amherst), (MIT Lincoln Laboratory), (United States Air Force), (MIT Lincoln Laboratory), (MIT) In order to conduct analytical tasks, data scientists often need to find relevant data from an avalanche of sources (e.g., data lakes, large organizational databases). This effort is typically made in an ad-hoc, non-systematic manner, which makes it a daunting endeavour. Current data discovery systems typically require the user to find relevant tables manually, usually by issuing multiple queries (e.g., using SQL). However, expressing such queries is nontrivial, as it requires knowledge of the underlying structure (schema) of the data organization in advance. This issue is further exacerbated when data resides in data lakes, where there is no predefined schema that data must conform to. On the other hand, data scientists can often come up with a few example records of interest quickly. Motivated by this observation, we developed DICE--a human-in-the-loop system for Data dIsCovery by Example--that takes user-provided example records as input and returns more records that satisfy the user intent. DICE's key idea is to synthesize a SQL query that captures the user intent, specified via examples. To this end, DICE follows a three-step process: (1) DICE first discovers a few candidate queries by finding join paths across tables within the data lake. (2) Then DICE consults with the user for validation by presenting a few records to them, and, thus, eliminating spurious queries. (3) Based on the user feedback, DICE refines the search and repeats the process until the user is satisfied with the results. We will demonstrate how DICE can help in data discovery through an interactive, example-based interaction.
How Divergent Is Your Data? [Download Paper] (Politecnico di Torino), (University of California, Santa Cruz), (Dipartimento di Automatica e Informatica Politecnico di Torino), (University of California, Santa Cruz) We present DivExplorer, a tool that enables users to explore datasets and find subgroups of data for which a classifier behaves in an anomalous manner. These subgroups, denoted as divergent subgroups, may exhibit, for example, higher-than-normal false positive or negative rates. DivExplorer can be used to analyze and debug classifiers. If the data has ethical or social implications, DivExplorer can be also used to identify bias in classifiers.
RONIN: Data Lake
Exploration [Download Paper]
(University of Rochester), (University of Rochester), (University of Rochester), (University of Toronto), (Microsoft Research), (Ontario Tech University), (Northeastern University)
Dataset discovery can be performed using search (with a query or keywords) to find relevant data. However, the result of this discovery can be overwhelming to explore. Existing navigation techniques mostly focus on linkage graphs that enable navigation from one data set to another based on similarity or joinability of attributes.
However, users often do not know which data set to start the navigation from. RONIN proposes an alternative way to navigate by building a hierarchical structure on a collection of data sets: the user navigates between groups of data sets in a hierarchical manner to narrow down to the data of interest. We demonstrate RONIN, a tool that enables user exploration of a data lake by seamlessly integrating the two common modalities of discovery: data set search and navigation of a hierarchical structure. In RONIN, a user can perform a keyword search or joinability search over a data lake, then, navigate the result using a hierarchical structure, called an organization, that is created on the fly. While navigating an organization, the user may switch to the search mode, and back to navigation on an organization that is updated based on search. This integration of search and navigation provides great power in allowing users to find and explore interesting data in a data lake.
SAND in Action: Subsequence Anomaly Detection for Streams [Download Paper] (Universite de Paris), (University of Chicago), (University of Paris), (University of Chicago) Subsequence anomaly detection in long data series is a significant problem. While the demand for real-time analytics and decision-making increases, anomaly detection methods have to operate over streams and handle drifts in data distribution. Nevertheless, existing approaches either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. Moreover, subsequence anomaly detection methods usually require access to the entire dataset and are not able to learn and detect anomalies in streaming settings. To address these limitations, we propose SAND, a novel online system suitable for domain-agnostic anomaly detection. SAND relies on a novel steaming methodology to incrementally update a model that adapts to distribution drifts and omits obsolete data. We demonstrate our system over different streaming scenarios and compare SAND with other subsequence anomaly detection methods.
Valentine in Action:
Matching Tabular Data at Scale [Download Paper]
(TU Delft), (TU Delft), (TU Delft), (TU Delft), (TU Delft), (Univ. of Lyon), (TU Delft)
Capturing relationships among heterogeneous datasets in large data lakes – traditionally termed schema matching – is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of the schema matching methods in use. However, despite the wealth of research, evaluation of schema matching methods is still a daunting task: there is a lack of openly-available datasets with ground truth, reference method implementations, and comprehensible GUIs that would facilitate development of both novel state-of-the-art schema matching techniques and novel data discovery methods.
Our recently proposed Valentine is the first system to offer an open-source experiment suite to organize, execute and orchestrate large-scale matching experiments. In this demonstration we present its functionalities and enhancements:i) a scalable system, with a user-centric GUI, that enables the fabrication of datasets and the evaluation of matching methods on schema matching scenarios tailored to the scope of tabular dataset discovery, ii) a scalable holistic matching system that can receive tabular datasets from heterogeneous sources and provide with similarity scores among their columns, in order to facilitate modern procedures in data lakes, such as dataset discovery."