VLDB 2021: Invited Scalable Data Science Talks
Towards Scalable Online Machine Learning Collaborations with OpenML
(Eindhoven University of Technology)
Is massively collaborative machine learning possible? Can we share and organize our collective
knowledge of machine learning to solve ever more challenging problems? In a way, yes: as a
community, we are already very successful at developing high-quality open-source machine learning
libraries, thanks to frictionless collaboration platforms for software development. However, code
is only one aspect. The answer is much less clear when we also consider the data that goes into
these algorithms and the exact models that are produced. A tremendous amount of work and experience
goes into the collection, cleaning, and preprocessing of data and the design, evaluation, and
finetuning of models, yet very little of this is shared and organized in a way so that others can
easily build on it.
Suppose one had a global platform for sharing machine learning datasets, models, and reproducible experiments in a frictionless way so that anybody could chip in at any time to share a good model, add or improve data, or suggest an idea. OpenML is an open-source initiative to create such a platform. It allows anyone to share datasets, machine learning pipelines, and full experiments, organizes all of it online with rich metadata, and enables anyone to reuse and build on them in novel and unexpected ways. All data is open and accessible through APIs, and it is readily integrated into popular machine learning tools to allow easy sharing of models and experiments. This openness also allows a budding ecosystem of automated processes to scale up machine learning further, such as discovering similar datasets, creating systematic benchmarks, or learning from all collected results how to build the best machine learning models and even automatically doing so for any new dataset. We welcome all of you to become a part of it. Bio: Joaquin Vanschoren is an assistant professor at the Eindhoven University of Technology (TU/e). His research focuses on the automation of machine learning (AutoML) and Meta-Learning. He co-authored and co-edited the books 'Automatic Machine: Methods, Systems, Challenges' and 'Meta-learning: Applications to AutoML and data mining', published over 100 articles on these topics, and received an Amazon Research Award, an Azure Research Award, the Dutch Data Prize, and an ECML PKDD demonstration award.
He founded and leads OpenML.org, an open science platform for machine learning. He is a founding member of the European AI associations ELLIS and CLAIRE, chairs the Open Machine Learning Foundation, and co-chairs the W3C Machine Learning Schema Community Group.
He has been tutorial speaker at NeurIPS and AAAI, and has given more than 20 invited talks, including ECDA, StatComp, IDEAL, and workshops at NeurIPS, ICML, and SIGMOD. He is datasets and benchmarks chair at NeurIPS 2021, program chair of Discovery Science 2018, general chair at LION 2016, demo chair at ECMLPKDD 2013, and he co-organizes the AutoML and Meta-Learning workshop series at NeurIPS and ICML from 2013 to 2021.
Internet Traffic Analysis at Scale
(Max Planck Institut für Informatik)
In this talk I will use multiple Internet measurement studies as examples to outline the challenges
that we face when performing Internet scale traffic analysis, including Implications of the COVID-19
Pandemic on Internet traffic as well as detecting IoT devices through the lens of an ISP. Using this
as motivation I will discuss the challenges of working with network-wide flow data and correlating
such data with other data sets.
Anja Feldmann got her Ph.D. from Carnegie Mellon University in 1995. The next four years she did
research work at AT&T Labs Research, before taking professor positions at Saarland University, the
TU Munich, and TU Berlin. From May 2012 to Dec. 2018 she served on Supervisory Board of SAP SE.
Since the beginning of 2018, Anja is a director at the Max Planck Institute for Informatics in
Saarbrucken, Germany. She is amongst others a member of the German National Academy of Science
(Leopoldina) and of Science and Engineering (acatech).
Her current research interests include Internet measurement, network performance debugging, and network architecture.
The Power of Summarization in Graph Mining and Learning: Smaller
Data, Faster Methods, More Interpretability
(University of Michigan)
Our ability to generate, collect, and archive data related to everyday activities, such as
interacting on social media, browsing the Web, and monitoring well-being, is rapidly increasing.
Getting the most benefit from this large-scale data requires analysis of patterns it contains,
which is computationally intensive or even intractable. Summarization techniques produce compact
data representations (summaries) that enable faster processing by complex algorithms and queries.
This talk will cover summarization of interconnected data (graphs) , which can represent a variety of natural processes (e.g., friendships, communication). I will present an overview of my group’s work on bridging the gap between research on summarized network representations and real-world problems. Examples include summarization of massive knowledge graphs for refinement  and on-device querying , summarization of graph streams for persistent activity detection , and summarization within graph neural networks for fast, interpretable classification . I will conclude with open challenges and opportunities for future research.
 Yike Liu, Tara Safavi, Abhilash Dighe, Danai Koutra. Graph Summarization Methods and Applications: A Survey. ACM Comput. Surv. 51(3): 62:1-62:34 (2018).
 Caleb Belth, Xinyi Zheng, Jilles Vreeken, Danai Koutra. What is Normal, What is Strange, and What is Missing in a Knowledge Graph: Unified Characterization via Inductive Summarization. WWW 2020: 1115-1126.
 Tara Safavi, Caleb Belth, Lukas Faber, Davide Mottin, Emmanuel Müller, Danai Koutra. Personalized Knowledge Graph Summarization: From the Cloud to Your Pocket. ICDM 2019: 528-537.
 Caleb Belth, Xinyi Zheng, Danai Koutra. Mining Persistent Activity in Continually Evolving Networks. KDD 2020: 934-944.
 Yujun Yan, Jiong Zhu, Marlena Duda, Eric Solarz, Chandra Sripada, Danai Koutra. GroupINN: Grouping-Based Interpretable Neural Network for Classification of Limited, Noisy Brain Data. KDD 2019: 772-782. Bio: Danai Koutra is an Associate Director of the Michigan Institute for Data Science (MIDAS) and a Morris Wellman Assistant Professor in Computer Science and Engineering at the University of Michigan, where she leads the Graph Exploration and Mining at Scale (GEMS) Lab. Her research focuses on practical and scalable methods for large-scale real networks, and her interests include graph summarization, knowledge graph mining, graph learning, similarity and alignment, and anomaly detection. She has won an NSF CAREER award, an ARO Young Investigator award, the 2020 SIGKDD Rising Star Award, research faculty awards from Google, Amazon, Facebook and Adobe, a Precision Health Investigator award, the 2016 ACM SIGKDD Dissertation award, and an honorable mention for the SCS Doctoral Dissertation Award (CMU). She holds one "rate-1" patent on bipartite graph alignment, and has multiple papers in top data mining conferences, including 8 award-winning papers. She is the Secretary of the new SIAG on Data Science, an Associate Editor of ACM TKDD, and has served multiple times in the organizing committees of all the major data mining conferences. She has worked at IBM, Microsoft Research, and Technicolor. She earned her Ph.D. and M.S. in Computer Science from CMU in 2015 and her diploma in Electrical and Computer Engineering at the National Technical University of Athens in 2010.
Designing Production-Friendly Machine Learning (Stanford and Databricks) Abstract: Building production ML applications is difficult because of their resource cost and complex failure modes. I’ll discuss these challenges from two perspectives: the Stanford DAWN lab and experience with large-scale commercial ML users at Databricks. I’ll then present two emerging ideas to help address these challenges. The first is “ML platforms”, an emerging class of software systems that standardize the interfaces used in ML applications to make them easier to build and maintain. I’ll give a few examples, including the open source MLflow system from Databricks. The second idea is models that are more “production-friendly” by design. As a concrete example, I will discuss retrieval-oriented NLP models such as Stanford's ColBERT that query documents from an updateable corpus to perform tasks such as question-answering, which gives multiple practical advantages, including low computational cost, high interpretability, and very fast updates to the model’s “knowledge”. These models are an exciting alternative to large language models such as GPT-3. Bio: Matei Zaharia is an Assistant Professor of Computer Science at Stanford and Cofounder and Chief Technologist at Databricks. He started the Apache Spark open source project during his PhD at UC Berkeley in 2009 and the MLflow open source project at Databricks, and has helped design other widely used data and AI systems software including Delta Lake and Apache Mesos. At Stanford, he is a co-PI of the DAWN lab working on infrastructure for machine learning, data management and cloud computing. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
From ML models to Intelligent Applications: The Rise of MLOps (Verta) Abstract: The last 5+ years in ML have focused on building the best models, hyperparameter optimization, parallel training, massive neural networks, etc. Now that the building of models has become easy, models are being integrated into every piece of software and device — from smart kitchens to radiology to detecting performance of turbines. This shift from training ML models to building Intelligent, ML-driven applications has highlighted a variety of problems going from “a model” to a whole application or business process running on ML. These challenges range from operational challenges (how to package and deploy different types of models using existing SDLC tools and practices), rethinking what existing abstractions mean for ML (e.g., testing, monitoring, warehouses for ML), and collaboration challenges arising from disparate skill sets involved in ML product development (DS vs. SWE), and brand new problems unique to ML (e.g., explainability, fairness, retraining, etc.) In this talk, I will discuss the slew of challenges that still exist in operationalizing ML to build intelligent applications, some solutions that the community has adopted, and highlight various open problems that would benefit from the research community’s contributions. Bio: Manasi Vartak is the founder and CEO of Verta, an MIT spinoff building an MLOps platform for the full ML lifecycle. Verta grew out of Manasi’s Ph.D. work at MIT on ModelDB, the first open-source model management system deployed at Fortune 500 companies. The Verta MLOps platform enables data scientists and ML engineers to robustly take trained ML models through the MLOps cycle including versioning, packaging, release, operations, and monitoring. Previously, Manasi worked on feed ranking at Twitter and dynamic ad-targeting at Google. Manasi has spoken at several top research as well as industrial conferences such as the Strata O’Reilly Conference, SIGMOD, VLDB, Data & AI Summit, and AnacondaCON, and has authored a course on model management.
Summarizing Patients Like Mine via an On-demand Consultation Service (Stanford) Abstract: Using evidence derived from previously collected medical records to guide patient care has been a long standing vision of clinicians and informaticians, and one with the potential to transform medical practice. We offered an on-demand consultation service to derive evidence from millions of other patients' data to answer clinician questions and support their bedside decision making. We describe the design and implementation of the service as well as a summary of our experience in responding to the first 100 requests. We will also review a new paradigm for a scalable time-aware clinical data search, and to describe the design, implementation and use of a search engine realizing this paradigm. Bio: Dr. Nigam Shah is Professor of Medicine (Biomedical Informatics) at Stanford University, and serves as the Associate CIO for Data Science for Stanford Health Care. Dr. Shah's research focuses on bringing machine learning to clinical use safely, ethically and cost-effectively. Dr. Shah was elected into the American College of Medical Informatics (ACMI) in 2015 and was inducted into the American Society for Clinical Investigation (ASCI) in 2016. He holds an MBBS from Baroda Medical College, India, a PhD from Penn State University and completed postdoctoral training at Stanford University.