VLDB 2022: Invited Scalable Data Science Talks

Chaired by Fatma Ozcan

07Sep

Sydney Time: 08:00Scalable Data Science Talk 1

Input Data Processing for Machine Learning Ana Klimovic (ETH Zurich) Abstract: Processing input data plays a vital role in ML training. We analyze millions of ML jobs running in Google's fleet and find that the input pipeline --- which is responsible for feeding data-hungry GPUs/TPUs with training examples --- significantly impacts end-to-end training time and cost. Our characterization of input data pipelines motivates several system design explorations, such as disaggregating input data processing from model training and caching commonly reoccurring input data computation subgraphs. Hence, we present Cachew, a fully-managed multi-tenant service for ML data processing which builds on tf.data. Cachew dynamically scales distributed resources for data processing to avoid stalls in training jobs. The service also selectively caches source data and/or preprocessed data to maximize training throughput and cost within and across jobs. Cachew's key contributions are autoscaling and autocaching policies, which optimize training time and cost by leveraging domain-specific metrics collected at data workers and training clients, rather than generic resource utilization metrics. We conclude with a discussion of open research questions for ML input data management and processing. Bio: Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University in 2019.

Sydney Time: 08:30Scalable Data Science Talk 2

Operationalizing Organizational Knowledge for Data-Centric AI Alex Ratner (Snorkel AI) Abstract: Today, much of the bottlenecks and associated development cycles in AI involve the data that machine learning models are trained on, rather than the models themselves. In this talk, I’ll describe and motivate this shift from model-centric to data-centric AI development, and describe several lines of recent work on how existing knowledge resources–labels, models, knowledge bases, large language (or foundation) models, and more–can be used to accelerate the labeling, development, and management of the training data that is at the center of AI success today. Bio: Alex Ratner is an Assistant Professor of Computer Science at the University of Washington, and Co-founder & CEO of Snorkel AI, Inc. (snorkel.ai). Prior to UW and Snorkel AI, he completed his PhD in CS advised by Christopher Ré at Stanford, where his research focused on applying data management and statistical learning techniques to emerging machine learning workflows such as creating and managing training data, and his AB in Physics at Harvard.

Sydney Time: 09:00Scalable Data Science Talk 3

Distributed Machine Learning for Big Models Bin Cui (Peking University) Abstract: Machine/Deep learning (ML/DL) systems are important foundations for artificial intelligence and have attracted a lot of attention in academia and industry in recent years. The increasing scale of Deed Learning models and data brings severe challenges to existing systems, and distributed deep learning systems are becoming more and more important. As the intersection of ML/DL and systems, it is necessary to pay attention not only to the data characteristics, model structures, training methods, and optimization algorithms, but also to the execution problems in the computing, storage, communication, scheduling, and hardware of the system. In this talk, I will introduce the current development of "big models" and then share our efforts on the system optimizations for distributed training of big models, as well as the explorations of automated parallel training. Based on these efforts, I will also briefly present our open-sourced system -- Hetu, a new distributed deep learning system for large-scale model training. Bio: Bin Cui is a Boya distinguished professor and Vice Dean in School of CS at Peking University. His research interests include database system, big data management and analytics, and ML system. He has regularly served in the Technical Program Committee of various international conferences including SIGMOD, VLDB, ICDE and KDD, and is the Editor-in-Chief of Data Science and Engineering, also in the Editorial Board of Distributed and Parallel Databases, Journal of Computer Science and Technology, and SCIENCE CHINA Information Sciences, and was an associate editor of IEEE Transactions on Knowledge and Data Engineering (TKDE) and VLDB Journal, and Trustee Board Member of VLDB Endowment. He is serving as Vice Chair of Technical Committee on Database China Computer Federation (CCF). He was awarded Microsoft Young Professorship award (MSRA 2008), CCF Young Scientist award (2009), Second Prize of Natural Science Award of MOE China (2014), and appointed as Cheung Kong distinguished Professor by MOE China in 2016.