VLDB 2014 has accepted five tutorials listed as below.
Systems for Big Graphs (3 hours)
Arijit Khan, Sameh Elnikety
Uncertain Entity Resolution (1.5 hours)
Knowledge Bases in the Age of Big Data Analytics (3 hours)
Fabian Suchanek, Gerhard Weikum
Causality and Explanations in Databases (1.5 hours)
Alexandra Meliou, Sudeepa Roy, Dan Suciu
Enterprise Search in the Big Data Era (3 hours)
Yunyao Li, Ziyang Liu, Huaiyu Zhu.
Systems for Big Graphs
Large-scale, highly-interconnected networks pervade our society and the natural world around us.
Graphs represent such complicated structures and schema-less data including the World Wide Web,
social networks, knowledge graphs, genome and scientific databases, e-commerce, medical and
government records. Graph processing poses interesting system challenges: A graph models entities and
their relationships, which are usually irregular and unstructured; and therefore, the computation and
data access patterns exhibit poor locality. Although several disk-based graph-indexing techniques have
been proposed for specific graph operations, they still cannot provide the level of efficient random
access required by graph computation. On the other hand, the scale of graph data easily overwhelms
the main memory and computation resources on commodity servers. Today's big-graphs consist of
millions of vertices and billions of edges. In these cases, achieving low latency and high throughput
requires partitioning the graph and processing the graph data in parallel across a cluster of servers.
However, the software and hardware advances that have worked well for developing parallel databases
and scientific applications are not necessarily effective for graph problems. Hence, the last few years has
seen an unprecedented interest in building systems for big-graphs by various communities including
databases, systems, semantic web, and machine learning. In this tutorial, we discuss the design of these
emerging systems for processing of big-graphs, key features of distributed graph algorithms, as well as
graph partitioning and workload balancing techniques. We discuss the current challenges and highlight
some future research directions.
Arijit Khan and Sameh Elnikety
Arijit Khan is a post-doctorate researcher in the Systems group at ETH Zurich. His research interests
span in the area of big-data, big-graphs, and graph systems. He received his PhD from the
Department of Computer Science, University of California, Santa Barbara. Arijit is the recipient of the
prestigious IBM PhD Fellowship in 2012-13. He co-presented a tutorial on emerging queries over
linked data at ICDE 2012.
Sameh Elnikety is a researcher at Microsoft Research in Redmond, Washington. He received his Ph.D.
from the Swiss Federal Institute of Technology (EPFL) in Lausanne, Switzerland , and M.S. from Rice
University in Houston, Texas. His research interests include distributed server systems, and database
systems. Sameh’s work on database replication received the best paper award at Eurosys 2007.
Uncertain Entity Resolution
Entity resolution is a fundamental problem in data integration dealing with the combination of data from diﬀerent sources to a uniﬁed view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records to the same entity cannot be made with certainty unless these are identical in all of their attributes or have a common key. In the light of recent advancement in data accumulation, management, and analytics landscape (known as big data) the tutorial re-evaluates the entity resolution process and in particular looks at best ways to handle data veracity. The tutorial ties entity resolution with recent advances in probabilistic database research, focusing on sources of uncertainty in the entity resolution process. We shall discuss which types of uncertainties have been handled in the literature and suggest new methods for coping with various types of uncertainties, some of which are presented as future challenges.
Avigdor Gal is an associate professor at Faculty of Industrial Engineering & Management, Technion. He is an expert on information systems. His research focuses on effective methods of integrating data from multiple and diverse sources, which affect the way businesses and consumers seek information over the Internet.
His current work zeroes in on data integration — the task of providing communication between databases, and connecting such communication to real-world concepts. Another line of research involves the identification of complex events such as flu epidemics, biological attacks, and breaches in computer security, and its application to disaster and crisis management. He has applied his research to European and American projects in government, eHealth, and the integration of business documents.
Prof. Gal has published more than 100 papers in leading professional journals (e.g. Journal of the ACM (JACM), ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM Transactions on Internet Technology (TOIT), and the VLDB Journal) and conferences (ICDE, BPM, DEBS, ER, CoopIS) and books (Schema Matching and Mapping). He authored the book Uncertain schema Matching in 2011, serves in various editorial capacities for periodicals including the Journal on Data Semantics (JoDS), Encyclopedia of Database Systems and Computing, and has helped organize professional workshops and conferences nearly every year since 1998.
He has won the IBM Faculty Award each year from 2002-2004, several Technion awards for teaching, the 2011-13 Technion-Microsoft Electronic Commerce Research Award, and the 2012 Yanai Award for Excellence in Academic Education, and others.
Knowledge Bases in the Age of Big Data Analytics
The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. They usually contain millions of entities and hundreds of millions of facts about them. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents state-of-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications. Particular emphasis will be on the twofold role of knowledge bases for big-data analytics: using scalable distributed algorithms for harvesting knowledge from Web and text sources, and leveraging entity-centric knowledge for deeper interpretation of and better intelligence with big data.
Fabian Suchanek and Gerhard Weikum
Fabian M. Suchanek is an associate professor at the Télécom ParisTech University in Paris, France. He obtained his PhD at the Max Planck Institute for Informatics in 2008, which earned him an honorable mention for the ACM SIGMOD Jim Gray Dissertation Award. Later he was a postdoc at Microsoft Research Search Labs in Silicon Valley (in the group of Rakesh Agrawal) and in the WebDam team at INRIA Saclay/France (in the group of Serge Abiteboul), and led an independent Otto Hahn Research Group, funded by the Max Planck Society. Fabian is the main architect of the YAGO ontology, one of the largest public knowledge bases.
Gerhard Weikum is a scientific director at the Max Planck Institute for Informatics in Saarbruecken, Germany, where he is leading the department on databases and information systems. He co-authored a comprehensive textbook on transactional systems, received the VLDB 10-Year Award for his work on automatic DB tuning, and is one of the creators of the YAGO knowledge base. Gerhard is an ACM Fellow, a member of several scientific academies in Germany and Europe, and a recipient of a Google Focused Research Award, an ACM SIGMOD Contributions Award, and an ERC Synergy Grant.
Causality and Explanations in Databases
With the surge in the availability of information, there is a great demand
for tools that assist users in understanding their data. While today's
exploration tools rely mostly on data visualization, users often want to go
deeper and understand the underlying causes of a particular observation.
This tutorial surveys research on causality and explanation for data-oriented
applications. Over the last few years there have been several efforts
in the Database and AI communities to develop general techniques to model
causes for observations on data, starting with Judea Pearl's seminal book on
`causality'. Causality has been formalized both for AI applications and for
database queries, and formal definitions of `explanations' have also been
proposed in the database literature. In this tutorial we will review and summarize
the research thus far into causality and explanation in the database and
AI communities, giving researchers a snapshot of the current state of the art on
this topic, and propose directions for future research. We will cover both the
theory of causality/explanation and some applications. We also discuss the
connections with other topics in database research like provenance,
deletion propagation, why-not queries, and OLAP techniques.
The tutorial will be aimed at a broad audience in the database community
including active researchers in data management, graduate students
seeking a new research topic, as well as practitioners from the industry to
preview a plausible future perspective in data analysis tools.
Alexandra Meliou, Sudeepa Roy and Dan Suciu
Alexandra Meliou is an Assistant Professor in the School of Computer
Science at the University of Massachusetts. She received her PhD and Masters
degrees from the University of California Berkeley in 2009 and 2005
respectively. She completed her postdoctoral work in 2012 at the University of
Washington. She has made contributions to the areas of provenance, causality
in databases, data cleaning, and sensor networks. She received a 2008 Siebel
scholarship, a 2012 Sigmod best demo award, and a 2013 Google faculty award.
Sudeepa Roy is a Postdoctoral Researcher in Computer Science
at the University of Washington. Her current research focuses
on theory and applications of causality/explanations in databases.
She has also worked on provenance in databases and workflows,
probabilistic databases, information extraction, and crowd sourcing.
During her doctoral studies at the University of Pennsylvania,
she was a recipient of the Google PhD Fellowship in structured data.
Dan Suciu is a Professor in Computer Science at the University of
Washington. He made contributions to semistructured data, data
privacy, probabilistic databases, and causality/explanations in
databases. He is a Fellow of the ACM, holds twelve US patents,
received the best paper award in SIGMOD 2000 and ICDT 2013, the ACM
PODS Alberto Mendelzon Test of Time Award in 2010 and in 2012, the 10
Year Most Influential Paper Award in ICDE 2013, and is a recipient of
the NSF Career Award and of an Alfred P. Sloan Fellowship. He has
given past tutorials in VLDB and SIGMOD (on semistructured data and
XML, and on probabilistic database).
Enterprise Search in the Big Data Era
Enterprise search allows users in an enterprise to retrieve
desired information through a simple search interface.
It is widely viewed as an important productivity
tool within an enterprise. While Internet search
engines have been highly successful, enterprise search
remains notoriously challenging due to a variety of
unique challenges, and is being made more so by the
increasing heterogeneity and volume of enterprise data.
On the other hand, enterprise search also presents opportunities
to succeed in ways beyond current Internet
search capabilities. This tutorial presents an organized
overview of these challenges and opportunities, and review
the state-of-the-art techniques for building a reliable
and high quality enterprise search engine, in the
context of the rise of big data.
Yunyao Li, Ziyang Liu and Huaiyu Zhu
Yunyao Li is a researcher at IBM Research—Almaden. She has broad interests across multiple disciplines, most notably databases, natural language processing, human-computer interaction, information retrieval, and machine learning. Her current research focuses on enterprise search and scalable declarative text analytics for enterprise applications. She is the owner of several key components in the search engine that is currently powering IBM intranet search. She received her PhD degree in Computer Science and Engineering from the University of Michigan, Ann Arbor in 2007.
Ziyang Liu is a researcher at the Data Management department at NEC Laboratories America. His research interests span several topics in data management, including efficient and iterative big data analytics, data pricing, multitenant databases, data usability and effectively searching structured data with keywords. He got his Ph.D. from the School of Computing, Informatics, and Decision Systems Engineering at Arizona State University in 2011. He also received B.S. degree in computer engineering from Harbin Institute of Technology, China, in 2006.
Huaiyu Zhu is with IBM Research—Almaden. He received his PhD degree in Computational Mathematics and Statistics from Liverpool University. His research interest includes statistical and machine learning techniques in data mining applications, especially in text analytics and large scale enterprise applications. In the past several years his main research focus was on enterprise search.