Systems for Big Data, Data Science, and Machine Learning
The focus of this work is information infrastructures and systems for data science and machine learning, with an emphasis on systems for graph data management and mining, machine learning, cloud data management, data stream analytics, interactive data exploration, uncertain data management, high-performance genomic data processing, and RFID/sensor data management.
- Nextdoor: GPU-based Graph Sampling for Graph Machine Learning
- Cloud data analytics and data stream analytics
- Interactive data exploration
- GESALL: Genomic scalable analysis with low latency
- Arabesque: A distributed graph mining system
Interplay Between Data and Models
In the past few years, research around data management has begun to intertwine with research around simulation, machine learning, and optimization models in novel and interesting ways in order to support robust decision making under uncertainty. Our research has addressed a variety of challenges arising at the interface of models and data. These include methods for moving stochastic analytics closer to the data, scaling model-based analytics over large data, using data for semi-automatic model creation, feeding data-hungry models in the presence of sparse data, efficiently maintaining models in the face of changing data, overcoming biases in training data, and using data to engender trust in a model.
- Predictive ML Model Maintenance
- NIM: Generative neural nets for simulation input modelling
- MaxEnt-MCC: Estimating chronic-disease prevalence from sparse data
- PackageBuilder: In-database optimization for decision support
Usability and Analysis
As data is now a staple in so many aspects of human activity, the audience for data technologies has expanded to include a varied range of users: from non-experts wishing to peruse datasets, to domain experts with specialized data processing needs. Data systems have not adapted to address these demands effectively: databases’ specialized query languages and structure create barriers for non-experts, while the lack of native support for important computing needs leaves experts to develop application-specific solutions themselves. Our work removes data-use barriers by simplifying access for non-experts to data and by augmenting database functionality with advanced problem-solving capabilities, thus simplifying analytics workflows by moving them closer to the data.
- SQuID: Semantic similarity-aware query intent discovery
- SuDocu: Summarizing documents by example
- PackageBuilder: In-database optimization for decision support
- CoCo: Data understanding and data cleaning
Provenance, Causality, and Explanations
Data is critical in almost every aspect of society, including education, technology, healthcare, economy, and science. Poor understanding and handling of data, poor data quality, and errors in data-driven processes are detrimental in all domains that rely on data. The goal of this research is to target these particular challenges, to develop tools that improve our understanding of data and facilitate the diagnosis of errors, and to extend the capabilities of modern database systems to support complex decisions and strategy planning queries.
- DataExposer: Exposing disconnect between data and systems
- ExTuNe: Explaining tuple non-conformance
- AID: Adaptive interventional debugging
- QFix: Diagnosing errors in relational logs
- Data X-Ray: Diagnosing errors in data systems
- Causality in databases
Fairness and Diversity
Data-driven software has the ability to shape human behavior: it affects the products we view and purchase, the news articles we read, the social interactions we engage in, and, ultimately, the opinions we form. Yet, data is an imperfect medium, tainted by errors, omissions, and biases. As a result, discrimination shows up in many data-driven applications, such as advertisements, hotel bookings, image search, and vendor services. Biases in data and software risk forming, propagating, and perpetuating biases in society. Data management research should develop tools to detect, inform, and mitigate the effects of bias, skew, and misuse in data-driven processes.
- Fair Min-Max Data Diversification
- Diverse data selection under fairness constraints
- Evaluation of fair classification
- Trusted machine learning
- Fairness testing
- Fast diverse data retrieval
Private Dissemination and Analysis of Data
The goal of this work is to understand how accurately aggregate properties about a data set can be studied while preserving the privacy of individual participants. Our recent work focuses on complex graph-structured data and trace data. Please see the following project pages for details, publications, and code releases:
- Private dissemination of tabular data
- Private dissemination of social network data
- Private dissemination of communication traces
Privacy, Provenance, and Data Retention
The goal of this work is to achieve the benefits of preserving history — accountability through the ability to audit the past — while avoiding threats to privacy posed by preserved data. Our work has included investigations of database forensics and models for the protection of audit histories. Please see the following project page for details and publications: