Publications

239 entries « ‹ 1 of 5 › »

2023

Juelin Liu, Sandeep Polisetty, Hui Guan, Marco Serafini

GraphMini: Accelerating Graph Pattern Matching Using Auxiliary Graphs Journal Article

International Conference on Parallel Architectures and Compilation Techniques, 2023.

Francisco Castro, Sahitya Raipura, Heather Conboy, Peter J Haas, Leon Osterweil, Ivon Arroyo

Piloting an Interactive Ethics and Responsible Computing Learning Environment in Undergraduate CS Courses Journal Article

Proceedings of the 54th ACM Technical Symposium on Computer Science Education, 2023.

BibTeX

Wang Cen, Peter J Haas

NIM: Generative Neural Networks for Automated Modeling and Generation of Simulation Inputs Journal Article

ACM Transactions on Modeling and Computer Simulation, 33 (3), pp. 1–26, 2023.

BibTeX

Chaitra Gopalappa, Hari Balasubramanian, Peter J Haas

A new mixed agent-based network and compartmental simulation framework for joint modeling of related infectious diseases- application to sexually transmitted infections Journal Article

Infectious Disease Modelling, 8 (1), pp. 84-100, 2023.

Abstract | Links | BibTeX

@article{gopalappa2023new,
title = {A new mixed agent-based network and compartmental simulation framework for joint modeling of related infectious diseases- application to sexually transmitted infections},
author = {Chaitra Gopalappa and Hari Balasubramanian and Peter J Haas},
doi = {https://doi.org/10.1016/j.idm.2022.12.003},
year = {2023},
date = {2023-03-01},
journal = {Infectious Disease Modelling},
volume = {8},
number = {1},
pages = {84-100},
abstract = {Background
A model that jointly simulates infectious diseases with common modes of transmission can serve as a decision-analytic tool to identify optimal intervention combinations for overall disease prevention. In the United States, sexually transmitted infections (STIs) are a huge economic burden, with a large fraction of the burden attributed to HIV. Data also show interactions between HIV and other sexually transmitted infections (STIs), such as higher risk of acquisition and progression of co-infections among persons with HIV compared to persons without. However, given the wide range in prevalence and incidence burdens of STIs, current compartmental or agent-based network simulation methods alone are insufficient or computationally burdensome for joint disease modeling. Further, causal factors for higher risk of coinfection could be both behavioral (i.e., compounding effects of individual behaviors, network structures, and care behaviors) and biological (i.e., presence of one disease can biologically increase the risk of another). However, the data on the fraction attributed to each are limited.

Methods
We present a new mixed agent-based compartmental (MAC) framework for jointly modeling STIs. It uses a combination of a new agent-based evolving network modeling (ABENM) technique for lower-prevalence diseases and compartmental modeling for higher-prevalence diseases. As a demonstration, we applied MAC to simulate lower-prevalence HIV in the United States and a higher-prevalence hypothetical Disease 2, using a range of transmission and progression rates to generate burdens replicative of the wide range of STIs. We simulated sexual transmissions among heterosexual males, heterosexual females, and men who have sex with men (men only and men and women). Setting the biological risk of co-infection to zero, we conducted numerical analyses to evaluate the influence of behavioral factors alone on disease dynamics.

Results
The contribution of behavioral factors to risk of coinfection was sensitive to disease burden, care access, and population heterogeneity and mixing. The contribution of behavioral factors was generally lower than observed risk of coinfections for the range of hypothetical prevalence studied here, suggesting potential role of biological factors, that should be investigated further specific to an STI.

Conclusions
The purpose of this study is to present a new simulation technique for jointly modeling infectious diseases that have common modes of transmission but varying epidemiological features. The numerical analysis serves as proof-of-concept for the application to STIs. Interactions between diseases are influenced by behavioral factors, are sensitive to care access and population features, and are likely exacerbated by biological factors. Social and economic conditions are among key drivers of behaviors that increase STI transmission, and thus, structural interventions are a key part of behavioral interventions. Joint modeling of diseases helps comprehensively simulate behavioral and biological factors of disease interactions to evaluate the true impact of common structural interventions on overall disease prevention. The new simulation framework is especially suited to simulate behavior as a function of social determinants, and further, to identify optimal combinations of common structural and disease-specific interventions.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Background
A model that jointly simulates infectious diseases with common modes of transmission can serve as a decision-analytic tool to identify optimal intervention combinations for overall disease prevention. In the United States, sexually transmitted infections (STIs) are a huge economic burden, with a large fraction of the burden attributed to HIV. Data also show interactions between HIV and other sexually transmitted infections (STIs), such as higher risk of acquisition and progression of co-infections among persons with HIV compared to persons without. However, given the wide range in prevalence and incidence burdens of STIs, current compartmental or agent-based network simulation methods alone are insufficient or computationally burdensome for joint disease modeling. Further, causal factors for higher risk of coinfection could be both behavioral (i.e., compounding effects of individual behaviors, network structures, and care behaviors) and biological (i.e., presence of one disease can biologically increase the risk of another). However, the data on the fraction attributed to each are limited.

Methods
We present a new mixed agent-based compartmental (MAC) framework for jointly modeling STIs. It uses a combination of a new agent-based evolving network modeling (ABENM) technique for lower-prevalence diseases and compartmental modeling for higher-prevalence diseases. As a demonstration, we applied MAC to simulate lower-prevalence HIV in the United States and a higher-prevalence hypothetical Disease 2, using a range of transmission and progression rates to generate burdens replicative of the wide range of STIs. We simulated sexual transmissions among heterosexual males, heterosexual females, and men who have sex with men (men only and men and women). Setting the biological risk of co-infection to zero, we conducted numerical analyses to evaluate the influence of behavioral factors alone on disease dynamics.

Results
The contribution of behavioral factors to risk of coinfection was sensitive to disease burden, care access, and population heterogeneity and mixing. The contribution of behavioral factors was generally lower than observed risk of coinfections for the range of hypothetical prevalence studied here, suggesting potential role of biological factors, that should be investigated further specific to an STI.

Conclusions
The purpose of this study is to present a new simulation technique for jointly modeling infectious diseases that have common modes of transmission but varying epidemiological features. The numerical analysis serves as proof-of-concept for the application to STIs. Interactions between diseases are influenced by behavioral factors, are sensitive to care access and population features, and are likely exacerbated by biological factors. Social and economic conditions are among key drivers of behaviors that increase STI transmission, and thus, structural interventions are a key part of behavioral interventions. Joint modeling of diseases helps comprehensively simulate behavioral and biological factors of disease interactions to evaluate the true impact of common structural interventions on overall disease prevention. The new simulation framework is especially suited to simulate behavior as a function of social determinants, and further, to identify optimal combinations of common structural and disease-specific interventions.

Brian Hentschel, Peter J Haas, Yuanyuan Tian

Exact PPS Sampling with Bounded Sample Size Journal Article

Information Processing Letters, 182 , 2023.

Abstract | Links | BibTeX

2022

Ryan McKenna, Brett Mullins, Daniel Sheldon, Gerome Miklau

AIM: an adaptive and iterative mechanism for differentially private synthetic data Inproceedings

Proceedings of the VLDB Endowment, pp. 2599–2612, 2022.

Abstract | Links | BibTeX

Dave Archer, Michael A August, Georgios Bouloukakis, Christopher Davison, Mamadou H Diallo, Dhrubajyoti Ghosh, Christopher T Graves, Michael Hay, Xi He, Peeter Laud, Steve Lu, Ashwin Machanavajjhala, Sharad Mehrotra, Gerome Miklau, Alisa Pankova, Shantanu Sharma, Nalini Venkatasubramanian, Guoxi Wang, Roberto Yus

Transitioning from testbeds to ships: an experience study in deploying the TIPPERS Internet of Things platform to the US Navy Journal Article

The Journal of Defense Modeling and Simulation, 19 (3), pp. 501-517, 2022.

Abstract | Links | BibTeX

@article{Archer2022transition,
title = {Transitioning from testbeds to ships: an experience study in deploying the TIPPERS Internet of Things platform to the US Navy},
author = {Dave Archer and Michael A August and Georgios Bouloukakis and Christopher Davison and Mamadou H Diallo and Dhrubajyoti Ghosh and Christopher T Graves and Michael Hay, Xi He, Peeter Laud and Steve Lu and Ashwin Machanavajjhala and Sharad Mehrotra and Gerome Miklau and Alisa Pankova and Shantanu Sharma and Nalini Venkatasubramanian and Guoxi Wang and Roberto Yus},
doi = {https://doi.org/10.1177/154851292095638},
year = {2022},
date = {2022-07-01},
journal = {The Journal of Defense Modeling and Simulation},
volume = {19},
number = {3},
pages = {501-517},
abstract = {This paper describes the collaborative effort between privacy and security researchers at nine different institutions along with researchers at the Naval Information Warfare Center to deploy, test, and demonstrate privacy-preserving technologies in creating sensor-based awareness using the Internet of Things (IoT) aboard naval vessels in the context of the US Navy’s Trident Warrior 2019 exercise. Funded by DARPA through the Brandeis program, the team built an integrated IoT data management middleware, entitled TIPPERS, that supports privacy by design and integrates a variety of Privacy Enhancing Technologies (PETs), including differential privacy, computation on encrypted data, and fine-grained policies. We describe the architecture of TIPPERS and its use in creating a smart ship that offers IoT-enabled services such as occupancy analysis, fall detection, detection of unauthorized access to spaces, and other situational awareness scenarios. We describe the privacy implications of creating IoT spaces that collect data that might include individuals’ data (e.g., location) and analyze the tradeoff between privacy and utility of the supported PETs in this context.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Sainyam Galhotra, Anna Fariha, Raoni Lourenço, Juliana Freire, Alexandra Meliou, Divesh Srivastava

DataPrism: Exposing Disconnect between Data and Systems Inproceedings

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022, pp. 217-231, 2022.

Abstract | Links | BibTeX

@inproceedings{galhotra2022dataprism,
title = {DataPrism: Exposing Disconnect between Data and Systems},
author = {Sainyam Galhotra and Anna Fariha and Raoni Lourenço and Juliana Freire and Alexandra Meliou and Divesh Srivastava},
doi = {https://doi.org/10.1145/3514221.3517864},
year = {2022},
date = {2022-06-01},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022},
pages = {217-231},
abstract = {As data is a central component of many modern systems, the cause of a system malfunction may reside in the data, and, specifically, particular properties of data. E.g., a health-monitoring system that is designed under the assumption that weight is reported in lbs will malfunction when encountering weight reported in kilograms. Like software debugging, which aims to find bugs in the source code or runtime conditions, our goal is to debug data to identify potential sources of disconnect between the assumptions about some data and systems that operate on that data. We propose DataPrism, a framework to identify data properties (profiles) that are the root causes of performance degradation or failure of a data-driven system. Such identification is necessary to repair data and resolve the disconnect between data and systems. Our technique is based on causal reasoning through interventions: when a system malfunctions for a dataset, DataPrism alters the data profiles and observes changes in the system's behavior due to the alteration. Unlike statistical observational analysis that reports mere correlations, DataPrism reports causally verified root causes -- in terms of data profiles -- of the system malfunction. We empirically evaluate DataPrism on seven real-world and several synthetic data-driven systems that fail on certain datasets due to a diverse set of reasons. In all cases, DataPrism identifies the root causes precisely while requiring orders of magnitude fewer interventions than prior techniques.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Maliha Tashfia Islam, Anna Fariha, Alexandra Meliou, Babak Salimi

Through the data management lens: Experimental analysis and evaluation of fair classification Inproceedings

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022., pp. 232-246, 2022.

Abstract | Links | BibTeX

@inproceedings{islam2022through,
title = {Through the data management lens: Experimental analysis and evaluation of fair classification},
author = {Maliha Tashfia Islam and Anna Fariha and Alexandra Meliou and Babak Salimi},
doi = {https://doi.org/10.1145/3514221.3517841},
year = {2022},
date = {2022-06-01},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022.},
pages = {232-246},
abstract = {Classification, a heavily-studied data-driven machine learning task, drives an increasing number of prediction systems involving critical human decisions such as loan approval and criminal risk assessment. However, classifiers often demonstrate discriminatory behavior, especially when presented with biased data. Consequently, fairness in classification has emerged as a high-priority research area. Data management research is showing an increasing presence and interest in topics related to data and algorithmic fairness, including the topic of fair classification. The interdisciplinary efforts in fair classification, with machine learning research having the largest presence, have resulted in a large number of fairness notions and a wide range of approaches that have not been systematically evaluated and compared. In this paper, we contribute a broad analysis of 13 fair classification approaches and additional variants, over their correctness, fairness, efficiency, scalability, robustness to data errors, sensitivity to underlying ML model, data efficiency, and stability using a variety of metrics and real-world datasets. Our analysis highlights novel insights on the impact of different metrics and high-level approach characteristics on different aspects of performance. We also discuss general principles for choosing approaches suitable for different practical settings, and identify areas where data-management-centric solutions are likely to have the most impact.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Raghavendra Addanki, Andrew McGregor, Alexandra Meliou, Zafeiria Moumoulidou

Improved approximation and scalability for fair max-min diversification Inproceedings

ICDT 2022, 2022.

Links | BibTeX

Wang Cen, Peter J. Haas

Enhanced Simulation Metamodeling via Graph and Generative Neural Networks Inproceedings

Winter Simulation Conference, WSC 2022, Singapore, December 11-14, 2022, pp. 2748-2759, 2022.

Abstract | Links | BibTeX

Sneha Gathani, Madelon Hulsebos, James Gale, Peter J. Haas, Çağatay Demiralp

Augmenting Decision Making via Interactive What-If Analysis Inproceedings

12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, USA, January 9-12, 2022, 2022.

Abstract | Links | BibTeX

@inproceedings{gathani2021augmenting,
title = {Augmenting Decision Making via Interactive What-If Analysis},
author = {Sneha Gathani and Madelon Hulsebos and James Gale and Peter J. Haas and Çağatay Demiralp
},
url = {https://www.cidrdb.org/cidr2022/papers/p49-gathani.pdf},
year = {2022},
date = {2022-01-09},
booktitle = {12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, USA, January 9-12, 2022},
abstract = {The fundamental goal of business data analysis is to improve business decisions using data. Business users often make decisions to achieve key performance indicators (KPIs) such as increasing customer retention or sales, or decreasing costs. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to perform lengthy exploratory analyses. This involves considering multitudes of combinations and scenarios and performing slicing, dicing, and transformations on the data accordingly, e.g., analyzing customer retention across quarters of the year or suggesting optimal media channels across strata of customers. However, the increasing complexity of datasets combined with the cognitive limitations of humans makes it challenging to carry over multiple hypotheses, even for simple datasets. Therefore mentally performing such analyses is hard. Existing commercial tools either provide partial solutions or fail to cater to business users altogether. Here we argue for four functionalities to enable business users to interactively learn and reason about the relationships between sets of data attributes thereby facilitating data-driven decision making. We implement these functionalities in SystemD, an interactive visual data analysis system enabling business users to experiment with the data by asking what-if questions. We evaluate the system through three business use cases: marketing mix modeling, customer retention analysis, and deal closing analysis, and report on feedback from multiple business users. Users find the SystemD functionalities highly useful for quick testing and validation of their hypotheses around their KPIs of interest, addressing their unmet analysis needs. The feedback also suggests that the UX design can be enhanced to further improve the understandability of these functionalities.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Azza Abouzied, Peter J Haas, Alexandra Meliou

In-Database Decision Support: Opportunities and Challenges Journal Article

IEEE Data Engineering Bulletin, 45 (3), pp. 102-115, 2022.

Abstract | Links | BibTeX

2021

Ryan McKenna, Gerome Miklau, Daniel Sheldon

Winning the NIST Contest: A scalable and general approach to differentially private synthetic data Journal Article

Journal of Privacy and Confidentiality, 11 (3), 2021.

Abstract | Links | BibTeX

Ryan McKenna, Siddhant Pradhan, Daniel Sheldon, Gerome Miklau

Relaxed marginal consistency for differentially private query answering Inproceedings

Advances in Neural Information Processing Systems, pp. 20696-20707, 2021.

Abstract | Links | BibTeX

Yifei Yang, Matt Youill, Matthew Woicik, Yizhou Liu, Xiangyao Yu, Marco Serafini, Ashraf Aboulnaga, Michael Stonebraker

FlexPushdownDB: Hybrid Pushdown and Caching in a Cloud DBMS Inproceedings

Proceedings of the VLDB Endowment, pp. 2101–2113, 2021.

Abstract | Links | BibTeX

Abhinav Jangda, Sandeep Polisetty, Arjun Guha, Marco Serafini

Accelerating graph sampling for graph machine learning using GPUs Inproceedings

EuroSys '21: Proceedings of the Sixteenth European Conference on Computer Systems, pp. 311–326, 2021.

Abstract | Links | BibTeX

Marco Serafini, Hui Guan

Scalable graph neural network training: The case for sampling Journal Article

ACM SIGOPS Operating Systems Review, 55 (1), pp. 68-76, 2021.

Abstract | Links | BibTeX

Fei Song, Khaled Zaouk, Chenghao Lyu, Arnab Sinha, Qi Fan, Yanlei Diao, Prashant J Shenoy

Spark-based Cloud Data Analytics using Multi-Objective Optimization Inproceedings

37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021, pp. 396–407, IEEE, 2021.

Abstract | Links | BibTeX

Zafeiria Moumoulidou, Andrew McGregor, Alexandra Meliou

Diverse Data Selection under Fairness Constraints Inproceedings

International Conference on Database Theory, (ICDT), pp. 11:1–11:25, 2021.

Abstract | Links | BibTeX

Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, Alexandra Meliou

Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems Inproceedings

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2021.

Abstract | Links | BibTeX

@inproceedings{FarihaTMRG21b,
title = {Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems},
author = {Anna Fariha and Ashish Tiwari and Arjun Radhakrishna and Sumit Gulwani and Alexandra Meliou},
url = {https://afariha.github.io/papers/Conformance_Constraints_SIGMOD_2021.pdf},
doi = {10.1145/3448016.3452795},
year = {2021},
date = {2021-06-18},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)},
abstract = {The reliability of inferences made by data-driven systems hinges on the data’s continued conformance to the systems’ initial settings and assumptions. When serving data (on which we want to apply inference) deviates from the profile of the initial training data, the outcome of inference becomes unreliable. We introduce conformance constraints, a new data profiling primitive tailored towards quantifying the degree of non-conformance, which can effectively characterize if inference over that tuple is untrustworthy. Conformance constraints are constraints over certain arithmetic expressions (called projections) involving the numerical attributes of a dataset, which existing data profiling primitives such as functional dependencies and denial constraints cannot model. Our key finding is that projections that incur low variance on a dataset construct effective conformance constraints. This principle yields the surprising result that lowvariance components of a principal component analysis, which are usually discarded for dimensionality reduction, generate stronger conformance constraints than the high-variance components. Based on this result, we provide a highly scalable and efficient technique—linear in data size and cubic in the number of attributes—for discovering conformance constraints for a dataset. To measure the degree of a tuple’s non-conformance with respect to a dataset, we propose a quantitative semantics that captures how much a tuple violates the conformance constraints of that dataset. We demonstrate the value of conformance constraints on two applications: trusted machine learning and data drift. We empirically show that conformance constraints offer mechanisms to (1) reliably detect tuples on which the inference of a machine-learned model should not be trusted, and (2) quantify data drift more accurately than the state of the art.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Anna Fariha, Ashish Tiwari, Alexandra Meliou, Arjun Radhakrishna, Sumit Gulwani

CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning Inproceedings

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2021.

Abstract | Links | BibTeX

@inproceedings{FarihaTMRG21demo,
title = {CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning},
author = {Anna Fariha and Ashish Tiwari and Alexandra Meliou and Arjun Radhakrishna and Sumit Gulwani},
url = {https://afariha.github.io/papers/CoCo_SIGMOD_2021_Demo.pdf},
doi = {10.1145/3448016.3452750},
year = {2021},
date = {2021-06-18},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)},
abstract = {Data profiling refers to the task of extracting technical metadata or profiles and has numerous applications such as data understanding, validation, integration, and cleaning. While a number of data profiling primitives exist in the literature, most of them are limited to categorical attributes. A few techniques consider numerical attributes; but, they either focus on simple relationships involving a pair of attributes (e.g., correlations) or convert the continuous semantics of numerical attributes to a discrete semantics, which results in information loss. To capture more complex relationships involving the numerical attributes, we developed a new data-profiling primitive called conformance constraints, which can model linear arithmetic relationships involving multiple numerical attributes.
We present CoCo, a system that allows interactive discovery and exploration of Conformance Constraints for understanding trends involving the numerical attributes of a dataset, with a particular focus on the application of data cleaning. Through a simple interface, CoCo enables the user to guide conformance constraint discovery according to their preferences. The user can examine to what extent a new, possibly dirty, dataset satisfies or violates the discovered conformance constraints. Further, CoCo provides useful suggestions for cleaning dirty data tuples, where the user can interactively alter cell values, and verify by checking change in conformance constraint violation due to the alteration. We demonstrate how CoCo can help in understanding trends in the data and assist the users in interactive data cleaning, using conformance constraints},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

2020

Muhammad Bilal, Marco Serafini, Marco Canini, Rodrigo Rodrigues

Do the Best Cloud Configurations Grow on Trees? An Experimental Evaluation of Black Box Algorithms for Optimizing Cloud Workloads Sub Journal Article

Proc. VLDB Endow., 13 (11), pp. 2563–2575, 2020.

Links | BibTeX

Matteo Brucato, Miro Mannino, Azza Abouzied, Peter J Haas, Alexandra Meliou

sPaQLTooLs: A Stochastic Package Query Interface for Scalable Constrained Optimization Journal Article

Proc. VLDB Endow., 13 (12), pp. 2881–2884, 2020.

Abstract | Links | BibTeX

@article{DBLP:journals/pvldb/BrucatoMAHM20,
title = {sPaQLTooLs: A Stochastic Package Query Interface for Scalable Constrained Optimization},
author = {Matteo Brucato and Miro Mannino and Azza Abouzied and Peter J Haas and Alexandra Meliou},
url = {http://www.vldb.org/pvldb/vol13/p2881-brucato.pdf},
year = {2020},
date = {2020-01-01},
journal = {Proc. VLDB Endow.},
volume = {13},
number = {12},
pages = {2881--2884},
abstract = {Everyone needs to make decisions under uncertainty and with limited resources, e.g., an investor who is building a stock portfolio subject to an investment budget and a bounded risk tolerance. Doing this with current technology is hard. There is a disconnect between software tools for data management, stochastic predictive modeling (e.g., simulation of future stock prices), and optimization; this leads to cumbersome analytical workflows. Moreover, current methods do not scale. To handle a broad class of uncertainty models, analysts approximate the original stochastic optimization problem by a large deterministic optimization problem that incorporates many “scenarios”, i.e., sample realizations of the uncertain data values. For large problems, a huge number of scenarios is required, often causing the solver to fail. We demonstrate sPaQLTooLs, a system for in-database specification and scalable solution of constrained optimization problems. The key ingredients are (i) a database-oriented specification of constrained stochastic optimization problems as “stochastic package queries” (SPQs), (ii) use of a
Monte Carlo database to incorporate stochastic predictive models, and (iii) a new SUMMARYSEARCH algorithm for scalably solving SPQs with approximation guarantees. In this demonstration, the attendees will experience first-hand the difficulty of manually constructing feasible and high-quality portfolios, using real-world stock market data. We will then demonstrate how SUMMARYSEARCH can easily and efficiently help them find very good portfolios, while being orders of magnitude faster than prior methods.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Anna Fariha, Matteo Brucato, Peter J Haas, Alexandra Meliou

SuDocu: Summarizing Documents by Example Journal Article

Proc. VLDB Endow., 13 (12), pp. 2861–2864, 2020.

Abstract | Links | BibTeX

Dan Zhang, Yoshihiko Suhara, Jinfeng Li, Madelon Hulsebos, Ç, Wang -

Sato: Contextual Semantic Type Detection in Tables Journal Article

Proc. VLDB Endow., 13 (11), pp. 1835–1848, 2020.

Abstract | Links | BibTeX

Xiangyao Yu, Matt Youill, Matthew E Woicik, Abdurrahman Ghanem, Marco Serafini, Ashraf Aboulnaga, Michael Stonebraker

PushdownDB: Accelerating a DBMS Using S3 Computation Inproceedings

36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20-24, 2020, pp. 1802–1805, 2020.

Abstract | Links | BibTeX

Xiaowei Zhu, Marco Serafini, Xiaosong Ma, Ashraf Aboulnaga, Wenguang Chen, Guanyu Feng

LiveGraph: A Transactional Graph Storage System with Purely Sequential Adjacency List Scans Journal Article

Proc. VLDB Endow., 13 (7), pp. 1020–1034, 2020.

Abstract | Links | BibTeX

Cibele Freire, Wolfgang Gatterbauer, Neil Immerman, Alexandra Meliou

New Results for the Complexity of Resilience for Binary Conjunctive Queries with Self-Joins Inproceedings

Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2020, Portland, OR, USA, June 14-19, 2020, pp. 271–284, 2020.

Abstract | Links | BibTeX

Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani

ExTuNe: Explaining Tuple Non-conformance Inproceedings

Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 2741–2744, 2020.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/sigmod/Fariha0RG20,
title = {ExTuNe: Explaining Tuple Non-conformance},
author = {Anna Fariha and Ashish Tiwari and Arjun Radhakrishna and Sumit Gulwani},
url = {https://doi.org/10.1145/3318464.3384694
https://people.cs.umass.edu/~afariha/papers/ExTuNe.pdf},
doi = {10.1145/3318464.3384694},
year = {2020},
date = {2020-06-16},
booktitle = {Proceedings of the 2020 International Conference on Management of
Data, SIGMOD Conference 2020, online conference [Portland, OR, USA],
June 14-19, 2020},
pages = {2741--2744},
crossref = {DBLP:conf/sigmod/2020},
abstract = {In data-driven systems, we often encounter tuples on which the predictions of a machine-learned model are untrustworthy. A key cause of such untrustworthiness is non-conformance of a new tuple with respect to the training dataset. To check conformance, we introduce a novel concept of data invariant, which captures a set of implicit constraints that all tuples of a dataset satisfy: a test tuple is non-conforming if it violates the data invariants. Data invariants model complex relationships among multiple attributes; but do not provide interpretable explanations of non-conformance. We present ExTuNe, a system for Explaining causes of Tuple Non-conformance. Based on the principles of causality, ExTuNe assigns responsibility to the attributes for causing non-conformance. The key idea is to observe change in invariant violation under intervention on attribute-values. Through a simple interface, ExTuNe produces a ranked list of the test tuples based on their degree of non-conformance and visualizes tuple-level attribute responsibility for non-conformance through heat maps. ExTuNe further visualizes attribute responsibility, aggregated over the test tuples. We demonstrate how ExTuNe can detect and explain tuple non-conformance and assist the users to make careful decisions towards achieving trusted machine learning.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Anna Fariha, Suman Nath, Alexandra Meliou

Causality-Guided Adaptive Interventional Debugging Inproceedings

Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 431–446, 2020.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/sigmod/FarihaNM20,
title = {Causality-Guided Adaptive Interventional Debugging},
author = {Anna Fariha and Suman Nath and Alexandra Meliou},
url = {https://doi.org/10.1145/3318464.3389694
https://people.cs.umass.edu/~afariha/papers/aid.pdf},
doi = {10.1145/3318464.3389694},
year = {2020},
date = {2020-06-15},
booktitle = {Proceedings of the 2020 International Conference on Management of
Data, SIGMOD Conference 2020, online conference [Portland, OR, USA],
June 14-19, 2020},
pages = {431--446},
crossref = {DBLP:conf/sigmod/2020},
abstract = {Runtime nondeterminism is a fact of life in modern database applications. Previous research has shown that nondeterminism can cause applications to intermittently crash, become unresponsive, or experience data corruption. We propose Adaptive Interventional Debugging (AID) for debugging such intermittent failures. AID combines existing statistical debugging, causal analysis, fault injection, and group testing techniques in a novel way to (1) pinpoint the root cause of an application's intermittent failure and (2) generate an explanation of how the root cause triggers the failure. AID works by first identifying a set of runtime behaviors (called predicates) that are strongly correlated to the failure. It then utilizes temporal properties of the predicates to (over)-approximate their causal relationships. Finally, it uses fault injection to execute a sequence of interventions on the predicates and discover their true causal relationships. This enables AID to identify the true root cause and its causal relationship to the failure. We theoretically analyze how fast AID can converge to the identification. We evaluate AID with six real-world applications that intermittently fail under specific inputs. In each case, AID was able to identify the root cause and explain how the root cause triggered the failure, much faster than group testing and more precisely than statistical debugging. We also evaluate AID with many synthetically generated applications with known root causes and confirm that the benefits also hold for them.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Matteo Brucato, Nishant Yadav, Azza Abouzied, Peter J Haas, Alexandra Meliou

Stochastic Package Queries in Probabilistic Databases Inproceedings

Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 269–283, 2020.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/sigmod/BrucatoYAHM20,
title = {Stochastic Package Queries in Probabilistic Databases},
author = {Matteo Brucato and Nishant Yadav and Azza Abouzied and Peter J Haas and Alexandra Meliou},
url = {https://doi.org/10.1145/3318464.3389765
https://people.cs.umass.edu/~matteo/files/3318464.3389765.pdf},
doi = {10.1145/3318464.3389765},
year = {2020},
date = {2020-06-15},
booktitle = {Proceedings of the 2020 International Conference on Management of
Data, SIGMOD Conference 2020, online conference [Portland, OR, USA],
June 14-19, 2020},
pages = {269--283},
crossref = {DBLP:conf/sigmod/2020},
abstract = {We provide methods for in-database support of decision making under uncertainty. Many important decision problems correspond to selecting a "package" (bag of tuples in a relational database) that jointly satisfy a set of constraints while minimizing some overall "cost" function; in most real-world problems, the data is uncertain. We provide methods for specifying---via a SQL extension---and processing stochastic package queries (SPQS), in order to solve optimization problems over uncertain data, right where the data resides. Prior work in stochastic programming uses Monte Carlo methods where the original stochastic optimization problem is approximated by a large deterministic optimization problem that incorporates many "scenarios", i.e., sample realizations of the uncertain data values. For large database tables, however, a huge number of scenarios is required, leading to poor performance and, often, failure of the solver software. We therefore provide a novel ßs algorithm that, instead of trying to solve a large deterministic problem, seamlessly approximates it via a sequence of smaller problems defined over carefully crafted "summaries" of the scenarios that accelerate convergence to a feasible and near-optimal solution. Experimental results on our prototype system show that ßs can be orders of magnitude faster than prior methods at finding feasible and high-quality packages.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

David Pujol, Ryan McKenna, Satya Kuppam, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

Fair decision making using privacy-protected data Inproceedings

FAT* '20: Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, January 27-30, 2020, pp. 189–199, 2020.

Links | BibTeX

Dan Zhang, Ryan McKenna, Ios Kotsogiannis, George Bissias, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

ϵKTELO: A Framework for Defining Differentially Private Computations Journal Article

ACM Trans. Database Syst., 45 (1), pp. 2:1–2:44, 2020.

Links | BibTeX

2019

Matteo Brucato, Azza Abouzied, Alexandra Meliou

Scalable computation of high-order optimization queries Journal Article

Commun. ACM, 62 (2), pp. 108–116, 2019.

Abstract | Links | BibTeX

Junjay Tan, Thanaa Ghanem, Matthew Perron, Xiangyao Yu, Michael Stonebraker, David DeWitt, Marco Serafini, Ashraf Aboulnaga, Tim Kraska

Choosing A Cloud DBMS: Architectures and Tradeoffs Journal Article

Proceedings of the VLDB Endowment, 12 (12), 2019.

Abstract | Links | BibTeX

Anna Fariha, Alexandra Meliou

Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity Journal Article

PVLDB, 12 (11), pp. 1262–1275, 2019.

Abstract | Links | BibTeX

@article{DBLP:journals/pvldb/FarihaM19,
title = {Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity},
author = {Anna Fariha and Alexandra Meliou},
url = {http://www.vldb.org/pvldb/vol12/p1262-fariha.pdf
squid.cs.umass.edu
https://bitbucket.org/afariha/squid-public/},
year = {2019},
date = {2019-01-01},
journal = {PVLDB},
volume = {12},
number = {11},
pages = {1262--1275},
abstract = {Traditional relational data interfaces require precise structured queries over potentially complex schemas. These rigid data retrieval mechanisms pose hurdles for non-expert users, who typically lack language expertise and are unfamiliar with the details of the schema. Query by Example (QBE) methods offer an alternative mechanism: users provide examples of their intended query output and the QBE system needs to infer the intended query. However, these approaches focus on the structural similarity of the examples and ignore the richer context present in the data. As a result, they typically produce queries that are too general, and fail to capture the user’s intent effectively. In this paper, we present SQUID, a system that performs semantic similarity-aware query intent discovery. Our work makes the following contributions: (1) We design an end-to-end system that automatically formulates select-project-join queries in an open-world setting, with optional group-by aggregation and intersection operators; a much larger class than prior QBE techniques. (2) We express the problem of query intent discovery using a probabilistic abduction model, that infers a query as the most likely explanation of the provided examples. (3) We introduce the notion of an abduction-ready database, which precomputes semantic properties and related statistics, allowing SQUID to achieve real-time performance. (4) We present an extensive empirical evaluation on three real-world datasets, including user-intent case studies, demonstrating that SQUID is efficient and effective, and outperforms machine learning methods, as well as the state-of the-art in the related query reverse engineering problem.
},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Xiaolan Wang, Xin Luna Dong, Yang Li, Alexandra Meliou

MIDAS: Finding the Right Web Sources to Fill Knowledge Gaps Inproceedings

35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pp. 578–589, 2019.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/icde/WangDLM19,
title = {MIDAS: Finding the Right Web Sources to Fill Knowledge Gaps},
author = {Xiaolan Wang and Xin Luna Dong and Yang Li and Alexandra Meliou},
url = {https://doi.org/10.1109/ICDE.2019.00058},
doi = {10.1109/ICDE.2019.00058},
year = {2019},
date = {2019-01-01},
booktitle = {35th IEEE International Conference on Data Engineering, ICDE 2019,
Macao, China, April 8-11, 2019},
pages = {578--589},
abstract = {Knowledge bases, massive collections of facts (RDF triples) on diverse topics, support vital modern applications. However, existing knowledge bases contain very little data compared to the wealth of information on the Web. This is because the industry standard in knowledge base creation and augmentation suffers from a serious bottleneck: they rely on domain experts to identify appropriate web sources to extract data from. Efforts to fully automate knowledge extraction have failed to improve this standard: these automated systems are able to retrieve much more data and from a broader range of sources, but they suffer from very low precision and recall. As a result, these large-scale extractions remain unexploited. In this paper, we present MIDAS, a system that harnesses the results of automated knowledge extraction pipelines to repair the bottleneck in industrial knowledge creation and augmentation processes. MIDAS automates the suggestion of good-quality web sources and describes what to extract with respect to augmenting an existing knowledge base. We make three major contributions. First, we introduce a novel concept, web source slices, to describe the contents of a web source. Second, we define a profit function to quantify the value of a web source slice with respect to augmenting an existing knowledge base. Third, we develop effective and highly-scalable algorithms to derive high-profit web source slices. We demonstrate that MIDAS produces high-profit results and outperforms the baselines significantly on both real-world and synthetic datasets.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Xiaolan Wang, Alexandra Meliou

Explain3D: Explaining Disagreements in Disjoint Datasets Journal Article

PVLDB, 12 (7), pp. 779–792, 2019.

Abstract | Links | BibTeX

@article{DBLP:journals/pvldb/WangM19,
title = {Explain3D: Explaining Disagreements in Disjoint Datasets},
author = {Xiaolan Wang and Alexandra Meliou},
url = {http://www.vldb.org/pvldb/vol12/p779-wang.pdf},
year = {2019},
date = {2019-01-01},
journal = {PVLDB},
volume = {12},
number = {7},
pages = {779--792},
abstract = {Data plays an important role in applications, analytic processes, and many aspects of human activity. As data grows in size and complexity, we are met with an imperative need for tools that promote understanding and explanations over data-related operations. Data management research on explanations has focused on the assumption that data resides in a single dataset, under one common schema. But the reality of today's data is that it is frequently unintegrated, coming from different sources with different schemas. When different datasets provide different answers to semantically similar questions, understanding the reasons for the discrepancies is challenging and cannot be handled by the existing single-dataset solutions. In this paper, we propose explain3D, a framework for explaining the disagreements across disjoint datasets (3D). Explain3D focuses on identifying the reasons for the differences in the results of two semantically similar queries operating on two datasets with potentially different schemas. Our framework leverages the queries to perform a semantic mapping across the relevant parts of their provenance; discrepancies in this mapping point to causes of the queries' differences. Exploiting the queries gives explain3D an edge over traditional schema matching and record linkage techniques, which are query-agnostic. Our work makes the following contributions: (1) We formalize the problem of deriving optimal explanations for the differences of the results of semantically similar queries over disjoint datasets. Our optimization problem considers two types of explanations, provenance-based and value-based, defined over an evidence mapping, which makes our solution interpretable. (2) We design a 3-stage framework for solving the optimal explanation problem. (3) We develop a smart-partitioning optimizer that improves the efficiency of the framework by orders of magnitude. (4) We experiment with real-world and synthetic data to demonstrate that explain3D can derive precise explanations efficiently, and is superior to alternative methods based on integration techniques and single-dataset explanation frameworks.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Data plays an important role in applications, analytic processes, and many aspects of human activity. As data grows in size and complexity, we are met with an imperative need for tools that promote understanding and explanations over data-related operations. Data management research on explanations has focused on the assumption that data resides in a single dataset, under one common schema. But the reality of today's data is that it is frequently unintegrated, coming from different sources with different schemas. When different datasets provide different answers to semantically similar questions, understanding the reasons for the discrepancies is challenging and cannot be handled by the existing single-dataset solutions. In this paper, we propose explain3D, a framework for explaining the disagreements across disjoint datasets (3D). Explain3D focuses on identifying the reasons for the differences in the results of two semantically similar queries operating on two datasets with potentially different schemas. Our framework leverages the queries to perform a semantic mapping across the relevant parts of their provenance; discrepancies in this mapping point to causes of the queries' differences. Exploiting the queries gives explain3D an edge over traditional schema matching and record linkage techniques, which are query-agnostic. Our work makes the following contributions: (1) We formalize the problem of deriving optimal explanations for the differences of the results of semantically similar queries over disjoint datasets. Our optimization problem considers two types of explanations, provenance-based and value-based, defined over an evidence mapping, which makes our solution interpretable. (2) We design a 3-stage framework for solving the optimal explanation problem. (3) We develop a smart-partitioning optimizer that improves the efficiency of the framework by orders of magnitude. (4) We experiment with real-world and synthetic data to demonstrate that explain3D can derive precise explanations efficiently, and is superior to alternative methods based on integration techniques and single-dataset explanation frameworks.

Ryan Mckenna, Daniel Sheldon, Gerome Miklau

Graphical-model based estimation and inference for differential privacy Inproceedings

International Conference on Machine Learning, pp. 4435–4444, 2019.

Abstract | Links | BibTeX

Ahmed Elgohary, Matthias Boehm, Peter J Haas, Frederick R Reiss, Berthold Reinwald

Compressed linear algebra for declarative large-scale machine learning Journal Article

Commun. ACM, 62 (5), pp. 83–91, 2019.

Abstract | Links | BibTeX

Yanlei Diao, Pawel Guzewicz, Ioana Manolescu, Mirjana Mazuran

Spade: A Modular Framework for Analytical Exploration of RDF Graphs Journal Article

PVLDB, 12 (12), pp. 1926–1929, 2019.

Abstract | Links | BibTeX

Khaled Zaouk, Fei Song, Chenghao Lyu, Arnab Sinha, Yanlei Diao, Prashant J Shenoy

UDAO: A Next-Generation Unified Data Analytics Optimizer Journal Article

PVLDB, 12 (12), pp. 1934–1937, 2019.

Abstract | Links | BibTeX

Brian Hentschel, Peter J Haas, Yuanyuan Tian

Online Model Management via Temporally Biased Sampling Journal Article

SIGMOD Record, 48 (1), pp. 69–76, 2019.

Abstract | Links | BibTeX

Johanna Sommer, Matthias Boehm, Alexandre V Evfimievski, Berthold Reinwald, Peter J Haas

MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions Inproceedings

Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pp. 1607–1623, 2019.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/sigmod/Sommer0ERH19,
title = {MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions},
author = {Johanna Sommer and Matthias Boehm and Alexandre V Evfimievski and Berthold Reinwald and Peter J Haas},
url = {https://doi.org/10.1145/3299869.3319854
https://mboehm7.github.io/resources/sigmod2019.pdf},
doi = {10.1145/3299869.3319854},
year = {2019},
date = {2019-01-01},
booktitle = {Proceedings of the 2019 International Conference on Management of
Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30
- July 5, 2019},
pages = {1607--1623},
crossref = {DBLP:conf/sigmod/2019},
abstract = {Efficiently computing linear algebra expressions is central to machine learning (ML) systems. Most systems support sparse formats and operations because sparse matrices are ubiquitous and their dense representation can cause prohibitive overheads. Estimating the sparsity of intermediates, however, remains a key challenge when generating execution plans or performing sparse operations. These sparsity estimates are used for cost and memory estimates, format decisions, and result allocation. Existing estimators tend to focus on matrix products only, and struggle to attain good accuracy with low estimation overhead. However, a key observation is that real-world sparse matrices commonly exhibit structural properties such as a single non-zero per row, or columns with varying sparsity. In this paper, we introduce MNC (Matrix Non-zero Count), a remarkably simple, count-based matrix synopsis that exploits these structural properties for efficient, accurate, and general sparsity estimation. We describe estimators and sketch propagation for realistic linear algebra expressions. Our experiments—on a new estimation benchmark called SparsEst—show that the MNC estimator yields good accuracy with very low overhead. This behavior makes MNC practical and broadly applicable in ML systems.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Zhiqi Huang, Ryan McKenna, George Bissias, Gerome Miklau, Michael Hay, Ashwin Machanavajjhala

PSynDB: Accurate and Accessible Private Data Generation Journal Article

PVLDB, 12 (12), pp. 1918–1921, 2019.

Links | BibTeX

Ios Kotsogiannis, Yuchao Tao, Xi He, Maryam Fanaeepour, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau

PrivateSQL: A Differentially Private SQL Query Engine Journal Article

PVLDB, 12 (11), pp. 1371–1384, 2019.

Links | BibTeX

Dan Zhang, Ryan McKenna, Ios Kotsogiannis, George Bissias, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

Ektelo: A Framework for Defining Differentially-Private Computations Journal Article

SIGMOD Record, 48 (1), pp. 15–22, 2019.

Links | BibTeX

Brian Hentschel, Peter J Haas, Yuanyuan Tian

General Temporally Biased Sampling Schemes for Online Model Management Journal Article

ACM Trans. Database Syst., 44 (4), pp. 14:1–14:45, 2019.

Abstract | Links | BibTeX

@article{DBLP:journals/tods/HentschelHT19,
title = {General Temporally Biased Sampling Schemes for Online Model Management},
author = {Brian Hentschel and Peter J Haas and Yuanyuan Tian},
url = {https://doi.org/10.1145/3360903},
doi = {10.1145/3360903},
year = {2019},
date = {2019-12-24},
journal = {ACM Trans. Database Syst.},
volume = {44},
number = {4},
pages = {14:1--14:45},
abstract = {To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

Ios Kotsogiannis, Yuchao Tao, Ashwin Machanavajjhala, Gerome Miklau, Michael Hay

Architecting a Differentially Private SQL Engine Inproceedings

CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, 2019.

Links | BibTeX

239 entries « ‹ 1 of 5 › »