Publications

239 entries « ‹ 2 of 5 › »

2019

Ios Kotsogiannis, Yuchao Tao, Ashwin Machanavajjhala, Gerome Miklau, Michael Hay

Architecting a Differentially Private SQL Engine Inproceedings

CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings, 2019.

Links | BibTeX

Brian Hentschel, Peter J Haas, Yuanyuan Tian

General Temporally Biased Sampling Schemes for Online Model Management Journal Article

ACM Trans. Database Syst., 44 (4), pp. 14:1–14:45, 2019.

Abstract | Links | BibTeX

@article{DBLP:journals/tods/HentschelHT19,
title = {General Temporally Biased Sampling Schemes for Online Model Management},
author = {Brian Hentschel and Peter J Haas and Yuanyuan Tian},
url = {https://doi.org/10.1145/3360903},
doi = {10.1145/3360903},
year = {2019},
date = {2019-12-24},
journal = {ACM Trans. Database Syst.},
volume = {44},
number = {4},
pages = {14:1--14:45},
abstract = {To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

Emily A Herbert, Wang Cen, Peter J Haas

NIM: generative neural networks for modeling and generation of simulation inputs Inproceedings

Proceedings of the 2019 Summer Simulation Conference, SummerSim 2019, Berlin, Germany, July 22-24, 2019, pp. 65:1–65:6, 2019.

Abstract | Links | BibTeX

2018

Abolfazl Asudeh, H V Jagadish, Gerome Miklau, Julia Stoyanovich

On Obtaining Stable Rankings Journal Article

PVLDB, 12 (3), pp. 237–250, 2018.

Links | BibTeX

Julia Stoyanovich, Bill Howe, H V Jagadish, Gerome Miklau

Panel: A Debate on Data and Algorithmic Ethics Journal Article

PVLDB, 11 (12), pp. 2165–2167, 2018.

Links | BibTeX

Sameera Ghayyur, Yan Chen, Roberto Yus, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau, Sharad Mehrotra

IoT-Detective: Analyzing IoT Data Under Differential Privacy Inproceedings

Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp. 1725–1728, 2018.

Links | BibTeX

Ke Yang, Julia Stoyanovich, Abolfazl Asudeh, Bill Howe, H V Jagadish, Gerome Miklau

A Nutritional Label for Rankings Inproceedings

Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp. 1773–1776, 2018.

Links | BibTeX

Dan Zhang, Ryan McKenna, Ios Kotsogiannis, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau

EKTELO: A Framework for Defining Differentially-Private Computations Inproceedings

Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp. 115–130, 2018.

Links | BibTeX

Brian Neil Levine, Gerome Miklau

Auditing and Forensic Analysis Incollection

Encyclopedia of Database Systems, Second Edition, 2018.

Links | BibTeX

Enhui Huang, Liping Peng, Luciano Di Palma, Ahmed Abdelkafi, Anna Liu, Yanlei Diao

Optimization for active learning-based interactive database exploration Journal Article

Proceedings of the VLDB Endowment, 12 (1), pp. 71–84, 2018.

Abstract | Links | BibTeX

Ryan McKenna, Gerome Miklau, Michael Hay, Ashwin Machanavajjhala

Optimizing error of high-dimensional statistical queries under differential privacy Journal Article

PVLDB, 11 (10), pp. 1206–1219, 2018.

Abstract | Links | BibTeX

Brian Hentschel, Peter J Haas, Yuanyuan Tian

Temporally-Biased Sampling for Online Model Management Inproceedings

Proceedings of the 21th International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26-29, 2018., pp. 109–120, 2018.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/edbt/HentschelHT18,
title = {Temporally-Biased Sampling for Online Model Management},
author = {Brian Hentschel and Peter J Haas and Yuanyuan Tian},
url = {http://openproceedings.org/2018/conf/edbt/paper-52.pdf},
year = {2018},
date = {2018-01-01},
booktitle = {Proceedings of the 21th International Conference on Extending Database
Technology, EDBT 2018, Vienna, Austria, March 26-29, 2018.},
pages = {109--120},
abstract = {To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying exponentially over time. We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both complete control over the decay rate and a guaranteed upper bound on the sample size, while maximizing both expected sample size and sample-size stability. The latter scheme rests on the notion of a “fractional sample” and, unlike T-TBS, allows for data arrival rates that are unknown and time varying. R-TBS and T-TBS are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data. },
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Yuriy Brun, Alexandra Meliou

Software Fairness Inproceedings

Proceedings of the New Ideas and Emerging Results Track at the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Lake Buena Vista, FL, USA, 2018.

Abstract | Links | BibTeX

Rico Angell, Brittany Johnson, Yuriy Brun, Alexandra Meliou

Themis: Automatically Testing Software for Discrimination Inproceedings

Proceedings of the Demonstrations Track at the The 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Lake Buena Vista, FL, USA, 2018, (demonstration paper).

Abstract | Links | BibTeX

Xiaolan Wang, Laura Haas, Alexandra Meliou

Explaining Data Integration Journal Article

IEEE Data Engineering Bulletin, 41 (2), pp. 47–58, 2018.

Abstract | Links | BibTeX

Yue Wang, Alexandra Meliou, Gerome Miklau

RC-Index: Diversifying Answers to Range Queries Journal Article

PVLDB, 11 (7), pp. 773–786, 2018.

Abstract | Links | BibTeX

Anna Fariha, Sheikh Muhammad Sarwar, Alexandra Meliou

SQuID: Semantic Similarity-Aware Query Intent Discovery Inproceedings

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1745–1748, 2018, (demonstration paper).

Abstract | Links | BibTeX

Matteo Brucato, Azza Abouzied, Alexandra Meliou

Package queries: efficient and scalable computation of high-order constraints Journal Article

The VLDB Journal, 2018, ([Special Issue on Best Papers of VLDB 2016]).

Links | BibTeX

2017

Garrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay, Gerome Miklau

Differentially Private Learning of Undirected Graphical Models Using Collective Graphical Models Inproceedings

Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp. 478–487, 2017.

Abstract | Links | BibTeX

Haopeng Zhang, Yanlei Diao, Alexandra Meliou

EXstream: Explaining Anomalies in Event Stream Monitoring Inproceedings

Proceedings of the 20th International Conference on Extending Database Technology, EDBT 2017, Venice, Italy, March 21-24, 2017., pp. 156–167, 2017.

Abstract | Links | BibTeX

Xiaolan Wang, Alexandra Meliou, Eugene Wu

QFix: Diagnosing errors through query histories Inproceedings

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 1369–1384, 2017.

Abstract | Links | BibTeX

@inproceedings{WangMW2017,
title = {QFix: Diagnosing errors through query histories},
author = {Xiaolan Wang and Alexandra Meliou and Eugene Wu},
url = {http://people.cs.umass.edu/ameli/projects/queryProvenance/papers/WangMW2017.pdf},
doi = {10.1145/2882903.2899388},
year = {2017},
date = {2017-05-14},
booktitle = {Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD)},
pages = {1369--1384},
abstract = {Data-driven applications rely on the correctness of their data to function properly and effectively. Errors in data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. While data cleaning tools can purge datasets of many errors before the data is used, applications and users interacting with the data can introduce new errors. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Even when some of these discrepancies are discovered, they are often corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder.

In this paper, we propose QFix, a framework that derives explanations and repairs for discrepancies in relational data, by analyzing the effect of queries that operated on the data and identifying potential mistakes in those queries. QFix is flexible, handling scenarios where only a subset of the true discrepancies is known, and robust to different types of update workloads. We make four important contributions: (a) we formalize the problem of diagnosing the causes of data errors based on the queries that operated on and introduced errors to a dataset; (b) we develop exact methods for deriving diagnoses and fixes for identified errors using state-of-the-art tools; (c) we present several optimization techniques that improve our basic approach without compromising accuracy, and (d) we leverage a tradeoff between accuracy and performance to scale diagnosis to large datasets and query logs, while achieving near-optimal results. We demonstrate the effectiveness of QFix through extensive evaluation over benchmark and synthetic data.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Matteo Brucato, Azza Abouzied, Alexandra Meliou

A Scalable Execution Engine for Package Queries Journal Article

SIGMOD Record, 46 (1), pp. 24–31, 2017, ISSN: 0163-5808, (ACM SIGMOD Research Highlight Award).

Abstract | Links | BibTeX

@article{BrucatoAM2017,
title = {A Scalable Execution Engine for Package Queries},
author = {Matteo Brucato and Azza Abouzied and Alexandra Meliou},
url = {https://sigmodrecord.org/publications/sigmodRecord/1703/pdfs/08_ASalable_RH_Brucato.pdf},
doi = {10.1145/3093754.3093761},
issn = {0163-5808},
year = {2017},
date = {2017-03-15},
journal = {SIGMOD Record},
volume = {46},
number = {1},
pages = {24--31},
publisher = {ACM},
address = {New York, NY, USA},
abstract = {Many modern applications and real-world problems involve the design of item collections, or packages: from planning your daily meals all the way to mapping the universe. Despite the pervasive need for packages, traditional data management does not offer support for their definition and computation. This is because traditional database queries follow a powerful, but very simple model: a query defines constraints that each tuple in the result must satisfy. However, a system tasked with the design of packages cannot consider items independently; rather, the system needs to determine if a set of items collectively satisfy given criteria.

In this paper, we present package queries, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets. We develop a full-fledged package query system, implemented on top of a traditional database engine. Our work makes several contributions. First, we design PaQL, a SQL-based query language that supports the declarative specification of package queries. Second, we present a fundamental strategy for evaluating package queries that combines the capabilities of databases and constraint optimization solvers. The core of our approach is a set of translation rules that transform a package query to an integer linear program. Third, we introduce an offline data partitioning strategy allowing query evaluation to scale to large data sizes. Fourth, we introduce SKETCHREFINE, an efficient and scalable algorithm for package evaluation, which offers strong approximation guarantees. Finally, we present extensive experiments over real-world data. Our results demonstrate that SKETCHREFINE is effective at deriving high-quality package results, and achieves runtime performance that is an order of magnitude faster than directly using ILP solvers over large datasets.},
note = {ACM SIGMOD Research Highlight Award},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Sainyam Galhotra, Yuriy Brun, Alexandra Meliou

Fairness Testing: Testing Software for Discrimination Inproceedings

Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, pp. 498–510, 2017, (ACM SIGSOFT Distinguished Paper Award).

Abstract | Links | BibTeX

@inproceedings{GalhotraBM2017,
title = {Fairness Testing: Testing Software for Discrimination},
author = {Sainyam Galhotra and Yuriy Brun and Alexandra Meliou},
url = {http://people.cs.umass.edu/~brun/pubs/pubs/Galhotra17fse.pdf},
doi = {10.1145/3106237.3106277},
year = {2017},
date = {2017-09-06},
booktitle = {Proceedings of 2017 11th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering},
pages = {498--510},
abstract = {This paper defines the notions of software fairness and discrimination and develops a testing-based method for measuring if and how much software discriminates. Specifically, the paper focuses on measuring causality in discriminatory behavior. Modern software contributes to important societal decisions and evidence of software discrimination has been found in systems that recommend criminal sentences, grant access to financial loans and products, and determine who is allowed to participate in promotions and receive services. Our approach, Themis, measures discrimination in software by generating efficient, discrimination-testing test suites. Given a schema describing valid system inputs, Themis generates discrimination tests automatically and, notably, does not require an oracle. We evaluate Themis on 20 software systems, 12 of which come from prior work with explicit focus on avoiding discrimination. We find that (1) Themis is effective at discovering software discrimination, (2) state-of-the-art techniques for removing discrimination from algorithms fail in many situations, at times discriminating against as much as 98% of an input subdomain, (3) Themis optimizations are effective at producing efficient test suites for measuring discrimination, and (4) Themis is more efficient on systems that exhibit more discrimination. We thus demonstrate that fairness testing is a critical aspect of the software development cycle in domains with possible discrimination and provide initial tools for measuring software discrimination.},
note = {ACM SIGSOFT Distinguished Paper Award},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Michael Hay, Liudmila Elagina, Gerome Miklau

Differentially Private Rank Aggregation Inproceedings

Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, Texas, USA, April 27-29, 2017., pp. 669–677, 2017.

Abstract | Links | BibTeX

Ios Kotsogiannis, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau

Pythia: Data Dependent Differentially Private Algorithm Selection Inproceedings

Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pp. 1323–1337, 2017.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/sigmod/KotsogiannisMHM17,
title = {Pythia: Data Dependent Differentially Private Algorithm Selection},
author = {Ios Kotsogiannis and
Ashwin Machanavajjhala and
Michael Hay and
Gerome Miklau},
url = {http://doi.acm.org/10.1145/3035918.3035945},
doi = {10.1145/3035918.3035945},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 2017 ACM International Conference on Management
of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017},
pages = {1323--1337},
crossref = {DBLP:conf/sigmod/2017},
abstract = {Differential privacy has emerged as a preferred standard for ensuring privacy in analysis tasks on sensitive datasets. Recent algorithms have allowed for significantly lower error by adapting to properties of the input data. These so-called data-dependent algorithms have different error rates for different inputs. There is now a complex and growing landscape of algorithms without a clear winner that can offer low error over all datasets. As a result, the best possible error rates are not attainable in practice, because the data curator cannot know which algorithm to select prior to actually running the algorithm.

We address this challenge by proposing a novel meta-algorithm designed to relieve the data curator of the burden of algorithm selection. It works by learning (from non-sensitive data) the association between dataset properties and the best-performing algorithm. The meta-algorithm is deployed by first testing the input for low-sensitivity properties and then using the results to select a good algorithm. The result is an end-to-end differentially private system: Pythia, which we show offers improvements over using any single algorithm alone. We empirically demonstrate the benefit of Pythia for the tasks of releasing histograms, answering 1- and 2-dimensional range queries, as well as for constructing private Naive Bayes classifiers.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Ios Kotsogiannis, Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Margaret Orr

DIAS: Differentially Private Interactive Algorithm Selection using Pythia Inproceedings

Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pp. 1679–1682, 2017.

Abstract | Links | BibTeX

Garrett Bernstein, Ryan McKenna, Tao Sun, Daniel Sheldon, Michael Hay, Gerome Miklau

Differentially Private Learning of Undirected Graphical Models using Collective Graphical Models Journal Article

CoRR, abs/1706.04646 , 2017.

Links | BibTeX

Julia Stoyanovich, Bill Howe, Serge Abiteboul, Gerome Miklau, Arnaud Sahuguet, Gerhard Weikum

Fides: Towards a Platform for Responsible Data Science Inproceedings

Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27-29, 2017, pp. 26:1–26:6, 2017.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/ssdbm/StoyanovichHAMS17,
title = {Fides: Towards a Platform for Responsible Data Science},
author = {Julia Stoyanovich and
Bill Howe and
Serge Abiteboul and
Gerome Miklau and
Arnaud Sahuguet and
Gerhard Weikum},
url = {http://doi.acm.org/10.1145/3085504.3085530},
doi = {10.1145/3085504.3085530},
year = {2017},
date = {2017-01-01},
booktitle = {Proceedings of the 29th International Conference on Scientific and
Statistical Database Management, Chicago, IL, USA, June 27-29, 2017},
pages = {26:1--26:6},
crossref = {DBLP:conf/ssdbm/2017},
abstract = {Issues of responsible data analysis and use are coming to the forefront of the discourse in data science research and practice, with most significant efforts to date on the part of the data mining, machine learning, and security and privacy communities. In these fields, the research has been focused on analyzing the fairness, accountability and transparency (FAT) properties of specific algorithms and their outputs. Although these issues are most apparent in the social sciences where fairness is interpreted in terms of the distribution of resources across protected groups, management of bias in source data affects a variety of fields. Consider climate change studies that require representative data from geographically diverse regions, or supply chain analyses that require data that represents the diversity of products and customers. Any domain that involves sparse or sampled data has exposure to potential bias.

In this vision paper, we argue that FAT properties must be considered as database system issues, further upstream in the data science lifecycle: bias in source data goes unnoticed, and bias may be introduced during pre-processing (fairness), spurious correlations lead to reproducibility problems (accountability), and assumptions made during pre-processing have invisible but significant effects on decisions (transparency). As machine learning methods continue to be applied broadly by non-experts, the potential for misuse increases. We see a need for a data sharing and collaborative analytics platform with features to encourage (and in some cases, enforce) best practices at all stages of the data science lifecycle. We describe features of such a platform, which we term Fides, in the context of urban analytics, outlining a systems research agenda in responsible data science.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Abhishek Roy, Yanlei Diao, Uday Evani, Avinash Abhyankar, Clinton Howarth, Rémi Le Priol, Toby Bloom

Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study Inproceedings

Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pp. 187–202, 2017.

Abstract | Links | BibTeX

Ahmed Elgohary, Matthias Boehm, Peter J Haas, Frederick R Reiss, Berthold Reinwald

Compressed linear algebra for large-scale machine learning Journal Article

The VLDB Journal, 2017, ISSN: 0949-877X.

Abstract | Links | BibTeX

Michael Hay, Liudmila Elagina, Gerome Miklau

Differentially Private Rank Aggregation Inproceedings

Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, Texas, USA, April 27-29, 2017, pp. 669–677, 2017.

Links | BibTeX

Yan Chen, Ashwin Machanavajjhala, Michael Hay, Gerome Miklau

PeGaSus: Data-Adaptive Differentially Private Stream Processing Inproceedings

Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, pp. 1375–1388, 2017.

Links | BibTeX

Chao Li, Daniel Yang Li, Gerome Miklau, Dan Suciu

A theory of pricing private data Journal Article

Commun. ACM, 60 (12), pp. 79–86, 2017.

Links | BibTeX

Matteo Brucato, Azza Abouzied, Chris Blauvelt

Redistributing Funds across Charitable Crowdfunding Campaigns Journal Article

CoRR, abs/1706.00070 , 2017.

Abstract | Links | BibTeX

2016

Yanlei Diao, Michael J. Franklin

High-Performance XML Message Brokering Incollection

Data Stream Management - Processing High-Speed Data Streams, pp. 451–471, 2016.

Abstract | Links | BibTeX

@incollection{DBLP:books/sp/16/DiaoF16,
title = {High-Performance XML Message Brokering},
author = {Yanlei Diao and
Michael J. Franklin},
url = {http://dx.doi.org/10.1007/978-3-540-28608-0_22},
doi = {10.1007/978-3-540-28608-0_22},
year = {2016},
date = {2016-01-01},
booktitle = {Data Stream Management - Processing High-Speed Data Streams},
pages = {451--471},
crossref = {DBLP:books/sp/GGR2016},
abstract = {For distributed environments including Web Services, data and application integration, and personalized content delivery, XML is becoming the common wire format for data. In this emerging distributed infrastructure, XML message brokers will play a key role as central exchange points for messages sent between applications and/or users. Users (equivalently, applications, or organizations) subscribe to the message broker by providing profiles expressing their data interests. After arriving at the message broker, these profiles become “standing queries,” which are executed on all incoming data. Data sources publish their data by pushing streams of XML messages to the broker. The broker delivers to each user the messages that match his data interests; these messages are presented in the required format of the user. We have developed “YFilter”, an XML filtering system aimed at providing efficient filtering for large numbers (e.g., 10’s or 100’s of thousands) of path queries. The key innovation in YFilter is a Nondeterministic Finite Automaton (NFA)-based representation of path expressions which combines all queries into a single machine. YFilter exploits commonality among path queries by merging the common prefixes of the paths so that they are processed at most once. The NFA-based implementation also provides additional benefits including a relatively small machine size, flexibility in dealing with diverse characteristics of data and queries, incremental machine construction, and ease of maintenance.

This work has been supported in part by the National Science Foundation under the ITR grants IIS0086057 and SI0122599 and by Boeing, IBM, Intel, Microsoft, Siemens, and the UC MICRO program.},
keywords = {},
pubstate = {published},
tppubtype = {incollection}
}

Yue Wang, Alexandra Meliou, Gerome Miklau

A Consumer-Centric Market for Database Computation in the Cloud Journal Article

CoRR, abs/1609.02104 , 2016.

Links | BibTeX

Xiaolan Wang, Alexandra Meliou, Eugene Wu

QFix: Diagnosing errors through query histories Journal Article

CoRR, abs/1601.07539 , 2016.

Links | BibTeX

Xiaolan Wang, Alexandra Meliou, Eugene Wu

QFix: Demonstrating Error Diagnosis in Query Histories Inproceedings

Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pp. 2177–2180, 2016.

Abstract | Links | BibTeX

Matteo Brucato, Juan Felipe Beltran, Azza Abouzied, Alexandra Meliou

Scalable Package Queries in Relational Database Systems Journal Article

PVLDB, 9 (7), pp. 576–587, 2016, (Best of VLDB 2016).

Abstract | Links | BibTeX

@article{DBLP:journals/pvldb/BrucatoBAM16,
title = {Scalable Package Queries in Relational Database Systems},
author = {Matteo Brucato and
Juan Felipe Beltran and
Azza Abouzied and
Alexandra Meliou},
url = {http://www.vldb.org/pvldb/vol9/p576-brucato.pdf},
year = {2016},
date = {2016-01-01},
journal = {PVLDB},
volume = {9},
number = {7},
pages = {576--587},
abstract = {Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However, many practical, real-world problems require a collection of result tuples to satisfy constraints collectively, rather than individually. In this paper, we present package queries, a new query model that extends traditional database queries to handle complex constraints and preferences over answer sets. We develop a full-fledged package query system, implemented on top of a traditional database engine. Our work makes several contributions. First, we design PaQL, a SQL-based query language that supports the declarative specification of package queries. We prove that PaQL is at least as expressive as integer linear programming, and therefore, evaluation of package queries is in general NP-hard. Second, we present a fundamental evaluation strategy that combines the capabilities of databases and constraint optimization solvers to derive solutions to package queries. The core of our approach is a set of translation rules that transform a package query to an integer linear program. Third, we introduce an offline data partitioning strategy allowing query evaluation to scale to large data sizes. Fourth, we introduce SketchRefine, a scalable algorithm for package evaluation, with strong approximation guarantees ((1±ε)^6-factor approximation). Finally, we present extensive experiments over real-world and benchmark data. The results demonstrate that SketchRefine is effective at deriving high-quality package results, and achieves runtime performance that is an order of magnitude faster than directly using ILP solvers over large datasets.},
note = {Best of VLDB 2016},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, Dan Zhang, George Bissias

Exploring Privacy-Accuracy Tradeoffs using DPComp Inproceedings

Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pp. 2101–2104, 2016.

Abstract | Links | BibTeX

Michael Hay, Ashwin Machanavajjhala, Gerome Miklau, Yan Chen, Dan Zhang

Principled Evaluation of Differentially Private Algorithms using DPBench Inproceedings

Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pp. 139–154, 2016.

Abstract | Links | BibTeX

@inproceedings{DBLP:conf/sigmod/HayMMCZ16,
title = {Principled Evaluation of Differentially Private Algorithms using DPBench},
author = {Michael Hay and
Ashwin Machanavajjhala and
Gerome Miklau and
Yan Chen and
Dan Zhang},
url = {http://doi.acm.org/10.1145/2882903.2882931},
doi = {10.1145/2882903.2882931},
year = {2016},
date = {2016-01-01},
booktitle = {Proceedings of the 2016 International Conference on Management of
Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 -
July 01, 2016},
pages = {139--154},
crossref = {DBLP:conf/sigmod/2016},
abstract = {Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is data dependent, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important. In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design.},
keywords = {},
pubstate = {published},
tppubtype = {inproceedings}
}

Julia Stoyanovich, Serge Abiteboul, Gerome Miklau

Data Responsibly: Fairness, Neutrality and Transparency in Data Analysis Inproceedings

Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15-16, 2016, Bordeaux, France, March 15-16, 2016., pp. 718–719, 2016.

Abstract | Links | BibTeX

Kyriaki Dimitriadou, Olga Papaemmanouil, Yanlei Diao

AIDE: An Active Learning-Based Approach for Interactive Data Exploration Journal Article

IEEE Trans. Knowl. Data Eng., 28 (11), pp. 2842–2856, 2016.

Abstract | Links | BibTeX

Yue Wang, Alexandra Meliou, Gerome Miklau

Lifting the Haze off the Cloud: A Consumer-Centric Market for Database Computation in the Cloud Journal Article

PVLDB, 10 (4), pp. 373–384, 2016.

Abstract | Links | BibTeX

@article{DBLP:journals/pvldb/WangMM16,
title = {Lifting the Haze off the Cloud: A Consumer-Centric Market for Database Computation in the Cloud},
author = {Yue Wang and
Alexandra Meliou and
Gerome Miklau},
url = {http://www.vldb.org/pvldb/vol10/p373-wang.pdf},
year = {2016},
date = {2016-01-01},
journal = {PVLDB},
volume = {10},
number = {4},
pages = {373--384},
abstract = {The availability of public computing resources in the cloud has revolutionized data analysis, but requesting cloud resources often involves complex decisions for consumers. Estimating the completion time and cost of a computation and requesting the appropriate cloud resources are challenging tasks even for an expert user. We propose a new market-based framework for pricing computational tasks in the cloud. Our framework introduces an agent between consumers and cloud providers. The agent takes data and computational tasks from users, estimates time and cost for evaluating the tasks, and returns to consumers contracts that specify the price and completion time. Our framework can be applied directly to existing cloud markets without altering the way cloud providers offer and price services. In addition, it simplifies cloud use for consumers by allowing them to compare contracts, rather than choose resources directly. We present design, analytical, and algorithmic contributions focusing on pricing computation contracts, analyzing their properties, and optimizing them in complex workflows. We conduct an experimental evaluation of our market framework over a real-world cloud service and demonstrate empirically that our market ensures three key properties: (a) that consumers benefit from using the market due to competitiveness among agents, (b) that agents have an incentive to price contracts fairly, and (c) that inaccuracies in estimates do not pose a significant risk to agents' profits. Finally, we present a fine-grained pricing mechanism for complex workflows and show that it can increase agent profits by more than an order of magnitude in some cases.},
keywords = {},
pubstate = {published},
tppubtype = {article}
}

Serge Abiteboul, Gerome Miklau, Julia Stoyanovich, Gerhard Weikum

Data, Responsibly (Dagstuhl Seminar 16291) Journal Article

Dagstuhl Reports, 6 (7), pp. 42–71, 2016.

Abstract | Links | BibTeX