Home/Magazine Archive/April 2017 (Vol. 60, No. 4)/Technical Perspective: Building a Safety Net For Data.../Full Text

Research highlights
## Technical Perspective: Building a Safety Net For Data Reuse

Most evidence in the empirical sciences is statistical in nature, and scientists rely on a variety of statistical tests to distinguish valid scientific discoveries from spurious ones. Unfortunately, there is a growing recognition that many important research findings based on statistical evidence are not reproducible, raising the question of whether there is a gap between what these statistical tests ensure and the way they are used.

While there are many reasons that research findings may be non-reproducible this work is focused on just one—*interactivity.* Existing statistical methods assume the procedure being used to analyze the data is fixed before the data is collected. For example, performing a regression analysis on a fixed set of variables. However, in practice, the methods used to analyze a dataset are often chosen based on previous interactions with the same dataset. For example, using the same dataset to first select variables and then perform a regression analysis. Analyzing a fixed dataset in this interactive manner is known to lead to spurious conclusions, even when each procedure is statistically sound in isolation.

How should we study interactive data analysis? The most natural approach is to explicitly model the interactive procedure as a single procedure. For example, there is a vast amount of literature on statistically sound ways to select variables then regress on those variables. But how can we model the entire process of interacting with a dataset from start to finish when there are many steps and the way early steps influence subsequent steps is complex and possibly ill defined?

This work takes a different approach and embraces interactivity. They consider the feasibility of designing a *statistical validity safety net* that goes around a fixed dataset so that it may be analyzed interactively without compromising statistical validity even when the dataset is analyzed in an interactive manner. More concretely, the dramatis personæ are an unknown population and a data analyst who wants to study this population, but only has a random sample from this population. The analyst is allowed to ask very general questions about the population, and each question may depend in an arbitrary, unknown way on the answers to previous questions. The answers are filtered through the safety net, which should ensure the answers remain approximately accurate for the population no matter how the analyst chooses the queries.

The authors of the following paper show there is a way to construct such a safety net. They do so using a substantial strengthening of the folklore result that differentially private algorithms automatically ensure statistical validity, and then deploy the rich toolkit of differentially private algorithms for interactive data analysis to construct such procedures.

Of course, the paper itself is the best way to learn about the results. But for those who become interested in this topic, I will give my own incomplete, very biased, probably already out-of-date summary of the body of work on interactive data analysis that has happened since.

**Sharper Bounds.** What are the best possible statistical guarantees for interactive data analysis? In a subsequent work with Bassily et al.,^{a} we gave improved bounds on the error for estimating adaptively chosen statistics, but we still do not know if these bounds are optimal. One reason we cannot currently close this gap is we do not know necessary conditions for ensuring statistical validity in interactive data analysis. The authors show a weaker condition than differential privacy—*bounded max-information*—are sufficient, and several other notions have been considered,^{a,b,c} but we still do not have necessary conditions.

**Computational and Statistical Bottlenecks.** In a concurrent work with Moritz Hardt,^{d} we approached this problem from the other direction, and showed that interactive data analysis is inherently more difficult than non-interactive data analysis. Very roughly, we show if either the dimensionality of the dataset is large, or if the procedure is required to be computationally efficient in the worst case over interactions and datasets, then it is infeasible to ensure statistical validity for a large number of rounds of interaction.

**Relaxing the Problem.** In my opinion, the most promising direction is to find meaningful relaxations of the model that allow for more useful procedures, and circumvent the bottlenecks we just discussed. The authors' *reusable holdout* is one such relaxation. For another example, Blum and Hardt^{e} gave an efficient algorithm for a specific application in data science competitions, showing the benefit of a more tailored approach. I believe there are many more opportunities to design useful tools by finding the right balance between generality and specificity.

As you can see, the foundations of interactive data analysis are ready to be developed and brought up to speed with the foundations of non-interactive data analysis. I look forward to reading more work on this exciting topic.

a. Bassily et al. Algorithmic stability for adaptive data analysis. In *Proceedings of STOC'16.*

b. D. Russo, J. Zou. Controlling bias in adaptive data analysis using information theory. In *Proceedings of AISTATS'16.*

c. R. Rogers, A. Roth, A. Smith, O. Thakkar. Max-in-formation, differential privacy, and post-selection hypothesis testing. In *Proceedings of FOCS'16.*

d. M. Hardt, J. Ullman. Preventing false discovery in interactive data analysis is hard. In *Proceedings of FOCS'14.*

e. A. Blum, M. Hardt. The Ladder: A reliable leaderboard for machine learning competitions. In *Proceedings of ICML'15.*

To view the accompanying paper, visit doi.acm.org/10.1145/3051088

The Digital Library is published by the Association for Computing Machinery. Copyright © 2017 ACM, Inc.