Online experiments are an integral part of the design and evaluation of software infrastructure at Internet firms. To handle the growing scale and complexity of these experiments, firms have developed software frameworks for their design and deployment. Ensuring that the results of experiments in these frameworks are trustworthy—referred to as internal validity—can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design.
We present the first approach for checking the internal validity of online experiments statically, that is, from code alone. We identify well-known problems that arise in experimental design and causal inference, which can take on unusual forms when expressed as computer programs: failures of randomization and treatment assignment, and causal sufficiency errors. Our analyses target PLANOUT, a popular framework that features a domain-specific language (DSL) to specify and run complex experiments. We have built PLANALYZER, a tool that checks PLANOUT programs for threats to internal validity, before automatically generating important data for the statistical analyses of a large class of experimental designs. We demonstrate PLANALYZER'S utility on a corpus of PLANOUT scripts deployed in production at Facebook, and we evaluate its ability to identify threats on a mutated subset of this corpus. PLANALYZER has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it generates match hand-specified data.
Many organizations conduct online experiments to assist decision-making.3,13,21,22 These organizations often develop software components that make designing experiments easier, or that automatically monitor experimental results. Such systems may integrate with existing infrastructure that perform such tasks as recording metrics of interest or specializing software configurations according to features of users, devices, or other experimental subjects. One popular example is Facebook's PLANOUT: a domain-specific language for experimental design.2
A script written in PLANOUT is a procedure for assigning a treatment (e.g., a piece of software under test) to a unit (e.g., users or devices whose behavior—or outcomes—is being assessed). Treatments could be anything from software-defined bit rates for data transmission to the layout of a Web page. Outcomes are typically metrics of interest to the firm, which may include click-through rates, time spent on a page, or the proportion of videos watched to completion. Critically, treatments and outcomes must be recorded in order to estimate the effect of treatment on an outcome. By abstracting over the details of how units are assigned treatment, PLANOUT has the potential to lower the barrier to entry for those without a background in experimental design to try their hand at experimentation-driven development.
No entries found