I like to follow scandals in other disciplines: the stem cell biologists' cloning fraud, physicists and their time traveling particles. The psychologists are having their own scandal just now about extrasensory perception. Can people see into the future? Do experimental participants do better on memory tests of words which they are just about to practice memorising? Is there such a thing as a precognitive response to erotic stimuli? Daryl Bem and his nine experiments say "yes" (Bem, 2011). The other psychologists, Occam's razor, and me say "no." Bem exhorts his colleagues to follow the White Queen in Alice in Wonderland and believe six impossible things before breakfast (Bem, 2011). His colleagues prefer to avoid vanishing down the rabbit hole (Ritchie et al., 2012).
As well as the sheer entertainment value, such debates in other fields are worth studying because they can help us study the practices in our own field. Let's have a look at what the subfields of computer science which use experimental methods (such as my own research in educational technology and HCI) could learn from the great Bem v.s. rationality debate. Are we in danger of falling down the rabbit hole of bad science too?
Don't Torture the Data
Wagenmakers et al (2011) make the distinction between exploratory and confirmatory studies. The point of an exploratory study is to develop new theory, so you may feel free to prod and poke your data with whatever statistical tests you most enjoy. As Bem himself (rather inadvisably in my view) wrote previously, "If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don't like, or trials, observers, or interviewers who gave you anomalous results, place them aside temporarily and see if any coherent patterns emerge. Go on a fishing expedition for something — anything — interesting." (Bem 2000). But if you are trying to run a confirmatory study, the fun stops and you have to be all grown up with your data analysis. No more hiding participants under the carpet. No more running post-hoc sub group analyses until you get some — any! — result. The purpose of a confirmatory study is hypothesis testing; searching for evidence to support a theory which you have perhaps developed through an exploratory study. Here you're meant to declare in advance what your hypothesis is (given some theoretical justification), the direction and size of effect you expect, a stopping rule for data collection and what tests you will conduct.
In the fields I work in, researchers often develop a novel system and then run a small exploratory user evaluation. That's perfectly reasonable. What's not reasonable is presenting the results of exploratory analysis as if they were confirmatory.
Bem's nudey picture study is a good example of what not to do. He used exploratory data analysis approaches, but then claimed results as if it was a confirmatory study. He not only tested for pre-cognitive responses to erotic pictures, but also positive and negative pictures. But he only got significant results for the erotic ones, so he pounced on them and breathlessly reported them. Similarly he checked for gender differences without there being pre-existing evidence or theory to suggest that there should be a difference or why. This flaw is close to home. HCI/educational technology people check for gender differences for no apparent reason all the time. I have done it myself in the past, but I swear to reform from this moment forth.
The more statistical tests you perform on your poor tortured data, the more likely it is that you will find a significant effect by chance alone. If you perform 20 tests, and you have an alpha level of .05, you're going to find something, right? You're meant to correct the p — value when conducting multiple tests because of this. Bem didn't. HCI researchers are not perfect in this regard either (Cairns, 2007).
The reason the Bem story is in the news again is that another team of psychologists have published a paper about their attempts to replicate Bem's study (Ritchie et al 2012). Guess what? They didn't replicate his results, so if you were counting on pre-cognitive arousal being real, you will be disappointed. There are five registered replications ongoing, which just goes to show that the psychologists are serious about showing this media friendly nonsense to be rubbish. Interestingly, Ritchie's replication paper was rejected by the journal which published Bem's original article on the grounds that they don't publish replications at all. My first thought was "That's really silly. Surely journals should be supporting the advancement of science? Of course they should publish replications. What are these psychologists thinking?" Then it occurred to me that replications in HCI or educational technology journals are extremely rare too. In fact, this was the subject of a paper and panel session at CHI 2011 ((Wilson et al, 2011). The authors observed that "As a community we are not encouraged to first replicate and then extend. Instead, we are encouraged to differentiate into novel spaces, design novel interfaces, and run novel experiments that create new insights into human behavior." If you think of human-computer interaction as an artistic pursuit, replication doesn't matter. But if you think of it as a scientific or even an engineering enterprise then it does matter and we should be doing it more. If our goal is to design systems which are socially useful (for example in education or healthcare) then we should be sure whether they actually work with their intended user groups. That means we can't stop at the first exploratory study. It means that we need to go on to run a confirmatory study, open source the system and encourage other teams to replicate. And that in turn has implications for the way our research is funded and the sorts of articles journals will accept.
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychology journals (pp. 3–16). Cambridge: Cambridge University Press.
(2011) Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. J Pers Soc Psychol 100: 407–425. doi:10.1037/a0021524.
Cairns, P. (2007). HCI... not as it should be: inferential statistics in HCI research. Proceedings of the 21st British HCI Group Annual Conference on People and Computers: HCI... but not as we know it-Volume 1 (pp. 195–201). British Computer Society. Retrieved from http://portal.acm.org/citation.cfm?id=1531321
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). Journal of personality and social psychology, 100(3), 426-32. doi:10.1037/a0022790
Max L. Wilson, Wendy Mackay, Ed Chi, Michael Bernstein, Dan Russell, and Harold Thimbleby. 2011. RepliCHI - CHI should be replicating and validating results more: discuss. In PART 2 ----------- Proceedings of the 2011 annual conference extended abstracts on Human factors in computing systems (CHI EA '11). ACM, New York, NY, USA, 463-466. DOI=10.1145/1979482.1979491 http://doi.acm.org/10.1145/1979482.1979491
R(2012) Failing the Future: Three Unsuccessful Attempts to Replicate Bem's 'Retroactive Facilitation of Recall' Effect. PLoS ONE 7(3): e33423. doi:10.1371/journal.pone.0033423
I have not read the first phrase and I feel the need of writing this comment: the physicists time travel particles are definitely the opposite of bad science and scandal. They found results that didn't match with the accepted theories and ask to their collegues to check them and try to reproduce. And if any scandal happened, was watching the media trying to explain that as if it was the infidelity of some VIP instead of the normal process of peer reviewing.
"Interestingly, Ritchie's replication paper was rejected by the journal which published Bem's original article on the grounds that they don't publish replications at all."
Actually it is far worse than that. In most branches of academic science (as opposed to the sorts that get turned in to engineering) funding bodies won't give you a grant if you want to replicate someone else's work. If you want to try and replicate it then you have to write a grant request with a 2 part deliverable where part 1 replicates the other work and part 2 builds on that replication (or lack thereof). Needless to say there are plenty of cases where part 2 gets funded and part 1 doesn't.
I heartily disagree with the notion here that significance testing is the moment of truth. Much research is done in this fashion although it is utterly inappropriate.
When something is statistically significant this only tells about the signal to noise ratio, but nothing about the relevance of the effect. One is at risk to discard high impact variables, just because there was strong measurement error.
Or: A poor signal to noise ratio happens due to strong variation in the population of users. Differences between users is something an HCI researcher should aim to understand, not discard as noise (or nuisance).
In turn, nothing is easier than making something significant: just increase the sample size. Or: Do 20 studies and publish the one that jumped over the 5%.
In applied research the main questions are the impact of factors ("To what _extent_ does aesthetic design make users happy?") and solid prediction. Null hypothesis testing is incompatible with these aims.
I fully agree with the issue of replication. This is a much more powerful way of proving that some effect is beyond chance, and that occurs in different situations.
Displaying all 3 comments