In Criminal Justice, it's common to use large data sets like the Uniform Crime Report (UCR) or versions of the National Longitudinal Survey (NLS) because the nature of certain questions doesn't lend itself well to experimentation or independent data gathering. As such, many researchers have conducted many analyses using the UCR and NLS. My question is whether or not p-values would need to be fixed across the entire data set? In other words, should researchers be correcting for multiple tests even when they themselves did not run the tests because all of the tests were run on the same data?This question gets at the core conundrum of correcting for multiple tests. Certainly if two researchers were collaborating on an analysis that they considered to be one study, and each researcher had a different (perhaps overlapping) batch of tests to run, then the overall combined set of tests should have their p values corrected for all the tests. On the other hand, if the two researchers declared that they were not collaborating and they considered their batches of tests to be "separate" then the p values of the two batches of tests would not be corrected properly for the overall combined set of tests. Such separation of batches of analyses invites an inflated false alarm (Type I error) rate for tests of the data set. Thus, the appropriate practice should be that every new analysis of the data should correct for all the previous tests of the data by all previous researchers, and all previously published analyses should have updated p values to take into account subsequent tests. Right?
The puzzler above is based on the premise that corrections for multiple tests should control the error rate for any one set of data. Which begs the question of how to define "one set of data." A few years ago I was reviewing a manuscript that was submitted to a major scientific journal. The researchers had conducted an experiment with several conditions; the theoretical motivation and procedure made it obvious that the conditions were part of one conceptualization. Moreover, participants volunteered and were assigned at random across all the various conditions; in other words the conditions were obviously intended to be part of the same study. But the manuscript reported one subset of conditions as being in "Experiment 1" and the complementary subset of conditions as being in "Experiment 2." Why would the authors do such a strange and confusing thing when reporting the research? Because that way the corrections for multiple comparisons would only have to take into account the set of tests for the data of "Experiment 1" separately from the set of tests for the data of "Experiment 2." If the data from all the conditions were considered to be one set of data, then the correction for multiple comparisons would have to take into account all the tests, and various tests would no longer have p<.05. Ugh.
There's an analogous old puzzler based on the premise that corrections for multiple tests should control the error rate for any one researcher (not just for one set of data). Especially if studies conducted by the researcher are follow-ups of previous studies, are the data from the follow-ups really separate sets of data? Aren't they all really just one extended set of data from that researcher? Therefore, each researcher is allowed a lifetime false-alarm rate of, say, 5%, and the critical p value for any single test by that researcher should take into account that fact that she will be conducting hundreds, probably thousands, of tests during her research lifetime. Moreover, if you are collaborating with other researchers, be sure that they only rarely run significance tests because they won't inflate your collaborative error rate as much as frequent-testers.
The general issue, of deciding what constitutes the appropriate "family" of tests to be corrected for, is a sticky problem. To define an error rate, there must be a presumed family of tests for which the error rate is being defined. There are various arguments for defining the family this way or that in any particular application. For example, when running experiments with multi-factor designs, typically each main effect and interaction is considered to be a separate family and corrections for multiple tests only need to be made within each family, not across. The usual argument is something like this: in an experimental design, for which the independent variables are manipulated and randomly assigned, each factor could have been left out. But that argument breaks down if the factors can be redefined to be multiple levels of one factor, etc.
Those are just some rambling thoughts. What do you think is the answer to Caitlin's question, "[For shared data sets], should researchers be correcting for multiple tests even when they themselves did not run the tests because all of the tests were run on the same data?"