Sunday, February 19, 2017

Interpreting Bayesian posterior distribution of a parameter: Is density meaningful?

Background: Suppose a researcher is interested in the Bayesian posterior distribution of a parameter, because the parameter is directly meaningful in the research domain. This occurs, for example, in psychometrics. Specifically, in item response theory (IRT; for details and an example of Bayesian IRT see this blog post), the data from many test questions (i.e., the items) and many people yield estimates of the difficulties \(\delta_i\) and discriminations \(\gamma_i\) of the items along with the abilities \(\alpha_p\) of the people. That is, the item difficulty is a parameter \(\delta_i\), and the analyst is specifically interested in the magnitude and uncertainty of each item's difficulty. The same is true for the other parameters, item discrimination and person ability. That is, the analyst is specifically interest in the discrimination \(\gamma_i\) magnitude and uncertainty for every item and the ability \(\alpha_p\) magnitude and uncertainty for every person.

The question: How should the posterior distribution of a meaningful parameter be summarized? We want a number that represents the central tendency of the (posterior) distribution, and numbers that indicate the uncertainty of the distribution. There are two options I'm considering, one based on densities, the other based on percentiles.

Densities. One way of conveying a summary of the posterior distribution is in terms of densities. This seems to be the most intuitive summary, as it directly answers the natural questions from the researcher:
  • Question: Based on the data, what is the most credible parameter value? Answer: The modal (highest density) value. For example, we ask: Based on the data, what is the most credible value for this item's difficulty \(\delta_i\)? Answer: The mode of the posterior is 64.5.
  • Question: Based on the data, what is the range of the 95% (say) most credible values? Answer: The 95% highest density interval (HDI). For example, we ask: Based on the data, what is the range of the 95% most credible values of \(\delta_i\)? Answer: 51.5 to 75.6.
Percentiles. A different way of conveying a summary of the posterior distribution is in terms of percentiles. The central tendency is reported as the 50th percentile (i.e., the median), and the range of uncertainty (that covers 95% of the distribution) is the equal-tailed credible interval from the 2.5 %ile to the 97.5 %ile. When using percentiles, densities are irrelevant, and the shape of the distribution is irrelevant.
An illustration from DBDA2E showing how highest-density intervals and equal-tailed intervals (based on percentiles) are not necessarily equivalent.

Some pros and cons:

Density answers what the researcher wants to know: What is the most credible value of the parameter, and what is the range of the credible (i.e., high density) values? Those questions simply are not answered by percentiles. On the other hand, density is not invariant under non-linear (but monotonic) transformations of the parameters. By squeezing or stretching different regions of the parameter, the densities can change dramatically, but the percentiles stay the same (on the transformed scale). This transformation invariance is the key reason that analysts avoid using densities in abstract, generic models and derivations.

But in applications where the parameters have meaningful interpretations, I don't think researchers are satisfied with percentiles.  If you told a researcher, "Well, we cannot tell you what the most probable parameter value is, all we can tell you is the median (50 %ile)," I don't think the researcher would be satisfied. If you told the researcher, "We can tell you that 30% of the posterior falls below this 30th %ile, but we cannot tell you whether values below the 30th %ile have lower or higher probability density than values above the 30th %ile," I don't think the researcher would be satisfied. Lots of parameters in traditional psychometric models have meaningful scales (and aren't arbitrarily non-linearly transformed). Lots of parameters in conventional models have scales that directly map onto the data scales, for example the mean and standard deviation of a normal model (and the data scales are usually conventional and aren't arbitrarily non-linearly transformed). And in spatial or temporal models, many parameters directly correspond to space and time, which (in most terrestial applications) are not non-linearly transformed.

Decision theory to the rescue? I know there is not a uniquely "correct" answer to this question. I suspect that the pros and cons could be formalized as cost functions in formal decision theory, and then an answer would emerge depending on the utilities assigned to density and tranformation invariance. If the cost function depends on densities, then mode and HDI would emerge as the better basis for decisions. If the cost function depends on transformation invariance, then median and equal-tail interval would emerge as the better basis for decisions.

What do you think?

Thursday, February 16, 2017

Equivalence testing (two one-sided test) and NHST compared with HDI and ROPE

In this blog post I show that frequentist equivalence testing (using the procedure of two one-sided tests: TOST) with null hypothesis significance testing (NHST) can produce conflicting decisions for the same parameter values, that is, TOST can accept the value while NHST rejects the same value. The Bayesian procedure using highest density interval (HDI) with region of practical equivalence (ROPE) never produces that conflict.

The Bayesian HDI+ROPE decision rule.

For a review of the HDI+ROPE decision rule, see this blog post and specifically this picture in that blog post. To summarize:
  • A parameter value is rejected when its ROPE falls entirely outside the (95%) HDI. To "reject" a parameter value merely means that all the most credible parameter values are not practically equivalent to the rejected value. For a parameter value to be "rejected", it is not merely outside the HDI!
  • A parameter value is accepted when its ROPE completely contains the (95%) HDI. To "accept" a parameter value merely means that all the most credible parameter values are practically equivalent to the accepted value. For a parameter value to be "accepted", it is not merely inside the HDI! In fact, parameter values can be "accepted" that are outside the HDI, because reject or accept depends on the ROPE.
The frequentist TOST procedure for equivalence testing.

In the frequentist TOST procedure, the analyst sets up a ROPE, and does a one-sided test for being below the high limit of the ROPE and another one-sided test for being above the low limit of the ROPE. If both limits are rejected, the parameter value is "accepted". The TOST is the same as checking that the 90% (not 95%) confidence interval falls inside the ROPE. The TOST procedure is used to decide on equivalence to a ROPE'd parameter value.

To reject a parameter value, the frequentist uses good ol' NHST. In other words, if the parameter value falls outside the 95% CI, it is rejected.

Examples comparing TOST+NHST with HDI+ROPE.

Examples below show the ranges of parameter values rejected, accepted, undecided, or conflicting (both rejected and accepted) by the two procedures.
  • In these cases the ROPE is symmetric around its parameter value, with ROPE limits at \(-0.1\) and \(+0.1\) the central value. These are merely default ROPE limits we might use if the parameter value were the effect-size Cohen's \(\delta\) because \(0.1\) is half of a "small" effect size. In general, ROPE limits could be asymmetric, and should be chosen in the context of current theory and measurement abilities. The key point is that the ROPE is the same for TOST and for HDI procedures for all the examples.
  • In all the examples, the 95% HDI and the 95% CI are arbitrarily set to be equal. In general, the 95% HDI and 95% CI will not be equal, especially when the CI is corrected for multiple tests or correctly computed for stopping rules that do not assume fixed sample size. But merely for simplicity of comparison, the 95% HDI and 95% CI are arbitrarily set equal to each other. The 90% CI is set to 0.83 times the width of the 95% CI, as it would be for a normal distribution. The exact numerical value does not matter for the qualitative results.
For each example, plots show the parameter values that would be accepted, rejected, undecided, or conflicting: both accepted and rejected. The horizontal axis is the parameter value. The horizontal black bars indicate the CI or HDI. The vertical axis indicates the different decisions.

Example 1: HDI and 90% CI are wider than the ROPE.

In the first example below, the HDI and 90% CI are wider than the ROPE. Therefore no parameter values can be accepted, because the ROPE, no matter what parameter value it is centered on, can never contain the HDI or 90% CI.
Example 1.
Notice above that NHST rejects all parameter values outside the 95% CI, whereas the HDI+ROPE procedure only rejects parameter values that are a half-ROPE away from the 95% HDI. All other parameter values are undecided (neither rejected nor accepted).

Example 2: HDI and 90% CI are a little narrower than the ROPE.

In the next example, below, the HDI and 90% CI are a little narrower than the ROPE. Therefore there are some parameter values which have ROPE's that contain the HDI or 90% CI and are "accepted". Notice that the TOST procedure accepts a wider range of parameter values than the HDI+ROPE decision rule, because the 90% CI is narrower than the 95% HDI (which in these examples is arbitrarily set equal to the 95% CI).
Example 2.
Notice above that qualitatively the TOST procedure for equivalence produces qualitatively similar results to the HDI+ROPE decision rule, but the TOST procedure accepts a wider range of parameter values because its 90% CI is narrower (i.e., less conservative) than the 95% HDI.

Example 3: HDI and 90% CI are much narrower than the ROPE.

The third example, below, had the HDI and CI considerably narrower than the ROPE. This situation might arise when there is a windfall of data, with high precision estimates but lenient tolerance for "practical equivalence". Notice that this leads to conflicting decisions for TOST and NHST: There are parameter values that are both accepted by TOST by rejected by NHST. Conflicts like this cannot happen when using the HDI+ROPE decision rule.

Example 3
Notice above that some parameter values are "accepted" for practical purposes even though they fall outside the HDI or CI. This points out the very different meaning of the discrete decision and the HDI or CI. The Bayesian HDI marks the most credible parameter values, but the green "accept" interval marks all the parameter values to which the most credible values are practically equivalent. The frequentist CI marks the parameter values that would not be rejected, but the green "accept" interval marks all the parameter values that pass the TOST procedure.

The comparison illustrated here was inspired by a recent blog post by Daniel Lakens, which emphasized the similarity of results using TOST and HDI+ROPE. Here I've tried to illustrate at least one aspect of their different meanings.

Wednesday, February 8, 2017

The Bayesian New Statistics - finally published

The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective.

Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty on the other. Among frequentists in psychology, a shift of emphasis from hypothesis testing to estimation has been dubbed “the New Statistics” (Cumming, 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, randomized controlled trials, and power analysis.

Published in Psychonomic Bulletin & Review.
Final submitted manuscript:
Published version view-only online (displays some figures incorrectly):

Published article:

Published online: 2017-Feb-08
Corrected proofs submitted: 2017-Jan-16
Accepted: 2016-Dec-16
Revision 2 submitted: 2016-Nov-15
Editor action 2: 2016-Oct-12
Revision 1 submitted: 2016-Apr-16
Editor action 1: 2015-Aug-23

Initial submission: 2015-May-13