Elsevier

NeuroImage

Volume 91, 1 May 2014, Pages 412-419
NeuroImage

Technical Note
Cluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations

https://doi.org/10.1016/j.neuroimage.2013.12.058Get rights and content

Highlights

  • Cluster-extent based thresholding is popular because of its high sensitivity.

  • However, cluster-extent based thresholding has several important problems.

  • One pitfall is low spatial specificity when significant clusters are large.

  • Another pitfall is increased false positives when a liberal primary threshold is used.

  • We recommend using stringent primary thresholds and augmented reporting procedures.

Abstract

Cluster-extent based thresholding is currently the most popular method for multiple comparisons correction of statistical maps in neuroimaging studies, due to its high sensitivity to weak and diffuse signals. However, cluster-extent based thresholding provides low spatial specificity; researchers can only infer that there is signal somewhere within a significant cluster and cannot make inferences about the statistical significance of specific locations within the cluster. This poses a particular problem when one uses a liberal cluster-defining primary threshold (i.e., higher p-values), which often produces large clusters spanning multiple anatomical regions. In such cases, it is impossible to reliably infer which anatomical regions show true effects. From a survey of 814 functional magnetic resonance imaging (fMRI) studies published in 2010 and 2011, we show that the use of liberal primary thresholds (e.g., p < .01) is endemic, and that the largest determinant of the primary threshold level is the default option in the software used. We illustrate the problems with liberal primary thresholds using an fMRI dataset from our laboratory (N = 33), and present simulations demonstrating the detrimental effects of liberal primary thresholds on false positives, localization, and interpretation of fMRI findings. To avoid these pitfalls, we recommend several analysis and reporting procedures, including 1) setting primary p < .001 as a default lower limit; 2) using more stringent primary thresholds or voxel-wise correction methods for highly powered studies; and 3) adopting reporting practices that make the level of spatial precision transparent to readers. We also suggest alternative and supplementary analysis methods.

Introduction

Recent advances in the statistical analysis of functional magnetic resonance imaging (fMRI) data have improved the ability of researchers to make meaningful inferences about task-related brain activation. Most statistical analyses of fMRI data are mass univariate approaches, with inferences at a voxel or cluster (of voxels) level. Typical fMRI analyses include > 80,000 voxels, resulting in numerous statistical tests, which must be appropriately corrected for multiple comparisons (Bennett et al., 2009, Friston et al., 1994, Genovese et al., 2002, Nichols, 2012, Nichols and Hayasaka, 2003, Nichols and Holmes, 2002).

Among the many approaches to deal with multiple comparisons, cluster-extent based thresholding has become the most popular (Fig. 1A; Friston et al., 1994; also see Carp, 2012). This approach detects statistically significant clusters on the basis of the number of contiguous voxels whose voxel-wise statistic values lie above a pre-determined primary threshold. Tests for statistical significance do not control the estimated false positive probability of each voxel in the contiguous region, but instead control the estimated false positive probability of the region as a whole. Cluster-extent based thresholding generally consists of two stages (Friston et al., 1994, Hayasaka and Nichols, 2003). First, an arbitrary voxel-level primary threshold defines clusters by retaining groups of suprathreshold voxels. Second, a cluster-level extent threshold, measured in units of contiguous voxels (k), is determined based on the estimated distribution of cluster sizes under the null hypothesis of no activation in any voxel in that cluster. The cluster-level extent threshold that controls family-wise error rate (FWER) can be obtained from the sampling distribution of the largest null hypothesis cluster size among suprathreshold voxels within the search area (e.g., the brain). The sampling distribution of the largest null cluster size under the global null hypotheses of no signal is typically estimated using theoretical methods (e.g., random field theory [RFT]; Worsley et al., 1992), Monte Carlo simulation (Forman et al., 1995), or nonparametric methods (Nichols and Holmes, 2002).

Cluster-extent based thresholding has certain advantages. First, voxel-level corrections for multiple comparisons, such as the Bonferroni and RFT-based corrections, are so stringent that they can dramatically increase Type II errors (i.e., low sensitivity) without extremely large sample sizes (Nichols and Hayasaka, 2003). By contrast, cluster-extent based thresholding has relatively high sensitivity (Friston et al., 1994, Smith and Nichols, 2009). Second, cluster-extent based thresholding accounts for the fact that individual voxel activations are not independent of the activations of their neighboring voxels, especially when the data are spatially smoothed (Friston, 2000, Heller et al., 2006, Wager et al., 2007).

Despite these strengths, cluster-extent based thresholding also has limitations; specifically, low spatial specificity when clusters are large (Friston et al., 1994, Nichols, 2012). The cluster-level p-value does not determine the statistical significance of activation at a specific location or voxel(s) within the cluster. Rather, it describes the probability of obtaining a cluster of a given size or greater under the null hypothesis. The logical alternative when this sharp null is rejected is a diffuse family of alternatives: At least some signal must be present somewhere in the cluster. Therefore, the larger the clusters become, the less spatially specific the inference. Though widely known, we believe the practical implications of this limitation have been largely overlooked.

If cluster sizes are small enough and lie within a single anatomical area of interest, cluster-extent based inferences are reasonably specific. However, if a liberal (i.e., higher p-values) primary voxel-level threshold (e.g., p < .01) is selected to define clusters, clusters that survive a cluster-extent based threshold for a FWER correction often become large enough to cross anatomical boundaries, particularly in the presence of spatially correlated physiological noise. It is tempting to set a liberal primary threshold in small, underpowered studies, because with more liberal primary thresholds, significant clusters are larger and thus appear more robust and substantial. However, a liberal primary threshold poses a disadvantage in the spatial specificity of claims that can be made. Here, we argue that the use of liberal primary thresholds is both endemic and detrimental to the neuroimaging field.

There are two distinct problems with setting a liberal primary threshold and accepting the reduction in spatial specificity that it entails. First, liberal primary thresholds render the relatively high spatial resolution of fMRI useless, and if significant clusters cross multiple anatomical boundaries, the results yield little useful neuroscientific information. Findings of “activity in the insula or the striatum” are not useful in building a cumulative understanding of human brain function. The second, and more pernicious, problem is that results are displayed as colored maps of voxels that pass the primary threshold, with only large-enough clusters retained. These maps invite readers (and authors) to mistakenly believe that significant results are found in all the voxels and all the anatomical regions depicted as ‘significant’ in figures. In fact, if a single cluster covers two anatomical regions, the authors cannot in good faith discuss findings in relation to either anatomical region, although this is common practice.

In addition to the standard cluster-extent based thresholding methods we discuss extensively here, several recent alternatives have been proposed, including the threshold-free cluster enhancement (TFCE) method (Smith and Nichols, 2009) and hierarchical false discovery rate (FDR) control on clusters (Benjamini and Heller, 2007). TFCE eliminates the need for setting an arbitrary cluster-defining primary threshold by combining voxel-wise statistics with local spatial support underneath the voxel. However, TFCE is also subject to the same limitations of low spatial specificity when significant clusters are large. Benjamini and Heller's (2007) hierarchical FDR method tests clusters first, and then trims locations with no signal within each significant cluster. However, this method heavily depends on a priori information about the data, such as pre-defined clusters or weights, which is generally unavailable in practice.

In this paper, we show a typical example of fMRI results thresholded with a cluster-extent based thresholding method, using an fMRI dataset from our laboratory (N = 33), in order to illustrate problems with spatial specificity and inappropriate inferences about anatomical regions. Next, we present findings from a survey of recent fMRI literature (N = 814 studies) to demonstrate how researchers currently select the primary threshold levels for their studies. Third, we present results of simulations examining the effects of selection of different primary threshold levels with different levels of signal-to-noise ratio on voxel- and cluster-level false positives (Type I error) and false negatives (Type II error) and on the average anatomical specificity of significant clusters. Finally, we conclude with recommendations for the use of cluster-extent based thresholding in neuroimaging studies.

Section snippets

Illustration

To illustrate the potential pitfalls of cluster-extent based thresholding, we used fMRI data (N = 33) from a study conducted in our laboratory (Wager et al., 2013). The data include voxel-wise mapping of the positive effects of heat intensity causing acute experimental pain. For more details about the data, please refer to the Methods section of Study 2 in Wager et al. (2013). The results we report here were thresholded with a primary threshold of voxel-wise p < .01, which yielded a cluster-extent

An illustration of potential pitfalls of cluster-extent based thresholding: low spatial specificity and inappropriate inferences (Illustration and Survey sections)

As Fig. 1A shows, cluster-extent based thresholding has been the most popular thresholding method for multiple comparisons correction in recent years. However, there are potential pitfalls of cluster-extent based thresholding, as illustrated in Fig. 1B. The presented map is thresholded at p < .05, FWER corrected using cluster-extent based thresholding (k > 611) with primary threshold of p < .01. As expected, voxels in multiple anatomical regions that have been implicated in pain processing showed

Discussion

The popularity of cluster-extent based thresholding is understandable given its advantages, including generally higher sensitivity in identifying significant regions (Friston et al., 1994) as compared to voxel-level correction methods for multiple comparisons. However, there are potential pitfalls, especially when activation clusters are so large that they cover multiple anatomical brain regions. Researchers can only ascertain that there is true signal somewhere within the cluster, and thus

Conclusion

Cluster-extent based thresholding has become the most popular correction method for multiple comparisons in fMRI data analysis because it is more sensitive (more powerful) and reflects the spatially correlated nature of fMRI signal. However, when a significant cluster is so large that it spans multiple anatomical regions, we cannot make inferences about a specific anatomical region with confidence, but we can only infer that there is signal somewhere within the large cluster. In other words,

Conflict of interest

We have no relevant conflicts of interest.

References (27)

  • T.D. Wager et al.

    Neuroimaging studies of shifting attention: a meta-analysis

    NeuroImage

    (2004)
  • T.D. Wager et al.

    Brain mediators of cardiovascular responses to social threat, part II: prefrontal–subcortical pathways and relationship with anxiety

    NeuroImage

    (2009)
  • Y. Benjamini et al.

    False discovery rates for spatial signals

    J. Am. Stat. Assoc.

    (2007)
  • Cited by (920)

    View all citing articles on Scopus
    View full text