Technical NoteCluster-extent based thresholding in fMRI analyses: Pitfalls and recommendations
Introduction
Recent advances in the statistical analysis of functional magnetic resonance imaging (fMRI) data have improved the ability of researchers to make meaningful inferences about task-related brain activation. Most statistical analyses of fMRI data are mass univariate approaches, with inferences at a voxel or cluster (of voxels) level. Typical fMRI analyses include > 80,000 voxels, resulting in numerous statistical tests, which must be appropriately corrected for multiple comparisons (Bennett et al., 2009, Friston et al., 1994, Genovese et al., 2002, Nichols, 2012, Nichols and Hayasaka, 2003, Nichols and Holmes, 2002).
Among the many approaches to deal with multiple comparisons, cluster-extent based thresholding has become the most popular (Fig. 1A; Friston et al., 1994; also see Carp, 2012). This approach detects statistically significant clusters on the basis of the number of contiguous voxels whose voxel-wise statistic values lie above a pre-determined primary threshold. Tests for statistical significance do not control the estimated false positive probability of each voxel in the contiguous region, but instead control the estimated false positive probability of the region as a whole. Cluster-extent based thresholding generally consists of two stages (Friston et al., 1994, Hayasaka and Nichols, 2003). First, an arbitrary voxel-level primary threshold defines clusters by retaining groups of suprathreshold voxels. Second, a cluster-level extent threshold, measured in units of contiguous voxels (k), is determined based on the estimated distribution of cluster sizes under the null hypothesis of no activation in any voxel in that cluster. The cluster-level extent threshold that controls family-wise error rate (FWER) can be obtained from the sampling distribution of the largest null hypothesis cluster size among suprathreshold voxels within the search area (e.g., the brain). The sampling distribution of the largest null cluster size under the global null hypotheses of no signal is typically estimated using theoretical methods (e.g., random field theory [RFT]; Worsley et al., 1992), Monte Carlo simulation (Forman et al., 1995), or nonparametric methods (Nichols and Holmes, 2002).
Cluster-extent based thresholding has certain advantages. First, voxel-level corrections for multiple comparisons, such as the Bonferroni and RFT-based corrections, are so stringent that they can dramatically increase Type II errors (i.e., low sensitivity) without extremely large sample sizes (Nichols and Hayasaka, 2003). By contrast, cluster-extent based thresholding has relatively high sensitivity (Friston et al., 1994, Smith and Nichols, 2009). Second, cluster-extent based thresholding accounts for the fact that individual voxel activations are not independent of the activations of their neighboring voxels, especially when the data are spatially smoothed (Friston, 2000, Heller et al., 2006, Wager et al., 2007).
Despite these strengths, cluster-extent based thresholding also has limitations; specifically, low spatial specificity when clusters are large (Friston et al., 1994, Nichols, 2012). The cluster-level p-value does not determine the statistical significance of activation at a specific location or voxel(s) within the cluster. Rather, it describes the probability of obtaining a cluster of a given size or greater under the null hypothesis. The logical alternative when this sharp null is rejected is a diffuse family of alternatives: At least some signal must be present somewhere in the cluster. Therefore, the larger the clusters become, the less spatially specific the inference. Though widely known, we believe the practical implications of this limitation have been largely overlooked.
If cluster sizes are small enough and lie within a single anatomical area of interest, cluster-extent based inferences are reasonably specific. However, if a liberal (i.e., higher p-values) primary voxel-level threshold (e.g., p < .01) is selected to define clusters, clusters that survive a cluster-extent based threshold for a FWER correction often become large enough to cross anatomical boundaries, particularly in the presence of spatially correlated physiological noise. It is tempting to set a liberal primary threshold in small, underpowered studies, because with more liberal primary thresholds, significant clusters are larger and thus appear more robust and substantial. However, a liberal primary threshold poses a disadvantage in the spatial specificity of claims that can be made. Here, we argue that the use of liberal primary thresholds is both endemic and detrimental to the neuroimaging field.
There are two distinct problems with setting a liberal primary threshold and accepting the reduction in spatial specificity that it entails. First, liberal primary thresholds render the relatively high spatial resolution of fMRI useless, and if significant clusters cross multiple anatomical boundaries, the results yield little useful neuroscientific information. Findings of “activity in the insula or the striatum” are not useful in building a cumulative understanding of human brain function. The second, and more pernicious, problem is that results are displayed as colored maps of voxels that pass the primary threshold, with only large-enough clusters retained. These maps invite readers (and authors) to mistakenly believe that significant results are found in all the voxels and all the anatomical regions depicted as ‘significant’ in figures. In fact, if a single cluster covers two anatomical regions, the authors cannot in good faith discuss findings in relation to either anatomical region, although this is common practice.
In addition to the standard cluster-extent based thresholding methods we discuss extensively here, several recent alternatives have been proposed, including the threshold-free cluster enhancement (TFCE) method (Smith and Nichols, 2009) and hierarchical false discovery rate (FDR) control on clusters (Benjamini and Heller, 2007). TFCE eliminates the need for setting an arbitrary cluster-defining primary threshold by combining voxel-wise statistics with local spatial support underneath the voxel. However, TFCE is also subject to the same limitations of low spatial specificity when significant clusters are large. Benjamini and Heller's (2007) hierarchical FDR method tests clusters first, and then trims locations with no signal within each significant cluster. However, this method heavily depends on a priori information about the data, such as pre-defined clusters or weights, which is generally unavailable in practice.
In this paper, we show a typical example of fMRI results thresholded with a cluster-extent based thresholding method, using an fMRI dataset from our laboratory (N = 33), in order to illustrate problems with spatial specificity and inappropriate inferences about anatomical regions. Next, we present findings from a survey of recent fMRI literature (N = 814 studies) to demonstrate how researchers currently select the primary threshold levels for their studies. Third, we present results of simulations examining the effects of selection of different primary threshold levels with different levels of signal-to-noise ratio on voxel- and cluster-level false positives (Type I error) and false negatives (Type II error) and on the average anatomical specificity of significant clusters. Finally, we conclude with recommendations for the use of cluster-extent based thresholding in neuroimaging studies.
Section snippets
Illustration
To illustrate the potential pitfalls of cluster-extent based thresholding, we used fMRI data (N = 33) from a study conducted in our laboratory (Wager et al., 2013). The data include voxel-wise mapping of the positive effects of heat intensity causing acute experimental pain. For more details about the data, please refer to the Methods section of Study 2 in Wager et al. (2013). The results we report here were thresholded with a primary threshold of voxel-wise p < .01, which yielded a cluster-extent
An illustration of potential pitfalls of cluster-extent based thresholding: low spatial specificity and inappropriate inferences (Illustration and Survey sections)
As Fig. 1A shows, cluster-extent based thresholding has been the most popular thresholding method for multiple comparisons correction in recent years. However, there are potential pitfalls of cluster-extent based thresholding, as illustrated in Fig. 1B. The presented map is thresholded at p < .05, FWER corrected using cluster-extent based thresholding (k > 611) with primary threshold of p < .01. As expected, voxels in multiple anatomical regions that have been implicated in pain processing showed
Discussion
The popularity of cluster-extent based thresholding is understandable given its advantages, including generally higher sensitivity in identifying significant regions (Friston et al., 1994) as compared to voxel-level correction methods for multiple comparisons. However, there are potential pitfalls, especially when activation clusters are so large that they cover multiple anatomical brain regions. Researchers can only ascertain that there is true signal somewhere within the cluster, and thus
Conclusion
Cluster-extent based thresholding has become the most popular correction method for multiple comparisons in fMRI data analysis because it is more sensitive (more powerful) and reflects the spatially correlated nature of fMRI signal. However, when a significant cluster is so large that it spans multiple anatomical regions, we cannot make inferences about a specific anatomical region with confidence, but we can only infer that there is signal somewhere within the large cluster. In other words,
Conflict of interest
We have no relevant conflicts of interest.
References (27)
The secret lives of experiments: methods reporting in the fMRI literature
NeuroImage
(2012)- et al.
An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest
NeuroImage
(2006) - et al.
Thresholding of statistical maps in functional neuroimaging using the false discovery rate
NeuroImage
(2002) - et al.
Validating cluster size inference: random field and permutation methods
NeuroImage
(2003) - et al.
Nonstationary cluster-size inference with random field and permutation methods
NeuroImage
(2004) - et al.
Cluster-based analysis of fMRI data
NeuroImage
(2006) Multiple testing corrections, nonparametric methods, and random field theory
NeuroImage
(2012)- et al.
False positives in neuroimaging genetics using voxel-based morphometry data
NeuroImage
(2011) - et al.
Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference
NeuroImage
(2009) - et al.
Valence, gender, and lateralization of functional brain anatomy in emotion: a meta-analysis of findings from neuroimaging
NeuroImage
(2003)
Neuroimaging studies of shifting attention: a meta-analysis
NeuroImage
Brain mediators of cardiovascular responses to social threat, part II: prefrontal–subcortical pathways and relationship with anxiety
NeuroImage
False discovery rates for spatial signals
J. Am. Stat. Assoc.
Cited by (920)
Quantifying model uncertainty for semantic segmentation of Fluorine-19 MRI using stochastic gradient MCMC
2024, Computer Vision and Image Understanding