Here is the conclusion of Dave's article:
Concluding remarks
Rank histograms, χ 2 goodness-of-fit test decomposition and residual quantile-quantile plots help to assess the probabilistic and climatological consistency of ensemble projections against a verification data set (e.g. Annan and Hargreaves, 2010; Marzban et al., 2011). If no reliable observable target can be identified, as is the case in periods and regions without Concluding remarks Rank histograms, χ 2 goodness-of-fit test decomposition and residual quantile-quantile plots help to assess the probabilistic and climatological consistency of ensemble projections against a verification data set (e.g. Annan and Hargreaves, 2010; Marzban et al., 2011). If no reliable observable target can be identified, as is the case in periods and regions without instrumental observations, such statistical analyses reduce the subjectivity in comparing simulation ensembles and statistical approximations from paleo-sensor data (Braconnot et al., 2012) under uncertainty and go beyond “wiggle matching”. The approach permits a succinct visualization of the consistency between an ensemble of estimates and an uncertain verification target. Ideally, it also reduces the dependence on the reference climatology which is present in many visual and mathematical methods that aim to qualify the correspondence between simulations and (approximated) observations.
We considered the COSMOS-Mill ensemble (Jungclaus et al., 2010) and various reconstructions within the described approach. We found the simulation ensemble to be consistent, within sampling variability, with the Central European temperature reconstruction by Dobrovolny et al. ´ (2010). The ensemble possibly lacks consistency with respect to the mean of the ensemble of Northern Hemisphere mean temperature reconstructions by Frank et al. (2010) due to probabilistic over-dispersion and various climatological deviations. The ensemble generally samples from a significantly wider distribution than the reconstruction ensemble mean. The distribution of the reconstruction ensemble in turn is possibly consistent relative to the simulation ensemble mean.
Furthermore, the simulation ensemble is found to be statistically distinguishable from the global field temperature reconstruction by Mann et al. (2009). Although the data is probabilistically consistent for multi-centennial sub-periods and certain regions according to the applied full test, analyses of single probabilistic deviations and climatological differences emphasise a general lack of consistency. We found the largest, but still limited, consistency over areas of Eurasia and North America for both full and sub-periods. For some periods, we also cannot reject consistency for most tropical and northern hemispheric ocean regions. The profound lack of climatological and probabilistic consistency between the simulation ensemble and reconstructions stresses the importance of improving simulations and reconstructions to investigate past climates in order to achieve a more resilient estimate of the true past climate state and evolution.
If our estimates are not consistent with each other for certain periods and areas, it is unclear how we should compare their accuracy. Thus if these reconstructions and this simulation ensemble are employed in dynamical comparisons and in studies on climate processes, we have to account for the climatological and probabilistic discrepancies between both data sets, which have been described in the present work.(see Jolliffe and Primo, 2008). The distributional degrees of freedom equal n − 1 for the full test and n is the number of classes in the rank histogram. The decomposition of the χ2 test statistic implies that we have only 1 degree of freedom for the single deviation test (Jolliffe and Primo, 2008; Annan and Hargreaves, 2010).
We reject consistency for certain right p values of thetest. Where appropriate, we also interpret the test statistics in terms of a reversed null hypothesis to test that there is a deviation from uniformity. This refers to the general goodnessof-fit χ2 statistic or to a specific deviation for the decomposed statistic. It is reasonable to consider significance at a conservative one-sided 90 % level due to the large uncertainties associated with the data. Thus critical chi-square values become 2.706 for the single deviation test. For the full goodness-of-fit test, we consider ensembles of eleven, nine, five and three members (see Sect. 2.2). Critical values are respectively 17.275, 14.684, 9.236 and 6.251.
Meaningful results for the tests require accounting for dependencies in the data (Jolliffe and Primo, 2008; Annan and Hargreaves, 2010). All analyses account for effective sample size (see discussions by and references of Bretherton et al., 1999). A larger effective sample size essentially leads to a higher chance of rejecting the hypothesis of uniformity. Furthermore, the results are sensitive to the made assumptions, particularly those with respect to the included uncertainty estimates (see Sect. 2.3).
Some further notes are in place. If ensemble and verification data are smoothed (as for the global data by Mann et al., 2009), either the sample size or the expected number of rank counts may be small compared with the theoretical requirements (but see e.g. Bradley et al., 1979, and
references therein). Temporal correlations further affect the structure of the rank histograms (Marzban et al., 2011; Wilks, 2011), and sampling variability can result in erroneous conclusions from the rank counts. That is, a flat rank histogram is only a necessary condition for consistency (see discussions by e.g. Hamill, 2001; Marzban et al., 2011). To account for this, we display, for area-averaged time series, quantile statistics of block-bootstrapped rank histograms (Marzban et al., 2011; Efron and Tibshirani, 1994). We apply a block length of 50 yr, calculate 2000 bootstrap replicates and display 0.5, 50 and 99.5 percentiles. This additionally allows for a secondary test of uniformity. The results are sensitive to the chosen block length, and 50 yr are possibly too short according to the auto-correlation functions for some reconstructions. However, 50 yr appear to be a reasonable compromise if we consider that the optimal length may also be shorter for some records.
*********************************************************************************
Maybe I missed it, Dave, but I just cannot locate where, in this study, they said that the model they used (the Max Planck Earth Model System,"based on the atmosphere model ECHAM5, the ocean model MPI-OM, a land-surface module including vegetation (JSBACH), a module for ocean biogeochemistry (HAMOCC) and an interactive carbon cycle" - that is to say a SINGLE model run in ensemble fashionand compared for consistency with proxy-based reconstructions) WAS OVERESTIMATING TEMPERATURE.
This entire thread is based on a false statement.