Making Meaning of Mixed Evidential Value for Research on Empirically Supported Treatments (ESTs)

During the 1970’s, psychological researchers began using randomized controlled trials–just like in medicine–in order to scientifically evaluate the effectiveness of different kinds of psychotherapies. Some time later, after a critical mass of controlled psychotherapy trials were published, a Task Force from APA Division 12 synthesized this literature. Division 12 produced a continually updated list of therapies that appeared to have particularly good scientific evidence for their efficacy in treating patients with specific diagnoses. They termed these Empirically Supported Treatments (or Therapies) (ESTs; Chambless & Hollon, 1998). For many clinicians, ESTs have become a cornerstone of evidence-based practice.

Over the last decade, however, the psychology research community has been jolted by the so-called Replicability Crisis. In short, following the publication of some… interesting… research findings in top-tier journals (e.g., Bem, 2011), revelations of high-profile fraud (Stapel, 2014), and increasing awareness of the many ways in which researchers depart (often unknowingly) from the “rules” of valid statistical analyses (John, Loewenstein, & Prelec, 2012), psychologists began to take a renewed interest in seeing whether they could independently replicate key findings in their areas of research. And the answer, whether attempting to replicate a particular effect (e.g., Wagenmakers et al., 2016) or groups of effects (e.g., Open Science Collaboration, 2015), was that it appeared many more effects could not be replicated than what anyone had imagined.

Questions of the replicability of research findings have, more recently, begun to take root in the community of Clinical Psychologists. We have recently contributed to this dialogue with an article on what we call the “Evidential Value” of research on ESTs (Sakaluk, Williams, Kilshaw, & Rhyner, 2019)–now published in a special issue of the Journal of Abnormal Psychology that is focused on issues of replicability (Tackett & Miller, 2019). At its core, our article is an attempt to “peek under the hood” of the EST literature, in order to look at what it means to have therapies that are “evidence based”, and which might be further classified as having “modest” or “strong” evidence on their side (Chambless & Hollon, 1998)–what we refer to as evidential value.

In our review, we examined all of the Key References/Clinical Trials included as supporting the ESTs that are a part of the list maintained by APA Division 12[1]. From each article, we extracted and coded the quantitative information of statistical tests that compared the effectiveness of the EST in question (e.g., Exposure) against some control condition (e.g., “Treatment As Usual”), on levels of diagnoses-relevant outcomes (e.g., fear of spiders). Using the statistical information coded, we then calculated four types of metrics that we think map onto evidential value:

  1. Rates of incorrectly reported statistical tests (Nuijten, Hartgerink, van Assen, Epskamp, & Wicherts, 2016): for a given EST and its corresponding set of statistical tests, do the reported test statistic types (e.g., z, t, F), degrees of freedom, test statistic values, and p-values “add up”?
  2. Statistical power (Cohen, 1992): for a given EST and its corresponding set of statistical tests, was there a sufficient amount of data to have a reasonably high change of detecting a truly helpful therapy?
  3. R-Index (Schimmack, 2016): for a given EST and its corresponding set of statistical tests, were the rates of “statistically significant effects” reasonable, given the observed statistical power underlying the set of effects?
  4. Bayes Factors (Jeffreys, 1935): for a given EST and its corresponding set of statistical tests, were the data appreciably more compatible with the hypothesis that the therapies were helpful compared to the hypothesis that that they were unhelpful?

None of these metrics is the “true” or “perfect” metric of evidential value. But when thinking of what it means to have “strong evidence” for a scientific claim, we think many people would imagine features like these: that a study’s statistical tests are correctly reported, backed by a large amount of data, and support the researcher’s claim to a much stronger degree than alternative explanations of the same data.

Our results[2], from our perspective, invite cause for simultaneous optimism and concern. Identifying which statistical comparison(s) in an article constituted the key tests of an EST’s efficacy, for example, was surprisingly difficult, with papers often declaring very different types of effects as “key” to informing whether a given EST was effective (e.g., whether people receiving an EST improved over time vs. were, on average, better than those receiving a control/placebo vs. were better than those received a control/placebo at the end of the study). Still, some of the ESTs we evaluated appear to rest on a foundation of incredibly strong research, and that is a very good thing. Other ESTs, meanwhile (including some classified as having “strong” empirical support under the traditional system of appraising ESTs), yielded metrics that were troubling. In many more cases, the evidential value of metrics we examined offered a more mixed picture. And in many cases, the key statistical comparisons between ESTs and controls were reported in so little detail, that we were unable to use them to calculate any metric of evidential value. Thus, our paper suggests that under the umbrella of the term “Empirically Supported Treatments”–and even within categories of ESTs described as “strong” and “modest”–there is a diverse range of evidential value.

While there are numerous ways you could make meaning of our findings, we think there are a few take-away messages that we should state more concretely. Namely, our results suggest that 1) some therapies appear to have very good evidence on their side; 2) some therapies appear to have better/worse evidence on their side; and 3) in many cases, the evidence underlying a given therapy appears very ambiguous. Our results don’t mean that therapy doesn’t work, or that any approach should be considered reasonable.

For the everyday user of (or clinician in) the mental health system, we think the most important take-home message is that it’s important for therapists and clients to have conversations (early and often!) about how you will each keep track of if–and how well–a current course of therapy is working. If the initial course of treatment is working, then great! But a given therapy (even one listed as an EST) might not work for everyone, and if that is the case, clients and therapists should know how they will make that determination, and have a back-up plan for treatment if they need to change course.

For those of us in the psychological research community, meanwhile, the results of our paper will likely leave us with much to reconsider about the ways we approach studying therapy. Division 12 itself has taken a lead in recommending new and improved criteria for evaluating the strengths of ESTs in recent years (Tolin et al., 2015); language on the Division 12 EST website suggests that the current lists of ESTs will be reevaluated per these new criteria. We applaud these time-consuming, often thankless efforts. However, we think even these new and improved approaches, like the ones outlined by Division 12 and others, (e.g., the APA Clinical Practice Guidelines; could benefit from also incorporating some of the metrics we examined (e.g., statistical misreporting) into their evaluations of the psychotherapy literature. The field may also need to move away from the model of science in which one lab may be responsible for executing a small-scale clinical trial, and instead, involve many labs in rolling out a clinical trial (Uhlmann et al., 2019), so that we can gather more convincing evidence of a therapy’s effectiveness.

At the end of the day, when consumers of psychotherapy hear that a given therapy is “evidence-based” or backed by “strong empirical support”, they will be inclined to believe–all else being equal–that the therapy is more likely to be effective for them than a therapy without that label. As a scholarly community, we must therefore work to ensure labels like “EST” contain as much scientific signal as possible.

[1]Some colleagues have suggested to us that this review strategy biased results against ESTs by excluding important studies that Division 12, as of 2018, had not listed as Key References for a given EST. Other colleagues suggested our review risked painting an overly optimistic view of some ESTs since the literature selected by Division 12 was intended to show the efficacy for ESTs rather than contain all relevant articles. We think both these criticisms have merit, and it remains an open question whether a broader review of the literature for a given EST would produce more favorable, less favorable or roughly equivalent results across our metrics. Our review can specifically speak to the evidential value of the research Division 12 has collated in support of ESTs over the last 20 years, and how one should interpret EST labeling used by Division 12.

[2] For a more accessible summary of our results than what is provided in our paper, an infographic is available here.

Discussion Questions

  1. What does it mean for a therapy to be “evidence based” vs. not “evidence based”
  2. What does strong evidence vs. modest evidence vs. weak evidence vs. no evidence look like, to you?

Author Bios

John K. Sakaluk, Ph.D. is an assistant professor at the University of Victoria, on the west coast of Canada. He is a social psychologist, and his lab studies sexuality and close relationships, with a keen interest in research synthesis, replicability, and measurement.

Alexander J. Williams, Ph.D. is an academic program director and psychological clinic director at the University of Kansas, Edwards Campus, and an assistant teaching professor at the University of Kansas. A clinical psychologist, his research interests pertain to the strength of evidence for different forms of psychotherapy as well as novel interventions for enhancing psychotherapeutic learning.

Robyn E. Kilshaw, B.Sc. is a graduate student in the clinical psychology program at the University of Utah. Her two main areas of research interest are traumatic stress studies and quantitative methods.

Kathleen T. Rhyner, Ph.D. is a clinical neuropsychologist at the VA Finger Lakes Healthcare System in western New York. She predominantly spends her time in clinical work, conducting assessments for Veterans with a variety of neurological and mental health conditions. Her research interests include lifestyle change treatments for depression and dementia and measurement.


Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407-425.

Chambless, D. L., & Hollon, S. D. (1998). Defining empirically supported therapies. Journal of Consulting and Clinical Psychology, 66(1), 7-18.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155-159.

Jeffreys, H. (1935, April). Some tests of significance, treated by the theory of probability. In Mathematical Proceedings of the Cambridge Philosophical Society(Vol. 31, No. 2, pp. 203-222). Cambridge University Press.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23, 524-532.

Nuijten, M. B., Hartgerink, C. H., van Assen, M. A., Epskamp, S., & Wicherts, J. M. (2016). The prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, 48, 1205-1226.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716.

Sakaluk, J. K., Williams, A. J., Kilshaw, R. E., & Rhyner, K. T. (2019). Evaluating the evidential value of empirically supported psychological treatments (ESTs): A meta-scientific review. Journal of Abnormal Psychology, 128, 500-509.

Schimmack, U. (2016). A revised introduction to the R-Index. Retrieved from

Stapel, D. (2014). Faking science: A true story of academic fraud (Trans. N. J. L. Brown.). Retrieved from

Tackett, J. L., & Miller, J. D. (2019). Introduction to the special section on increasing replicability, transparency, and openness in clinical psychology. Journal of Abnormal Psychology, 128, 487-492.

Uhlmann, E. L., Ebersole, C. R., Chartier, C. R., Errington, T. M., Kidwell, M. C., Lai, C. K., … & Nosek, B. A. (2019). Scientific utopia III: Crowdsourcing science. Perspectives on Psychological Science, 14, 711-733.

Wagenmakers, E. J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams Jr, R. B., … & Bulnes, L. C. (2016). Registered replication report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11, 917-928.