8.1 Effect Size

An is a quantitative description of the size or strength of an effect, difference, or relationship.

In Chapter 7, we used the p-value to decide whether a result was statistically significant. That is useful, but it is not enough. A statistically significant result can be tiny and practically unimportant. A non-significant result can still be interesting, especially if the study had low power or a small sample.

Effect sizes help us ask a different question:

How big is the effect?

Statistical vs. Practical Significance

A p-value helps us decide whether the result is statistically significant. An effect size helps us think about whether the result is meaningful.

Those are not the same thing.

A very small effect can be statistically significant if the sample size is huge. A larger effect can be non-significant if the sample size is small or the study has low power. This is why we should avoid letting the p-value do all the thinking for us.

TipKey Idea

Statistical significance asks whether a result is surprising under the null hypothesis.

Practical significance asks whether the effect matters in context.

Common Effect Sizes

Different statistical tests use different effect-size measures. You do not need to master all of these right now, but you should recognize that effect size is not one single statistic.

Statistical context Common effect size What it describes
Mean differences Cohen’s d or Hedges’ g Difference between means in standardized units
Correlation r Strength and direction of a relationship
ANOVA η², partial η², or ω² Proportion of variance associated with an effect
Chi-square φ or Cramer’s V Strength of association between categorical variables

We will return to specific effect sizes in the chapters where they are used. For now, the important point is that effect sizes describe magnitude.

The d family of effect sizes describes standardized mean differences. Cohen’s d is common, but Hedges’ g is often preferred with smaller samples because it corrects some small-sample bias.

The r family of effect sizes describes associations. A correlation can range from -1 to +1, with 0 indicating no linear relationship. Squaring a correlation gives the proportion of variance explained.

The eta-squared family is often used with ANOVA. Eta-squared, partial eta-squared, and omega-squared estimate the amount of variation in the dependent variable associated with one or more predictors. We will keep those distinctions light until the ANOVA chapter.

Small, Medium, and Large Effects

You may have seen rules of thumb online. For example, Cohen’s d values of .20, .50, and .80 are often described as small, medium, and large. Correlations of .10, .30, and .50 are also often described that way.

Those guidelines can be useful as a starting point, but please do not treat them like universal laws.

Quite frankly, it depends.

The meaning of an effect size depends on the research area, the outcome, the intervention, the cost of acting on the result, the risk of not acting, and what previous studies have found.

For example, a very small effect can matter if the outcome is important, the intervention is cheap, and many people are affected. A small reduction in future heart attacks can matter a lot if the intervention is inexpensive and widely available.

On the other hand, imagine an educational intervention that produces a large effect on GPA but costs $100,000 per student. That might be statistically and practically impressive, but it may not be feasible. The effect is large, but the real-world decision is still complicated.

WarningBe Careful With Benchmarks

Small, medium, and large effect-size labels are shortcuts. They are not substitutes for thinking.

A better question is: What is the smallest effect size that would matter in this context?

Effect Size as Part of BEAN

Effect size matters because larger effects are easier to detect.

Holding alpha and sample size constant:

  • Larger effect size → higher power
  • Smaller effect size → lower power

This also means smaller effects require larger samples if we want a good chance of detecting them.

That is one of the major reasons researchers conduct power analyses. They are not just asking, “How many people do I need?” They are asking, “How many people do I need to detect an effect of this size with reasonable power?”

Daniel Lakens has a useful journal article on effect sizes and a chapter on effect sizes. He also discusses sample size justification, including ways to identify the smallest effect size of interest.

One helpful approach is to look at the distribution of effect sizes in a research area instead of relying only on generic benchmarks. For example, some education researchers have examined existing intervention effects to develop field-specific expectations for what counts as small, medium, or large.