8.1 Effect sizes
An effect size is a quantitative description of the strength of a phenomenon (phenomenon means thing being studied). The larger the value, the stronger the phenomenon (e.g., bigger mean differences or stronger relationship). Whereas the p-value tells us about the statistical significance of an effect, the effect size tells us the practical significance, or how much we should care about the significance.
Types of effect sizes
There are two basic effect sizes we tend to talk about:
The d family of effect sizes are standardized mean differences. They start at 0 (no mean difference) and can go up to infinity, with larger values meaning larger standardized mean differences. Some of the effect sizes in this family:
Cohen’s d is perhaps the most popular standardized mean difference effect size. Generally, the equation is the mean difference divided by the pooled standard deviation, but in reality the equation differs based on a variety of scenarios and whether you are using a one-sample, independent samples, or paired samples t-test.
Hedge’s g is a less biased version of Cohen’s d. Cohen’s d is particularly problematic for small sample sizes, so Hedge’s g is generally preferred, but you’ll see that not all statistical programs provide this effect size. It’s not that difficult to calculate Hedge’s g based on Cohen’s d, but just keep this information in mind.
The r family of effect sizes are measures of strength of association. They range from -1 to +1 such that 0 is no relationship and |1| is a perfect relationship between the variables. As you’ll read about in the correlation and regression chapters, this family of effect sizes can describe the proportion of variance explained by squaring the correlation (e.g., with a correlation of r = .8, then the r-squared is .82 or .8 * .8 which is 64% variance explained). Some of the effect sizes in this family:
r is a correlation. It’s a standardized measure of the strength of association where r = -1 or +1 means a perfect relationship and r = 0 is no relationship at all. We typically work with Pearson’s correlation, but we will also learn about Spearman correlation and rho.
\(\eta^2\) (eta-squared) measures the proportion of variance in the dependent variable associated with the different groups of the independent variable. This is considered a biased estimate, especially when trying to compare values across studies, so there are two more preferred effect sizes. We’ll cover the difference between these three in a later chapter (ANOVA).
\(\eta^2_p\) (partial eta-squared) is calculated slightly differently and is considered a less biased estimate (again, we’ll learn about this in a later chapter). This can allow for better comparisons of effect sizes across studies. It’s still not perfect, though.
\(\omega^2\) (omega-squared) is calculated even more differently and is considered the least biased estimate. There is also \(\omega^2_p\) (partial omega-squared) and \(\omega^2_G\) (generalized omega-squared) but as jamovi doesn’t provide it we won’t go over it in this course.
What is all this about more or less biased effect sizes? It has to do with how the standard deviations are calculated in these effect sizes and the fact that we’re dealing with samples and trying to infer the population’s effect size. Cohen’s d and eta-squared tend to slightly overestimate the true population effect, so there are options that provide corrections for this overestimation and lead to less biased results.
We’ll also learn about phi and Cramer’s V as effect sizes for the chi-square test and beta as an effect size for regression in subsequent chapters.
If you nerded out over this information and want to learn more, check out this great journal article by Daniel Lakens or this chapter on effect sizes by Daniel Lakens.
Small, medium, and large effect sizes
What is considered a small, medium, and large effect size? Quite frankly, it depends.
You may have seen some heuristics online about what small, medium, and large is for Cohen’s d (e.g., .2, .5, and .8) and r (e.g., .1, .3, and .5) but these heuristics should not be used without critical thought. In fact, Cohen (who is regularly cited for these heuristics) said that the way we should determine cut-offs is based on looking across studies to find what is considered small, medium, and large in that particular context.
As an example, a correlation of r = .03 might be considered so small to be not meaningful, but it’s in fact the correlation between taking aspirin after a heart attack and the prevalence of a future heart attack7 and was considered so practically important that it’s now standard practice. Think why: taking aspirin after a heart attack is easy and cheap, and if it reduces to chances of a future heart attack even slightly then it has more benefit than harm.
Let’s take another example. Let’s imagine that we have created an educational intervention that has a d = 1.00 increase in students’ GPAs. That’s big, right? Yup, but imagine that educational intervention costs $100,000 per student. That would be way too expensive to be practical. It would be unlikely that anyone would use the educational intervention, instead looking to see how they could reduce the cost of the intervention while keeping the biggest possible effect on students’ GPAs.
What makes an effect practically significant?
We’ll get into p-values in a moment, which are about statistical significance, but they don’t tell us anything about how meaningful the effect is. That’s what an effect size is for. But how do we know if it’s meaningful or practically significant?
Lakens (who also did the great journal article on effect sizes above) has a fantastic new preprint out on Sample Size Justification. In it, he provides an overview of six possible ways to determine which effect sizes are interesting:
- “Smallest effect size of interest: what is the smallest effect size that is theoretically and practically interesting?
- Minimally statistically detectable effect: given the test and sample size, what is the critical effect size that can be statistically significant?
- Expected effect size: which effect size is expected based on theoretical predictions or previous research?
- Width of confidence interval: which effect sizes are excluded based on the expected width of the confidence interval around the effect size?
- Sensitivity power analysis: across a range of possible effect sizes, which effects does a design have sufficient power to detect when performing a hypothesis test?
- Distribution of effect sizes in a research area: what is the empirical range of effect sizes in a specific research area, and which effects are a priori unlikely to be observed?” (p. 3)
Basically, what does past research have to say about what effect size you can expect (#3 and #6)? What is the smallest effect size you care about (#1)? What is the smallest effect size you can reasonably obtain (e.g., due to sample size limitations; #2, #3, and #4)? This is the justification you use to determine what effect size you are looking for. This is important for when you are then determining what sample size you need, which will be discussed in a separate section.
As a fun followup, as an example of #6, this study in the field of education collected effect sizes of many education interventions to figure out benchmarks for small (<.05), medium (<.20), and large (>= .20) effect sizes based on existing data rather than poor quality heuristics.
More recent studies have actually shown that the relationship between aspirin and future heart attacks is actually a lot stronger than r = .03, but the point here is to show that even what appears to be a very small correlation isn’t very meaningful.↩︎