2.4 Understanding Distributions and Variability
Descriptive statistics are useful, but a single number never tells the whole story. A mean, median, or standard deviation becomes much more meaningful when we understand the shape and spread of the data behind it.
That is where distributions and variability come in.
By the end of this section, you should be able to:
- explain what a distribution is
- describe why variability matters
- distinguish among range, interquartile range, variance, and standard deviation
- explain skew and why it affects interpretation
- recognize why outliers should be examined before analysis
- explain why visualizing data is an important part of statistical reasoning
What Is a Distribution?
A shows how values are arranged for a variable. It tells us which values are common, which values are rare, and whether the values are clustered, spread out, symmetric, or lopsided.
For example, imagine two classes both have an average exam score of 80. In one class, almost everyone scored between 78 and 82. In the other class, some students scored near 50 and others scored near 100. The mean is the same, but the distributions are very different.
This is why we do not want to rely only on one summary statistic. We need to understand the pattern of values.
Measures of Variability
Variability describes how spread out the values are. There are several common ways to summarize variability.
The range is the difference between the maximum and minimum value. If the minimum score is 17 and the maximum score is 49, the range is 32.
The interquartile range, or IQR, describes the middle 50% of the data. It is the distance between the first quartile (Q1, the 25th percentile) and the third quartile (Q3, the 75th percentile). The IQR is often useful when data are skewed or include outliers because it focuses on the middle of the distribution.
The variance is based on how far values are from the mean. More specifically, it is based on squared deviations from the mean. The standard deviation is the square root of the variance. In practice, students usually find the standard deviation easier to interpret because it is in the original units of the variable.
I sometimes include equations in this book because formulas help some people understand what is happening behind the scenes. However, I rarely expect you to calculate statistics by hand. The more important goal is to understand what the statistic means and how to interpret it.
Shape of a Distribution
The shape of a distribution tells us how the values are arranged.
A is symmetric, with values clustered around the center and fewer values farther away from the center. It is sometimes called a bell curve.
Many statistical ideas are easier to explain using a normal distribution, but real data are not always perfectly normal. That is okay. The important thing is to learn how to recognize and think about the shape of your data.
Skew
refers to asymmetry in a distribution. A distribution is skewed when one tail is longer than the other.
- In a positively skewed distribution, the long tail points toward higher values. Most values are on the lower end.
- In a negatively skewed distribution, the long tail points toward lower values. Most values are on the higher end.
Skew matters because it can affect which measure of center is most useful. In a strongly skewed distribution, the mean can be pulled toward the long tail. The median may better represent a typical value.
Kurtosis
describes the weight of the tails of a distribution relative to a normal distribution. In everyday terms, it is related to how much the distribution has heavier or lighter tails than expected.
There are some fancy terms related to kurtosis that you may hear, such as leptokurtic and platykurtic. Honestly, I do not hear researchers use those terms very often in applied work. For this book, the main point is that kurtosis can signal unusual tail behavior, including the possibility of more extreme values.
Outliers
An is an unusually extreme value compared with the rest of the data.
Outliers are not automatically errors. Sometimes they are real and meaningful. For example, if you are studying income, a very high income value may be accurate. If you are studying reaction time, an extremely slow response might mean the participant got distracted.
But outliers are worth examining because they can affect means, standard deviations, graphs, assumptions, and statistical tests. When you find an outlier, the goal is not to delete it automatically. The goal is to understand it.
Why Visualizing Data Matters
Graphs help us see patterns that descriptive statistics can hide. Histograms, density plots, boxplots, and scatterplots can show us shape, spread, clusters, gaps, and outliers.
You will learn more about data visualization later, but the habit starts here: always look at your data. Numbers are useful, but graphs often reveal what the numbers alone do not.
Looking Ahead
Distributions and variability will come back throughout the book. You will use them when you:
- describe continuous variables
- choose graphs
- check statistical assumptions
- interpret standard deviations and effect sizes
- decide whether results are meaningful in context
In other words, this is not just background information. It is part of the logic you will use every time you analyze data.
- Why can two datasets have the same mean but still look very different?
- What does skew tell us about a distribution?
- Why should outliers be examined rather than automatically deleted?
Answers
- They may have different variability or different shapes. One dataset might be tightly clustered around the mean while another is widely spread out.
- Skew tells us that a distribution is asymmetrical, with one tail longer than the other.
- Outliers may be errors, but they may also be real and meaningful values. We should understand them before deciding what to do.