2.1 Describing data

Before we can analyze data, we need to understand what our data represent and how they are organized.

At its core, statistics is about using data to answer questions. Data are pieces of information we collect about people, behaviors, or outcomes of interest.

First, let’s understand some basic statistics related to how we describe our data, including measures of central tendency (averages), measures of dispersion (spread), and measures of shape of the distribution (particularly a normal distribution). Here’s a video walking you through what we learn in this chapter.

Learning Objectives

By the end of this section, you should be able to:

Define what data are and explain why context matters
Identify observations (rows) and variables (columns) in a dataset
Distinguish between categorical and numerical variables
Describe how datasets are structured for analysis
Recognize features of a well-organized dataset

What Are Data?

In research, data are typically organized in a dataset (often a spreadsheet format), where:

Rows represent individual observations (e.g., participants)
Columns represent variables (e.g., age, test score, group membership)

Each value in the dataset tells us something about a specific observation on a specific variable.

Is “data” singular or plural?

In formal scientific writing, including APA style, “data” is treated as plural (the singular form is datum). For example, “The data are consistent with the hypothesis.”

However, in common usage, “data” is often treated as singular when referring to a set or collection of information. For example, “The data is stored in a spreadsheet.”

In this textbook, we will treat data as singular for ease of readability. However, you may see “data” treated as plural in research articles and formal writing, so it’s helpful to be familiar with both conventions.

Data Need Context

Data are only meaningful when we understand what they represent.

For example, a value of “75” could represent:

A test score
A heart rate
A temperature

To interpret data correctly, we need to know:

What each variable represents
How it was measured
What the values mean

This is why clear variable names and documentation are important when working with datasets.

How to Read and Work With Data

To understand and use a dataset effectively, it’s important to recognize how data are organized. Well-structured datasets follow a few key principles:

Rows represent observations (e.g., one participant, one trial, or one case)
Columns represent variables (e.g., age, condition, score)
Each cell contains a single value for one observation on one variable
Variable names are clear and consistent, so you know what each column represents
Missing data are clearly labeled (e.g., NA), rather than left blank, although jamovi knows to treat blank cells as missing data
Data are stored in a clean, rectangular format, making them easier to analyze

Understanding this structure will help you read datasets, identify variables, and prepare your data for analysis in tools like jamovi.

Following these principles also helps prevent errors and makes data easier to analyze, share, and reproduce (Broman & Woo, 2018).

What Is a Variable?

A variable is anything we measure or record that can vary across individuals or observations.

Examples of variables include:

Age
Gender
Test scores
Survey responses

Some variables describe characteristics of participants, while others represent outcomes we are interested in understanding.

You will learn more about different types of variables in the next section.

Types of Data

Not all variables are the same. Broadly, variables fall into two categories:

Categorical variables describe groups or categories (e.g., gender, major, treatment condition)
Numerical variables represent quantities or amounts (e.g., age, height, test scores)

Understanding the type of data you are working with will help you decide how to summarize and analyze it. You’ll learn more about this and how to apply it in future chapters.

Why Do We Describe Data?

Once we collect data, the first step is to describe what we have. Raw data (a spreadsheet full of numbers) can be difficult to interpret on their own. Descriptive statistics help us:

Summarize large amounts of data
Identify patterns or trends
Understand what is typical or unusual
Prepare for further analysis

What Does It Mean to Describe Data?

When we describe data, we are typically trying to understand three key features:

Center (What is typical?): This tells us what a “typical” value looks like, often summarized using measures like the mean or median. These are called measures of central tendency.
Variability (How spread out are the values?): This tells us how much the data differ from one another. Some datasets are tightly clustered, while others are more spread out. These are called measures of dispersion.
Shape (What does the distribution look like?): This refers to how the data are distributed—for example, whether values are evenly spread out or skewed in one direction.

You will learn how to calculate and interpret these in later chapters.

Types of Descriptive Information

There are two main ways we describe data:

Numerical summaries, using measures of central tendency (e.g., mean, median, mode) and dispersion (e.g., standard deviation, variance).
Visual summaries, using graphs and charts (e.g., histograms, bar charts) which help us quickly see patterns and differences.

Looking Ahead

Describing data is always the first step in any statistical analysis.

You will use these ideas when you:

Summarize your data (Chapter 4)
Visualize patterns (Chapter 5)
Begin hypothesis testing (Chapter 7)
Interpret results from statistical tests (Chapters 11–14)

Understanding your data before analyzing it will help you make better decisions and avoid mistakes.

Key Takeaways

Data are organized into rows (observations) and columns (variables)
Variables are characteristics or outcomes that can vary
Describing data involves understanding center, variability, and shape
Descriptive statistics summarize and simplify data
Describing data is the first step in any analysis

Check Your Understanding

By the end of this section, you should be able to:

In a dataset, what does each row represent? What does each column represent?
Give an example of a categorical variable and a numerical variable.
Why is it important that each cell contains only one value?
What are two features of a well-organized dataset?
Why do data need context to be meaningful?

Answers

Rows represent observations (e.g., participants), and columns represent variables.
Categorical: major. Numerical: age.