# Data Collection Basics

### Learning Outcomes

- Determine whether a value calculated from a group is a statistic or a parameter
- Identify the difference between a census and a sample
- Identify the population of a study
- Determine whether a measurement is categorical or qualitative

In this lesson we will introduce some important terminology related to collecting data. When you are finished you will be able to identify the difference between terms like census and sample. In the following lessons we will rely on your understanding of these terms, so study well!

## Populations and Samples

### Selecting A Focus

Before we begin gathering and analyzing data we need to characterize the **population** we are studying. If we want to study the amount of money spent on textbooks by a typical first-year college student, our population might be all first-year students at your college. Or it might be:

- All first-year community college students in the state of Washington.
- All first-year students at public colleges and universities in the state of Washington.
- All first-year students at all colleges and universities in the state of Washington.
- All first-year students at all colleges and universities in the entire United States.
- And so on.

### Population

The **population** of a study is the group the collected data is intended to describe.

Sometimes the intended population is called the **target population**, since if we design our study badly, the collected data might not actually be representative of the intended population.

Why is it important to specify the population? We might get different answers to our question as we vary the population we are studying. First-year students at the University of Washington might take slightly more diverse courses than those at your college, and some of these courses may require less popular textbooks that cost more; or, on the other hand, the University Bookstore might have a larger pool of used textbooks, reducing the cost of these books to the students. Whichever the case (and it is likely that some combination of these and other factors are in play), the data we gather from your college will probably not be the same as that from the University of Washington. Particularly when conveying our results to others, we want to be clear about the population we are describing with our data.

### example

A newspaper website contains a poll asking people their opinion on a recent news article.

What is the population?

**Solution:**

While the target (intended) population may have been all people, the real population of the survey is readers of the website.

If we were able to gather data on every member of our population, say the average (we will define “average” more carefully in a subsequent section) amount of money spent on textbooks by each first-year student at your college during the 2009-2010 academic year, the resulting number would be called a **parameter**.

### Parameter

A **parameter** is a value (average, percentage, etc.) calculated using all the data from a population

We seldom see parameters, however, since surveying an entire population is usually very time-consuming and expensive, unless the population is very small or we already have the data collected.

### Census

A survey of an entire population is called a **census**.

You are probably familiar with two common censuses: the official government Census that attempts to count the population of the U.S. every ten years, and voting, which asks the opinion of all eligible voters in a district. The first of these demonstrates one additional problem with a census: the difficulty in finding and getting participation from everyone in a large population, which can bias, or skew, the results.

There are occasionally times when a census is appropriate, usually when the population is fairly small. For example, if the manager of Starbucks wanted to know the average number of hours her employees worked last week, she should be able to pull up payroll records or ask each employee directly.

Since surveying an entire population is often impractical, we usually select a **sample** to study.

### Sample

A **sample** is a smaller subset of the entire population, ideally one that is fairly representative of the whole population.

We will discuss sampling methods in greater detail in a later section. For now, let us assume that samples are chosen in an appropriate manner. If we survey a sample, say 100 first-year students at your college, and find the average amount of money spent by these students on textbooks, the resulting number is called a **statistic**.

### Statistic

A **statistic** is a value (average, percentage, etc.) calculated using the data from a sample.

### example

A researcher wanted to know how citizens of Tacoma felt about a voter initiative. To study this, she goes to the Tacoma Mall and randomly selects 500 shoppers and asks them their opinion. 60% indicate they are supportive of the initiative. What is the sample and population? Is the 60% value a parameter or a statistic?

**Solutions:**

The sample is the 500 shoppers questioned. The population is less clear. While the intended population of this survey was Tacoma citizens, the effective population was mall shoppers. There is no reason to assume that the 500 shoppers questioned would be representative of all Tacoma citizens.

The 60% value was based on the sample, so it is a statistic.

The examples on this page are detailed in the following video.

### Try It

To determine the average length of trout in a lake, researchers catch 20 fish and measure them. What is the sample and population in this study?

**Solution:**

The sample is the 20 fish caught. The population is all fish in the lake. The sample may be somewhat unrepresentative of the population since not all fish may be large enough to catch the bait.

A college reports that the average age of their students is 28 years old. Is this a statistic or a parameter?

**Solution:**

This is a parameter, since the college would have access to data on all students (the population)

## Categorizing Data

### Quantitative or Categorical

Once we have gathered data, we might wish to classify it. Roughly speaking, data can be classified as categorical data or quantitative data.

### Quantitative and categorical data

**Categorical (qualitative) data** are pieces of information that allow us to classify the objects under investigation into various categories.

**Quantitative data** are responses that are numerical in nature and with which we can perform meaningful arithmetic calculations.

### example

We might conduct a survey to determine the name of the favorite movie that each person in a math class saw in a movie theater.

When we conduct such a survey, the responses would look like: *Finding Nemo*, *The Hulk*, or *Terminator 3: Rise of the Machines*. We might count the number of people who give each answer, but the answers themselves do not have any numerical values: we cannot perform computations with an answer like “*Finding Nemo*.” Is this categorical or quantitative data?

**Solution:**

This would be categorical data.

### Example

A survey could ask the number of movies you have seen in a movie theater in the past 12 months (0, 1, 2, 3, 4, . . .). Is this categorical or quantitative data?

**Solution:**

This would be quantitative data.Other examples of quantitative data would be the running time of the movie you saw most recently (104 minutes, 137 minutes, 104 minutes, . . .) or the amount of money you paid for a movie ticket the last time you went to a movie theater ($5.50, $7.75, $9, . . .).

Sometimes, determining whether or not data is categorical or quantitative can be a bit trickier. In the next example, teh data collected is in numerical form, but it is not quantitative data. Read on to find out why.

### example

Suppose we gather respondents’ ZIP codes in a survey to track their geographical location. Is this categorical or quantitative?

**Solution:**

ZIP codes are numbers, but we can’t do any meaningful mathematical calculations with them (it doesn’t make sense to say that 98036 is “twice” 49018 — that’s like saying that Lynnwood, WA is “twice” Battle Creek, MI, which doesn’t make sense at all), so ZIP codes are really categorical data.

### Example

A survey about the movie you most recently attended includes the question “How would you rate the movie you just saw?” with these possible answers:

1 – it was awful

2 – it was just OK

3 – I liked it

4 – it was great

5 – best movie ever!

Is this categorical or quantitative?

**Solution:**

Again, there are numbers associated with the responses, but we can’t really do any calculations with them: a movie that rates a 4 is not necessarily twice as good as a movie that rates a 2, whatever that means; if two people see the movie and one of them thinks it stinks and the other thinks it’s the best ever it doesn’t necessarily make sense to say that “on average they liked it.”

As we study movie-going habits and preferences, we shouldn’t forget to specify the population under consideration. If we survey 3-7 year-olds the runaway favorite might be *Finding Nemo*. 13-17 year-olds might prefer *Terminator 3*. And 33-37 year-olds might prefer . . . well, *Finding Nemo*.

The examples in this page are discussed further in the following video:

### Try It

Classify each measurement as categorical or quantitative.

- Eye color of a group of people
- Daily high temperature of a city over several weeks
- Annual income

**Solutions:**

1. Categorical. 2. Quantitative 3. Quantitative

Attributions

This chapter contains material taken from *Math in Society* (on OpenTextBookStore) by David Lippman, and is used under a CC Attribution-Share Alike 3.0 United States (CC BY-SA 3.0 US) license.

This chapter contains material taken from of *Math for the Liberal Arts* (on Lumen Learning) by Lumen Learning, and is used under a *CC BY: Attribution* license.