Lesson 15

Comparing Data Sets

  • Let’s compare statistics for data sets.

15.1: Bowling Partners

Each histogram shows the bowling scores for the last 25 games played by each person. Choose 2 of these people to join your bowling team. Explain your reasoning.

Person A

  • mean: 118.96
  • median: 111
  • standard deviation:​ ​32.96
  • interquartile range: 44
Histogram for bowler A

Person B

  • mean: 131.08
  • median: 129
  • standard deviation: 8.64
  • interquartile range: 8
Histogram for Bowler B

Person C

  • mean: 133.92
  • median: 145
  • standard deviation: 45.04
  • interquartile range: 74
Histogram for bowler C

Person D

  • mean: 116.56
  • median: 103
  • standard deviation: 56.22
  • interquartile range: 31.5
Histogram for bowler D

15.2: Comparing Marathon Times

All of the marathon runners from each of two different age groups have their finishing times represented in the dot plot.

Dot plot from 220 to 460 by 20’s. ages 30 through 39 marathon finish times in minutes. Beginning at 220 up to but not including 240, number of dots in each interval is 1, 11, 10, 10, 5, 4, 4, 5, 0, 0, 0, 0.
Dot plot from 220 to 460 by 20’s. ages 40 through 49 marathon finish times in minutes. Beginning at 220 up to but not including 240, number of dots in each interval is 0, 1, 7, 5, 4, 5, 4, 3, 5, 1, 6, 3.
  1. Which age group tends to take longer to run the marathon? Explain your reasoning.
  2. Which age group has more variable finish times? Explain your reasoning.


  1. How do you think finish times for a 20–29 age range will compare to these two distributions?

  2. Find some actual marathon finish times for this group and make a box plot of your data to help compare.

15.3: Comparing Measures

For each group of data sets,

  • Determine the best measure of center and measure of variability to use based on the shape of the distribution.
  • Determine which set has the greatest measure of center.
  • Determine which set has the greatest measure of variability.
  • Be prepared to explain your reasoning.

1a

Dot plot from negative 16 to negative 3 by 1's. Distribution 1a. Beginning at negative 12, number of dots above each increment is 6, 4, 3, 2, 1, 2, 3, 4, 6.

1b

Dot plot from negative 16 to negative 3 by 1's. Distribution 1b. Beginning at negative 14, number of dots above each increment is 1, 2, 4, 5, 7, 5, 4, 2, 1, 0, 0, 0.

2a

Dot plot from 11 to 33 by 1's. Distribution 2a. Beginning at 13, number of dots above each increment is 1, 1, 2, 2, 2, 3, 3, 4, 3, 3, 2, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0.

2b

Dot plot from 11 to 33 by 1's. Distribution 2b. Beginning at 27, number of dots above each increment is 1, 5, 6, 8, 6, 5, 1.

3a

Dot plot from 0 to 12 by 1's. Distribution 3a. Beginning at 0, number of dots above each increment is 0, 3, 2, 1, 1, 0, 2, 2, 3, 3, 5, 4.

3b

Dot plot from 0 to 12 by 0.5's. Distribution 3b. Beginning at 0, number of dots above each increment is 0, 4, 5, 3, 3, 2, 2, 0, 1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.
 

4a

Dot plot from 78 to 112 by 2's. Distribution 4a. Beginning at 78, number of dots above each increment is 0, 0, 0, 0, 0, 0, 2, 2, 3, 3, 4, 5, 4, 3, 3, 2, 2.

4b

Dot plot from 78 to 112 by 2's. Distribution 4b. Beginning at 78, number of dots above each increment is 0, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0. 

5a

Box plot from 0 to 1,200 by 100's. Distribution 5a. Whisker from 500 to 600. Box from 600 to 900 with vertical line at 700. Whisker from 900 to 1100.

5b

Box plot from 0 to 1,200 by 100's. Distribution 5b. Whisker from 200 to 450. Box from 450 to 650 with vertical line at 500. Whisker from 650 to 700.

6a

A political podcast has mostly reviews that either love the podcast or hate it.

6b

A cooking podcast has reviews that neither hate nor love the podcast.​​​

7a

Stress testing concrete from site A has all 12 samples break at 450 pounds per square inch (psi).

7b

Stress testing concrete from site B has samples break every 10 psi starting at 450 psi until the last core is broken at 560 psi. 

7c

Stress testing concrete from site C has 6 samples break at 430 psi and the other 6 break at 460 psi.

Summary

To compare data sets, it is helpful to look at the measures of center and measures of variability. The shape of the distribution can help choose the most useful measure of center and measure of variability.

When distributions are symmetric or approximately symmetric, the mean is the preferred measure of center and should be paired with the standard deviation as the preferred measure of variability. When distributions are skewed or when outliers are present, the median is usually a better measure of center and should be paired with the interquartile range (IQR) as the preferred measure of variability.

Once the appropriate measure of center and measure of variability are selected, these measures can be compared for data sets with similar shapes.

For example, let’s compare the number of seconds it takes football players to complete a 40-yard dash at two different positions. First, we can look at a dot plot of the data to see that the tight end times do not seem symmetric, so we should probably find the median and IQR for both sets of data to compare information.

Dot plot from 4 point 25 to 5 point 75 by  point 25’s. Wide receiver times in seconds. Beginning at 4 point 25 up to but not including 4  point 5, number of dots in each interval is 12, 11, 2, 0, 0, 0.
 
Dot plot from 4 point 25 to 5 point 75 by point 25’s. Tight end times in seconds. Beginning at 4 point 25 up to but not including 4 point 5, number of dots in each interval is 0, 10, 6, 4, 3, 1.
 

The median and IQR could be computed from the values, but can also be determined from a box plot.

Box plot.
Box plot for tight end times.

This shows that the tight end times have a greater median (about 4.9 seconds) compared to the median of wide receiver times (about 4.5 seconds). The IQR is also greater for the tight end times (about 0.5 seconds) compared to the IQR for the wide receiver times (about 0.25 seconds).

This means that the tight ends tend to be slower in the 40-yard dash when compared to the wide receivers. The tight ends also have greater variability in their times. Together, this can be taken to mean that, in general, a typical wide receiver is faster than a typical tight end, and the wide receivers tend to have more similar times to one another than the tight ends do to one another.

Video Summary

Glossary Entries

  • outlier

    A data value that is unusual in that it differs quite a bit from the other values in the data set. In the box plot shown, the minimum, 0, and the maximum, 44, are both outliers.

  • standard deviation

    A measure of the variability, or spread, of a distribution, calculated by a method similar to the method for calculating the MAD (mean absolute deviation). The exact method is studied in more advanced courses.

  • statistic

    A quantity that is calculated from sample data, such as mean, median, or MAD (mean absolute deviation).