Lesson 15
Comparing Data Sets
- Let’s compare statistics for data sets.
15.1: Bowling Partners
Each histogram shows the bowling scores for the last 25 games played by each person. Choose 2 of these people to join your bowling team. Explain your reasoning.
Person A
- mean: 118.96
- median: 111
- standard deviation: 32.96
- interquartile range: 44
Person B
- mean: 131.08
- median: 129
- standard deviation: 8.64
- interquartile range: 8
Person C
- mean: 133.92
- median: 145
- standard deviation: 45.04
- interquartile range: 74
Person D
- mean: 116.56
- median: 103
- standard deviation: 56.22
- interquartile range: 31.5
15.2: Comparing Marathon Times
All of the marathon runners from each of two different age groups have their finishing times represented in the dot plot.
- Which age group tends to take longer to run the marathon? Explain your reasoning.
- Which age group has more variable finish times? Explain your reasoning.
-
How do you think finish times for a 20–29 age range will compare to these two distributions?
-
Find some actual marathon finish times for this group and make a box plot of your data to help compare.
15.3: Comparing Measures
For each group of data sets,
- Determine the best measure of center and measure of variability to use based on the shape of the distribution.
- Determine which set has the greatest measure of center.
- Determine which set has the greatest measure of variability.
- Be prepared to explain your reasoning.
1a
1b
2a
2b
3a
3b
4a
4b
5a
5b
6a
A political podcast has mostly reviews that either love the podcast or hate it.
6b
A cooking podcast has reviews that neither hate nor love the podcast.
7a
Stress testing concrete from site A has all 12 samples break at 450 pounds per square inch (psi).
7b
Stress testing concrete from site B has samples break every 10 psi starting at 450 psi until the last core is broken at 560 psi.
7c
Stress testing concrete from site C has 6 samples break at 430 psi and the other 6 break at 460 psi.
Summary
To compare data sets, it is helpful to look at the measures of center and measures of variability. The shape of the distribution can help choose the most useful measure of center and measure of variability.
When distributions are symmetric or approximately symmetric, the mean is the preferred measure of center and should be paired with the standard deviation as the preferred measure of variability. When distributions are skewed or when outliers are present, the median is usually a better measure of center and should be paired with the interquartile range (IQR) as the preferred measure of variability.
Once the appropriate measure of center and measure of variability are selected, these measures can be compared for data sets with similar shapes.
For example, let’s compare the number of seconds it takes football players to complete a 40-yard dash at two different positions. First, we can look at a dot plot of the data to see that the tight end times do not seem symmetric, so we should probably find the median and IQR for both sets of data to compare information.
The median and IQR could be computed from the values, but can also be determined from a box plot.
This shows that the tight end times have a greater median (about 4.9 seconds) compared to the median of wide receiver times (about 4.5 seconds). The IQR is also greater for the tight end times (about 0.5 seconds) compared to the IQR for the wide receiver times (about 0.25 seconds).
This means that the tight ends tend to be slower in the 40-yard dash when compared to the wide receivers. The tight ends also have greater variability in their times. Together, this can be taken to mean that, in general, a typical wide receiver is faster than a typical tight end, and the wide receivers tend to have more similar times to one another than the tight ends do to one another.
Video Summary
Glossary Entries
- outlier
A data value that is unusual in that it differs quite a bit from the other values in the data set. In the box plot shown, the minimum, 0, and the maximum, 44, are both outliers.
- standard deviation
A measure of the variability, or spread, of a distribution, calculated by a method similar to the method for calculating the MAD (mean absolute deviation). The exact method is studied in more advanced courses.
- statistic
A quantity that is calculated from sample data, such as mean, median, or MAD (mean absolute deviation).