Python Tutorial

Measures of Spread

In this tutorial, you will learn about and apply measures of spread including standard deviation, variance, range, and interquartile range using Python programming and the Pandas data analytics library.

By Jennifer Campbell

Measures of Spread

Measures of central tendency are great for providing information on the center of the data, but there is a lot more to consider when looking at datasets. It’s important to know how spread out, or disperse the data is. This helps to describe the variability of the dataset.

Take a look below, both of these data sets have the same exact mean of 11.1 and median of 11, but by just glancing at these two datasets we can tell that they are drastically different in terms of spread or dispersion. The first set has a lot less spread since the numbers are all closer together.

Set 1
7, 8, 9, 10, 11, 11, 12, 12, 13, 18

Set 2
1, 1, 2, 3, 11, 11, 14, 22, 22, 24

The graphs below provide another way to see how different these two sets are in terms of spread. These graphs are histograms, which lists the frequency (or number of times a number appears) for each number in the set. In Set 1, the numbers are pretty much grouped in the middle whereas in Set 2, they are very spread out.

Variance and Standard Deviation

There are a few measures that help us determine the spread of a dataset. The standard deviation is a measure that tells how spread out a group of numbers in a dataset are. This measure can be found by taking the square root of the variance. The variance is another measure that is used to describe how far each number in the dataset is from the mean. This is calculated by determining the average of the squared differences from the mean. Let’s take a look at what this math actually means.

Below is the Set 1 data set and the mean of this data is 11.1. First, each number is subtracted from the mean, so 7 - 11.1 and then 8 - 11.1 and so forth.

After this, the values are squared. This is important since we don’t want the negative values and positive values canceling each other out. Squaring the numbers gets rid of the negatives. Lastly, the average of these final numbers is taken, but instead of dividing by the number of data values, it is divided by the number minus 1. In this case, there are 10 values - so it is divided by 10 - 1 or 9. There is a pretty sophisticated reason for doing this which is quite complicated, but we encourage you to check it out if you are interested!

Variance = (16.81 + 9.61 + 4.41 + 1.21 + 0.01 + 0.01 + 0.81 + 0.81 + 3.61 + 47.61) ➗ (10 - 1)

This calculates to a variance of 9.43. The lower the variance, the less spread out the data is.

Since we have the variance now, we can take the square root to find the standard deviation. Remember that the variance describes how dispersed EACH number is while the standard deviation tells how dispersed the GROUP of numbers are. Again, the lower this value, the less spread out the numbers are.

Standard Deviation = √9.43 = 3.07

You Try!

The data used in the coding exercise below lists the number of people named Anna in the US that were born in each year from 1990 to 2017. This would be very hard to determine the variance and standard deviation by hand! Let’s use Python and the Pandas library to calculate the values.

Print the dataset by typing in print(people_named_anna)
Print the variance by typing in print(people_named_anna.var())
Print the standard deviation by typing in print(people_named_anna.std())
Reflect: What do the variance and standard deviation mean for this dataset?

Range and IQR

There are two more measures of spread that are important. The range, which may sound familiar, and the interquartile range. The range is found by subtracting the minimum value from the maximum value. Again, the larger this number, the larger the spread.

The interquartile range, or IQR for short, is found by finding the range of the middle 50% of the data. This helps to avoid outliers from affecting the data. Quartiles split the data into four sets. The first quartile is the 25% point in the data or half of the first half and the third quartile is the 75% point of the data or half of the second half. If you are wondering where the second quartile is - it’s the same as the median, which is the 50% point. Let’s use a few Pandas functions to help determine these two measures of spread.

You Try!

Print the dataset by typing in print(people_named_anna)
Define a variable and store the maximum value by typing in people_named_anna.max()
Define a variable and store the minimum value by typing in people_named_anna.min()
Use these two variables to print the range of the function. Remember that range is the maximum value minus the minimum value.
Define a variable and store the first quartile value by typing in people_named_anna.quantile(0.25)
Define a variable and store the third quartile value by typing in people_named_anna.quantile(0.75)
Use these two variables to print the IQR of the function. Remember that the IQR is the third quartile minus the first quartile.
Reflect: Compare the range with the IQR. What does this mean for the middle 50% of the data?

Plotting Data

If all of these measures, numbers, and percentages are confusing, there is a visual approach to looking at the measures of spread! This visual is called a boxplot and Pandas, along with a new library can plot one for us.

The beginning and the end of the arms of the boxplot show the maximum and minimum values. The line in the middle is the median. And lastly, the tops and bottom of the box denote the first and the third quartiles!

We can also view the data as a histogram. The only thing that changed here in the code is replacing the boxplot function with a histogram, or hist function. We can also add in a comma after the dataset and specify the color that we’d like the edgecolor to be. This stops the bars from running together.

You won’t be able to determine the median or mean as easily in a histogram as a boxplot, but you can still see the outlier and can easily spot values that appear more than once. A histogram also provides you with a pretty good visual of the spread of the data.

You Try!

Change the plot from a boxplot to a histogram. Change plt.boxplot(people_named_anna) to plt.hist(people_named_anna)
Explore the different attributes that can be used with creating a boxplot and histogram.
- Series.plot Documentation

Measures of Spread

By Jennifer Campbell

Measures of Spread

Variance and Standard Deviation

You Try!

Range and IQR

You Try!

Plotting Data

You Try!

Products

Use Cases

Platform

Curriculum

PD

Programming Languages

Resources

Company

Products

Platform

PD

Resources

Use Cases

Curriculum

Programming Languages

Company

Products

Computer Science Curriculum

Customizable K-12 Computer Science Curriculum

Measures of Spread

By Jennifer Campbell

Measures of Spread

Variance and Standard Deviation

You Try!

Range and IQR

You Try!

Plotting Data

You Try!

Related Tutorials

Pandas Series and Central Tendency

Math Module in Python