In this tutorial, you will learn about and apply measures of spread including standard deviation, variance, range, and interquartile range using Python programming and the Pandas data analytics library.
Measures of central tendency are great for providing information on the center of the data, but there is a lot more to consider when looking at datasets. It’s important to know how spread out, or disperse the data is. This helps to describe the variability of the dataset.
Take a look below, both of these data sets have the same exact mean of 11.1 and median of 11, but by just glancing at these two datasets we can tell that they are drastically different in terms of spread or dispersion. The first set has a lot less spread since the numbers are all closer together.
Set 1
7, 8, 9, 10, 11, 11, 12, 12, 13, 18
Set 2
1, 1, 2, 3, 11, 11, 14, 22, 22, 24
The graphs below provide another way to see how different these two sets are in terms of spread. These graphs are histograms, which lists the frequency (or number of times a number appears) for each number in the set. In Set 1, the numbers are pretty much grouped in the middle whereas in Set 2, they are very spread out.
There are a few measures that help us determine the spread of a dataset. The standard deviation is a measure that tells how spread out a group of numbers in a dataset are. This measure can be found by taking the square root of the variance. The variance is another measure that is used to describe how far each number in the dataset is from the mean. This is calculated by determining the average of the squared differences from the mean. Let’s take a look at what this math actually means.
Below is the Set 1 data set and the mean of this data is 11.1. First, each number is subtracted from the mean, so 7 - 11.1 and then 8 - 11.1 and so forth.
After this, the values are squared. This is important since we don’t want the negative values and positive values canceling each other out. Squaring the numbers gets rid of the negatives. Lastly, the average of these final numbers is taken, but instead of dividing by the number of data values, it is divided by the number minus 1. In this case, there are 10 values - so it is divided by 10 - 1 or 9. There is a pretty sophisticated reason for doing this which is quite complicated, but we encourage you to check it out if you are interested!
Variance = (16.81 + 9.61 + 4.41 + 1.21 + 0.01 + 0.01 + 0.81 + 0.81 + 3.61 + 47.61) ➗ (10 - 1)
This calculates to a variance of 9.43. The lower the variance, the less spread out the data is.
Since we have the variance now, we can take the square root to find the standard deviation. Remember that the variance describes how dispersed EACH number is while the standard deviation tells how dispersed the GROUP of numbers are. Again, the lower this value, the less spread out the numbers are.
Standard Deviation = √9.43 = 3.07
The data used in the coding exercise below lists the number of people named Anna in the US that were born in each year from 1990 to 2017. This would be very hard to determine the variance and standard deviation by hand! Let’s use Python and the Pandas library to calculate the values.
print(people_named_anna)
print(people_named_anna.var())
print(people_named_anna.std())
There are two more measures of spread that are important. The range, which may sound familiar, and the interquartile range. The range is found by subtracting the minimum value from the maximum value. Again, the larger this number, the larger the spread.
The interquartile range, or IQR for short, is found by finding the range of the middle 50% of the data. This helps to avoid outliers from affecting the data. Quartiles split the data into four sets. The first quartile is the 25% point in the data or half of the first half and the third quartile is the 75% point of the data or half of the second half. If you are wondering where the second quartile is - it’s the same as the median, which is the 50% point. Let’s use a few Pandas functions to help determine these two measures of spread.
print(people_named_anna)
people_named_anna.max()
people_named_anna.min()
people_named_anna.quantile(0.25)
people_named_anna.quantile(0.75)
If all of these measures, numbers, and percentages are confusing, there is a visual approach to looking at the measures of spread! This visual is called a boxplot and Pandas, along with a new library can plot one for us.
The beginning and the end of the arms of the boxplot show the maximum and minimum values. The line in the middle is the median. And lastly, the tops and bottom of the box denote the first and the third quartiles!
We can also view the data as a histogram. The only thing that changed here in the code is replacing the boxplot
function with a histogram, or hist
function. We can also add in a comma after the dataset and specify the color that we’d like the edgecolor to be. This stops the bars from running together.
You won’t be able to determine the median or mean as easily in a histogram as a boxplot, but you can still see the outlier and can easily spot values that appear more than once. A histogram also provides you with a pretty good visual of the spread of the data.
plt.boxplot(people_named_anna)
to plt.hist(people_named_anna)