When we're trying to describe and summarize a sample of data, we probably start by finding the mean (or average), the median, and the mode of the data. These are central tendency measures and are often our first look at a dataset.
In this tutorial, we'll learn how to find or compute the mean, the median, and the mode in Python. We'll first code a Python function for each measure followed by using Python's
statistics module to accomplish the same task.
With this knowledge, we'll be able to take a quick look at our datasets and get an idea of the general tendency of data.
Table of Contents
- Calculating the Mean of a Sample
- Finding the Median of a Sample
- Finding the Mode of a Sample
Calculating the Mean of a Sample
If we have a sample of numeric values, then its mean or the average is the total sum of the values (or observations) divided by the number of values.
Say we have the sample
[4, 8, 6, 5, 3, 2, 8, 9, 2, 5]. We can calculate its mean by performing the operation:
(4 + 8 + 6 + 5 + 3 + 2 + 8 + 9 + 2 + 5) / 10 = 5.2
The mean (arithmetic mean) is a general description of our data. Suppose you buy 10 pounds of tomatoes. When you count the tomatoes at home, you get 25 tomatoes. In this case, you can say that the average weight of a tomato is 0.4 pounds. That would be a good description of your tomatoes.
The mean can also be a poor description of a sample of data. Say you're analyzing a group of dogs. If you take the cumulated weight of all dogs and divide it by the number of dogs, then that would probably be a poor description of the weight of an individual dog as different breeds of dogs can have vastly different sizes and weights.
How good or how bad the mean describes a sample depends on how spread the data is. In the case of tomatoes, they're almost the same weight each and the mean is a good description of them. In the case of dogs, there is no topical dog. They can range from a tiny Chihuahua to a giant German Mastiff. So, the mean by itself isn't a good description in this case.
Now it's time to get into action and learn how we can calculate the mean using Python.
Calculating the Mean With Python
To calculate the mean of a sample of numeric data, we'll use two of Python's built-in functions. One to calculate the total sum of the values and another to calculate the length of the sample.
The first function is
sum(). This built-in function takes an iterable of numeric values and returns their total sum.
The second function is
len(). This built-in function returns the length of an object.
len() can take sequences (string, bytes, tuple, list, or range) or collections (dictionary, set, or frozen set) as an argument.
Here's how we can calculate the mean:
>>> def my_mean(sample): ... return sum(sample) / len(sample) ... >>> my_mean([4, 8, 6, 5, 3, 2, 8, 9, 2, 5]) 5.2
We first sum the values in
sum(). Then, we divide that sum by the length of
sample, which is the resulting value of
Using Python's mean()
Since calculating the mean is a common operation, Python includes this functionality in the
statistics module. It provides some functions for calculating basic statistics on sets of data. The
statistics.mean() function takes a sample of numeric data (any iterable) and returns its mean.
Here's how Python's
>>> import statistics >>> statistics.mean([4, 8, 6, 5, 3, 2, 8, 9, 2, 5]) 5.2
We just need to import the
statistics module and then call
mean() with our sample as an argument. That will return the mean of the sample. This is a quick way of finding the mean using Python.
Finding the Median of a Sample
The median of a sample of numeric data is the value that lies in the middle when we sort the data. The data may be sorted in ascending or descending order, the median remains the same.
To find the median, we need to:
- Sort the sample
- Locate the value in the middle of the sorted sample
When locating the number in the middle of a sorted sample, we can face two kinds of situations:
- If the sample has an odd number of observations, then the middle value in the sorted sample is the median
- If the sample has an even number of observations, then we'll need to calculate the mean of the two middle values in the sorted sample
If we have the sample
[3, 5, 1, 4, 2] and want to find its median, then we first sort the sample to
[1, 2, 3, 4, 5]. The median would be
3 since that's the value in the middle.
On the other hand, if we have the sample
[1, 2, 3, 4, 5, 6], then its median will be
(3 + 4) / 2 = 3.5.
Let's take a look at how we can use Python to calculate the median.
Finding the Median With Python
To find the median, we first need to sort the values in our sample. We can achieve that using the built-in
sorted() takes an iterable and returns a sorted
list containing the same values of the original iterable.
The second step is to locate the value that lies in the middle of the sorted sample. To locate that value in a sample with an odd number of observations, we can divide the number of observations by 2. The result will be the index of the value in the middle of the sorted sample.
Since a division operator (
/) returns a float number, we'll need to use a floor division operator, (
//) to get an integer. So, we can use it as an index in an indexing operation (
If the sample has an even number of observations, then we need to locate the two middle values. Say we have the sample
[1, 2, 3, 4, 5, 6]. If we divide its length (
2 using a floor division, then we get
3. That's the index of our upper-middle value (
4). To find the index of our lower-middle value (
3), we can decrement the index of the upper-middle value by
Let's put all these together in function that calculates the median of a sample. Here's a possible implementation:
>>> def my_median(sample): ... n = len(sample) ... index = n // 2 ... # Sample with an odd number of observations ... if n % 2: ... return sorted(sample)[index] ... # Sample with an even number of observations ... return sum(sorted(sample)[index - 1:index + 1]) / 2 ... >>> my_median([3, 5, 1, 4, 2]) 3 >>> my_median([3, 5, 1, 4, 2, 6]) 3.5
This function takes a sample of numeric values and returns its median. We first find the length of the sample,
n. Then, we calculate the index of the middle value (or upper-middle value) by dividing
if statement checks if the sample at hand has an odd number of observations. If so, then the median is the value at
return runs if the sample has an even number of observations. In that case, we find the median by calculating the mean of the two middle values.
Note that the slicing operation
[index - 1:index + 1] gets two values. The value at
index - 1 and the value at
index because slicing operations exclude the value at the final index (
index + 1).
Using Python's median()
statistics.median() takes a sample of data and returns its median. Here's how the method works:
>>> import statistics >>> statistics.median([3, 5, 1, 4, 2]) 3 >>> statistics.median([3, 5, 1, 4, 2, 6]) 3.5
median() automatically handles the calculation of the median for samples with either an odd or an even number of observations.
Finding the Mode of a Sample
The mode is the most frequent observation (or observations) in a sample. If we have the sample
[4, 1, 2, 2, 3, 5], then its mode is
2 appears two times in the sample whereas the other elements only appear once.
The mode doesn't have to be unique. Some samples have more than one mode. Say we have the sample
[4, 1, 2, 2, 3, 5, 4]. This sample has two modes -
4 because they're the values that appear more often and both appear the same number of times.
The mode is commonly used for categorical data. Common categorical data types are:
- boolean - Can take only two values like in
- nominal - Can take more than two values like in
American - European - Asian - African
- ordinal - Can take more than two values but the values have a logical order like in
few - some - many
When we're analyzing a dataset of categorical data, we can use the mode to know which category is the most common in our data.
We can find samples that don't have a mode. If all the observations are unique (there aren't repeated observations), then your sample won't have a mode.
Now that we know the basics about mode, let's take a look at how we can find it using Python.
Finding the Mode with Python
To find the mode with Python, we'll start by counting the number of occurrences of each value in the sample at hand. Then, we'll get the value(s) with a higher number of occurrences.
Since counting objects is a common operation, Python provides the
collections.Counter class. This class is specially designed for counting objects.
Counter class provides a method defined as
.most_common([n]). This method returns a
list of two-items tuples with the
n more common elements and their respective counts. If
n is omitted or
.most_common() returns all of the elements.
.most_common() to code a function that takes a sample of data and returns its mode.
Here's a possible implementation:
>>> from collections import Counter >>> def my_mode(sample): ... c = Counter(sample) ... return [k for k, v in c.items() if v == c.most_common(1)] ... >>> my_mode(["male", "male", "female", "male"]) ['male'] >>> my_mode(["few", "few", "many", "some", "many"]) ['few', 'many'] >>> my_mode([4, 1, 2, 2, 3, 5])  >>> my_mode([4, 1, 2, 2, 3, 5, 4]) [4, 2]
We first count the observations in the
sample using a
Counter object (
c). Then, we use a list comprehension to create a
list containing the observations that appear the same number of times in the sample.
.most_common(1) returns a
list with one
tuple of the form
(observation, count), we need to get the observation at index
0 in the
list and then the item at index
1 in the nested
tuple. This can be done with the expression
c.most_common(1). That value is the first mode of our sample.
Note that the comprehension's condition compares the count of each observation (
v) with the count of the most common observation (
c.most_common(1)). This will allow us to get multiple observations (
k) with the same count in the case of a multi-mode sample.
Using Python's mode()
statistics.mode() takes some
data and returns its (first) mode. Let's see how we can use it:
>>> import statistics >>> statistics.mode([4, 1, 2, 2, 3, 5]) 2 >>> statistics.mode([4, 1, 2, 2, 3, 5, 4]) 4 >>> st.mode(["few", "few", "many", "some", "many"]) 'few'
With a single-mode sample, Python's
mode() returns the most common value,
2. However, in the proceeding two examples, it returned
few. These samples had other elements occurring the same number of times, but they weren't included.
Since Python 3.8 we can also use
statistics.multimode() which accepts an iterable and returns a
list of modes.
Here's an example of how to use
>>> import statistics >>> statistics.multimode([4, 1, 2, 2, 3, 5, 4]) [4, 2] >>> statistics.multimode(["few", "few", "many", "some", "many"]) ['few', 'many'] >>> st.multimode([4, 1, 2, 2, 3, 5]) 
Note: The function always returns a
list, even if you pass a single-mode sample.
The mean (or average), the median, and the mode are commonly our first looks at a sample of data when we're trying to understand the central tendency of the data.
In this tutorial, we've learned how to find or compute the mean, the median, and the mode using Python. We first covered, step-by-step, how to create our own functions to compute them, and then how to use Python's
statistics module as a quick way to find these measures.