Turning a collection of numbers (or objects whose fields you'd like to inspect) into a distribution of those numbers is a common statistical technique, and is employed in various contexts in reporting and data-driven applications.
Given a collection:
1, 1, 2, 1, 2, 3, 1, 4, 5, 1, 3
You can inspect their distribution as a count (frequency of each element), and store the results in a map:
{
"1": 5,
"2": 2,
"3": 2,
"4": 1,
"5": 1
}
Or, you can normalize the values based on the total number of values - thus expressing them in percentages:
{
"1": 0.45,
"2": 0.18,
"3": 0.18,
"4": 0.09,
"5": 0.09
}
Or even express these percentages in a 0..100
format instead of a 0..1
format.
In this guide, we'll take a look at how you can calculate a distribution from a collection - both using primitive types and objects whose fields you might want to report in your application.
With the addition of functional programming support in Java - calculating distributions is easier than ever. We'll be working with a collection of numbers and a collection of Book
s:
public class Book {
private String id;
private String name;
private String author;
private long pageNumber;
private long publishedYear;
// Constructor, getters, setters, toString()
}
Calculate Distribution of Collection in Java
Let's first take a look at how you can calculate a distribution for primitive types. Working with objects simply allows you to call custom methods from your domain classes to provide more flexibility in the calculations.
By default, we'll represent the percentages as a double from 0.00
to 100.00
.
Primitive Types
Let's create a list of integers and print their distribution:
List<Integer> integerList = List.of(1, 1, 2, 1, 2, 3, 1, 4, 5, 1, 3);
System.out.println(calculateIntegerDistribution(integerList));
The distribution is calculated with:
public static Map<Integer, Double> calculateIntegerDistribution(List<Integer> list) {
return list.stream()
.collect(Collectors.groupingBy(Integer::intValue,
Collectors.collectingAndThen(Collectors.counting(),
count -> (Double.parseDouble(String.format("%.2f", count * 100.00 / list.size()))))));
}
This method accepts a list and streams it. While streamed, the values are grouped by their integer value - and their values are counted using Collectors.counting()
, before being collected into a Map<Integer, Double>
where the keys represent the input values and the doubles represent their percentages in the distribution.
The key methods here is collect()
which accepts two collectors. The key-collector collects by simply grouping by the key values (input elements). The value-collector collects via the collectingAndThen()
method, which allows us to count the values and then format them in another format, such as count * 100.00 / list.size()
which lets us express the counted elements in percentages:
{1=45.45, 2=18.18, 3=18.18, 4=9.09, 5=9.09}
Advice: If you'd like to read more about groupingBy()
, counting()
and collectingAndThen()
- read our in-depth, example driven "Guide to Java 8 Collectors: groupingBy()", "Guide to Java 8 Collectors: counting()" and "Guide to Java 8 Collectors: collectingAndThen()".
Sort Distribution by Value or Key
When creating distributions - you'll typically want to sort the values. More often than not, this'll be by key. Java HashMap
s don't guarantee to preserve order of insertion, so we'll have to use a LinkedHashMap
which does. Additionally, it's easiest to re-stream the map and re-collect it now that it's a much smaller size and much more manageable.
The previous operation can quickly collapse multiple thousand records into small maps, depending on the number of keys you're dealing with, so re-streaming isn't expensive:
public static Map<Integer, Double> calculateIntegerDistribution(List<Integer> list) {
return list.stream()
.collect(Collectors.groupingBy(Integer::intValue,
Collectors.collectingAndThen(Collectors.counting(),
count -> (Double.parseDouble(String.format("%.2f", count.doubleValue() / list.size()))))))
// Stream the entries of the distribution map
// to sort it and store in a LinkedHashMap
.entrySet()
.stream()
.sorted(Map.Entry.comparingByKey())
.collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
Map.Entry::getValue,
(a, b) -> {
throw new AssertionError();
},
LinkedHashMap::new));
}
Objects
How can this be done for objects? The same logic applies! Instead of an identify function (Integer::intValue
), we'll be using the desired field instead - such as the published year for our books. Let's create a few books, store them in a list and then calculate the distributions of the publishing years:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Book book1 = new Book("001", "Our Mathematical Universe", "Max Tegmark", 432, 2014);
Book book2 = new Book("002", "Life 3.0", "Max Tegmark", 280, 2017);
Book book3 = new Book("003", "Sapiens", "Yuval Noah Harari", 443, 2011);
Book book4 = new Book("004", "Steve Jobs", "Water Isaacson", 656, 2011);
List<Book> books = Arrays.asList(book1, book2, book3, book4);
Let's calculate the distribution of the publishedYear
field:
public static Map<Integer, Double> calculateDistribution(List<Book> books) {
return books.stream()
.collect(Collectors.groupingBy(Book::getPublishedYear,
Collectors.collectingAndThen(Collectors.counting(),
count -> (Double.parseDouble(String.format("%.2f", count * 100.00 / books.size()))))))
// Sort results by year
.entrySet()
.stream()
.sorted(Map.Entry.comparingByKey())
.collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
Map.Entry::getValue,
(a, b) -> {
throw new AssertionError();
},
LinkedHashMap::new));
}
Adjust the "%.2f"
to set the floating point precision. This results in:
{2011=50.0, 2014=25.0, 2017=25.0}
50% of the given books (2/4) were published in 2011, 25% (1/4) were published in 2014 and 25% (1/4) in 2017. What if you want to format this result differently, and normalize the range in 0..1
?
Calculate Normalized (Percentage) Distribution of Collection in Java
To normalize the percentages from a 0.0...100.0
range to a 0..1
range - we'll simply adapt the collectingAndThen()
call to not multiply the count by 100.0
before dividing by the size of the collection.
Previously, the Long
count returned by Collectors.counting()
was implicitly converted into a double (multiplication with a double value) - so this time, we'll want to explicitly get the doubleValue()
of the count
:
public static Map<Integer, Double> calculateDistributionNormalized(List<Book> books) {
return books.stream()
.collect(Collectors.groupingBy(Book::getPublishedYear,
Collectors.collectingAndThen(Collectors.counting(),
count -> (Double.parseDouble(String.format("%.4f", count.doubleValue() / books.size()))))))
// Sort results by key
.entrySet()
.stream()
.sorted(comparing(e -> e.getKey()))
.collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
Map.Entry::getValue,
(a, b) -> {
throw new AssertionError();
},
LinkedHashMap::new));
}
Adjust the "%.4f"
to set the floating point precision. This results in:
{2011=0.5, 2014=0.25, 2017=0.25}
Calculate Element Count (Frequency) of Collection
Finally - we can get the element count (frequency of all elements) in the collection by simply not dividing the count by the size of the collection! This is a fully non-normalized count:
public static Map<Integer, Integer> calculateDistributionCount(List<Book> books) {
return books
.stream()
.collect(Collectors.groupingBy(Book::getPublishedYear,
Collectors.collectingAndThen(Collectors.counting(),
count -> (Integer.parseInt(String.format("%s", count.intValue()))))))
// Sort values by key
.entrySet()
.stream()
.sorted(Map.Entry.comparingByKey())
.collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
Map.Entry::getValue,
(a, b) -> {
throw new AssertionError();
},
LinkedHashMap::new));
}
This results in:
{2011=2, 2014=1, 2017=1}
Indeed, there are two books from 2011, and one from 2014 and 2017 each.
Conclusion
Calculating distributions of data is a common task in data-rich applications, and doesn't require the use of external libraries or complex code. With functional programming support, Java made working with collections a breeze!
In this short draft, we've taken a look at how you can calculate frequency counts of all elements in a collection, as well as how to calculate distribution maps normalized to percentages between 0
and 1
as well as 0
and 100
in Java.