Calculate Distribution from Collection in Java

Calculate Distribution from Collection in Java

Turning a collection of numbers (or objects who'se fields you'd like to inspect) into a distribution of those numbers is a common statistical technique, and is employed in various contexts in reporting and data-driven applications.

Given a collection:

1, 1, 2, 1, 2, 3, 1, 4, 5, 1, 3

You can inspect their distribution as a count (frequency of each element), and store the results in a map:

{
"1": 5,
"2": 2,
"3": 2,
"4": 1,
"5": 1
}

Or, you can normalize the values based on the total number of values - thus expressing them in percentages:

{
"1": 0.45,
"2": 0.18,
"3": 0.18,
"4": 0.09,
"5": 0.09
}

Or even express these percentages in a 0..100 format instead of a 0..1 format.

In this guide, we'll take a look at how you can calculate a distribution from a collection - both using primitive types and objects who'se fields you might want to report in your application.

With the addition of functional programming support in Java - calculating distributions is easier than ever. We'll be working with a collection of numbers and a collection of Books:

public class Book {

    private String id;
    private String name;
    private String author;
    private long pageNumber;
    private long publishedYear;

   // Constructor, getters, setters, toString()
}

Calculate Distribution of Collection in Java

Let's first take a look at how you can calculate a distribution for primitive types. Working with objects simply allows you to call custom methods from your domain classes to provide more flexibility in the calculations.

By default, we'll represent the percentages as a double from 0.00 to 100.00.

Primitive Types

Let's create a list of integers and print their distribution:

List<Integer> integerList = List.of(1, 1, 2, 1, 2, 3, 1, 4, 5, 1, 3);
System.out.println(calculateIntegerDistribution(integerList));

The distribution is calculated with:

public static Map<Integer, Double> calculateIntegerDistribution(List<Integer> list) {
    return list.stream()
            .collect(Collectors.groupingBy(Integer::intValue,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.2f", count * 100.00 / list.size()))))));
}

This method accepts a list and streams it. While streamed, the values are grouped by their integer value - and their values are counted using Collectors.counting(), before being collected into a Map<Integer, Double> where the keys represent the input values and the doubles represent their percentages in the distribution.

The key methods here is collect() which accepts two collectors. The key-collector collects by simply grouping by the key values (input elements). The value-collector collects via the collectingAndThen() method, which allows us to count the values and then format them in another format, such as count * 100.00 / list.size() which lets us express the counted elements in percentages:

{1=45.45, 2=18.18, 3=18.18, 4=9.09, 5=9.09}

Advice: If you'd like to read more about groupingBy(), counting() and collectingAndThen() - read our in-depth, example driven "Guide to Java 8 Collectors: groupingBy()", "Guide to Java 8 Collectors: counting()" and "Guide to Java 8 Collectors: collectingAndThen()".

Sort Distribution by Value or Key

When creating distributions - you'll typically want to sort the values. More often than not, this'll be by key. Java HashMaps don't guarantee to preserve order of insertion, so we'll have to use a LinkedHashMap which does. Additionally, it's easiest to re-stream the map and re-collect it now that it's a much smaller size and much more manageable.

The previous operation can quickly collapse multiple thousand records into small maps, depending on the number of keys you're dealing with, so re-streaming isn't expensive:

public static Map<Integer, Double> calculateIntegerDistribution(List<Integer> list) {
    return list.stream()
            .collect(Collectors.groupingBy(Integer::intValue,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.2f", count.doubleValue() / list.size()))))))
            // Stream the entries of the distribution map
            // to sort it and store in a LinkedHashMap
            .entrySet()
            .stream()
            .sorted(Map.Entry.comparingByKey())
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

Objects

How can this be done for objects? The same logic applies! Instead of an identify function (Integer::intValue), we'll be using the desired field instead - such as the published year for our books. Let's create a few books, store them in a list and then calculate the distributions of the publishing years:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Book book1 = new Book("001", "Our Mathematical Universe", "Max Tegmark", 432, 2014);
Book book2 = new Book("002", "Life 3.0", "Max Tegmark", 280, 2017);
Book book3 = new Book("003", "Sapiens", "Yuval Noah Harari", 443, 2011);
Book book4 = new Book("004", "Steve Jobs", "Water Isaacson", 656, 2011);

List<Book> books = Arrays.asList(book1, book2, book3, book4);

Let's calculate the distribution of the publishedYear field:

public static Map<Integer, Double> calculateDistribution(List<Book> books) {
    return books.stream()
            .collect(Collectors.groupingBy(Book::getPublishedYear,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.2f", count * 100.00 / books.size()))))))
            // Sort results by year
            .entrySet()
            .stream()
            .sorted(Map.Entry.comparingByKey())
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

Adjust the "%.2f" to set the floating point precision. This results in:

{2011=50.0, 2014=25.0, 2017=25.0}

50% of the given books (2/4) were published in 2011, 25% (1/4) were published in 2014 and 25% (1/4) in 2017. What if you want to format this result differently, and normalize the range in 0..1?

Calculate Normalized (Percentage) Distribution of Collection in Java

To normalize the percentages from a 0.0...100.0 range to a 0..1 range - we'll simply adapt the collectingAndThen() call to not multiply the count by 100.0 before dividing by the size of the collection.

Previously, the Long count returned by Collectors.counting() was implicitly converted into a double (multiplication with a double value) - so this time, we'll want to explicitly get the doubleValue() of the count:

    public static Map<Integer, Double> calculateDistributionNormalized(List<Book> books) {
        return books.stream()
            .collect(Collectors.groupingBy(Book::getPublishedYear,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Double.parseDouble(String.format("%.4f", count.doubleValue() / books.size()))))))
            // Sort results by key
            .entrySet()
            .stream()
            .sorted(comparing(e -> e.getKey()))
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

Adjust the "%.4f" to set the floating point precision. This results in:

{2011=0.5, 2014=0.25, 2017=0.25}

Calculate Element Count (Frequency) of Collection

Finally - we can get the element count (frequency of all elements) in the collection by simply not dividing the count by the size of the collection! This is a fully non-normalized count:

   public static Map<Integer, Integer> calculateDistributionCount(List<Book> books) {
        return books
            .stream()
            .collect(Collectors.groupingBy(Book::getPublishedYear,
                    Collectors.collectingAndThen(Collectors.counting(),
                            count -> (Integer.parseInt(String.format("%s", count.intValue()))))))
            // Sort values by key
            .entrySet()
            .stream()
            .sorted(Map.Entry.comparingByKey())
            .collect(Collectors.toMap(e -> Integer.parseInt(e.getKey().toString()),
                    Map.Entry::getValue,
                    (a, b) -> {
                        throw new AssertionError();
                    },
                    LinkedHashMap::new));
}

This results in:

{2011=2, 2014=1, 2017=1}

Indeed, there are two books from 2011, and one from 2014 and 2017 each.

Conclusion

Calculating distributions of data is a common task in data-rich applications, and doesn't require the use of external libraries or complex code. With functional programming support, Java made working with collections a breeze!

In this short draft, we've taken a look at how you can calculate frequency counts of all elements in a collection, as well as how to calculate distribution maps normalized to percentages between 0 and 1 as well as 0 and 100 in Java.

Last Updated: November 3rd, 2022
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

Make Clarity from Data - Quickly Learn Data Visualization with Python

Learn the landscape of Data Visualization tools in Python - work with Seaborn, Plotly, and Bokeh, and excel in Matplotlib!

From simple plot types to ridge plots, surface plots and spectrograms - understand your data and learn to draw conclusions from it.

Want a remote job?

    © 2013-2022 Stack Abuse. All rights reserved.

    DisclosurePrivacyTerms