Java: Count Number of Word Occurrence in String

Java: Count Number of Word Occurrence in String

Introduction

Counting the number of word occurrences in a string is a fairly easy task, but has several approaches to doing so. You have to account for the efficiency of the method as well, since you'll typically want to employ automated tools when you don't want to perform manual labor - i.e. when the search space is large.

In this guide, you'll learn how to count the number of word occurences in a string in Java:

String searchText = "Your body may be chrome, but the heart never changes. It wants what it wants.";
String targetWord = "wants";

We'll search for the number of occurrences of the targetWord, using String.split(), Collections.frequency() and Regular Expressions.

Count Word Occurences in String with String.split()

The simplest way to count the occurence of a target word in a string is to split the string on each word, and iterate through the array, incrementing a wordCount on each match. Note that when a word has any sort of punctuation around it, such as wants. at the end of the sentence - the simple word-level split will correctly treat wants and wants. as separate words!

To work around this, you can easily remove all punctuation from the sentence before splitting it:

String[] words = searchText.replaceAll("\\p{Punct}", "").split(" ");

int wordCount = 0;
for (int i=0; i < words.length; i++)
    if (words[i].equals(targetWord))
        wordCount++;
System.out.println(wordCount);

In the for loop, we simply iterate through the array, checking whether the element at each index is equal to the targetWord. If it is, we increment the wordCount, which at the end of the execution, prints:

2

Count Word Occurences in String with Collections.frequency()

The Collections.frequency() method provides a much cleaner, higher-level implementation, which abstracts away a simple for loop, and checks for both identity (whether an object is another object) and equality (whether an object is equal to another object, depending on the qualitative features of that object).

The frequency() method accepts a list to search through, and the target object, and works for all other objects as well, where the behavior depends on how the object itself implements equals(). In the case of strings, equals() checks for the contents of the string:

// Clean text of punctuation marks
searchText = searchText.replaceAll("\\p{Punct}", "");
// Search through list of words
int wordCount = Collections.frequency(Arrays.asList(searchText.split(" ")), targetWord);
System.out.println(wordCount);

Here, we've converted the array obtained from split() into a Java ArrayList, using the helper asList() method of the Arrays class. The reduction operation frequency() returns an integer denoting the frequency of targetWord in the list, and results in:

2

Word Occurences in String with Matcher (Regular Expressions - RegEx)

Finally, you can use Regular Expressions to search for patterns, and count the number of matched patterns. Regular Expressions are made for this, so it's a very natural fit for the task. In Java, the Pattern class is used to represent and compile Regular Expressions, and the Matcher class is used to find and match patterns.

Using RegEx, we can code the punctuation invariance into the expression itself, so there's no need to externally format the string or remove punctuation, which is preferable for large texts where storing another altered version in memory might be expenssive:

Pattern pattern = Pattern.compile("\\b%s(?!\\w)".format(targetWord));
// Or if you want to avoid string formatting
Pattern pattern = Pattern.compile("\\bwants(?!\\w)");
Matcher matcher = pattern.matcher(searchText);

int wordCount = 0;
while (matcher.find())
    wordCount++;

System.out.println(wordCount);

This also results in:

2

Efficiency Benchmark

So, which is the most efficient? Let's run a small benchmark:

int runs = 100000;

long start1 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
    int result = countOccurencesWithSplit(searchText, targetWord);
}

long end1 = System.currentTimeMillis();
System.out.println(String.format("Array split approach took: %s miliseconds", end1-start1));

long start2 = System.currentTimeMillis();
  for (int i = 0; i < runs; i++) {
    int result = countOccurencesWithCollections(searchText, targetWord);
}

long end2 = System.currentTimeMillis();
System.out.println(String.format("Collections.frequency() approach took: %s miliseconds", end2-start2));

long start3 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
    int result = countOccurencesWithRegex(searchText, targetWord);
}

long end3 = System.currentTimeMillis();
System.out.println(String.format("Regex approach took: %s miliseconds", end3-start3));

Each method will be run 100000 times (the higher the number, the lower the variance and results due to chance, due to the law of large numbers). Running this code results in:

Array split approach took: 152 miliseconds
Collections.frequency() approach took: 140 miliseconds
Regex approach took: 92 miliseconds

However - what happens if we make the search more computationally expensive by making it larger? Let's generate a synthetic sentence:

List<String> possibleWords = Arrays.asList("hello", "world ");
StringBuffer searchTextBuffer = new StringBuffer();

for (int i = 0; i < 100; i++) {
    searchTextBuffer.append(String.join(" ", possibleWords));
}
System.out.println(searchTextBuffer);

This create a string with the contents:

hello world hello world hello world hello ...

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

Now, if we were to search for either "hello" or "world" - there'd be many more matches than the two from before. How do our methods do now in the benchmark?

Array split approach took: 606 miliseconds
Collections.frequency() approach took: 899 miliseconds
Regex approach took: 801 miliseconds

Now, array splitting comes out fastest! In general, benchmarks depend on various factors - such as the search space, the target word, etc. and your personal use case might be different from the benchmark.

Advice: Try the methods out on your own text, note the times, and pick the most efficient and elegant one for you.

Conclusion

In this short guide, we've taken a look at how to count word occurrences for a target word, in a string in Java. We've started out by splitting the string and using a simple counter, followed by using the Collections helper class, and finally, using Regular Expressions.

In the end, we've benchmarked the methods, and noted that the performance isn't linear, and depends on the search space. For longer input texts with many matches, splitting arrays seems to be the most performant. Try all three methods on your own, and pick the most performant one.

Last Updated: October 7th, 2022
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

David LandupAuthor

Entrepreneur, Software and Machine Learning Engineer, with a deep fascination towards the application of Computation and Deep Learning in Life Sciences (Bioinformatics, Drug Discovery, Genomics), Neuroscience (Computational Neuroscience), robotics and BCIs.

Great passion for accessible education and promotion of reason, science, humanism, and progress.

Make Clarity from Data - Quickly Learn Data Visualization with Python

Learn the landscape of Data Visualization tools in Python - work with Seaborn, Plotly, and Bokeh, and excel in Matplotlib!

From simple plot types to ridge plots, surface plots and spectrograms - understand your data and learn to draw conclusions from it.

Want a remote job?

    © 2013-2022 Stack Abuse. All rights reserved.

    DisclosurePrivacyTerms