Introduction
Counting the number of word occurrences in a string is a fairly easy task, but has several approaches to doing so. You have to account for the efficiency of the method as well, since you'll typically want to employ automated tools when you don't want to perform manual labor - i.e. when the search space is large.
In this guide, you'll learn how to count the number of word occurrences in a string in Java:
String searchText = "Your body may be chrome, but the heart never changes. It wants what it wants.";
String targetWord = "wants";
We'll search for the number of occurrences of the targetWord
, using String.split()
, Collections.frequency()
and Regular Expressions.
Count Word Occurrences in String with String.split()
The simplest way to count the occurrence of a target word in a string is to split the string on each word, and iterate through the array, incrementing a wordCount
on each match. Note that when a word has any sort of punctuation around it, such as wants.
at the end of the sentence - the simple word-level split will correctly treat wants
and wants.
as separate words!
To work around this, you can easily remove all punctuation from the sentence before splitting it:
String[] words = searchText.replaceAll("\\p{Punct}", "").split(" ");
int wordCount = 0;
for (int i=0; i < words.length; i++)
if (words[i].equals(targetWord))
wordCount++;
System.out.println(wordCount);
In the for
loop, we simply iterate through the array, checking whether the element at each index is equal to the targetWord
. If it is, we increment the wordCount
, which at the end of the execution, prints:
2
Count Word Occurrences in String with Collections.frequency()
The Collections.frequency()
method provides a much cleaner, higher-level implementation, which abstracts away a simple for
loop, and checks for both identity (whether an object is another object) and equality (whether an object is equal to another object, depending on the qualitative features of that object).
The frequency()
method accepts a list to search through, and the target object, and works for all other objects as well, where the behavior depends on how the object itself implements equals()
. In the case of strings, equals()
checks for the contents of the string:
// Clean text of punctuation marks
searchText = searchText.replaceAll("\\p{Punct}", "");
// Search through list of words
int wordCount = Collections.frequency(Arrays.asList(searchText.split(" ")), targetWord);
System.out.println(wordCount);
Here, we've converted the array obtained from split()
into a Java ArrayList
, using the helper asList()
method of the Arrays
class. The reduction operation frequency()
returns an integer denoting the frequency of targetWord
in the list, and results in:
2
Word Occurrences in String with Matcher (Regular Expressions - RegEx)
Finally, you can use Regular Expressions to search for patterns, and count the number of matched patterns. Regular Expressions are made for this, so it's a very natural fit for the task. In Java, the Pattern
class is used to represent and compile Regular Expressions, and the Matcher
class is used to find and match patterns.
Using RegEx, we can code the punctuation invariance into the expression itself, so there's no need to externally format the string or remove punctuation, which is preferable for large texts where storing another altered version in memory might be expensive:
Pattern pattern = Pattern.compile("\\b%s(?!\\w)".format(targetWord));
// Or if you want to avoid string formatting
Pattern pattern = Pattern.compile("\\bwants(?!\\w)");
Matcher matcher = pattern.matcher(searchText);
int wordCount = 0;
while (matcher.find())
wordCount++;
System.out.println(wordCount);
This also results in:
2
Efficiency Benchmark
So, which is the most efficient? Let's run a small benchmark:
int runs = 100000;
long start1 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
int result = countOccurencesWithSplit(searchText, targetWord);
}
long end1 = System.currentTimeMillis();
System.out.println(String.format("Array split approach took: %s milliseconds", end1-start1));
long start2 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
int result = countOccurencesWithCollections(searchText, targetWord);
}
long end2 = System.currentTimeMillis();
System.out.println(String.format("Collections.frequency() approach took: %s milliseconds", end2-start2));
long start3 = System.currentTimeMillis();
for (int i = 0; i < runs; i++) {
int result = countOccurencesWithRegex(searchText, targetWord);
}
long end3 = System.currentTimeMillis();
System.out.println(String.format("Regex approach took: %s milliseconds", end3-start3));
Each method will be run 100000 times (the higher the number, the lower the variance and results due to chance, due to the law of large numbers). Running this code results in:
Array split approach took: 152 milliseconds
Collections.frequency() approach took: 140 milliseconds
Regex approach took: 92 milliseconds
However - what happens if we make the search more computationally expensive by making it larger? Let's generate a synthetic sentence:
List<String> possibleWords = Arrays.asList("hello", "world ");
StringBuffer searchTextBuffer = new StringBuffer();
for (int i = 0; i < 100; i++) {
searchTextBuffer.append(String.join(" ", possibleWords));
}
System.out.println(searchTextBuffer);
This create a string with the contents:
hello world hello world hello world hello ...
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Now, if we were to search for either "hello" or "world" - there'd be many more matches than the two from before. How do our methods do now in the benchmark?
Array split approach took: 606 milliseconds
Collections.frequency() approach took: 899 milliseconds
Regex approach took: 801 milliseconds
Now, array splitting comes out fastest! In general, benchmarks depend on various factors - such as the search space, the target word, etc. and your personal use case might be different from the benchmark.
Advice: Try the methods out on your own text, note the times, and pick the most efficient and elegant one for you.
Conclusion
In this short guide, we've taken a look at how to count word occurrences for a target word, in a string in Java. We've started out by splitting the string and using a simple counter, followed by using the Collections
helper class, and finally, using Regular Expressions.
In the end, we've benchmarked the methods, and noted that the performance isn't linear, and depends on the search space. For longer input texts with many matches, splitting arrays seems to be the most performant. Try all three methods on your own, and pick the most performant one.