During textual processing, whether you're searching for certain words and making pattern matching rules, counting the frequency of elements, etc. - punctuation can throw a wrench in your plans.
Oftentimes, you'll want to remove stopwords, punctuation, digits or otherwise some category of characters, depending on what your end goal is.
In this short tutorial, we'll take a look at how to remove punctuation from a string in Java.
Remove Punctuation from String with RegEx (Regular Expressions)
Regular Expressions are a very natural fit here, both because they're likely going to be part of other processing parts, and because they're efficient pattern matchers! In Java, the regular expression for matching punctuation is \p{Punct}
or a shorthand \p{P}
.
You'll have to escape the first backslash in a string, so removing all punctuation is equivalent to matching them and replacing with an empty character:
String.replaceAll("\\p{P}", "")
Let's apply it to a simple sentence:
String text = "Hi! This is, in effect, a synthetic sentence. It's meant to have several punctuation characters!";
String clean = text.replaceAll("\\p{P}", "");
System.out.println(clean);
This results in:
Hi This is in effect a synthetic sentence Its meant to have several punctuation characters
Let's take a look at what characters are treated as punctuation here:
String text = "!#$%&'()*+,-./:;<=>?@[]^_`{|}~";
String clean = text.replaceAll("\\p{P}", "");
System.out.println(clean);
With these special characters - which are left after removing punctuation?
$+<=>^`|~
Remove Punctuation from String without RegEx
If you don't want to employ regular expressions, you can do a manual check while iterating through each character of a string. Remember to use a StringBuffer
instead of a String
while doing this, since strings are immutable and copies need to be made every time you want to add a character - so you'd be creating string.length
number of strings in memory.
StringBuffer
is mutable, and can be easily converted into an immutable string at the end of the process:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
public static String removePunctuations(String s) {
StringBuffer buffer = new StringBuffer();
for (Character c : s.toCharArray()) {
if(Character.isLetterOrDigit(c))
buffer.append(c);
}
return buffer.toString();
}
Let's create a string and clean it:
String text = "Hello! \nHere are some special characters: !#$%&'()*+,-./:;<=>?@[]^_`{|}~ \nWhere are they? :(\n";
System.out.println(text);
String clean = removePunctuations(text);
System.out.println(clean);
Hello!
Here are some special characters: !#$%&'()*+,-./:;<=>?@[]^_`{|}~
Where are they? :(
HelloHerearesomespecialcharactersWherearethey
While this process is more customizable, it only checks for letters and digits. You can check manually for character codes alternatively, and only exclude some punctuation characters instead - and leave in whitespaces, line breaks, etc.
Conclusion
In this short tutorial, we took a look at how you can remove punctuation or certain special characters from a string in Java, using regular expressions or a manual check in an enhanced for
loop.