Introduction
When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8.
UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.
A code point can represent single characters, but also have other meanings, such as for formatting. "Variable-width" means that it encodes each code point with a different number of bytes (between one and four) and as a space-saving measure, commonly used code points are represented with fewer bytes than those used less frequently.
UTF-8 uses one byte to represent code points from 0-127, making the first 128 code points a one-to-one map with ASCII characters, so UTF-8 is backward-compatible with ASCII.
Note: Java encodes all Strings into UTF-16, which uses a minimum of two bytes to store code points. Why would we need to convert to UTF-8 then?
Not all input might be UTF-16, or UTF-8 for that matter. You might actually receive an ASCII-encoded String, which doesn't support as many characters as UTF-8. Additionally, not all output might handle UTF-16, so it makes sense to convert to a more universal UTF-8.
We'll be working with a few Strings that contain Unicode characters you might not encounter on a daily basis - such as č
, ß
and あ
, simulating user input.
Let's write out a couple of Strings:
String serbianString = "Šta radiš?"; // What are you doing?
String germanString = "Wie heißen Sie?"; // What's your name?
String japaneseString = "よろしくお願いします"; // Pleased to meet you.
Now, let's leverage the String(byte[] bytes, Charset charset)
constructor of a String, to recreate these Strings, but with a different Charset
, simulating ASCII input that arrived to us in the first place:
String asciiSerbianString = new String(serbianString.getBytes(), StandardCharsets.US_ASCII);
String asciigermanString = new String(germanString.getBytes(), StandardCharsets.US_ASCII);
String asciijapaneseString = new String(japaneseString.getBytes(), StandardCharsets.US_ASCII);
System.out.println(asciiSerbianString);
System.out.println(asciigermanString);
System.out.println(asciijapaneseString);
Once we've created these Strings and encoded them as ASCII characters, we can print them:
��ta radi��?
Wie hei��en Sie?
������������������������������
While the first two Strings contain just a few characters that aren't valid ASCII characters - the final one doesn't contain any.
To avoid this issue, we can assume that not all input might already be encoded to our liking - and encode it to iron out such cases ourselves. There are several ways we can go about encoding a String to UTF-8 in Java.
Encoding a String in Java simply means injecting certain bytes into the byte array that constitutes a String - providing additional information that can be used to format it once we form a String
instance.
Using the getBytes() method
The String
class, being made up of bytes, naturally offers a getBytes()
method, which returns the byte array used to create the String. Since encoding is really just manipulating this byte array, we can put this array through a Charset
to form it while getting the data.
By default, without providing a Charset
, the bytes are encoded using the platform’s default Charset
- which might not be UTF-8 or UTF-16. Let's get the bytes of a String and print them out:
String serbianString = "Šta radiš?"; // What are you doing?
byte[] bytes = serbianString.getBytes(StandardCharsets.UTF_8);
for (byte b : bytes) {
System.out.print(String.format("%s ", b));
}
This outputs:
-59 -96 116 97 32 114 97 100 105 -59 -95 63
These are the code points for our encoded characters, and they're not really useful to human eyes. Though, again, we can leverage String's constructor to make a human-readable String from this very sequence. Considering the fact that we've encoded this byte array into UTF_8
, we can go ahead and safely make a new String from this:
String utf8String = new String(bytes);
System.out.println(utf8String);
Note: Instead of encoding them through the getBytes()
method, you can also encode the bytes through the String constructor:
String utf8String = new String(bytes, StandardCharsets.UTF_8);
This now outputs the exact same String we started with, but encoded to UTF-8:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Šta radiš?
Encode a String to UTF-8 with Java 7 StandardCharsets
Since Java 7, we've been introduced to the StandardCharsets
class, which has several Charset
s available such as US_ASCII
, ISO_8859_1
, UTF_8
and UTF-16
among others.
Each Charset
has an encode()
and decode()
method, which accepts a CharBuffer
(which implements CharSequence
, same as a String
). In practical terms - this means we can chuck in a String into the encode()
methods of a Charset
.
The encode()
method returns a ByteBuffer
- which we can easily turn into a String again.
Earlier when we used our getBytes()
method, we stored the bytes we got in an array of bytes, but when using the StandardCharsets
class, things are a bit different. We first need to use a class called ByteBuffer
to store our bytes. Then, we need to both encode and then decode back our newly allocated bytes. Let's see how this works in code:
String japaneseString = "よろしくお願いします"; // Pleased to meet you.
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode(japaneseString);
String utf8String = new String(byteBuffer.array(), StandardCharsets.UTF_8);
System.out.println(utf8String);
Running this code results in:
よろしくお願いします
Encode a String to UTF-8 with Apache Commons
The Apache Commons Codec package contains simple encoders and decoders for various formats such as Base64 and Hexadecimal. In addition to these widely used encoders and decoders, the codec package also maintains a collection of phonetic encoding utilities.
For us to be able to use the Apache Commons Codec, we need to add it to our project as an external dependency.
Using Maven, let's add the commons-codec
dependency to our pom.xml
file:
<dependency>
<groupId>commons-codec</groupId>
<artifactId>commons-codec</artifactId>
<version>1.15</version>
</dependency>
Alternatively if you're using Gradle:
compile 'commons-codec:commons-codec:1.15'
Now, we can utilize the utility classes of Apache Commons - and as usual, we'll be leveraging the StringUtils
class.
It allows us to convert Strings to and from bytes using various encodings required by the Java specification. This class is null-safe and thread-safe, so we've got an extra layer of protection when working with Strings.
To encode a String to UTF-8 with Apache Common's StringUtils
class, we can use the getBytesUtf8()
method, which functions much like the getBytes()
method with a specified Charset
:
String germanString = "Wie heißen Sie?"; // What's your name?
byte[] bytes = StringUtils.getBytesUtf8(germanString);
String utf8String = StringUtils.newStringUtf8(bytes);
System.out.println(utf8String);
This results in:
Wie heißen Sie?
Or, you can use the regular StringUtils
class from the commons-lang3
dependency:
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
If you're using Gradle:
implementation group: 'org.apache.commons', name: 'commons-lang3', version: ${version}
And now, we can use much the same approach as with regular Strings:
String germanString = "Wie heißen Sie?"; // What's your name?
byte[] bytes = StringUtils.getBytes(germanString, StandardCharsets.UTF_8);
String utf8String = StringUtils.toEncodedString(bytes, StandardCharsets.UTF_8);
System.out.println(utf8String);
Though, this approach is thread-safe and null-safe:
Wie heißen Sie?
Conclusion
In this tutorial, we've taken a look at how to encode a Java String to UTF-8. We've taken a look at a few approaches - manually creating a String using getBytes()
and manipulating them, the Java 7 StandardCharsets
class as well as Apache Commons.