CSV stands for Comma Separated Values, a method of formatting data which has been used even before the use of personal computers became widespread. The format gained popularity because the first computers used punched cards to process data, and comma separated values are easier to 'punch' in than traditional table-formatted data in fixed columns.
In the present, CSV files are typically used to transfer data between applications or systems, acting as a common format to export data from one system and import it back to another.
The CSV format typically consists of information organized with lines consisting of multiple fields separated by a delimiter, with one line equivalent to one data record. The delimiter can be a comma, semi-colon, or tab character. There is no specific format, although a specification called RFC 4180 has been introduced in an attempt to standardize how a CSV file should ideally behave.
"RFC" stands for Request for Comments, meaning that the document is just meant to be a set of common specifications or guidelines, and not accepted rules.
There are many deviations from the specified format in the way CSV files are generated and read by modern applications, but most systems adhere to the initial guidelines set out by RFC 4180.
According to RFC 4180, CSV files should have the following commonalities:
- Each record should be on a separate line, with a character break at the end of the line.
- There may or may not be a header line. The presence of a header can be specified in the
headerparameter of the MIME type.
- The MIME type for CSV files officially registered with IANA is "text/csv".
- Each record may consist of one or more comma-separated fields, and the same number of fields should persist throughout the file (there should be an equal number of fields in all records).
- If a field contains commas, line breaks or double-quotes, it should be enclosed in double quotes itself.
Why use CSVs for IO operations with Java
CSV files can be imported into most spreadsheet applications like Excel, Google Sheets and OpenOffice Calc, and easy to generate - an existing
.xlsx file can be converted to CSV format within Excel itself, as long as the file contains just characters and not any macros, images, etc.
The format is compact and usually smaller in size, leading to faster processing and generation. In comparison, XML requires repetition; typically the column header name is repeated twice for each row with the start and end tags (along with syntax-related characters), whereas CSV only requires the column headers once, usually in the very first row.
Given all these factors, being able to read from and write to CSV files is a key skill for any Java developer.
Reading and Writing CSVs in Core Java
Owning to the popularity and widespread use of CSV as a format for data transfer, there are many parser libraries that can be used along with Java.
Third-party parsers define common formats and are able to work with various delimiters, handle special characters, and sometimes even read non-binary data. However, it is still important to be able to handle CSV files with core Java, without the use of any additional libraries.
A simple combination of
String.split() can facilitate reading data from CSVs.
Reading and Writing CSVs with Apache Commons CSV
The Apache Commons CSV library is the Apache Software Foundation's version of a Java CSV parser. According to the project summary, it attempts to "provide a simple interface for reading and writing CSV files of various types".
As with all libraries associated with Apache, it operates with an Apache license, meaning it can be used, distributed, and modified freely.
Reading and Writing CSVs with OpenCSV
OpenCSV is one of the simplest and easiest CSV parsers to understand, using standard
Writer classes and offering a
CSVReader implementation on top.
Just like Apache Commons CSV, OpenCSV operates with an Apache 2.0 license. Before downloading and deciding whether to use OpenCSVs parsers, you can browse through the source code and Java docs, and even check out their JUnit test suite, which is included in their git repository.
Third-Party Libraries for CSV IO operations
Knowing how to read from and write to a CSV file in core Java is important, and usually sufficient for most basic operations. However, there are instances where relying on a third party library is the way to go.
For example, our own usage of
split to parse a CSV file, without using libraries above, would fail if the fields themselves contained commas. We could extend our logic to fit this scenario, but why reinvent the wheel when there are already proven solutions available for use?
This is where libraries come in - most of them support various configurations and identify escape characters and end of file characters frequently used by different systems, product suites and databases so that we don't have to track, implement, and unit-test each configuration ourselves.
There are a multitude of different parsing libraries available for free, each with different strengths. Let's consider some of the more popular libraries for CSV parsing with Java, comparing their pros and cons.
Other CSV libraries
In addition to Apache Commons CSV and OpenCSV, there are a variety of other CSV parsers available for use. Let's take a quick look at some of the other libraries, and compare their usage, advantages, and disadvantages:
SuperCSV is another dominant CSV parsing library. The SuperCSV implementation supports formats that are not considered by other mainstream parsers.
Similar to the OpenCSV annotation methods, SuperCSV offers POJO support for dealing with Java Beans, in addition to the usual lists and maps.
Encoding and decoding is also handled by the library as long as the file is compliant to the format outlined in the SuperCSV specification. If the file is non-compliant, you can still define a custom delimiter, quote character, or new line character as required, or extend the source code to facilitate specific requirements.
Parsing is made easier by data formatting options available with SuperCSV, which allows trimming and regex replacements while processing. The library also supports stream-based input and output, making it manageable in terms of performance and memory constrained systems.
The SuperCSV library also allows partial reading and partial writing, which is not supported by the other CSV parsers we've discussed throughout this article. You can choose to set specific header column values to
null and proceed with processing the remaining columns, or write a dataset which contains optional values without adding your own error-handling.
A big downside that deserves mentioning is that the library does not appear to be maintained - the last published date is four years ago - in 2015, although the git repository has more recent contributions.
UniVocity CSV Parser
UniVocity CSV Parser loudly claims to be the fastest CSV parser in a 2018 comparison among 18 different publicly available CSV parsers. UniVocity parser allows you to select the fields you want to parse, skipping the unnecessary or non-mandatory fields in a single file, giving it the ability to filter columns of a CSV.
It has more customization options than OpenCSV and Apache Commons CSV, which makes it harder to set up and get started with. Code readability might also be less when compared to some other libraries since UniVocity parser requires that the format, line separator, and header extraction method be declared before parsing is attempted.
On the positive side, the variety of formatting and customization options makes it suitable to deal with the 'edge-cases' involving CSV files which are not RFC4180 compliant.
Similar to both OpenCSV and Apache Commons CSV, you can use either an iterator or defined parser class (in this case either
TsvParser). UniVocity CSV Parsers also supports reading into beans with a more complex setup when compared to OpenCSV.
Writing is a similarly complicated, but configurable process with the UniVocity CSV parser, with specific use cases like the ability to define value conversions and column selection. Writing directly from a map or annotated Java Beans is also supported.
FlatPack CSV Parser
FlatPack CSV Parser is fast and more suitable for handling extremely large files, dealing with file sorting before parsing and fixed width parsing. It can be used in a scenario where your CSV does not have a specific delimiter but is comprised of fixed width text for example. The parser also supports column mapping through XML specifications, where the fields in the XML and the data fields in the CSV are in the same order.
BuffReaderDelimiterFactory allows streaming larger files to avoid everything being contained in memory when parsing data. Columns can also be added, removed, or ignored as needed.
Since the library focusses on being friendly for larger files, it also allows the option to exclude bad data, and add it to an error collection for later processing. This avoids a massive dataset needing to be re-processed due to one or two errors and simplifies error handling.
The library is currently maintained, with the most recent publication being in 2019. It has specific strengths, but can be complicated to set up and understand due to the multitude of options and customization features introduced to handle very specific scenarios that are not RFC4180 compliant.
The most basic CSV reading and writing scenarios can be handled using core Java IO with
FileWriters, and customized error handling. However, external libraries provide tried and tested solutions when it comes to more complex operations that involve being able to support larger files that may or may not be RFC 4180 compliant, with varying delimiters and different requirements.
The performance and flexibility of your application depend on the option you choose - some parsers are better at memory management, while others are more flexible and customizable.
You can use this article as a guide to identify which library best suits your needs, and learn the basics of CSV file handling, and reading and writing CSVs in Java.