Reading and Writing CSVs in Java with Apache Commons CSV

Introduction

This is the second article in a short series dedicated to Libraries for Reading and Writing CSVs in Java, and a direct continuation from the previous article - Reading and Writing CSVs in Core Java.

Apache Commons CSV

The Apache Commons CSV library is the Apache Software Foundation's version of a Java CSV parser. According to the project summary, it attempts to "provide a simple interface for reading and writing CSV files of various types".

As with all libraries associated with Apache, it operates with an Apache license, meaning it can be used, distributed and modified freely.

Apache Commons allows developers to define their own formats, but offers a predefined set of formats with its CSVFormat class.

These predefined formats are:

  • RFC4180 - comma separated format defined by RFC 4180.
  • DEFAULT - Similar to RFC4180 format, but allows empty lines in between rows of data. This format is used if not otherwise specified when you define a parser with the Apache Commons CSV library.
  • EXCEL - Similar to RFC 4180, but allows missing column names, and ignores empty lines.
  • TDF - Predefined format for CSV files which are delimited with tabs (\t) instead of commas.
  • MYSQL - The format supported by MySql's SELECT INTO OUTFILE and LOAD DATA INFILE operations.
  • ORACLE - CSV format used by the Oracle SQL loader.
  • INFORMIX_UNLOAD and INFORMIX_UNLOAD_CSV - specialized formats defined for use with the IBM Informix embeddable database.
  • MONGODB_CSV and MONGODB_TSV - Works with the poplar NoSQL MongoDB database, for comma separated and tab separated values respectively.
  • POSTGRESQL_CSV and POSTGRESQL_TEXT - Format supported by PostgreSQL databases.

If you need a very simple solution, Apache Commons CSV might not be suitable. The implementation is designed to offer maximum flexibility, which makes the source code equate to around 30,000 lines, and prove quite complex to understand in some cases.

However, if you do need to cover a wide variety of formats, Apache Commons is a trusted library, well maintained and updated regularly, with extensive Java Docs and developer documentation to support beginners.

It is included in the Maven Central repository and has no external dependencies.

Reading CSVs with Apache Commons CSV

The Apache Commons library offers several methods to access individual fields in a CSV file. If you're working with apps like Excel, then your Excel formatted CSV is likely to have a header.

However, if you're using CSV as a basic comma-separated set of text, to transfer data between systems or to feed into another processing application, the file may contain data starting from the first row itself, without a header. The Apache Commons CSV library takes both these scenarios into consideration.

If your CSV file does not contain a header, or if you're unsure whether it does, you can use the index to access a record. Since CSVRecord implements the Java Iterable Interface, the index is 0-based, even though CSV indexes, when opened with Excel and most other applications, start with 1:

CSVParser csvParser = CSVFormat.DEFAULT.parse(new InputStreamReader(csvFile.getInputStream()));  
for (CSVRecord record : csvParser) {  
    String field_1 = record.get(0);
    String field_2 = record.get(1);
    ...
}

If you're sure that the CSV file you need to parse has a header, and you know the header format before processing, you can use the header column string to fetch records.

Let's consider a sample CSV file with tree data, and call it the 'tree data CSV' for future reference in this article:

Index Girth (in) Height (ft) Volume (ft)
1 8.3 70 10.3

To refer to each row of data we could use indexing as with the previous example, or the column header:

InputStreamReader input = new InputStreamReader(csvFile.getInputStream());  
CSVParser csvParser = CSVFormat.EXCEL.withFirstRecordAsHeader().parse(input);  
for (CSVRecord record : csvParser) {  
    String field_1 = record.get("Index");
    String field_2 = record.get("Girth (in)");
    String field_3 = record.get("Height (ft)");
    String field_4 = record.get("Volume (ft)");
}

If you want to read a file that does not contain a header row, want to define your own header, or find indexing confusing, Apache Commons also allows the definition of a header for parsing.

Instead of using the .withFirstRecordAsHeader() method when defining the format of the CSV file, you can define a header manually. For example, if you want to avoid referring to the units of measure in the header in our tree data file, you can redefine the header to use your own string values:

CSVParser csvParser = CSVFormat.REF4180.withHeader("Index", "Girth", "Height", "Volume");  
for (CSVRecord record : csvParser) {  
    String field_2 = record.get("Girth");
}

If your CSV file contains a header, but you want to define your own header and skip reading the header in the file, use .readNext() to skip the first row:

CSVRecord header = csvParser.readNext();  
// read the other rows in a loop as usual

You can also use an enumeration to define a header if you need to use your definition across multiple classes as in this example:

public enum treeHeader {  
    Index, Girth, Height, Volume
}
...
CSVParser csvParser = CSVFormat.DEFAULT.withHeader(treeHeader.class).parse(input);  
// read rows

The Apache Commons CSV library has some additional methods to make parsing easier, including:

  • .getRecordNumber() - returns the number assigned to the record in the CSV file.
  • .isConsistent() - can be used for error handling, returns True or False based on whether the size of the current record matches the size of the header row.
  • .size() - can be used to determine the number of values in the record.
  • .toString() - returns the record as a String. Useful when you need to store the entire row as a String for later processing, hashing or comparison.

Writing CSVs with Apache Commons CSV

The Apache Commons CSV library is mostly focused on reading data from CSV files. When it comes to writing, the methods recommended in the user guide are quite familiar to our core Java implementation.

However, there are some useful additional features, such as the ability to define a header when printing, and printing directly from a JDBC ResultSet object.

The CSVPrinter class implements the Flushable and Closeable interfaces, making its behavior similar to an extension of a usual Java Writer or StringWriter.

For example to generate our tree data CSV:

CSVPrinter csvPrinter = new CSVPrinter(stringWriter, CSVFormat.DEFAULT.withHeader("Index", "Girth", "Height", "Volume"));  
csvPrinter.printRecord("1", "8.3", "70", "10.3");  
csvPrinter.flush();  

The method .withHeader() accepts the String values of the header as parameters.

If you want to generate a CSV without a header you can simply use new CSVPrinter(stringWriter, CSVFormat.DEFAULT);.

If you want to use the column headers from an existing JDBC ResultSet for printing, Apache Commons CSV accepts a ResultSet as a parameter for its .withHeader() method:

CSVPrinter csvPrinter = CSVFormat.RFC4180.withHeader(treeDataResultSet).print(out);  
csvPrinter.printRecords(treeDataResultSet);  

If the CSVFormat you use allows empty lines, you can use csvPrinter.println() to print a blank line in between data rows.

In addition to a ResultSet, the .printRecords() method is also able to work with an Iterable Array, or a Collection of Strings passed to the method using varArgs.

Conclusion

Apache Commons CSV attempts to provide a simple interface for reading and writing CSV files of various types.

The implementation is designed to offer maximum flexibility, which makes the source code quite complex to understand in some cases.

However, if you do need to cover a wide variety of formats, Apache Commons is a trusted library, well maintained and updated regularly, with extensive Java Docs and developer documentation to support beginners.

Author image
Sri Lanka