Running SQL on CSV Data: Data Conversion and Extraction

A lot of tools output data as comma-separated values (CSV). It is a simple but common tabular data format of plaintext that can easily be processed. Each line of the file represents a single record. A record consists of the same number of fields, or columns. Usually, the delimiter between the single fields is either a comma (,), a semi-colon (;), a space, or a tabulator.

One or more CSV files can be processed using the command-line tools cut, join, head, comm, sed, and awk. This approach works very well but may take you a while to find a proper solution based on these commands.

The following example shows how to calculate the total traveling distance for the trip listing shown below. The columns in the file tracks.csv are delimited by tabs, and the awk script totalizes the values in the third column, only.

$ cat tracks.csv 
Date    Track   Distance  
1 Dec 2018  Paris-Metz  300  
3 Dec 2018  Metz-Nancy  57  
4 Dec 2018  Nancy-Vesoul    156  
5 Dec 2018  Vesoul-Mulhouse 112  
$ awk -F '\t+' '{ total += $3 } END {printf "total: %d km\n", total}' tracks.csv
total: 625 km  

In contrast, Database Management Systems (DBMS) like MySQL, MariaDB, PostgreSQL, and SQLite store data in a non-plaintext format that can seldomly read in an open way. In order to extract data the statements have to be formulated in Structured Query Language (SQL), and evaluated by the DBMS.

The statement below shows how to extract the contents of the column Track from the table named track:

SELECT Track FROM track  
Paris-Metz  
Metz-Nancy  
Nancy-Vesoul  
Vesoul-Mulhouse  
(4 lines)

Next we'll explain how to run SQL statements directly on CSV data. There are quite a few clever tools that make it possible to connect these worlds with each other.

q

The first tool is simply named q, and its name is actually derived from The Q Continuum that is mentioned in the Star Trek series. The according Debian package for q is named python-q-text-as-data. q accepts SQL statements, and allows direct execution on one or more CSV files.

Below you will see how to extract the first column named Date from the CSV file.

$ q -H -t "SELECT Date FROM tracks.csv"
1 Dec 2018  
3 Dec 2018  
4 Dec 2018  
5 Dec 2018  

The two flags used above have the following meaning:

  • -H (--skip-header) : The first line of the data file contains a description of the columns (headers).

  • -t: The columns are tabular-separated.

For the other flags we recommend you to have a look at the q usage page.

In order to retrieve datasets based on conditions, you may write the following statement which extracts the distance value from all the datasets that store the date of 3 Dec 2018:

$ q -H -t "SELECT Distance FROM tracks.csv WHERE Date = '3 Dec 2018'"
57  

q follows the SQL syntax used by SQLite. According to the project website any standard SQL expressions, conditions (both WHERE and HAVING), GROUP BY, ORDER BY, etc. are allowed. In a WHERE clause JOINs and subqueries are supported, too.

This also includes a number of functions like ABS(), LENGTH(), LOWER(), and UPPER() as well as date and time functions, aggregate functions, and JSON functions. In order to calculate the total of the single voyages use a statement as follows that demonstrates the SUM() function:

$ q -H -t "SELECT SUM(Distance) FROM tracks.csv"
625  

The next example is borrowed from the q website, and demonstrates how to extract system information. It combines ps and q on a regular workstation in order to calculate the top 5 user ids with the largest number of owned processes, sorted in descending order.

$ ps -ef | q -H "SELECT UID,COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 5"
root 129  
frank 68  
postgres 6  
www-data 5  
avahi 2  

q also connects to the original Python sqlite library. A similar functionality is provided by the querycsv Python module, that seems to be forked as the querycsv-redux project.

fsql

fsql is part of a Perl module named fsql. Currently, it is not available as a package for Debian or Ubuntu but from the Comprehensive Perl Archive Network (CPAN). In order to use fsql you may install the Perl module with its package dependencies using cpanm as follows:

# cpanm App::fsql
--> Working on App::fsql
Fetching http://www.cpan.org/authors/id/P/PE/PERLANCAR/App-fsql-0.230.tar.gz ... OK  
Configuring App-fsql-0.230 ... OK  
==> Found dependencies: DBD::CSV, JSON::MaybeXS, Data::Format::Pretty::YAML, Data::Format::Pretty::Console, YAML::XS, Data::Format::Pretty::JSON, File::chdir, Data::Format::Pretty::Perl, Text::LTSV, Perinci::Result::Util, File::Slurper
--> Working on DBD::CSV
Fetching http://www.cpan.org/authors/id/H/HM/HMBRAND/DBD-CSV-0.54.tgz ... OK  
Configuring DBD-CSV-0.54 ... OK  
==> Found dependencies: SQL::Statement
--> Working on SQL::Statement
Fetching http://www.cpan.org/authors/id/R/RE/REHSACK/SQL-Statement-1.412.tar.gz ... OK  
Configuring SQL-Statement-1.412 ... OK

[...]

Building and testing App-fsql-0.230 ... OK  
Successfully installed App-fsql-0.230  
95 distributions installed  

As explained on the fsql project page the tool lets you perform SQL queries against one or several "flat" files of various formats such as plain text, CSV, and JSON.

The example below demonstrates how to extract all the voyages that have a distance of more than 100 km. The data is provided from stdin using cat, and piped to fsql, then:

$ cat tracks.csv | fsql 'SELECT Date,Track,Distance FROM stdin WHERE Distance > 100'
Date    Track   Distance  
"1 Dec 2018"    Paris-Metz  300
"4 Dec 2018"    Nancy-Vesoul    156
"5 Dec 2018"    Vesoul-Mulhouse 112

So far, fsql produced a line-based output that is common for databases. Using the output parameter -f plus the desired output format, it can do much more. Among others, fsql supports comma-separated values (csv), tabular-separated values (tsv), Perl (perl), Javascript Object Notation (JSON) (json), and YAML Ain't Markup Language (YAML) (yaml).

The example below outputs the query result as a JSON array:

$ cat tracks.csv | fsql 'SELECT Date,Track,Distance FROM stdin WHERE Distance > 60' -f json
[
   [
      "1 Dec 2018",
      "Paris-Metz",
      "300"
   ],
   [
      "4 Dec 2018",
      "Nancy-Vesoul",
      "156"
   ],
   [
      "5 Dec 2018",
      "Vesoul-Mulhouse",
      "112"
   ]
]

Using the option -f yaml the output is formatted in YAML style:

$ cat tracks.csv | fsql 'SELECT sum(Distance) FROM stdin' -f yaml
---
-
  - 625

Miller

Asking long-term Linux experts about Miller may result in a bit of a surprise as it is just less known than other options. Miller aims to combine the functionality of several text tools such as cat, grep, sed, awk, join, and sort.

The screenshot below uses the cut subcommand, and shows how to extract the second column that contains the routes. This call is similar to the SQL query SELECT Track from track.

miller cut

Others

The list of helpful tools can be extended for a while. D. Bohdan maintains a rather comprehensive list. Among others this contains termsql, jq, as well as SPAWK which is an SQL Powered AWK.

Funny stuff, which we will deal with in more detail in one of the next articles.

Acknowledgements

The author would like to thank Axel Beckert for his critical comments while preparing this article.

Author image
Berlin -- Genève -- Cape Town Twitter Website
IT developer, trainer, and author. Coauthor of the Debian Package Management Book (http://www.dpmb.org/).