A lot of tools output data as comma-separated values (CSV). It is a simple but common tabular data format of plain text that can easily be processed. Each line of the file represents a single record. A record consists of the same number of fields, or columns. Usually, the delimiter between the single fields is either a comma (,
), a semi-colon (;
), a space, or a tabulator.
One or more CSV files can be processed using the command-line tools cut
, join
, head
, comm
, sed, and AWK. This approach works very well but may take you a while to find a proper solution based on these commands.
The following example shows how to calculate the total traveling distance for the trip listing shown below. The columns in the file tracks.csv
are delimited by tabs, and the AWK script totalizes the values in the third column, only.
$ cat tracks.csv
Date Track Distance
1 Dec 2018 Paris-Metz 300
3 Dec 2018 Metz-Nancy 57
4 Dec 2018 Nancy-Vesoul 156
5 Dec 2018 Vesoul-Mulhouse 112
$ awk -F '\t+' '{ total += $3 } END {printf "total: %d km\n", total}' tracks.csv
total: 625 km
In contrast, Database Management Systems (DBMS) like MySQL, MariaDB, PostgreSQL, and SQLite store data in a non plain text format that can seldom be read in an open way. In order to extract data the statements have to be formulated in Structured Query Language (SQL), and evaluated by the DBMS.
The statement below shows how to extract the contents of the column Track from the table named track:
SELECT Track FROM track
Paris-Metz
Metz-Nancy
Nancy-Vesoul
Vesoul-Mulhouse
(4 lines)
Next we'll explain how to run SQL statements directly on CSV data. There are quite a few clever tools that make it possible to connect these worlds with each other.
q
The first tool is simply named q, and its name is actually derived from The Q Continuum that is mentioned in the Star Trek series. The according Debian package for q
is named python-q-text-as-data. q
accepts SQL statements, and allows direct execution on one or more CSV files.
Below you will see how to extract the first column named Date
from the CSV file.
$ q -H -t "SELECT Date FROM tracks.csv"
1 Dec 2018
3 Dec 2018
4 Dec 2018
5 Dec 2018
The two flags used above have the following meaning:
-
-H
(--skip-header
) : The first line of the data file contains a description of the columns (headers). -
-t
: The columns are tabular-separated.
For the other flags we recommend you to have a look at the q
usage page.
In order to retrieve datasets based on conditions, you may write the following statement which extracts the distance value from all the datasets that store the date of 3 Dec 2018:
$ q -H -t "SELECT Distance FROM tracks.csv WHERE Date = '3 Dec 2018'"
57
q
follows the SQL syntax used by SQLite. According to the project website any standard SQL expressions, conditions (both WHERE
and HAVING
), GROUP BY
, ORDER BY
, etc. are allowed. In a WHERE
clause, JOIN
s and sub-queries are supported, too.
This also includes a number of functions like ABS()
, LENGTH()
, LOWER()
, and UPPER()
as well as date and time functions, aggregate functions, and JSON functions. In order to calculate the total of the single voyages use a statement as follows that demonstrates the SUM()
function:
$ q -H -t "SELECT SUM(Distance) FROM tracks.csv"
625
The next example is borrowed from the q
website, and demonstrates how to extract system information. It combines ps
and q
on a regular workstation in order to calculate the top 5 user ids with the largest number of owned processes, sorted in descending order.
$ ps -ef | q -H "SELECT UID,COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 5"
root 129
frank 68
postgres 6
www-data 5
avahi 2
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
q
also connects to the original Python SQLite library. A similar functionality is provided by the querycsv Python module, that seems to be forked as the querycsv-redux project.
fsql
fsql
is part of a Perl module named fsql. Currently, it is not available as a package for Debian or Ubuntu but from the Comprehensive Perl Archive Network (CPAN). In order to use fsql
you may install the Perl module with its package dependencies using cpanm as follows:
# cpanm App::fsql
--> Working on App::fsql
Fetching http://www.cpan.org/authors/id/P/PE/PERLANCAR/App-fsql-0.230.tar.gz ... OK
Configuring App-fsql-0.230 ... OK
==> Found dependencies: DBD::CSV, JSON::MaybeXS, Data::Format::Pretty::YAML, Data::Format::Pretty::Console, YAML::XS, Data::Format::Pretty::JSON, File::chdir, Data::Format::Pretty::Perl, Text::LTSV, Perinci::Result::Util, File::Slurper
--> Working on DBD::CSV
Fetching http://www.cpan.org/authors/id/H/HM/HMBRAND/DBD-CSV-0.54.tgz ... OK
Configuring DBD-CSV-0.54 ... OK
==> Found dependencies: SQL::Statement
--> Working on SQL::Statement
Fetching http://www.cpan.org/authors/id/R/RE/REHSACK/SQL-Statement-1.412.tar.gz ... OK
Configuring SQL-Statement-1.412 ... OK
[...]
Building and testing App-fsql-0.230 ... OK
Successfully installed App-fsql-0.230
95 distributions installed
As explained on the fsql
project page the tool lets you perform SQL queries against one or several "flat" files of various formats such as plain text, CSV, and JSON.
The example below demonstrates how to extract all the voyages that have a distance of more than 100 km. The data is provided from stdin
using cat
, and piped to fsql
, then:
$ cat tracks.csv | fsql 'SELECT Date,Track,Distance FROM stdin WHERE Distance > 100'
Date Track Distance
"1 Dec 2018" Paris-Metz 300
"4 Dec 2018" Nancy-Vesoul 156
"5 Dec 2018" Vesoul-Mulhouse 112
So far, fsql
produced a line-based output that is common for databases. Using the output parameter -f
plus the desired output format, it can do much more. Among others, fsql
supports comma-separated values (csv
), tabular-separated values (tsv
), Perl (perl
), JavaScript Object Notation (JSON) (json
), and YAML Ain't Markup Language (YAML) (yaml
).
The example below outputs the query result as a JSON array:
$ cat tracks.csv | fsql 'SELECT Date,Track,Distance FROM stdin WHERE Distance > 60' -f json
[
[
"1 Dec 2018",
"Paris-Metz",
"300"
],
[
"4 Dec 2018",
"Nancy-Vesoul",
"156"
],
[
"5 Dec 2018",
"Vesoul-Mulhouse",
"112"
]
]
Using the option -f yaml
the output is formatted in YAML style:
$ cat tracks.csv | fsql 'SELECT sum(Distance) FROM stdin' -f yaml
---
-
- 625
Miller
Asking long-term Linux experts about Miller may result in a bit of a surprise as it is just less known than other options. Miller aims to combine the functionality of several text tools such as cat
, grep
, sed
, awk
, join
, and sort
.
The screen-shot below uses the cut
subcommand, and shows how to extract the second column that contains the routes. This call is similar to the SQL query SELECT Track from track
.
Others
The list of helpful tools can be extended for a while. D. Bohdan maintains a rather comprehensive list. Among others this contains termsql, jq, as well as SPAWK which is an SQL Powered AWK.
Funny stuff, which we will deal with in more detail in one of the next articles.
Acknowledgements
The author would like to thank Axel Beckert for his critical comments while preparing this article.