Linux: Find Files Containing Text

This topic is essential knowledge for every user of UNIX, Linux, Solaris, OS X, and BSD. Furthermore, the LPI certification contains tricky questions about this.

If you want to find files with a certain filename using the command line then use either the find or the locate commands. But if you want to find files that contain a certain text you'll want to use grep and its friends. Here, the term friends means a group of similar tools that are tailored to a specific data format, or file structure like plain text, compressed files, and PDF documents.

Here is what we'll cover in this article:

Basic Text Searching

The name grep is a combination of the initial letters of the four words "global / regular expression / print". This is similar to formulating search patterns in the stream editor sed. grep is designed to find according patterns in entire data streams (files). Given patterns are interpreted as text or Regular Expressions (see below for an example).

Example 1 displays how to discover all the occurrences of the brand name "Mikrotik" written either as "Mikrotik", or "MikroTik". We use grep to search through all files whose name starts with "invoice-2017". The result is a list of file names with the according matches - one per line preceded by the file name.

Example 1: Calling grep with an Regular Expression

$ grep Mikro[tT]ik invoice-2017*
invoice-20170015.text:Mikrotik Routerboard 750GL Gigabit Switch  
invoice-20170045.text:MikroTik RouterBoard RB250GS Gigabit Switch  

This output is helpful but does not contain the line number. To show the line number with grep, use the option -n, or --line-number as the long version. Then, the result is as follows:

Example 2: Calling grep with an Regular Expression, and line numbers

$ grep -n Mikro[tT]ik invoice-2017*
invoice-20170015.text:64:Mikrotik Routerboard 750GL Gigabit Switch  
invoice-20170045.text:65:MikroTik RouterBoard RB250GS Gigabit Switch  

On every line, the individual output fields are separated by colons. The first field contains the filename ("invoice-20170015.text"), the second field is the line number within the matched file ("64"), and the third field is the entire line with the matched text ("Mikrotik Routerboard 750GL Gigabit Switch").

More grep Options

grep has a long list of helpful options. See the manual page for a detailed description. The most relevant ones for this article are:

Short option Long option Description
-i --ignore-case lower and upper case writing
-l --files-with-matches stop after the first match, and output the file name
-n --line-number show the line number of the match
-r --recursive search recursively
--color or --colour highlight the actual match

With the exception of highlighting the actual match, Example 3 combines the options -i, -r, and -l as described above. This simplifies the call, and returns a list of files with matches, no matter how many matches exist for each file. With the help of this you can see if there are matches at all, and if so, in which files.

Example 3: how to search for all files that contain the term "mikrotik" in any kind of spelling recursively

$ grep -irl mikrotik invoice-2017*
invoice-20170015.text  
invoice-20170045.text  

The command grep comes with two special variants - fgrep and egrep. fgrep interprets the search pattern as a string of single characters and is exactly the same as grep -F (and grep --fixed-strings).

In contrast, egrep takes the pattern as a Regular Expression and is similar to grep -E (and grep --extended-regexp). In older Linux releases prior Debian 4 Etch, both commands are implemented as shell scripts that call grep with special options. Nowadays, current Linux releases keep the commands as binary files. In either case the search is a bit quicker than using grep without this special option.

Searching Compressed Files

grep is unable to inspect compressed files properly. Now, the specialists named zgrep, bzgrep, xzgrep and zipgrep enter the stage. These tools help you to simplify commands like this:

$ zcat archive.gz | fgrep [pattern]

zcat uncompresses the given file, and outputs its content to stdout. Piped to fgrep, the data stream is searched for the given [pattern].

With the help of the commands above, you don't have to unpack files compressed with gzip, bzip2, xz and zip before searching - this step happens behind the scenes. As with grep, the special variants zfgrep and zegrep for gzip exist as well as bzfgrep and bzegrep for bz, and xzfgrep and xzegrep for xz archives. Example 4 shows how to search an xz-compressed file.

Example 4: Searching an xz-compressed file

$ xzfgrep Mikrotik invoice-20130015.text.xz
Mikrotik Routerboard RB450G Level 5 680MHz  

Searching compressed archives is a bit more complex, and requires a bit of shell scripting. Listing 1 demonstrates such a shell script that works only with gzip-compressed archives. For simplicity, we saved the below script with the name "search.sh". The script requires two parameters - the search pattern, and the filename of the compressed archive (see Example 5 below).

Listing 1: Searching compressed gzip archives

#!/bin/bash

pattern=$1  
archive=$2

for filename in $(tar -tzf "$archive");  
  do
    match=$(tar -xOzf "$archive" "$filename" | fgrep "$pattern") && echo "$filename:";
    echo "$match" | fgrep --color "$pattern";
    echo "";
  done

Example 5: Calling the script

$ ./search.sh Mikro archive.tar.gz
invoice-20110045.text:  
Mikrotik

invoice-20110110.text:  
MikroTik

Understanding the script may need a moment of time. First, the script extracts the list of files from the archive, and evaluates each file one after the next. The outer for loop does all the complex work. Second, the single matches are saved in the variable $match. Therefore, the current file is extracted from the compressed archive, and is then piped to fgrep. fgrep searches the data stream, and indicates a match with a positive return value. In case of that the following echo command is executed, and the file name is sent to stdout. Third, the actual match is printed as well, and followed by an empty line. This separates the different matches file-wise.

An alternative is the tool deepgrep which is part of the desktop search engine Strigi (Debian package strigi-utils). It searches tar.gz files as well as zip archives, Debian packages, and even Microsoft Word files. Example 6 shows how it works. Line by line you see the file name, and the according matches.

Example 6: Searching an archive using deepgrep

$ deepgrep Mikro archive.tar.gz
archive.tar.gz/invoice-20110045.text:Mikrotik  
archive.tar.gz/invoice-20110110.text:MikroTik  
$

Searching Other Document Types

deepgrep covers a lot of file formats but has quite a few package dependencies. Instead you may have a look at pdfgrep and ssgrep, instead. pdfgrep is specialized for PDF documents, and ssgrep is for spreadsheets.

What I like about pdfgrep is both its simplicity in usage, and variety in terms of options. Matches are highlighted without the need to specify further arguments.

The option -n, which is shown in use below, helps to identify the page the pattern was discovered. In Example 7, each line consists of three data fields that are separated by a column - the file name, the page number of the match, and the extracted text from the match. If the output terminal supports colors the data fields and the matches are highlighted in different ways.

Example 7: Searching PDF documents

$ pdfgrep -n Mikro[tT]ik invoice*.pdf
invoice-20120033.pdf:2:MikroTik Sextant 5HnD 18dbi MIMO  
invoice-20120075.pdf:1:MikroTik RouterBOARD 250GS Giga  

As mentioned above, ssgrep helps you to search spreadsheets. ssgrep abbreviates "spreadsheet grep", and is part of the Gnumeric tool. As the file format, both Open/Libre Office Calc and Gnumeric use gzip-compressed XML as their file format. Newer releases of Microsoft Excel could work as well, but I didn't test. Figure 1 shows an example spreadsheet with sales data and four orders.

Gnumeric spreadsheet
Figure 1: Gnumeric example spreadsheet

To identify the single cells that contain the term "NanoStation", ssgrep is called with the options -H and -n. -H outputs the file name as the first data field, and -n adds the location - the name of the spreadsheet, and the table position. See Example 8 below for the output.

Example 8:

$ ssgrep -Hn "Nano[Ss]tation" orders.gnumeric
orders.gnumeric:orders!C3:5x NanoStation M5  
orders.gnumeric:orders!C5:4x NanoStation M2  
orders.gnumeric:orders!C6:3x NanoStation M3  

The tools presented up to now are command line tools. To search Open/Libre Office documents you may use the graphical tool named "loook" (Debian package loook). Figure 2 shows the simple graphical user interface.

Loook document search
Figure 2: The graphical interface of loook

Conclusion

Searching data formats is complex, and can be an endless story. Several tools helps you to identify the relevant files easily. For a full list of commands for other data formats have a look at the given references below.

Further Reading

Acknowledgements

The author would like to thank Axel Beckert and Gerold Rupprecht for their support, and critics while preparing this article.

Author image
Berlin -- Genève -- Cape Town Twitter Github
IT developer, trainer, and author. Coauthor of the Debian Package Management Book (http://www.dpmb.org/).