Python: List Files in a Directory

I prefer to work with Python because it is a very flexible programming language, and allows me to interact with the operating system easily. This also includes file system functions. To simply list files in a directory the modules os, subprocess, fnmatch, and pathlib come into play. The following solutions demonstrate how to use these methods effectively.

Using os.walk()

The os module contains a long list of methods that deal with the filesystem, and the operating system. One of them is walk(), which generates the filenames in a directory tree by walking the tree either top-down or bottom-up (with top-down being the default setting).

os.walk() returns a list of three items. It contains the name of the root directory, a list of the names of the subdirectories, and a list of the filenames in the current directory. Listing 1 shows how to write this with only three lines of code. This works with both Python 2 and 3 interpreters.

Listing 1: Traversing the current directory using os.walk()

import os

for root, dirs, files in os.walk("."):  
    for filename in files:
        print(filename)

Using the Command Line via Subprocess

As already described in the article Parallel Processing in Python, the subprocess module allows you to execute a system command, and collect its result. The system command we call in this case is the following one:

Example 1: Listing the files in the current directory

$ ls -p . | grep -v /$

The command ls -p . lists directory files for the current directory, and adds the delimiter / at the end of the name of each subdirectory, which we'll need in the next step. The output of this call is piped to the grep command that filters the data as we need it.

The parameters -v /$ exclude all the names of entries that end with the delimiter /. Actually, /$ is a Regular Expression that matches all the strings that contain the character / as the very last character before the end of the string, which is represented by $.

The subprocess module allows to build real pipes, and to connect the input and output streams as you do on a command line. Calling the method subprocess.Popen() opens a corresponding process, and defines the two parameters named stdin and stdout.

Listing 2 shows how to program that. The first variable ls is defined as a process executing ls -p . that outputs to a pipe. That's why the stdout channel is defined as subprocess.PIPE. The second variable grep is defined as a process, too, but executes the command grep -v /$, instead.

To read the output of the ls command from the pipe, the stdin channel of grep is defined as ls.stdout. Finally, the variable endOfPipe reads the output of grep from grep.stdout that is printed to stdout element-wise in the for-loop below. The output is seen in Example 2.

Listing 2: Defining two processes connected with a pipe

import subprocess

# define the ls command
ls = subprocess.Popen(["ls", "-p", "."],  
                      stdout=subprocess.PIPE,
                     )

# define the grep command
grep = subprocess.Popen(["grep", "-v", "/$"],  
                        stdin=ls.stdout,
                        stdout=subprocess.PIPE,
                        )

# read from the end of the pipe (stdout)
endOfPipe = grep.stdout

# output the files line by line
for line in endOfPipe:  
    print (line)

Example 2: Running the program

$ python find-files3.py
find-files2.py  
find-files3.py  
find-files4.py  
...

This solution works quite well with both Python 2 and 3, but can we improve it somehow? Let us have a look at the other variants, then.

Combining os and fnmatch

As you have seen before the solution using subprocesses is elegant but requires lots of code. Instead, let us combine the methods from the two modules os, and fnmatch. This variant works with Python 2 and 3, too.

As the first step, we import the two modules os, and fnmatch. Next, we define the directory we would like to list the files using os.listdir(), as well as the pattern for which files to filter. In a for loop we iterate over the list of entries stored in the variable listOfFiles.

Finally, with the help of fnmatch we filter for the entries we are looking for, and print the matching entries to stdout. Listing 3 contains the Python script, and Example 3 the corresponding output.

Listing 3: Listing files using os and fnmatch module

import os, fnmatch

listOfFiles = os.listdir('.')  
pattern = "*.py"  
for entry in listOfFiles:  
    if fnmatch.fnmatch(entry, pattern):
            print (entry)

Example 3: The output of Listing 3

$ python2 find-files.py
find-files.py  
find-files2.py  
find-files3.py  
...

Using os.listdir() and Generators

In simple terms, a generator is a powerful iterator that keeps its state. To learn more about generators, check out one of our previous articles, Python Generators.

The following variant combines the listdir() method of the os module with a generator function. The code works with both versions 2 and 3 of Python.

As you may have noted before, the listdir() method returns the list of entries for the given directory. The method os.path.isfile() returns True if the given entry is a file. The yield operator quits the function but keeps the current state, and returns only the name of the entry detected as a file. This allows us to loop over the generator function (see Listing 4). The output is identical to the one from Example 3.

Listing 4: Combining os.listdir() and a generator function

import os

def files(path):  
    for file in os.listdir(path):
        if os.path.isfile(os.path.join(path, file)):
            yield file

for file in files("."):  
    print (file)

Use pathlib

The pathlib module describes itself as a way to "Parse, build, test, and otherwise work on filenames and paths using an object-oriented API instead of low-level string operations". This sounds cool - let's do it. Starting with Python 3, the module belongs to the standard distribution.

In Listing 5, we first define the directory. The dot (".") defines the current directory. Next, the iterdir() method returns an iterator that yields the names of all the files. In a for loop we print the name of the files one after the other.

Listing 5: Reading directory contents with pathlib

import pathlib

# define the path
currentDirectory = pathlib.Path('.')

for currentFile in currentDirectory.iterdir():  
    print(currentFile)

Again, the output is identical to the one from Example 3.

As an alternative, we can retrieve files by matching their filenames by using something called a glob. This way we can only retrieve the files we want. For example, in the code below we only want to list the Python files in our directory, which we do by specifying "*.py" in the glob.

Listing 6: Using pathlib with the glob method

import pathlib

# define the path
currentDirectory = pathlib.Path('.')

# define the pattern
currentPattern = "*.py"

for currentFile in currentDirectory.glob(currentPattern):  
    print(currentFile)

Using os.scandir()

In Python 3.6, a new method becomes available in the os module. It is named scandir(), and significantly simplifies the call to list files in a directory.

Having imported the os module first, use the getcwd() method to detect the current working directory, and save this value in the path variable. Next, scandir() returns a list of entries for this path, which we test for being a file using the is_file() method.

Listing 7: Reading directory contents with scandir()

import os

# detect the current working directory
path = os.getcwd()

# read the entries
with os.scandir(path) as listOfEntries:  
    for entry in listOfEntries:
        # print all entries that are files
        if entry.is_file():
            print(entry.name)

Again, the output of Listing 7 is identical to the one from Example 3.

Conclusion

There is disagreement which version is the best, which is the most elegant, and which is the most "pythonic" one. I like the simplicity of the os.walk() method as well as the usage of both the fnmatch and pathlib modules.

The two versions with the processes/piping and the iterator require a deeper understanding of UNIX processes and Python knowledge, so they may not be best for all programmers due to their added (and unnecessary) complexity.

To find an answer to which version is the quickest one, the timeit module is quite handy. This module counts the time that has elapsed between two events.

To compare all of our solutions without modifying them, we use a Python functionality: call the Python interpreter with the name of the module, and the appropriate Python code to be executed. To do that for all the Python scripts at once a shell script helps (Listing 8).

Listing 8: Evaluating the execution time using the timeit module

#! /bin/bash

for filename in *.py; do  
    echo "$filename:"
    cat $filename | python3 -m timeit
    echo " "
done  

The tests were taken using Python 3.5.3. The result is as follows whereas os.walk() gives the best result. Running the tests with Python 2 returns different values but does not change the order - os.walk() is still on top of the list.

Method Result for 100,000,000 loops
os.walk 0.0085 usec per loop
subprocess/pipe 0.00859 usec per loop
os.listdir/fnmatch 0.00912 usec per loop
os.listdir/generator 0.00867 usec per loop
pathlib 0.00854 usec per loop
pathlib/glob 0.00858 usec per loop
os.scandir 0.00856 usec per loop

Acknowledgements

The author would like to thank Gerold Rupprecht for his support, and comments while preparing this article.

Author image
Berlin -- Genève -- Cape Town Twitter Github
IT developer, trainer, and author. Coauthor of the Debian Package Management Book (http://www.dpmb.org/).