I prefer to work with Python because it is a very flexible programming language, and allows me to interact with the operating system easily. This also includes file system functions. To simply list files in a directory the modules
pathlib come into play. The following solutions demonstrate how to use these methods effectively.
os module contains a long list of methods that deal with the filesystem, and the operating system. One of them is
walk(), which generates the filenames in a directory tree by walking the tree either top-down or bottom-up (with top-down being the default setting).
os.walk() returns a list of three items. It contains the name of the root directory, a list of the names of the subdirectories, and a list of the filenames in the current directory. Listing 1 shows how to write this with only three lines of code. This works with both Python 2 and 3 interpreters.
Listing 1: Traversing the current directory using
import os for root, dirs, files in os.walk("."): for filename in files: print(filename)
Using the Command Line via Subprocess
Note: While this is a valid way to list files in a directory, it is not recommended as it introduces the opportunity for command injection attacks.
As already described in the article Parallel Processing in Python, the
subprocess module allows you to execute a system command, and collect its result. The system command we call in this case is the following one:
Example 1: Listing the files in the current directory
$ ls -p . | grep -v /$
ls -p . lists directory files for the current directory, and adds the delimiter
/ at the end of the name of each subdirectory, which we'll need in the next step. The output of this call is piped to the
grep command that filters the data as we need it.
-v /$ exclude all the names of entries that end with the delimiter
/$ is a Regular Expression that matches all the strings that contain the character
/ as the very last character before the end of the string, which is represented by
subprocess module allows to build real pipes, and to connect the input and output streams as you do on a command line. Calling the method
subprocess.Popen() opens a corresponding process, and defines the two parameters named stdin and stdout.
Listing 2 shows how to program that. The first variable
ls is defined as a process executing
ls -p . that outputs to a pipe. That's why the stdout channel is defined as
subprocess.PIPE. The second variable
grep is defined as a process, too, but executes the command
grep -v /$, instead.
To read the output of the
ls command from the pipe, the stdin channel of
grep is defined as
ls.stdout. Finally, the variable
endOfPipe reads the output of
grep.stdout that is printed to stdout element-wise in the
for-loop below. The output is seen in Example 2.
Listing 2: Defining two processes connected with a pipe
import subprocess # define the ls command ls = subprocess.Popen(["ls", "-p", "."], stdout=subprocess.PIPE, ) # define the grep command grep = subprocess.Popen(["grep", "-v", "/$"], stdin=ls.stdout, stdout=subprocess.PIPE, ) # read from the end of the pipe (stdout) endOfPipe = grep.stdout # output the files line by line for line in endOfPipe: print (line)
Example 2: Running the program
$ python find-files3.py find-files2.py find-files3.py find-files4.py ...
This solution works quite well with both Python 2 and 3, but can we improve it somehow? Let us have a look at the other variants, then.
As you have seen before the solution using subprocesses is elegant but requires lots of code. Instead, let us combine the methods from the two modules
fnmatch. This variant works with Python 2 and 3, too.
As the first step, we import the two modules
fnmatch. Next, we define the directory we would like to list the files using
os.listdir(), as well as the pattern for which files to filter. In a
for loop we iterate over the list of entries stored in the variable
Finally, with the help of
fnmatch we filter for the entries we are looking for, and print the matching entries to stdout. Listing 3 contains the Python script, and Example 3 the corresponding output.
Listing 3: Listing files using os and fnmatch module
import os, fnmatch listOfFiles = os.listdir('.') pattern = "*.py" for entry in listOfFiles: if fnmatch.fnmatch(entry, pattern): print (entry)
Example 3: The output of Listing 3
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
$ python2 find-files.py find-files.py find-files2.py find-files3.py ...
os.listdir() and Generators
In simple terms, a generator is a powerful iterator that keeps its state. To learn more about generators, check out one of our previous articles, Python Generators.
The following variant combines the
listdir() method of the
os module with a generator function. The code works with both versions 2 and 3 of Python.
As you may have noted before, the
listdir() method returns the list of entries for the given directory. The method
True if the given entry is a file. The
yield operator quits the function but keeps the current state, and returns only the name of the entry detected as a file. This allows us to loop over the generator function (see Listing 4). The output is identical to the one from Example 3.
Listing 4: Combining
os.listdir() and a generator function
import os def files(path): for file in os.listdir(path): if os.path.isfile(os.path.join(path, file)): yield file for file in files("."): print (file)
pathlib module describes itself as a way to "Parse, build, test, and otherwise work on filenames and paths using an object-oriented API instead of low-level string operations". This sounds cool - let's do it. Starting with Python 3, the module belongs to the standard distribution.
In Listing 5, we first define the directory. The dot (".") defines the current directory. Next, the
iterdir() method returns an iterator that yields the names of all the files. In a
for loop we print the name of the files one after the other.
Listing 5: Reading directory contents with
import pathlib # define the path currentDirectory = pathlib.Path('.') for currentFile in currentDirectory.iterdir(): print(currentFile)
Again, the output is identical to the one from Example 3.
As an alternative, we can retrieve files by matching their filenames by using something called a glob. This way we can only retrieve the files we want. For example, in the code below we only want to list the Python files in our directory, which we do by specifying "*.py" in the glob.
Listing 6: Using
pathlib with the
import pathlib # define the path currentDirectory = pathlib.Path('.') # define the pattern currentPattern = "*.py" for currentFile in currentDirectory.glob(currentPattern): print(currentFile)
In Python 3.6, a new method becomes available in the
os module. It is named
scandir(), and significantly simplifies the call to list files in a directory.
Having imported the
os module first, use the
getcwd() method to detect the current working directory, and save this value in the
path variable. Next,
scandir() returns a list of entries for this path, which we test for being a file using the
Listing 7: Reading directory contents with
import os # detect the current working directory path = os.getcwd() # read the entries with os.scandir(path) as listOfEntries: for entry in listOfEntries: # print all entries that are files if entry.is_file(): print(entry.name)
Again, the output of Listing 7 is identical to the one from Example 3.
There is disagreement which version is the best, which is the most elegant, and which is the most "pythonic" one. I like the simplicity of the
os.walk() method as well as the usage of both the
The two versions with the processes/piping and the iterator require a deeper understanding of UNIX processes and Python knowledge, so they may not be best for all programmers due to their added (and unnecessary) complexity.
To find an answer to which version is the quickest one, the
timeit module is quite handy. This module counts the time that has elapsed between two events.
To compare all of our solutions without modifying them, we use a Python functionality: call the Python interpreter with the name of the module, and the appropriate Python code to be executed. To do that for all the Python scripts at once a shell script helps (Listing 8).
Listing 8: Evaluating the execution time using the
for filename in *.py; do echo "$filename:" cat $filename | python3 -m timeit echo " " done
The tests were taken using Python 3.5.3. The result is as follows whereas
os.walk() gives the best result. Running the tests with Python 2 returns different values but does not change the order -
os.walk() is still on top of the list.
|Method||Result for 100,000,000 loops|
|os.walk||0.0085 usec per loop|
|subprocess/pipe||0.00859 usec per loop|
|os.listdir/fnmatch||0.00912 usec per loop|
|os.listdir/generator||0.00867 usec per loop|
|pathlib||0.00854 usec per loop|
|pathlib/glob||0.00858 usec per loop|
|os.scandir||0.00856 usec per loop|
The author would like to thank Gerold Rupprecht for his support, and comments while preparing this article.