Read a File Line-by-Line in Python

Introduction

Over the course of my working life I have had the opportunity to use many programming concepts and technologies to do countless things. Some of these things involve relatively low value fruits of my labor, such as automating the error prone or mundane like report generation, task automation, and general data reformatting. Others have been much more valuable, such as developing data products, web applications, and data analysis and processing pipelines. One thing that is notable about nearly all of these projects is the need to simply open a file, parse its contents, and do something with it.

However, what do you do when the file you are trying to consume is quite large? Say the file is several GB of data or larger? Again, this has been another frequent aspect of my programming career, which has primarily been spent in the BioTech sector where files, up to a TB in size, are quite common.

The answer to this problem is to read in chunks of a file at a time, process it, then free it from memory so you can pull in and process another chunk until the whole massive file has been processed. While it is up to the programmer to determine a suitable chunk size, perhaps the most commonly used is simply a line of a file at a time.

This is what we will be discussing in this article - memory management by reading in a file line-by-line in Python. The code used in this article can be found in the following GitHub repo.

Basic File IO in Python

Being a great general purpose programming language, Python has a number of very useful file IO functionality in its standard library of built-in functions and modules. The built-in open() function is what you use to open a file object for either reading or writing purposes.

fp = open('path/to/file.txt', 'r')  

The open() function takes in multiple arguments. We will be focusing on the first two, with the first being a positional string parameter representing the path to the file that should be opened. The second optional parameter is also a string which specifies the mode of interaction you intend for the file object being returned by the function call. The most common modes are listed in the table below, with the default being 'r' for reading.

Mode Description
`r` Open for reading plain text
`w` Open for writing plain text
`a` Open an existing file for appending plain text
`rb` Open for reading binary data
`wb` Open for writing binary data

Once you have written or read all of the desired data for a file object you need to close the file so that resources can be reallocated on the operating system that the code is running on.

fp.close()  

You will often see many code snippets on the internet or in programs in the wild that do not explicitly close file objects that have been generated in accord with the example above. It is always good practice to close a file object resource, but many of us either are too lazy or forgetful to do so or think we are smart because documentation suggests that an open file object will self close once a process terminates. This is not always the case.

Instead of harping on how important it is to always call close() on a file object, I would like to provide an alternate and more elegant way to open a file object and ensure that the Python interpreter cleans up after us :)

with open('path/to/file.txt') as fp:  
    # do stuff with fp

By simply using the with keyword (introduced in python 2.5) to wrap our code for opening a file object, the internals of Python will do something similar to the following code to ensure that no matter what the file object is closed after use.

try:  
    fp = open('path/to/file.txt')
    # do stuff here
finally:  
    fp.close()

Reading Line by Line

Now, lets get to actually reading in a file. The file object returned from open() has three common explicit methods (read, readline, and readlines) to read in data and one more implicit way.

The read method will read in all the data into one text string. This is useful for smaller files where you would like to do text manipulation on the entire file, or whatever else suits you. Then there is readline which is one useful way to only read in individual line incremental amounts at a time and return them as strings. The last explicit method, readlines, will read all the lines of a file and return them as a list of strings.

As mentioned earlier, you can use these methods to only load small chunks of the file at a time. To do this with these methods, you can pass a parameter to them telling how many bytes to load at a time. This is the only argument these methods accept.

One implementation for reading a text file one line at a time might look like the following code. Note that for the remainder of this article I will be demonstrating how to read in the text of the book The "Iliad of Homer" which can be found at gutenberg.org, as well as in the GitHub repo where the code is for this article.

In readline.py you will find the following code. In the terminal if you run $ python readline.py you can see the output of reading all the lines of the Iliad, as well as their line numbers.

filepath = 'Iliad.txt'  
with open(filepath) as fp:  
   line = fp.readline()
   cnt = 1
   while line:
       print("Line {}: {}".format(cnt, line.strip()))
       line = fp.readline()
       cnt += 1

The above code snippet opens a file object stored as a variable called fp, then reads in a line at a time by calling readline on that file object iteratively in a while loop and prints it to the console.

Running this code you should see something like the following:

$ python forlinein.py 
Line 0: BOOK I  
Line 1:  
Line 2: The quarrel between Agamemnon and Achilles--Achilles withdraws  
Line 3: from the war, and sends his mother Thetis to ask Jove to help  
Line 4: the Trojans--Scene between Jove and Juno on Olympus.  
Line 5:  
Line 6: Sing, O goddess, the anger of Achilles son of Peleus, that brought  
Line 7: countless ills upon the Achaeans. Many a brave soul did it send  
Line 8: hurrying down to Hades, and many a hero did it yield a prey to dogs and  
Line 9: vultures, for so were the counsels of Jove fulfilled from the day on  
...

While this is perfectly fine, there is one final way that I mentioned fleetingly earlier, which is less explicit but a bit more elegant, which I greatly prefer. This final way of reading in a file line-by-line includes iterating over a file object in a for loop assigning each line to a special variable called line. The above code snippet can be replicated in the following code, which can be found in the Python script forlinein.py:

filepath = 'Iliad.txt'  
with open(filepath) as fp:  
   for cnt, line in enumerate(fp):
       print("Line {}: {}".format(cnt, line))

In this implementation we are taking advantage of a built-in Python functionality that allows us to iterate over the file object implicitly using a for loop in combination of using the iterable object fp. Not only is this simpler to read but it also takes fewer lines of code to write, which is always a best practice worthy of following.

An Example Application

I would be remiss to write an application on how to consume information in a text file without demonstrating at least a trivial usage of how to use such a worthy skill. That being said, I will be demonstrating a small application that can be found in wordcount.py, which calculates the frequency of each word present in "The Iliad of Homer" used in previous examples.

import sys  
import os

def main():  
   filepath = sys.argv[1]

   if not os.path.isfile(filepath):
       print("File path {} does not exist. Exiting...".format(filepath))
       sys.exit()

   bag_of_words = {}
   with open(filepath) as fp:
       cnt = 0
       for line in fp:
           print("line {} contents {}".format(cnt, line))
           record_word_cnt(line.strip().split(' '), bag_of_words)
           cnt += 1
   sorted_words = order_bag_of_words(bag_of_words, desc=True)
   print("Most frequent 10 words {}".format(sorted_words[:10]))

def order_bag_of_words(bag_of_words, desc=False):  
   words = [(word, cnt) for word, cnt in bag_of_words.items()]
   return sorted(words, key=lambda x: x[1], reverse=desc)

def record_word_cnt(words, bag_of_words):  
   for word in words:
       if word != '':
           if word.lower() in bag_of_words:
               bag_of_words[word.lower()] += 1
           else:
               bag_of_words[word.lower()] = 0

if __name__ == '__main__':  
   main()

The above code represents a command line python script that expects a file path passed in as an argument. The script uses the os module to make sure that the passed in file path is a file that exists on the disk. If the path exists then each line of the file is read and passed to a function called record_word_cnt as a list of strings, delimited the spaces between words as well as a dictionary called bag_of_words. The record_word_cnt function counts each instance of every word and records it in the bag_of_words dictionary.

Once all the lines of the file are read and recorded in the bag_of_words dictionary, then a final function call to order_bag_of_words is called, which returns a list of tuples in (word, word count) format, sorted by word count. The returned list of tuples is used to print the most frequently occurring 10 words.

Conclusion

So, in this article we have explored ways to read a text file line-by-line in two ways, including a way that I feel is a bit more Pythonic (this being the second way demonstrated in forlinein.py). To wrap things up I presented a trivial application that is potentially useful for reading in and preprocessing data that could be used for text analytics or sentiment analysis.

As always I look forward to your comments and I hope you can use what has been discussed to develop exciting and useful applications.

Author image
Lincoln, Nebraska Twitter