Read a File Line-by-Line in Python - Stack Abuse

Read a File Line-by-Line in Python

Introduction

A common task in programming is opening a file and parsing its contents. What do you do when the file you are trying to process is quite large, like several GB of data or larger? The answer to this problem is to read in chunks of a file at a time, process it, then free it from memory so you can process another chunk, until the whole massive file has been processed. While it's up to you to determine a suitable size for the chunks of data you're processing, for many applications it's suitable to process a file one line at a time.

Throughout this article we'll be covering a number of code examples that demonstrate how to read files line by line. In case you want to try out some of these examples by yourself, the code used in this article can be found at the following GitHub repo.

Basic File IO in Python

Python is a great general purpose programming language, and it has a number of very useful file IO functionality in its standard library of built-in functions and modules.

The built-in open() function is what you use to open a file object for either reading or writing purposes. Here's how you can use it to open a file:

fp = open('path/to/file.txt', 'r')

As demonstrated above, the open() function takes in multiple arguments. We will be focusing on two arguments, with the first being a positional string parameter representing the path to the file you want to open. The second (optional) parameter is also a string, and it specifies the mode of interaction you intend to be used on the file object being returned by the function call. The most common modes are listed in the table below, with the default being 'r' for reading:

Mode Description
r Open for reading plain text
w Open for writing plain text
a Open an existing file for appending plain text
rb Open for reading binary data
wb Open for writing binary data

Once you have written or read all of the desired data in a file object, you need to close the file so that resources can be reallocated on the operating system that the code is running on.

fp.close()

Note: It's always good practice to close a file object resource, but it's a task that's easy to forget.

While you can always remember to call close() on a file object, there's an alternate and more elegant way to open a file object and ensure that the Python interpreter cleans up after its use:

with open('path/to/file.txt') as fp:
    # Do stuff with fp

By simply using the with keyword (introduced in Python 2.5) to the code we use to open a file object, Python will do something similar to the following code. This ensures that no matter what the file object is closed after use:

try:
    fp = open('path/to/file.txt')
    # Do stuff with fp
finally:
    fp.close()

Either of these two methods are suitable, with the first example being more Pythonic.

The file object returned from the open() function has three common explicit methods (read(), readline(), and readlines()) to read in data. The read() method reads in all the data into a single string. This is useful for smaller files where you would like to do text manipulation on the entire file. Then there is readline(), which is a useful way to only read in individual lines, in incremental amounts at a time, and return them as strings. The last explicit method, readlines(), will read all the lines of a file and return them as a list of strings.

Note: For the remainder of this article we will be working with the text of the book The "Iliad of Homer", which can be found at gutenberg.org, as well as in the GitHub repo where the code is for this article.

Reading a File Line-by-Line in Python with readline()

Let's start off with the the readline() method, which reads a single line, which will require us to use a counter and increment it:

filepath = 'Iliad.txt'
with open(filepath) as fp:
   line = fp.readline()
   cnt = 1
   while line:
       print("Line {}: {}".format(cnt, line.strip()))
       line = fp.readline()
       cnt += 1

This code snippet opens a file object whose reference is stored in fp, then reads in a line one at a time by calling readline() on that file object iteratively in a while loop. It then simply prints the line to the console.

Running this code, you should see something like the following:

...
Line 567: exceedingly trifling. We have no remaining inscription earlier than the
Line 568: fortieth Olympiad, and the early inscriptions are rude and unskilfully
Line 569: executed; nor can we even assure ourselves whether Archilochus, Simonides
Line 570: of Amorgus, Kallinus, Tyrtaeus, Xanthus, and the other early elegiac and
Line 571: lyric poets, committed their compositions to writing, or at what time the
Line 572: practice of doing so became familiar. The first positive ground which
Line 573: authorizes us to presume the existence of a manuscript of Homer, is in the
Line 574: famous ordinance of Solon, with regard to the rhapsodies at the
Line 575: Panathenaea: but for what length of time previously manuscripts had
Line 576: existed, we are unable to say.
...

Though, this approach is crude and explicit. Most certainly not very Pythonic. We can utilize the readlines() method to make this code much more succinct.

Read a File Line-by-Line with readlines()

The readlines() method reads all the lines and stores them into a List. We can then iterate over that list and using enumerate(), make an index for each line for our convenience:

file = open('Iliad.txt', 'r')
lines = file.readlines()

for index, line in enumerate(lines):
    print("Line {}: {}".format(index, line.strip()))
    
file.close()

This results in:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

...
Line 160: INTRODUCTION.
Line 161:
Line 162:
Line 163: Scepticism is as much the result of knowledge, as knowledge is of
Line 164: scepticism. To be content with what we at present know, is, for the most
Line 165: part, to shut our ears against conviction; since, from the very gradual
Line 166: character of our education, we must continually forget, and emancipate
Line 167: ourselves from, knowledge previously acquired; we must set aside old
Line 168: notions and embrace fresh ones; and, as we learn, we must be daily
Line 169: unlearning something which it has cost us no small labour and anxiety to
Line 170: acquire.
...

Now, although much better, we don't even need to call the readlines() method to achieve this same functionality. This is the traditional way of reading a file line-by-line, but there's a more modern, shorter one.

Read a File Line-by-Line with a for Loop - Most Pythonic Approach

The returned File itself is an iterable. We don't need to extract the lines via readlines() at all - we can iterate the returned object itself. This also makes it easy to enumerate() it so we can write the line number in each print() statement.

This is the shortest, most Pythonic approach to solving the problem, and the approach favored by most:

with open('Iliad.txt') as f:
    for index, line in enumerate(f):
        print("Line {}: {}".format(index, line.strip()))

This results in:

...
Line 277: Mentes, from Leucadia, the modern Santa Maura, who evinced a knowledge and
Line 278: intelligence rarely found in those times, persuaded Melesigenes to close
Line 279: his school, and accompany him on his travels. He promised not only to pay
Line 280: his expenses, but to furnish him with a further stipend, urging, that,
Line 281: "While he was yet young, it was fitting that he should see with his own
Line 282: eyes the countries and cities which might hereafter be the subjects of his
Line 283: discourses." Melesigenes consented, and set out with his patron,
Line 284: "examining all the curiosities of the countries they visited, and
...

Here, we're taking advantage of the built-in functionalities of Python that allow us to effortlessly iterate over an iterable object, simply using a for loop. If you'd like to read more about Python's built-in functionalities on iterating objects, we've got you covered:

Applications of Reading Files Line-by-Line

How can you use this practically? Most NLP applications deal with large corpora of data. Most of the time, it won't be wise to read the entire corpora into memory. While rudimentary, you can write a from-scratch solution to count the frequency of certain words, without using any external libraries. Let's write a simple script that loads in a file, reads it line-by-line and counts the frequency of words, printing the 10 most frequent words and the number of their occurrences:

import sys
import os

def main():
   filepath = sys.argv[1]
   if not os.path.isfile(filepath):
       print("File path {} does not exist. Exiting...".format(filepath))
       sys.exit()
  
   bag_of_words = {}
   with open(filepath) as fp:
       for line in fp:
           record_word_cnt(line.strip().split(' '), bag_of_words)
   sorted_words = order_bag_of_words(bag_of_words, desc=True)
   print("Most frequent 10 words {}".format(sorted_words[:10]))
  
def order_bag_of_words(bag_of_words, desc=False):
   words = [(word, cnt) for word, cnt in bag_of_words.items()]
   return sorted(words, key=lambda x: x[1], reverse=desc)

def record_word_cnt(words, bag_of_words):
    for word in words:
        if word != '':
            if word.lower() in bag_of_words:
                bag_of_words[word.lower()] += 1
            else:
                bag_of_words[word.lower()] = 1

if __name__ == '__main__':
    main()

The script uses the os module to make sure that the file we're attempting to read actually exists. If so, its read line-by-line and each line is passed on into the record_word_cnt() function. It delimits the spaces between words and adds the word to the dictionary - bag_of_words. Once all the lines are recorded into the dictionary, we order it via order_bag_of_words() which returns a list of tuples in the (word, word_count) format, sorted by the word count.

Finally, we print the top ten most common words.

Typically, for this, you'd create a Bag of Words Model, using libraries like NLTK, though, this implementation will suffice. Let's run the script and provide our Iliad.txt to it:

$ python app.py Iliad.txt

This results in:

Most frequent 10 words [('the', 15633), ('and', 6959), ('of', 5237), ('to', 4449), ('his', 3440), ('in', 3158), ('with', 2445), ('a', 2297), ('he', 1635), ('from', 1418)]

If you'd like to read more about NLP, we've got a series of guides on various tasks: Natural Language Processing in Python.

Conclusion

In this article, we've explored multiple ways to read a file line-by-line in Python, as well as created a rudimentary Bag of Words model to calculate the frequency of words in a given file.

Last Updated: June 15th, 2021
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Adam McQuistanAuthor

I am both passionate and inquisitive about all things software. My background is mostly in Python, Java, and JavaScript in the areas of science but, have also worked on large ecommerce and ERP apps.

Want a remote job?

    Prepping for an interview?

    • Improve your skills by solving one coding problem every day
    • Get the solutions the next morning via email
    • Practice on actual problems asked by top companies, like:
     
     
     

    Better understand your data with visualizations

    •  30-day no-questions refunds
    •  Beginner to Advanced
    •  Updated regularly (update June 2021)
    •  New bonus resources and guides

    © 2013-2021 Stack Abuse. All rights reserved.