Reading and Writing CSV Files in Python with Pandas

While you can read and write CSV files in Python using the built-in open() function, or the dedicated csv module - you can also use Pandas.

In this article, you will see how to use Python's Pandas library to read and write CSV files.

What is a CSV File?

Let's quickly recap what a CSV file is - nothing more than a simple text file, following a few formatting conventions. However, it is the most common, simple, and easiest method to store tabular data. This format arranges tables by following a specific structure divided into rows and columns. It is these rows and columns that contain your data.

A new line terminates each row to start the next row. Similarly, a delimiter, usually a comma, separates columns within each row.

For example, we might have a table that looks like this:

| City         | State        | Capital | Population    |
| ------------ | ------------ | ------- | ------------- |
| Philadelphia | Pennsylvania | No      | 1.581 Million |
| Sacramento   | California   | Yes     | 0.5 Million   |
| New York     | New York     | No      | 8.623 Million |
| Austin       | Texas        | Yes     | 0.95 Million  |
| Miami        | Florida      | No      | 0.463 Million |

If we were to convert it into the CSV format, it'd look like this:

City,State,Capital,Population
Philadelphia,Pennsylvania,No,1.581 Million
Sacramento,California,Yes,0.5 Million
New York,New York,No,8.623 Million
Austin,Texas,Yes,0.95 Million
Miami,Florida,No,0.463 Million

Although the name (Comma-Separated Values) inherently uses a comma as the delimiter, you can use other delimiters (separators) as well, such as the semicolon (;). Each row of the table is a new line of the CSV file and it's a very compact and concise way to represent tabular data.

Now, let's take a look at the read_csv() function.

Reading and Writing CSV Files using Pandas

Pandas is a very powerful and popular framework for data analysis and manipulation. One of the most striking features of Pandas is its ability to read and write various types of files including CSV and Excel. You can effectively and easily manipulate CSV files in Pandas using functions like read_csv() and to_csv().

Installing Pandas

We have to install Pandas before using it. Let's use pip:

$ pip install pandas

Reading CSV Files with read_csv()

Let's import the Titanic Dataset, which can be obtained on GitHub:

import pandas as pd
titanic_data = pd.read_csv('titanic.csv')

Pandas will search for this file in the directory of the script, naturally, and we just supply the filepath to the file we'd like to parse as the one and only required argument of this method.

Let's take a look at the head() of this dataset to make sure it's imported correctly:

titanic_data.head()

This results in:

   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

Alternatively, you can also read CSV files from online resources, such as GitHub, simply by passing in the URL of the resource to the read_csv() function. Let's read this same CSV file from the GitHub repository, without downloading it to our local machine first:

import pandas as pd

titanic_data = pd.read_csv(r'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print(titanic_data.head())

This also results in:

   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

Customizing Headers

By default, the read_csv() method uses the first row of the CSV file as the column headers. Sometimes, these headers might have odd names, and you might want to use your own headers. You can set headers either after reading the file, simply by assigning the columns field of the DataFrame instance another list, or you can set the headers while reading the CSV in the first place.

Let's define a list of column names, and use those names instead of the ones from the CSV file:

import pandas as pd

col_names = ['Id',
             'Survived',
             'Passenger Class',
             'Full Name',
             'Gender',
             'Age',
             'SibSp',
             'Parch',
             'Ticket Number',
             'Price', 'Cabin',
             'Station']

titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', names=col_names)
print(titanic_data.head())

Let's run this code:

            Id  Survived Passenger Class  ...    Price  Cabin   Station
0  PassengerId  Survived          Pclass  ...     Fare  Cabin  Embarked
1            1         0               3  ...     7.25    NaN         S
2            2         1               1  ...  71.2833    C85         C
3            3         1               3  ...    7.925    NaN         S
4            4         1               1  ...     53.1   C123         S

Hmm, now we've got our custom headers, but the first row of the CSV file, which was originally used to set the column names is also included in the DataFrame. We'll want to skip this line, since it no longer holds any value for us.

Skipping Rows While Reading CSV

Let's address this issue by using the skiprows argument:

import pandas as pd

col_names = ['Id',
             'Survived',
             'Passenger Class',
             'Full Name',
             'Gender',
             'Age',
             'SibSp',
             'Parch',
             'Ticket Number',
             'Price', 'Cabin',
             'Station']

titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', names=col_names, skiprows=[0])
print(titanic_data.head())

Now, let's run this code:

   Id  Survived  Passenger Class  ...    Price Cabin  Station
0   1         0                3  ...   7.2500   NaN        S
1   2         1                1  ...  71.2833   C85        C
2   3         1                3  ...   7.9250   NaN        S
3   4         1                1  ...  53.1000  C123        S
4   5         0                3  ...   8.0500   NaN        S

Works like a charm! The skiprows argument accepts a list of rows you'd like to skip. You can skip, for example, 0, 4, 7 if you'd like as well:

titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', names=col_names, skiprows=[0, 4, 7])
print(titanic_data.head(10))

This would result in a DataFrame that doesn't have some of the rows we've seen before:

   Id  Survived  Passenger Class  ...    Price Cabin  Station
0   1         0                3  ...   7.2500   NaN        S
1   2         1                1  ...  71.2833   C85        C
2   3         1                3  ...   7.9250   NaN        S
3   5         0                3  ...   8.0500   NaN        S
4   6         0                3  ...   8.4583   NaN        Q
5   8         0                3  ...  21.0750   NaN        S
6   9         1                3  ...  11.1333   NaN        S
7  10         1                2  ...  30.0708   NaN        C
8  11         1                3  ...  16.7000    G6        S
9  12         1                1  ...  26.5500  C103        S

Keep in mind that skipping rows happens before the DataFrame is fully formed, so you won't be missing any indices of the DataFrame itself, though, in this case, you can see that the Id field (imported from the CSV file) is missing IDs 4 and 7.

Removing Headers

You can also decide to remove the header completely, which would result in a DataFrame that simply has 0...n header columns, by setting the header argument to None:

titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', header=None, skiprows=[0])

You'll also want to skip the first row here, since if you don't, the values from the first row will be actually be included in the first row:

   0   1   2                                                  3       4   ...  7                 8        9 
0   1   0   3                            Braund, Mr. Owen Harris    male  ...   0         A/5 21171   7.2500
1   2   1   1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  ...   0          PC 17599  71.2833
2   3   1   3                             Heikkinen, Miss. Laina  female  ...   0  STON/O2. 3101282   7.9250
3   4   1   1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  ...   0            113803  53.1000
4   5   0   3                           Allen, Mr. William Henry    male  ...   0            373450   8.0500

Specifying Delimiters

As stated earlier, you'll eventually probably encounter a CSV file that doesn't actually use commas to separate data. In such cases, you can use the sep argument to specify other delimiters:

titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', sep=';')

Writing CSV Files with to_csv()

Again, DataFrames are tabular. Turning a DataFrame into a CSV file is as simple as turning a CSV file into a DataFrame - we call the write_csv() function on the DataFrame instance.

When writing a DataFrame to a CSV file, you can also change the column names, using the columns argument, or specify a delimiter via the sep argument. If you don't specify either of these, you'll end up with a standard Comma-Separated Value file.

Let's play around with this:

import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
cities.to_csv('cities.csv')

Here, we've made a simple DataFrame with two cities and their respective states. Then, we've gone ahead and saved that data into a CSV file using to_csv() and providing the filename.

This results in a new file in the working directory of the script you're running, which contains:

,City,State
0,Sacramento,California
1,Miami,Florida

Though, this isn't really well-formatted. We've still got the indices from the DataFrame, which also puts a weird missing spot before the column names. If we re-imported this CSV back into a DataFrame, it'd be a mess:

df = pd.read_csv('cities.csv')
print(df)

This results in:

   Unnamed: 0        City       State
0           0  Sacramento  California
1           1       Miami     Florida

The indices from the DataFrame ended up becoming a new column, which is now Unnamed.

When saving the file, let's make sure to drop the index of the DataFrame:

import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
cities.to_csv('cities.csv', index=False)

Now, this results in a file that contains:

City,State
Sacramento,California
Miami,Florida

Works like a charm! If we re-import it and print the contents, the DataFrame is constructed well:

df = pd.read_csv('cities.csv')
print(df)

This results in:

         City       State
0  Sacramento  California
1       Miami     Florida

Customizing Headers

Let's change the column headers from the default ones:

import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
new_column_names = ['City_Name', 'State_Name']
cities.to_csv('cities.csv', index=False, header=new_column_names)

We've made a new_header list, that contains different values for our columns. Then, using the header argument, we've set these instead of the original column names. This generates a cities.csv with these contents:

City_Name,State_Name
Sacramento,California
Miami,Florida
Washington DC,Unknown

Customizing Delimiter

Let's change the delimiter from the default (,) value to a new one:

import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
cities.to_csv('cities.csv', index=False, sep=';')

This results in a cities.csv file that contains:

City;State
Sacramento;California
Miami;Florida

Handling Missing Values

Sometimes, DataFrames have missing values that we've left as NaN or NA. In such cases, you might want to format these when you write them out into a CSV file. You can use the na_rep argument and set the value to be put instead of a missing value:

import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida'], ['Washington DC', pd.NA]], columns=['City', 'State'])
cities.to_csv('cities.csv', index=False, na_rep='Unknown')

Here, we've got two valid city-state pairs, but Washington DC is missing its state. If we run this code, it'll result in a cities.csv with the following contents:

City,State
Sacramento,California
Miami,Florida
Washington DC,Unknown

Conclusion

The article shows how to read and write CSV files using Python's Pandas library. To read a CSV file, the read_csv() method of the Pandas library is used. You can also pass custom header names while reading CSV files via the names attribute of the read_csv() method. Finally, to write a CSV file using Pandas, you first have to create a Pandas DataFrame object and then call to_csv method on the DataFrame.