While you can read and write CSV files in Python using the built-in open()
function, or the dedicated csv module - you can also use Pandas.
In this article, you will see how to use Python's Pandas library to read and write CSV files.
What is a CSV File?
Let's quickly recap what a CSV file is - nothing more than a simple text file, following a few formatting conventions. However, it is the most common, simple, and easiest method to store tabular data. This format arranges tables by following a specific structure divided into rows and columns. It is these rows and columns that contain your data.
A new line terminates each row to start the next row. Similarly, a delimiter, usually a comma, separates columns within each row.
For example, we might have a table that looks like this:
| City | State | Capital | Population |
| ------------ | ------------ | ------- | ------------- |
| Philadelphia | Pennsylvania | No | 1.581 Million |
| Sacramento | California | Yes | 0.5 Million |
| New York | New York | No | 8.623 Million |
| Austin | Texas | Yes | 0.95 Million |
| Miami | Florida | No | 0.463 Million |
If we were to convert it into the CSV format, it'd look like this:
City,State,Capital,Population
Philadelphia,Pennsylvania,No,1.581 Million
Sacramento,California,Yes,0.5 Million
New York,New York,No,8.623 Million
Austin,Texas,Yes,0.95 Million
Miami,Florida,No,0.463 Million
Although the name (Comma-Separated Values) inherently uses a comma as the delimiter, you can use other delimiters (separators) as well, such as the semicolon (;
). Each row of the table is a new line of the CSV file and it's a very compact and concise way to represent tabular data.
Now, let's take a look at the read_csv()
function.
Reading and Writing CSV Files using Pandas
Pandas is a very powerful and popular framework for data analysis and manipulation. One of the most striking features of Pandas is its ability to read and write various types of files including CSV and Excel. You can effectively and easily manipulate CSV files in Pandas using functions like read_csv()
and to_csv()
.
Installing Pandas
We have to install Pandas before using it. Let's use pip
:
$ pip install pandas
Reading CSV Files with read_csv()
Let's import the Titanic Dataset, which can be obtained on GitHub:
import pandas as pd
titanic_data = pd.read_csv('titanic.csv')
Pandas will search for this file in the directory of the script, naturally, and we just supply the file path to the file we'd like to parse as the one and only required argument of this method.
Let's take a look at the head()
of this dataset to make sure it's imported correctly:
titanic_data.head()
This results in:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
Alternatively, you can also read CSV files from online resources, such as GitHub, simply by passing in the URL of the resource to the read_csv()
function. Let's read this same CSV file from the GitHub repository, without downloading it to our local machine first:
import pandas as pd
titanic_data = pd.read_csv(r'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
print(titanic_data.head())
This also results in:
PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
[5 rows x 12 columns]
Customizing Headers
By default, the read_csv()
method uses the first row of the CSV file as the column headers. Sometimes, these headers might have odd names, and you might want to use your own headers. You can set headers either after reading the file, simply by assigning the columns
field of the DataFrame
instance another list, or you can set the headers while reading the CSV in the first place.
Let's define a list of column names, and use those names instead of the ones from the CSV file:
import pandas as pd
col_names = ['Id',
'Survived',
'Passenger Class',
'Full Name',
'Gender',
'Age',
'SibSp',
'Parch',
'Ticket Number',
'Price', 'Cabin',
'Station']
titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', names=col_names)
print(titanic_data.head())
Let's run this code:
Id Survived Passenger Class ... Price Cabin Station
0 PassengerId Survived Pclass ... Fare Cabin Embarked
1 1 0 3 ... 7.25 NaN S
2 2 1 1 ... 71.2833 C85 C
3 3 1 3 ... 7.925 NaN S
4 4 1 1 ... 53.1 C123 S
Hmm! Now we've got our custom headers, but the first row of the CSV file, which was originally used to set the column names, is also included in the DataFrame
. We'll want to skip this line, since it no longer holds any value for us.
Skipping Rows While Reading CSV
Let's address this issue by using the skiprows
argument:
import pandas as pd
col_names = ['Id',
'Survived',
'Passenger Class',
'Full Name',
'Gender',
'Age',
'SibSp',
'Parch',
'Ticket Number',
'Price', 'Cabin',
'Station']
titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', names=col_names, skiprows=[0])
print(titanic_data.head())
Now, let's run this code:
Id Survived Passenger Class ... Price Cabin Station
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 4 1 1 ... 53.1000 C123 S
4 5 0 3 ... 8.0500 NaN S
Works like a charm! The skiprows
argument accepts a list of rows you'd like to skip. You can skip, for example, 0, 4, 7
if you'd like as well:
titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', names=col_names, skiprows=[0, 4, 7])
print(titanic_data.head(10))
This would result in a DataFrame
that doesn't have some of the rows we've seen before:
Id Survived Passenger Class ... Price Cabin Station
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S
3 5 0 3 ... 8.0500 NaN S
4 6 0 3 ... 8.4583 NaN Q
5 8 0 3 ... 21.0750 NaN S
6 9 1 3 ... 11.1333 NaN S
7 10 1 2 ... 30.0708 NaN C
8 11 1 3 ... 16.7000 G6 S
9 12 1 1 ... 26.5500 C103 S
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Keep in mind that skipping rows happens before the DataFrame
is fully formed, so you won't be missing any indices of the DataFrame
itself, though, in this case, you can see that the Id
field (imported from the CSV file) is missing IDs 4
and 7
.
Removing Headers
You can also decide to remove the header completely, which would result in a DataFrame
that simply has 0...n
header columns, by setting the header
argument to None
:
titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', header=None, skiprows=[0])
You'll also want to skip the first row here, since if you don't, the values from the first row will be actually be included in the first row:
0 1 2 3 4 ... 7 8 9
0 1 0 3 Braund, Mr. Owen Harris male ... 0 A/5 21171 7.2500
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female ... 0 PC 17599 71.2833
2 3 1 3 Heikkinen, Miss. Laina female ... 0 STON/O2. 3101282 7.9250
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female ... 0 113803 53.1000
4 5 0 3 Allen, Mr. William Henry male ... 0 373450 8.0500
Specifying Delimiters
As stated earlier, you'll eventually probably encounter a CSV file that doesn't actually use commas to separate data. In such cases, you can use the sep
argument to specify other delimiters:
titanic_data = pd.read_csv(r'E:\Datasets\titanic.csv', sep=';')
Writing CSV Files with to_csv()
Again, DataFrame
s are tabular. Turning a DataFrame
into a CSV file is as simple as turning a CSV file into a DataFrame
- we call the write_csv()
function on the DataFrame
instance.
When writing a DataFrame
to a CSV file, you can also change the column names, using the columns
argument, or specify a delimiter via the sep
argument. If you don't specify either of these, you'll end up with a standard Comma-Separated Value file.
Let's play around with this:
import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
cities.to_csv('cities.csv')
Here, we've made a simple DataFrame
with two cities and their respective states. Then, we went ahead and saved that data into a CSV file using to_csv()
and provided the filename.
This results in a new file in the working directory of the script you're running, which contains:
,City,State
0,Sacramento,California
1,Miami,Florida
Though, this isn't really well-formatted. We've still got the indices from the DataFrame
, which also puts a weird missing spot before the column names. If we re-imported this CSV back into a DataFrame
, it'd be a mess:
df = pd.read_csv('cities.csv')
print(df)
This results in:
Unnamed: 0 City State
0 0 Sacramento California
1 1 Miami Florida
The indices from the DataFrame
ended up becoming a new column, which is now Unnamed
.
When saving the file, let's make sure to drop the index of the DataFrame
:
import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
cities.to_csv('cities.csv', index=False)
Now, this results in a file that contains:
City,State
Sacramento,California
Miami,Florida
Works like a charm! If we re-import it and print the contents, the DataFrame
is constructed well:
df = pd.read_csv('cities.csv')
print(df)
This results in:
City State
0 Sacramento California
1 Miami Florida
Customizing Headers
Let's change the column headers from the default ones:
import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
new_column_names = ['City_Name', 'State_Name']
cities.to_csv('cities.csv', index=False, header=new_column_names)
We've made a new_header
list that contains different values for our columns. Then, using the header
argument, we've set these instead of the original column names. This generates a cities.csv
with these contents:
City_Name,State_Name
Sacramento,California
Miami,Florida
Washington DC,Unknown
Customizing Delimiter
Let's change the delimiter from the default (,
) value to a new one:
import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida']], columns=['City', 'State'])
cities.to_csv('cities.csv', index=False, sep=';')
This results in a cities.csv
file that contains:
City;State
Sacramento;California
Miami;Florida
Handling Missing Values
Sometimes, DataFrame
s have missing values that we've left as NaN
or NA
. In such cases, you might want to format these when you write them out into a CSV file. You can use the na_rep
argument and set the value to be put instead of a missing value:
import pandas as pd
cities = pd.DataFrame([['Sacramento', 'California'], ['Miami', 'Florida'], ['Washington DC', pd.NA]], columns=['City', 'State'])
cities.to_csv('cities.csv', index=False, na_rep='Unknown')
Here, we've got two valid city-state pairs, but Washington DC
is missing its state. If we run this code, it'll result in a cities.csv
with the following contents:
City,State
Sacramento,California
Miami,Florida
Washington DC,Unknown
Conclusion
The article shows how to read and write CSV files using Python's Pandas library. To read a CSV file, the read_csv()
method of the Pandas library is used. You can also pass custom header names while reading CSV files via the names
attribute of the read_csv()
method. Finally, to write a CSV file using Pandas, you first have to create a Pandas DataFrame object and then call the to_csv
method on the DataFrame.