Introduction
Hypertext Markup Language (HTML) is the standard markup language for building web pages. We can render tabular data using HTML's <table>
element. The Pandas data analysis library provides functions like read_html()
and to_html()
so we can import and export data to DataFrames.
In this article, we will learn how to read tabular data from an HTML file and load it into a Pandas DataFrame. We'll also learn how to write data from a Pandas DataFrame and to an HTML file.
Note: In this article, we will read and write HTML <table>
elements. This article does not include parsing the entire HTML file.
Reading HTML
We can read tables of an HTML file using the read_html()
function. This function reads tables of HTML files as Pandas DataFrames. It can be read from a file or a URL.
Let's have a look at each input source one by one.
Reading HTML Data From a File
For this section, we'll use one set of input data. One table contains programming languages and the year of their creation. The other table has land sizes and their cost in USD.
Save the following HTML content in a file called table_data.html
:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Table Data</title>
</head>
<body>
<table>
<thead>
<tr>
<th>Programming Language</th>
<th>Creator</th>
<th>Year</th>
</tr>
</thead>
<tbody>
<tr>
<td>C</td>
<td>Dennis Ritchie</td>
<td>1972</td>
</tr>
<tr>
<td>Python</td>
<td>Guido Van Rossum</td>
<td>1989</td>
</tr>
<tr>
<td>Ruby</td>
<td>Yukihiro Matsumoto</td>
<td>1995</td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr>
<th>
Area (sq.ft)
</th>
<th>
Price (USD)
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
12000
</td>
<td>
500
</td>
</tr>
<tr>
<td>
32000
</td>
<td>
700
</td>
</tr>
</tbody>
</table>
</body>
</html>
Pandas needs help from another library, called lxml
to parse HTML and XML files. For the read_html()
function to work you need to install lxml
:
$ pip install lxml
Once lmxl
is installed, we can use the read_html()
function. It returns a list of DataFrames, where each DataFrame is an entire table element of the given HTML file. We access each table as a DataFrame by indexing the list.
The code below demonstrates use of read_html()
function to read tables from HTML file:
import pandas as pd
tables = pd.read_html('table_data.html')
print('Tables found:', len(tables))
df1 = tables[0] # Save first table in variable df1
df2 = tables[1] # Saving next table in variable df2
print('First Table')
print(df1)
print('Another Table')
print(df2)
Note: While you need lxml
installed, you don't need to import it in your program for Pandas to work.
Running the above code on a Python interpreter will produce the following output:
Tables found: 2
First Table
Programming Language Creator Year
0 C Dennis Ritchie 1972
1 Python Guido Van Rossum 1989
2 Ruby Yukihiro Matsumoto 1995
Another Table
Area (sq.ft) Price (USD)
0 12000 500
1 32000 700
Reading HTML Data From URL
Just as we read table elements from an HTML file, we can also read table elements from an HTML web page into a DataFrame with read_html()
. However, in place of the file name, we will provide a URL like this:
read_html('https://en.wikipedia.org/wiki/Python_(programming_language)')
And it will return a list of DataFrames where each DataFrame represents a table element from the given URL.
Here is an example code to read table elements from a website URL using Pandas:
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/Python_(programming_language)')
print('Tables found:', len(tables))
df1 = tables[0] # Save first table in variable df1
print('First Table')
print(df1.head()) # To print first 5 rows
If we successfully run the above code, we can see the output as:
Tables found: 10
First Table
0 1
0 NaN NaN
1 Paradigm Multi-paradigm: functional, imperative, object...
2 Designed by Guido van Rossum
3 Developer Python Software Foundation
4 First appeared 1991; 29 years ago[1]
Reading HTML Data From URL That Requires Authentication
We know that we can read table elements from a website. However, when the site requires authentication, the code runs into the following exception:
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: UNAUTHORIZED
To read data from such URLs we will use the requests
module. You can install it with pip
:
$ pip install requests
Now, we will use the get()
method from the requests
library to make a request to a website URL while providing the optional auth
parameter if the site requires authentication.
This method returns a response object from the web page. We can check the status code (to make sure the content is definitely there) and get the text from the response object, and then convert the table to a DataFrame.
Let's see an example of using requests
to get data that requires authentication. For this purpose, we are using httpbin.org:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
import requests
r = requests.get('https://httpbin.org/basic-auth/john/johnspassword', auth=('john', 'johnspassword'))
print(r.status_code)
print(r.text)
On executing the above code, we can see the following output:
200
{
"authenticated": true,
"user": "john"
}
This shows that we successfully accessed the web page content of an authenticated URL. However, this website only contains JSON data and we need HTML table elements as DataFrames.
Let's stick to the earlier URL and use requests
to read HTML tables as DataFrames. While the previous site was public, the steps to access authenticated content is the same.
Once we get a response, we can pass the r.text
to read_html()
method. And as usual, we'll get a list of tables it contains as DataFrames:
import pandas as pd
import requests
# Can use auth parameter for authenticated URLs
r = requests.get('https://en.wikipedia.org/wiki/Python_(programming_language)',
auth=('john', 'johnspassword'))
tables = pd.read_html(r.text)
print('Tables found:', len(tables))
df1 = tables[0]
print('First Table')
print(df1.head())
Running this code will generate the following output:
Tables found: 10
First Table
0 1
0 NaN NaN
1 Paradigm Multi-paradigm: functional, imperative, object...
2 Designed by Guido van Rossum
3 Developer Python Software Foundation
4 First appeared 1991; 29 years ago[1]
Writing HTML Tables with Python's Pandas
We have successfully read data from HTML tables. Let's write Pandas DataFrame in an HTML file. This can be achieved by using the to_html()
method.
The to_html()
takes the path of the file you want the data exported to. If you don't provide an absolute path, it would save a file relative to the current directory.
You can export a DataFrame to an HTML table like this:
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html')
This code will produce the following file write_html.html
in the current directory:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>3</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>4</td>
</tr>
</tbody>
</table>
Note that the export isn't an entire HTML document, but just the HTML table itself.
Writing Styled HTML Tables with Python's Pandas
As we can see, by default the table border is 1, alignment is right, and it also has DataFrame indices in <th>
tags. We can change this default structure by providing a few optional parameters.
Hide Index
If we don't want to include the index in the table output, we can set index=False
in the to_html()
:
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html', index=False)
This code produces write_html.html
file with the following content:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>4</td>
</tr>
</tbody>
</table>
Changing Table Border
The default table border is 1 pixel. To change this default, we can set the border
parameter to a value in pixels.
The following code changes the border to 3 pixels:
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df.to_html('write_html.html', border=3)
The generated file now sets the border attribute of the table to "3":
<table border="3" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>3</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>4</td>
</tr>
</tbody>
</table>
Justify Text
By default, the table's head text is right-aligned. We change this alignment with the justify
parameter. For example, doing justify="center"
will add style="text-align: center;"
in the <tr>
tag of the <thead>
tag.
Let's try aligning the head text to center and see results:
import pandas as pd
df = pd.DataFrame({'A': [1, 'AAA'], 'B': ['BBB', 4]})
df.to_html('write_html.html', justify='center')
The table created by above code looks like this:
<table border="1" class="dataframe">
<thead>
<tr style="text-align: center;">
<th></th>
<th>A</th>
<th>B</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>BBB</td>
</tr>
<tr>
<th>1</th>
<td>AAA</td>
<td>4</td>
</tr>
</tbody>
</table>
The text of the table head is now beautifully aligned to the center.
Conclusion
In this tutorial, we have learned how to import and export HTML table data with Pandas DataFrames. We loaded HTML table data from files, as well as from web page URLs. In the case of authenticated URLs, we used the requests module to authenticate and retrieve the site's data, and then pass the response text into the read_html()
function.
We have also written a Pandas DataFrame as an HTML file using the to_html()
function. We then styled the generated table by passing a few optional parameters like index
, border
and justify
. This makes it easy to write the data of DataFrame in a presentational manner.