Guide to Parsing HTML with BeautifulSoup in Python

Introduction

Web scraping is programmatically collecting information from various websites. While there are many libraries and frameworks in various languages that can extract web data, Python has long been a popular choice because of its plethora of options for web scraping.

This article will give you a crash course on web scraping in Python with Beautiful Soup - a popular Python library for parsing HTML and XML.

Ethical Web Scraping

Web scraping is ubiquitous and gives us data as we would get with an API. However, as good citizens of the internet, it's our responsibility to respect the site owners we scrape from. Here are some principles that a web scraper should adhere to:

  • Don't claim scraped content as our own. Website owners sometimes spend a lengthy amount of time creating articles, collecting details about products or harvesting other content. We must respect their labor and originality.
  • Don't scrape a website that doesn't want to be scraped. Websites sometimes come with a robots.txt file - which defines the parts of a website that can be scraped. Many websites also have a Terms of Use which may not allow scraping. We must respect websites that do not want to be scraped.
  • Is there an API available already? Splendid, there's no need for us to write a scraper. APIs are created to provide access to data in a controlled way as defined by the owners of the data. We prefer to use APIs if they're available.
  • Making requests to a website can cause a toll on a website's performance. A web scraper that makes too many requests can be as debilitating as a DDOS attack. We must scrape responsibly so we won't cause any disruption to the regular functioning of the website.

An Overview of Beautiful Soup

The HTML content of the webpages can be parsed and scraped with Beautiful Soup. In the following section, we will be covering those functions that are useful for scraping webpages.

What makes Beautiful Soup so useful is the myriad functions it provides to extract data from HTML. This image below illustrates some of the functions we can use:

BeautifulSoup - An Overview

Let's get hands-on and see how we can parse HTML with Beautiful Soup. Consider the following HTML page saved to file as doc.html:

<html>
<head>
  <title>Head's title</title>
</head>

<body>
  <p class="title"><b>Body's title</b></p>
  <p class="story">line begins
    <a href="http://example.com/element1" class="element" id="link1">1</a>
    <a href="http://example.com/element2" class="element" id="link2">2</a>
    <a href="http://example.com/avatar1" class="avatar" id="link3">3</a>
  <p> line ends</p>
</body>
</html>

The following code snippets are tested on Ubuntu 20.04.1 LTS. You can install the BeautifulSoup module by typing the following command in the terminal:

$ pip3 install beautifulsoup4

The HTML file doc.html needs to be prepared. This is done by passing the file to the BeautifulSoup constructor, let's use the interactive Python shell for this, so we can instantly print the contents of a specific part of a page:

from bs4 import BeautifulSoup

with open("doc.html") as fp:
    soup = BeautifulSoup(fp, "html.parser")

Now we can use Beautiful Soup to navigate our website and extract data.

From the soup object created in the previous section, let's get the title tag of doc.html:

soup.head.title   # returns <title>Head's title</title>

Here's a breakdown of each component we used to get the title:

Navigating Specific Tags

Beautiful Soup is powerful because our Python objects match the nested structure of the HTML document we are scraping.

To get the text of the first <a> tag, enter this:

soup.body.a.text  # returns '1'

To get the title within the HTML's body tag (denoted by the "title" class), type the following in your terminal:

soup.body.p.b     # returns <b>Body's title</b>

For deeply nested HTML documents, navigation could quickly become tedious. Luckily, Beautiful Soup comes with a search function so we don't have to navigate to retrieve HTML elements.

Searching the Elements of Tags

The find_all() method takes an HTML tag as a string argument and returns the list of elements that match with the provided tag. For example, if we want all a tags in doc.html:

soup.find_all("a")

We'll see this list of a tags as output:

[<a class="element" href="http://example.com/element1" id="link1">1</a>, <a class="element" href="http://example.com/element2" id="link2">2</a>, <a class="element" href="http://example.com/element3" id="link3">3</a>]

Here's a breakdown of each component we used to search for a tag:

Searching Elements of Tags

We can search for tags of a specific class as well by providing the class_ argument. Beautiful Soup uses class_ because class is a reserved keyword in Python. Let's search for all a tags that have the "element" class:

soup.find_all("a", class_="element")

As we only have two links with the "element" class, you'll see this output:

[<a class="element" href="http://example.com/element1" id="link1">1</a>, <a class="element" href="http://example.com/element2" id="link2">2</a>]

What if we wanted to fetch the links embedded inside the a tags? Let's retrieve a link's href attribute using the find() option. It works just like find_all() but it returns the first matching element instead of a list. Type this in your shell:

soup.find("a", href=True)["href"] # returns http://example.com/element1

The find() and find_all() functions also accept a regular expression instead of a string. Behind the scenes, the text will be filtered using the compiled regular expression's search() method. For example:

import re

for tag in soup.find_all(re.compile("^b")):
    print(tag)

The list upon iteration, fetches the tags starting with the character b which includes <body> and <b>:

<body>
 <p class="title"><b>Body's title</b></p>
 <p class="story">line begins
       <a class="element" href="http://example.com/element1" id="link1">1</a>
 <a class="element" href="http://example.com/element2" id="link2">2</a>
 <a class="element" href="http://example.com/element3" id="link3">3</a>
 <p> line ends</p>
 </p></body>
 <b>Body's title</b>

We've covered the most popular ways to get tags and their attributes. Sometimes, especially for less dynamic web pages, we just want the text from it. Let's see how we can get it!

Getting the Whole Text

The get_text() function retrieves all the text from the HTML document. Let's get all the text of the HTML document:

soup.get_text()

Your output should be like this:

Head's title


Body's title
line begins
      1
2
3
 line ends

Sometimes the newline characters are printed, so your output may look like this as well:

"\n\nHead's title\n\n\nBody's title\nline begins\n    1\n2\n3\n line ends\n\n"

Now that we have a feel for how to use Beautiful Soup, let's scrape a website!

Beautiful Soup in Action - Scraping a Book List

Now that we have mastered the components of Beautiful Soup, it's time to put our learning to use. Let's build a scraper to extract data from https://books.toscrape.com/ and save it to a CSV file. The site contains random data about books and is a great space to test out your web scraping techniques.

First, create a new file called scraper.py. Let's import all the libraries we need for this script:

import requests
import time
import csv
import re
from bs4 import BeautifulSoup

In the modules mentioned above:

  • requests - performs the URL request and fetches the website's HTML
  • time - limits how many times we scrape the page at once
  • csv - helps us export our scraped data to a CSV file
  • re - allows us to write regular expressions that will come in handy for picking text based on its pattern
  • bs4 - yours truly, the scraping module to parse the HTML

You would have bs4 already installed, and time, csv, and re are built-in packages in Python. You'll need to install the requests module directly like this:

$ pip3 install requests

Before you begin, you need to understand how the webpage's HTML is structured. In your browser, let's go to http://books.toscrape.com/catalogue/page-1.html. Then right-click on the components of the webpage to be scraped, and click on the inspect button to understand the hierarchy of the tags as shown below.

This will show you the underlying HTML for what you're inspecting. The following picture illustrates these steps:

Understanding the HTML tags

From inspecting the HTML, we learn how to access the URL of the book, the cover image, the title, the rating, the price, and more fields from the HTML. Let's write a function that scrapes a book item and extract its data:

def scrape(source_url, soup):  # Takes the driver and the subdomain for concats as params
    # Find the elements of the article tag
    books = soup.find_all("article", class_="product_pod")

    # Iterate over each book article tag
    for each_book in books:
        info_url = source_url+"/"+each_book.h3.find("a")["href"]
        cover_url = source_url+"/catalogue" + \
            each_book.a.img["src"].replace("..", "")

        title = each_book.h3.find("a")["title"]
        rating = each_book.find("p", class_="star-rating")["class"][1]
        # can also be written as : each_book.h3.find("a").get("title")
        price = each_book.find("p", class_="price_color").text.strip().encode(
            "ascii", "ignore").decode("ascii")
        availability = each_book.find(
            "p", class_="instock availability").text.strip()

        # Invoke the write_to_csv function
        write_to_csv([info_url, cover_url, title, rating, price, availability])

The last line of the above snippet points to a function to write the list of scraped strings to a CSV file. Let's add that function now:

def write_to_csv(list_input):
    # The scraped info will be written to a CSV here.
    try:
        with open("allBooks.csv", "a") as fopen:  # Open the csv file.
            csv_writer = csv.writer(fopen)
            csv_writer.writerow(list_input)
    except:
        return False

As we have a function that can scrape a page and export to CSV, we want another function that crawls through the paginated website, collecting book data on each page.

To do this, let's look at the URL we are writing this scraper for:

"http://books.toscrape.com/catalogue/page-1.html"

The only varying element in the URL is the page number. We can format the URL dynamically so it becomes a seed URL:

"http://books.toscrape.com/catalogue/page-{}.html".format(str(page_number))

This string formatted URL with the page number can be fetched using the method requests.get(). We can then create a new BeautifulSoup object. Every time we get the soup object, the presence of the "next" button is checked so we could stop at the last page. We keep track of a counter for the page number that's incremented by 1 after successfully scraping a page.

def browse_and_scrape(seed_url, page_number=1):
    # Fetch the URL - We will be using this to append to images and info routes
    url_pat = re.compile(r"(http://.*\.com)")
    source_url = url_pat.search(seed_url).group(0)

   # Page_number from the argument gets formatted in the URL & Fetched
    formatted_url = seed_url.format(str(page_number))

    try:
        html_text = requests.get(formatted_url).text
        # Prepare the soup
        soup = BeautifulSoup(html_text, "html.parser")
        print(f"Now Scraping - {formatted_url}")

        # This if clause stops the script when it hits an empty page
        if soup.find("li", class_="next") != None:
            scrape(source_url, soup)     # Invoke the scrape function
            # Be a responsible citizen by waiting before you hit again
            time.sleep(3)
            page_number += 1
            # Recursively invoke the same function with the increment
            browse_and_scrape(seed_url, page_number)
        else:
            scrape(source_url, soup)     # The script exits here
            return True
        return True
    except Exception as e:
        return e

The function above, browse_and_scrape(), is recursively called until the function soup.find("li",class_="next") returns None. At this point, the code will scrape the remaining part of the webpage and exit.

For the final piece to the puzzle, we initiate the scraping flow. We define the seed_url and call the browse_and_scrape() to get the data. This is done under the if __name__ == "__main__" block:

if __name__ == "__main__":
    seed_url = "http://books.toscrape.com/catalogue/page-{}.html"
    print("Web scraping has begun")
    result = browse_and_scrape(seed_url)
    if result == True:
        print("Web scraping is now complete!")
    else:
        print(f"Oops, That doesn't seem right!!! - {result}")

If you'd like to learn more about the if __name__ == "__main__" block, check out our guide on how it works.

You can execute the script as shown below in your terminal and get the output as:

$ python scraper.py
Web scraping has begun
Now Scraping - http://books.toscrape.com/catalogue/page-1.html
Now Scraping - http://books.toscrape.com/catalogue/page-2.html
Now Scraping - http://books.toscrape.com/catalogue/page-3.html
.
.
.
Now Scraping - http://books.toscrape.com/catalogue/page-49.html
Now Scraping - http://books.toscrape.com/catalogue/page-50.html
Web scraping is now complete!

The scraped data can be found in the current working directory under the filename allBooks.csv. Here's a sample the file's content:

http://books.toscrape.com/a-light-in-the-attic_1000/index.html,http://books.toscrape.com/catalogue/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg,A Light in the Attic,Three,51.77,In stock
http://books.toscrape.com/tipping-the-velvet_999/index.html,http://books.toscrape.com/catalogue/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg,Tipping the Velvet,One,53.74,In stock
http://books.toscrape.com/soumission_998/index.html,http://books.toscrape.com/catalogue/media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg,Soumission,One,50.10,In stock

Good job! If you wanted to have a look at the scraper code as a whole, you can find it on GitHub.

Conclusion

In this tutorial, we learned the ethics of writing good web scrapers. We then used Beautiful Soup to extract data from an HTML file using the Beautiful Soup's object properties, and it's various methods like find(), find_all() and get_text(). We then built a scraper than retrieves a book list online and exports to CSV.

Web scraping is a useful skill that helps in various activities such as extracting data like an API, performing QA on a website, checking for broken URLs on a website, and more. What's the next scraper you're going to build?

Author image
Chennai, India Twitter Website
Pythonist 🐍| Linux Geek who codes on WSL | Data & Cloud Fanatic | Blogging Advocate | Author | DAWs | Listens to Lo-fi | Masego, Pigeon John & Noah 🎧