lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play. The key benefits of this library are that it's ease of use, extremely fast when parsing large documents, very well documented, and provides easy conversion of data to Python data types, resulting in easier file manipulation.
In this tutorial, we will deep dive into Python's lxml library, starting with how to set it up for different operating systems, and then discussing its benefits and the wide range of functionalities it offers.
There are multiple ways to install lxml on your system. We'll explore some of them below.
Pip is a Python package manager which is used to download and install Python libraries to your local system with ease i.e. it downloads and installs all the dependencies for the package you're installing, as well.
If you have pip installed on your system, simply run the following command in terminal or command prompt:
$ pip install lxml
If you're using MacOS or Linux, you can install lxml by running this command in your terminal:
$ sudo apt-get install python-lxml
You probably won't get to this part, but if none of the above commands works for you for some reason, try using
$ easy_install lxml
Note: If you wish to install any particular version of lxml, you can simply state it when you run the command in the command prompt or terminal like this,
By now, you should have a copy of the lxml library installed on your local machine. Let's now get our hands dirty and see what cool things can be done using this library.
To be able to use the lxml library in your program, you first need to import it. You can do that by using the following command:
from lxml import etree as et
This will import the
etree module, the module of our interest, from the lxml library.
Creating HTML/XML Documents
etree module, we can create XML/HTML elements and their subelements, which is a very useful thing if we're trying to write or manipulate an HTML or XML file. Let's try to create the basic structure of an HTML file using
root = et.Element('html', version="5.0") # Pass the parent node, name of the child node, # and any number of optional attributes et.SubElement(root, 'head') et.SubElement(root, 'title', bgcolor="red", fontsize='22') et.SubElement(root, 'body', fontsize="15")
In the code above, you need to know that the
Element function requires at least one parameter, whereas the
SubElement function requires at least two. This is because the
Element function only 'requires' the name of the element to be created, whereas the
SubElement function requires the name of both the root node and the child node to be created.
It's also important to know that both these functions only have a lower bound to the number of arguments they can accept, but no upper bound because you can associate as many attributes with them as you want. To add an attribute to an element, simply add an additional parameter to the (Sub)Element function and specify your attribute in the form of
Let's try to run the code we wrote above to gain a better intuition regarding these functions:
# Use pretty_print=True to indent the HTML output print (et.tostring(root, pretty_print=True).decode("utf-8"))
<html version="5.0"> <head/> <title bgcolor="red" fontsize="22"/> <body fontsize="15"/> </html>
There's another way to create and organize your elements in a hierarchical manner. Let's explore that as well:
root = et.Element('html') root.append(et.SubElement('head')) root.append(et.SubElement('body'))
So in this case whenever we create a new element, we simply append it to the root/parent node.
Parsing HTML/XML Documents
Until now, we have only considered creating new elements, assigning attributes to them, etc. Let's now see an example where we already have an HTML or XML file, and we wish to parse it to extract certain information. Assuming that we have the HTML file that we created in the first example, let's try to get the tag name of one specific element, followed by printing the tag names of all the elements.
Now to iterate through all the child elements in the
root node and print their tags:
for e in root: print(e.tag)
head title body
Working with Attributes
Let's now see how we associate attributes to existing elements, as well as how to retrieve the value of a particular attribute for a given element.
Using the same
root element as before, try out the following code:
root.set('newAttribute', 'attributeValue') # Print root again to see if the new attribute has been added print(et.tostring(root, pretty_print=True).decode("utf-8"))
<html version="5.0" newAttribute="attributeValue"> <head/> <title bgcolor="red" fontsize="22"/> <body fontsize="15"/> </html>
Here we can see that the
newAttribute="attributeValue" has indeed been added to the root element.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Let's now try to get the values of the attributes we have set in the above code. Here we access a child element using array indexing on the
root element, and then use the
get() method to retrieve the attribute:
print(root.get('newAttribute')) print(root.get('alpha')) # root accesses the `title` element print(root.get('bgcolor'))
attributeValue None red
Retrieving Text from Elements
Now that we have seen basic functionalities of the
etree module, let's try to do some more interesting things with our HTML and XML files. Almost always, these files have some text in between the tags. So, let's see how we can add text to our elements:
# Copying the code from the very first example root = et.Element('html', version="5.0") et.SubElement(root, 'head') et.SubElement(root, 'title', bgcolor="red", fontsize="22") et.SubElement(root, 'body', fontsize="15") # Add text to the Elements and SubElements root.text = "This is an HTML file" root.text = "This is the head of that file" root.text = "This is the title of that file" root.text = "This is the body of that file and would contain paragraphs etc" print(et.tostring(root, pretty_print=True).decode("utf-8"))
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>
Check if an Element has Children
Next, there are two very important things that we should be able to check, as that is required in a lot of web scraping applications for exception handling. First thing we'd like to check is whether or not an element has children, and second is whether or not a node is an
Let's do that for the nodes we created above:
if len(root) > 0: print("True") else: print("False")
The above code will output "True" since the root node does have child nodes. However, if we check the same thing for the root's child nodes, like in the code below, the output will be "False".
for i in range(len(root)): if (len(root[i]) > 0): print("True") else: print("False")
False False False
Now let's do the same thing to see if each of the nodes is an
Element or not:
for i in range(len(root)): print(et.iselement(root[i]))
True True True
iselement method is helpful for determining if you have a valid
Element object, and thus if you can continue traversing it using the methods we've shown here.
Check if an Element has a Parent
Just now, we showed how to go down the hierarchy, i.e. how to check if an element has children or not, and now in this section we will try to go up the hierarchy, i.e. how to check and get the parent of a child node.
print(root.getparent()) print(root.getparent()) print(root.getparent())
The first line should return nothing (aka
None) as the root node itself doesn't have any parent. The other two should both point to the root element i.e. the HTML tag. Let's check the output to see if it is what we expect:
None <Element html at 0x1103c9688> <Element html at 0x1103c9688>
Retrieving Element Siblings
In this section we will learn how to traverse sideways in the hierarchy, which retrieves an element's siblings in the tree.
Traversing the tree sideways is quite similar to navigating it vertically. For the latter, we used the
getparent and the length of the element, for the former, we'll use
getprevious functions. Let's try them on nodes that we previously created to see how they work:
# root is the `title` tag print(root.getnext()) # The tag after the `title` tag print(root.getprevious()) # The tag before the `title` tag
<Element body at 0x10b5a75c8> <Element head at 0x10b5a76c8>
Here you can see that
root.getnext() retrieved the "body" tag since it was the next element, and
root.getprevious() retrieved the "head" tag.
Similarly, if we had used the
getprevious function on root, it would have returned
None, and if we had used the
getnext function on root, it would also have returned
Parsing XML from a String
Moving on, if we have an XML or HTML file and we wish to parse the raw string in order to obtain or manipulate the required information, we can do so by following the example below:
root = et.XML('<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">This is the title of that file</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>') root.text = "The title text has changed!" print(et.tostring(root, xml_declaration=True).decode('utf-8'))
<html version="5.0">This is an HTML file<head>This is the head of that file</head><title bgcolor="red" fontsize="22">The title text has changed!</title><body fontsize="15">This is the body of that file and would contain paragraphs etc</body></html>
As you can see, we successfully changed some text in the HTML document. The XML doctype declaration was also automatically added because of the
xml_declaration parameter that we passed to the
Searching for Elements
The last thing we're going to discuss is quite handy when parsing XML and HTML files. We will be checking ways through which we can see if an
Element has any particular type of children, and if it does what do they contain.
This has many practical use-cases, such as finding all of the link elements on a particular web page.
print(root.find('a')) # No <a> tags exist, so this will be `None` print(root.find('head').tag) print(root.findtext('title')) # Directly retrieve the the title tag's text
None head This is the title of that file
In the above tutorial, we started with a basic introduction to what lxml library is and what it is used for. After that, we learned how to install it on different environments like Windows, Linux, etc. Moving on, we explored different functionalities that could help us in traversing through the HTML/XML tree vertically as well as sideways. In the end, we also discussed ways to find elements in our tree, and as well as obtain information from them.