Getting Started with Python's Wikipedia API

Introduction

In this article, we will be using the Wikipedia API to retrieve data from Wikipedia. Data scraping has seen a rapid surge owing to the increasing use of data analytics and machine learning tools. The Internet is the single largest source of information, and therefore it is important to know how to fetch data from various sources. And with Wikipedia being one of the largest and most popular sources for information on the Internet, this is a natural place to start.

In this article, we will see how to use Python's Wikipedia API to fetch a variety of information from the Wikipedia website.

Installation

In order to extract data from Wikipedia, we must first install the Python Wikipedia library, which wraps the official Wikipedia API. This can be done by entering the command below in your command prompt or terminal:

$ pip install wikipedia

Once the installation is done, we can use the Wikipedia API in Python to extract information from Wikipedia. In order to call the methods of the Wikipedia module in Python, we need to import it using the following command.

import wikipedia  

Searching Titles and Suggestions

The search() method does a Wikipedia search for a query that is supplied as an argument to it. As a result, this method returns a list of all the article's titles that contain the query. For example:

import wikipedia  
print(wikipedia.search("Bill"))  

Output:

['Bill', 'The Bill', 'Bill Nye', 'Bill Gates', 'Bills, Bills, Bills', 'Heartbeat bill', 'Bill Clinton', 'Buffalo Bill', 'Bill & Ted', 'Kill Bill: Volume 1']

As you see in the output, the searched title along with the related search suggestions are displayed. You can configure the number of search titles returned by passing a value for the results parameter, as shown here:

import wikipedia  
print(wikipedia.search("Bill", results=2))  

Output:

['Bill', 'The Bill']

The above code prints only 2 search results of the query since that is how many we requested to be returned.

Let's say we need to get the Wikipedia search suggestions for a search title, "Bill Cliton" that is incorrectly entered or has a typo. The suggest() method returns suggestions related to the search query entered as a parameter to it, or it will return "None" if no suggestions were found.

Let's try it out here:

import wikipedia  
print(wikipedia.suggest("Bill cliton"))  

Output:

bill clinton  

You can see that it took our incorrect entry, "Bill cliton", and returned the correct suggestion of "bill clinton".

Extracting Wikipedia Article Summary

We can extract the summary of a Wikipedia article using the summary() method. The article for which the summary needs to be extracted is passed as a parameter to this method.

Let's extract the summary for "Ubuntu":

print(wikipedia.summary("Ubuntu"))  

Output:

Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots). Ubuntu is a popular operating system for cloud computing, with support for OpenStack.Ubuntu is released every six months, with long-term support (LTS) releases every two years. The latest release is 19.04 ("Disco Dingo"), and the most recent long-term support release is 18.04 LTS ("Bionic Beaver"), which is supported until 2028. Ubuntu is developed by Canonical and the community under a meritocratic governance model. Canonical provides security updates and support for each Ubuntu release, starting from the release date and until the release reaches its designated end-of-life (EOL) date. Canonical generates revenue through the sale of premium services related to Ubuntu. Ubuntu is named after the African philosophy of Ubuntu, which Canonical translates as "humanity to others" or "I am what I am because of who we all are".  

The whole summary is printed in the output. We can customize the number of sentences in the summary text to be displayed by configuring the sentences argument of the method.

print(wikipedia.summary("Ubuntu", sentences=2))  

Output:

Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots).  

As you can see, only 2 sentences of Ubuntu's text summary is printed.

However, keep in mind that wikipedia.summary will raise a "disambiguation error" if the page does not exist or the page is disambiguous. Let's see an example.

print(wikipedia.summary("key"))  

The above code throws a DisambiguationError since there are many articles that would match "key".

Output:

Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/wikipedia/util.py", line 28, in __call__
    ret = self._cache[key] = self.fn(*args, **kwargs)
  File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 231, in summary
    page_info = page(title, auto_suggest=auto_suggest, redirect=redirect)
  File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
    return WikipediaPage(title, redirect=redirect, preload=preload)
  File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
    self.__load(redirect=redirect, preload=preload)
  File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
    raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "Key" may refer to:  
Key (cryptography)  
Key (lock)  
Key (map)  
...

If you had wanted the summary on a "cryptography key", for example, then you'd have to enter it as the following:

print(wikipedia.summary("Key (cryptography)"))  

With the more specific query we now get the correct summary in the output.

Retrieving Full Wikipedia Page Data

In order to get the contents, categories, coordinates, images, links and other metadata of a Wikipedia page, we must first get the Wikipedia page object or the page ID for the page. To do this, the page() method is used with page the title passed as an argument to the method.

Look at the following example:

wikipedia.page("Ubuntu")  

This method call will return a WikipediaPage object, which we'll explore more in the next few sections.

Extracting Metadata of a Page

To get the complete plain text content of a Wikipedia page (excluding images, tables, etc.), we can use the content attribute of the page object.

print(wikipedia.page("Python").content)  

Output:

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and  functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.  
...

Similarly, we can get the URL of the page using the url attribute:

print(wikipedia.page("Python").url)  

Output:

https://en.wikipedia.org/wiki/Python_(programming_language)  

We can get the URLs of external links on a Wikipedia page by using the references property of the WikipediaPage object.

print(wikipedia.page("Python").references)  

Output:

[u'http://www.computerworld.com.au/index.php/id;66665771', u'http://neopythonic.blogspot.be/2009/04/tail-recursion-elimination.html', u'http://www.amk.ca/python/writing/gvr-interview', u'http://cdsweb.cern.ch/journal/CERNBulletin/2006/31/News%20Articles/974627?ln=en', u'http://www.2ality.com/2013/02/javascript-influences.html', ...]

The title property of the WikipediaPage object can be used to we extract the title of the page.

print(wikipedia.page("Python").title)  

Output:

Python (programming language)  

Similarly, the categories attribute can be used to get the list of categories of a Wikipedia page:

print(wikipedia.page("Python").categories)  

Output

['All articles containing potentially dated statements', 'Articles containing potentially dated statements from August 2016', 'Articles containing potentially dated statements from December 2018', 'Articles containing potentially dated statements from March 2018', 'Articles with Curlie links', 'Articles with short description', 'Class-based programming languages', 'Computational notebook', 'Computer science in the Netherlands', 'Cross-platform free software', 'Cross-platform software', 'Dutch inventions', 'Dynamically typed programming languages', 'Educational programming languages', 'Good articles', 'High-level programming languages', 'Information technology in the Netherlands', 'Object-oriented programming languages', 'Programming languages', 'Programming languages created in 1991', 'Python (programming language)', 'Scripting languages', 'Text-oriented programming languages', 'Use dmy dates from August 2015', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with SUDOC identifiers']

The links element of the WikipediaPage object can be used to get the list of titles of the pages whose links are present in the page.

print(wikipedia.page("Ubuntu").links)  

Output

[u'/e/ (operating system)', u'32-bit', u'4MLinux', u'ALT Linux', u'AMD64', u'AOL', u'APT (Debian)', u'ARM64', u'ARM architecture', u'ARM v7', ...]

Finding Pages Based on Coordinates

The geosearch() method is used to do a Wikipedia geo search using latitude and longitude arguments supplied as float or decimal numbers to the method.

print(wikipedia.geosearch(37.787, -122.4))  

Output:

['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora']

As you see, the above method returns articles based on the coordinates provided.

Similarly, we can set the coordinates property of the page() and get the articles related to the geolocation. For example:

print(wikipedia.page(37.787, -122.4))  

Output:

['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora']

Language Settings

You can customize the language of a Wikipedia page to your native language, provided the page exists in your native language. To do so, you can use the set_lang() method. Each language has a standard prefix code which is passed as an argument to the method. For example, let's get the first 2 sentences of the summary text of "Ubuntu" wiki page in the German language.

wikipedia.set_lang("de")  
print(wikipedia.summary("ubuntu", sentences=2))  

Output

Ubuntu (auch Ubuntu Linux) ist eine Linux-Distribution, die auf Debian basiert. Der Name Ubuntu bedeutet auf Zulu etwa „Menschlichkeit“ und bezeichnet eine afrikanische Philosophie.  

You can check the list of currently supported ISO languages along with its prefix, as follows:

print(wikipedia.languages())  

Retrieving Images in a Wikipedia Page

The images list of the WikipediaPage object can be used to fetch images from a Wikipedia page. For instance, the following script returns the first image from Wikipedia's Ubuntu page:

print(wikipedia.page("ubuntu").images[0])  

Output

https://upload.wikimedia.org/wikipedia/commons/1/1d/Bildschirmfoto_zu_ubuntu_704.png  

The above code returns the URL of the image present at index 0 in the Wikipedia page.

To see the image, you can copy and paste the above URL into your browser.

Retreiving Full HTML Page Content

To get the full Wikipedia page in HTML format, you can use the following script:

print(wikipedia.page("Ubuntu").html())  

Output

<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">For the African philosophy, see <a href="/wiki/Ubuntu_philosophy" title="Ubuntu philosophy">Ubuntu philosophy</a>. For other uses, see <a href="/wiki/Ubuntu_(disambiguation)" class="mw-disambig" title="Ubuntu (disambiguation)">Ubuntu (disambiguation)</a>.</div>  
<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Linux distribution based on Debian</div>  
...

As seen in the output, the entire page in HTML format is displayed. This can take a bit longer to load if the page size is large, so keep in mind that it can raise an HTMLTimeoutError when a request to the server times out.

Conclusion

In this tutorial, we had a glimpse of using the Wikipedia API for extracting data from the web. We saw how to get a variety of information such as a page's title, category, links, images, and retrieve articles based on geo-locations.