Introduction
In this article, we will be using the Wikipedia API to retrieve data from Wikipedia. Data scraping has seen a rapid surge owing to the increasing use of data analytics and machine learning tools. The Internet is the single largest source of information, and therefore it is important to know how to fetch data from various sources. And with Wikipedia being one of the largest and most popular sources for information on the Internet, this is a natural place to start.
In this article, we will see how to use Python's Wikipedia API to fetch a variety of information from the Wikipedia website.
Installation
In order to extract data from Wikipedia, we must first install the Python Wikipedia library, which wraps the official Wikipedia API. This can be done by entering the command below in your command prompt or terminal:
$ pip install wikipedia
Once the installation is done, we can use the Wikipedia API in Python to extract information from Wikipedia. In order to call the methods of the Wikipedia module in Python, we need to import it using the following command.
import wikipedia
Searching Titles and Suggestions
The search()
method does a Wikipedia search for a query that is supplied as an argument to it. As a result, this method returns a list of all the article's titles that contain the query. For example:
import wikipedia
print(wikipedia.search("Bill"))
Output:
['Bill', 'The Bill', 'Bill Nye', 'Bill Gates', 'Bills, Bills, Bills', 'Heartbeat bill', 'Bill Clinton', 'Buffalo Bill', 'Bill & Ted', 'Kill Bill: Volume 1']
As you see in the output, the searched title along with the related search suggestions are displayed. You can configure the number of search titles returned by passing a value for the results
parameter, as shown here:
import wikipedia
print(wikipedia.search("Bill", results=2))
Output:
['Bill', 'The Bill']
The above code prints only 2 search results of the query since that is how many we requested to be returned.
Let's say we need to get the Wikipedia search suggestions for a search title, "Bill Cliton" that is incorrectly entered or has a typo. The suggest()
method returns suggestions related to the search query entered as a parameter to it, or it will return "None" if no suggestions were found.
Let's try it out here:
import wikipedia
print(wikipedia.suggest("Bill cliton"))
Output:
bill clinton
You can see that it took our incorrect entry, "Bill cliton", and returned the correct suggestion of "bill clinton".
Extracting Wikipedia Article Summary
We can extract the summary of a Wikipedia article using the summary()
method. The article for which the summary needs to be extracted is passed as a parameter to this method.
Let's extract the summary for "Ubuntu":
print(wikipedia.summary("Ubuntu"))
Output:
Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots). Ubuntu is a popular operating system for cloud computing, with support for OpenStack.Ubuntu is released every six months, with long-term support (LTS) releases every two years. The latest release is 19.04 ("Disco Dingo"), and the most recent long-term support release is 18.04 LTS ("Bionic Beaver"), which is supported until 2028. Ubuntu is developed by Canonical and the community under a meritocratic governance model. Canonical provides security updates and support for each Ubuntu release, starting from the release date and until the release reaches its designated end-of-life (EOL) date. Canonical generates revenue through the sale of premium services related to Ubuntu. Ubuntu is named after the African philosophy of Ubuntu, which Canonical translates as "humanity to others" or "I am what I am because of who we all are".
The whole summary is printed in the output. We can customize the number of sentences in the summary text to be displayed by configuring the sentences
argument of the method.
print(wikipedia.summary("Ubuntu", sentences=2))
Output:
Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots).
As you can see, only 2 sentences of Ubuntu's text summary is printed.
However, keep in mind that wikipedia.summary
will raise a "disambiguation error" if the page does not exist or the page is dis-ambiguous. Let's see an example.
print(wikipedia.summary("key"))
The above code throws a DisambiguationError
since there are many articles that would match "key".
Output:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/wikipedia/util.py", line 28, in __call__
ret = self._cache[key] = self.fn(*args, **kwargs)
File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 231, in summary
page_info = page(title, auto_suggest=auto_suggest, redirect=redirect)
File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 276, in page
return WikipediaPage(title, redirect=redirect, preload=preload)
File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__
self.__load(redirect=redirect, preload=preload)
File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load
raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to)
wikipedia.exceptions.DisambiguationError: "Key" may refer to:
Key (cryptography)
Key (lock)
Key (map)
...
If you had wanted the summary on a "cryptography key", for example, then you'd have to enter it as the following:
print(wikipedia.summary("Key (cryptography)"))
With the more specific query we now get the correct summary in the output.
Retrieving Full Wikipedia Page Data
In order to get the contents, categories, coordinates, images, links and other metadata of a Wikipedia page, we must first get the Wikipedia page object or the page ID for the page. To do this, the page()
method is used with the page title passed as an argument to the method.
Look at the following example:
wikipedia.page("Ubuntu")
This method call will return a WikipediaPage
object, which we'll explore more in the next few sections.
Extracting Metadata of a Page
To get the complete plain text content of a Wikipedia page (excluding images, tables, etc.), we can use the content
attribute of the page
object.
print(wikipedia.page("Python").content)
Output:
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.
...
Similarly, we can get the URL of the page using the url
attribute:
print(wikipedia.page("Python").url)
Output:
https://en.wikipedia.org/wiki/Python_(programming_language)
We can get the URLs of external links on a Wikipedia page by using the references
property of the WikipediaPage
object.
print(wikipedia.page("Python").references)
Output:
[u'http://www.computerworld.com.au/index.php/id;66665771', u'http://neopythonic.blogspot.be/2009/04/tail-recursion-elimination.html', u'http://www.amk.ca/python/writing/gvr-interview', u'http://cdsweb.cern.ch/journal/CERNBulletin/2006/31/News%20Articles/974627?ln=en', u'http://www.2ality.com/2013/02/javascript-influences.html', ...]
The title
property of the WikipediaPage
object can be used to extract the title of the page.
print(wikipedia.page("Python").title)
Output:
Python (programming language)
Similarly, the categories
attribute can be used to get the list of categories of a Wikipedia page:
print(wikipedia.page("Python").categories)
Output
['All articles containing potentially dated statements', 'Articles containing potentially dated statements from August 2016', 'Articles containing potentially dated statements from December 2018', 'Articles containing potentially dated statements from March 2018', 'Articles with Curlie links', 'Articles with short description', 'Class-based programming languages', 'Computational notebook', 'Computer science in the Netherlands', 'Cross-platform free software', 'Cross-platform software', 'Dutch inventions', 'Dynamically typed programming languages', 'Educational programming languages', 'Good articles', 'High-level programming languages', 'Information technology in the Netherlands', 'Object-oriented programming languages', 'Programming languages', 'Programming languages created in 1991', 'Python (programming language)', 'Scripting languages', 'Text-oriented programming languages', 'Use dmy dates from August 2015', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with SUDOC identifiers']
The links
element of the WikipediaPage
object can be used to get the list of titles of the pages whose links are present in the page.
print(wikipedia.page("Ubuntu").links)
Output
[u'/e/ (operating system)', u'32-bit', u'4MLinux', u'ALT Linux', u'AMD64', u'AOL', u'APT (Debian)', u'ARM64', u'ARM architecture', u'ARM v7', ...]
Finding Pages Based on Coordinates
The geosearch()
method is used to do a Wikipedia geo search using latitude and longitude arguments supplied as float or decimal numbers to the method.
print(wikipedia.geosearch(37.787, -122.4))
Output:
['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora']
As you see, the above method returns articles based on the coordinates provided.
Similarly, we can set the coordinates property of the page()
and get the articles related to geolocation. For example:
print(wikipedia.page(37.787, -122.4))
Output:
['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora']
Language Settings
You can customize the language of a Wikipedia page to your native language, provided the page exists in your native language. To do so, you can use the set_lang()
method. Each language has a standard prefix code which is passed as an argument to the method. For example, let's get the first 2 sentences of the summary text of "Ubuntu" wiki page in the German language.
wikipedia.set_lang("de")
print(wikipedia.summary("ubuntu", sentences=2))
Output
Ubuntu (auch Ubuntu Linux) ist eine Linux-Distribution, die auf Debian basiert. Der Name Ubuntu bedeutet auf Zulu etwa „Menschlichkeit“ und bezeichnet eine afrikanische Philosophie.
You can check the list of currently supported ISO languages along with its prefix, as follows:
print(wikipedia.languages())
Retrieving Images in a Wikipedia Page
The images
list of the WikipediaPage
object can be used to fetch images from a Wikipedia page. For instance, the following script returns the first image from Wikipedia's Ubuntu page:
print(wikipedia.page("ubuntu").images[0])
Output
https://upload.wikimedia.org/wikipedia/commons/1/1d/Bildschirmfoto_zu_ubuntu_704.png
The above code returns the URL of the image present at index 0 in the Wikipedia page.
To see the image, you can copy and paste the above URL into your browser.
Retrieving Full HTML Page Content
To get the full Wikipedia page in HTML format, you can use the following script:
print(wikipedia.page("Ubuntu").html())
Output
<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">For African philosophy, see <a href="/wiki/Ubuntu_philosophy" title="Ubuntu philosophy">Ubuntu philosophy</a>. For other uses, see <a href="/wiki/Ubuntu_(disambiguation)" class="mw-disambig" title="Ubuntu (disambiguation)">Ubuntu (disambiguation)</a>.</div>
<div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Linux distribution based on Debian</div>
...
As seen in the output, the entire page in HTML format is displayed. This can take a bit longer to load if the page size is large, so keep in mind that it can raise an HTMLTimeoutError
when a request to the server times out.
Conclusion
In this tutorial, we had a glimpse of using the Wikipedia API for extracting data from the web. We saw how to get a variety of information such as a page's title, category, links, images, and retrieve articles based on geolocations.