Introduction
One thing that Python developers enjoy is surely the huge number of resources developed by its big community. Python-built application programming interfaces (APIs) are a common thing for websites. It's hard to imagine that any popular web service will not have created a Python API library to facilitate the access to its services. A few ideas of such APIs for some of the most popular web services could be found here. In fact, "Python wrapper" is a more correct term than "Python API", because a web API would usually provide a general application programming interface, while programming language-specific libraries create code to "wrap" around it into easy to use functions. Anyway, we'll use both terms interchangeably throughout this article.
In this blog post we concentrate on the Twitter API, show how the setting up of your credentials goes with Twitter, and compare a few Python wrappers based on the community engagement. Then we show a few examples of using the Twitter API for searching tweets, and creating a stream of real time tweets on a particular subject. Finally, we'll explore the saved data.
An Overview of the Twitter API
There are many APIs on the Twitter platform that software developers can engage with, with the ultimate possibility to create fully automated systems which will interact with Twitter. While this feature could benefit companies by drawing insights from Twitter data, it's also suitable for smaller-scale projects, research, and fun. Here are a few of the most notable APIs provided by Twitter:
- Tweets: searching, posting, filtering, engagement, streaming etc.
- Ads: campaign and audience management, analytics.
- Direct messages (still in Beta): sending and receiving, direct replies, welcome messages etc.
- Accounts and users (Beta): account management, user interactions.
- Media: uploading and accessing photos, videos and animated GIFs.
- Trends: trending topics in a given location.
- Geo: information about known places or places near a location.
There are many more possibilities with the Twitter APIs, which are not included in this list. Twitter is also constantly expanding its range of services by adding new APIs from time to time, and updating existing ones.
Getting Credentials
Before using the Twitter API, you first need a Twitter account, and to have obtained some credentials. The process of getting credentials could change with time, but currently it is as follows:
- Visit the Application Management page at https://apps.twitter.com/, and sign in with your Twitter account
- Click on the "Create New App" button, fill in the details and agree the Terms of Service
- Navigate to "Keys and Access Tokens" section and take a note of your Consumer Key and Secret
- In the same section click on "Create my access token" button
- Take note of your Access Token and Access Token Secret
And that's all. The consumer key/secret is used to authenticate the app that is using the Twitter API, while the access token/secret authenticates the user. All of these parameters should be treated as passwords, and should not be included in your code in plain text. One suitable way is to store them in a JSON file "twitter_credentials.json" and load these values from your code when needed.
import json
# Enter your keys/secrets as strings in the following fields
credentials = {}
credentials['CONSUMER_KEY'] = ...
credentials['CONSUMER_SECRET'] = ...
credentials['ACCESS_TOKEN'] = ...
credentials['ACCESS_SECRET'] = ...
# Save the credentials object to file
with open("twitter_credentials.json", "w") as file:
json.dump(credentials, file)
Python Wrappers
Python is one of the programming languages with the biggest number of developed wrappers for Twitter API. Therefore, it's hard to compare them if you haven't used each of them for some time. Possibly a good way to choose the right tool is to dig into their documentation and look at the possibilities they offer, and how they fit with the specifics of your app. In this part, we'll compare the various API wrappers using the engagement of the Python community in their GitHub projects. A few suitable metrics for comparison would be: number of contributors, number of received stars, number of watchers, library's maturity in timespan since first release etc.
_Table 1 _: Python libraries for Twitter API ordered by number of received stars.
Library | # contributors | # stars | # watchers | Maturity |
tweepy | 135 | 4732 | 249 | ~ 8.5 years |
Python Twitter Tools | 60 | 2057 | 158 | ~ 7 years |
python-twitter | 109 | 2009 | 148 | ~ 5 years |
twython | 73 | 1461 | 100 | NA |
TwitterAPI | 15 | 424 | 49 | ~ 4.5 years |
TwitterSearch | 8 | 241 | 29 | ~ 4.5 years |
The above table listed some of the most popular Python libraries for the Twitter API. Now let's use one of them to search through tweets, get some data, and explore.
Twython Examples
We've selected the twython library because of its diverse features aligned with different Twitter APIs, its maturity - although there's no information when its first release was published, there's information that version 2.6.0 appeared around 5 years ago, and its support for streaming tweets. In our first example we'll use the Search API to search tweets containing the string "learn python", and later on we'll show a more realistic example using Twitter's Streaming API.
Search API
In this example we'll create a query for the Search API with a search keyword "learn python", which would return the most popular public tweets in the past 7 days. Note that since our keyword is composed of two words, "learn" and "python", they both need to appear in the text of the tweet, and not necessarily as a continuous phrase. First, let's install the library. The easiest way is using pip
, but other options are also listed in the installation docs.
$ pip install twython
In the next step, we'll import the Twython class, instantiate an object of it, and create our search query. We'll use only four arguments in the query: q
, result_type
, count
and lang
, respectively for the search keyword, type, count, and language of results. Twitter also defines other arguments to fine-tune the search query, which can be found here.
# Import the Twython class
from twython import Twython
import json
# Load credentials from json file
with open("twitter_credentials.json", "r") as file:
creds = json.load(file)
# Instantiate an object
python_tweets = Twython(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'])
# Create our query
query = {'q': 'learn python',
'result_type': 'popular',
'count': 10,
'lang': 'en',
}
Finally we can use our Twython object to call the search
method, which returns a dictionary of search_metadata
and statuses
- the queried results. We'll only look at the statuses
part, and save a portion of all information in a pandas
dataframe, to present it in a table.
import pandas as pd
# Search tweets
dict_ = {'user': [], 'date': [], 'text': [], 'favorite_count': []}
for status in python_tweets.search(**query)['statuses']:
dict_['user'].append(status['user']['screen_name'])
dict_['date'].append(status['created_at'])
dict_['text'].append(status['text'])
dict_['favorite_count'].append(status['favorite_count'])
# Structure data in a pandas DataFrame for easier manipulation
df = pd.DataFrame(dict_)
df.sort_values(by='favorite_count', inplace=True, ascending=False)
df.head(5)
date | favorite_count | text | user | |
---|---|---|---|---|
1 | Fri Jan 12 21:50:03 +0000 2018 | 137 | 2017 was the Year of Python. We set out to lea... | Codecademy |
3 | Mon Jan 08 23:01:40 +0000 2018 | 137 | Step-by-Step Guide to Learn #Python for #DataS... | KirkDBorne |
4 | Mon Jan 08 11:13:02 +0000 2018 | 109 | Resetter is a new tool written in Python and p... | linuxfoundation |
8 | Sat Jan 06 16:30:06 +0000 2018 | 96 | We're proud to announce that this week we have... | DataCamp |
2 | Sun Jan 07 19:00:36 +0000 2018 | 94 | Learn programming in Python with the Python by... | humble |
So we got some interesting tweets. Note that these are the most popular tweets containing the words "learn" and "python" in the past 7 days. To explore data back in history, you'll need to purchase the Premium or Enterprise plan of the Search API.
Streaming API
While the previous example showed a one-off search, a more interesting case would be to collect a stream of tweets. This is done using the Twitter Streaming API, and Twython has an easy way to do it through the TwythonStreamer class. We'll need to define a class MyStreamer
that inherits TwythonStreamer
and then override the on_success
and on_error
methods, as follows.
The on_success
method is called automatically when twitter sends us data, while the on_error
method is called whenever a problem occurs with the API (most commonly due to constraints of the Twitter APIs). The added method save_to_csv
is a useful way to store tweets to file.
Similar to the previous example, we won't save all the data in a tweet, but only the fields we are interested in, such as: hashtags used, user name, user's location, and the text of the tweet itself. There's a lot of interesting information in a tweet, so feel free to experiment with it. Note that we'll store the tweet location as present on the user's profile, which might not correspond to the current or real location of the user sending the tweet. This is because only a small portion of Twitter users provide their current location - usually in the coordinates
key of the tweet data.
from twython import TwythonStreamer
import csv
# Filter out unwanted data
def process_tweet(tweet):
d = {}
d['hashtags'] = [hashtag['text'] for hashtag in tweet['entities']['hashtags']]
d['text'] = tweet['text']
d['user'] = tweet['user']['screen_name']
d['user_loc'] = tweet['user']['location']
return d
# Create a class that inherits TwythonStreamer
class MyStreamer(TwythonStreamer):
# Received data
def on_success(self, data):
# Only collect tweets in English
if data['lang'] == 'en':
tweet_data = process_tweet(data)
self.save_to_csv(tweet_data)
# Problem with the API
def on_error(self, status_code, data):
print(status_code, data)
self.disconnect()
# Save each tweet to csv file
def save_to_csv(self, tweet):
with open(r'saved_tweets.csv', 'a') as file:
writer = csv.writer(file)
writer.writerow(list(tweet.values()))
The next thing to do is instantiate an object of the MyStreamer
class with our credentials passed as arguments, and we'll use the filter
method to only collect tweets we're interested in. We'll create our filter with the track
argument which provides the filter keywords, in our case "python". Besides the track
argument, there are more possibilities to fine-tune your filter, listed in the basic streaming parameters, such as: collecting tweets from selected users, languages, locations etc. The paid versions of the Streaming API would provide much more filtering options.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
# Instantiate from our streaming class
stream = MyStreamer(creds['CONSUMER_KEY'], creds['CONSUMER_SECRET'],
creds['ACCESS_TOKEN'], creds['ACCESS_SECRET'])
# Start the stream
stream.statuses.filter(track='python')
With the code above, we collected data for around 10,000 tweets containing the keyword "python". In the next part, we'll do a brief analysis of the included hashtags and user locations.
Brief Data Analysis
The Twitter API is a powerful thing, very suitable for researching the public opinion, market analysis, quick access to news, and other use-cases your creativity can support. A common thing to do, after you've carefully collected your tweets, is to analyze the data, where sentiment analysis plays a crucial role in systematically extracting subjective information from text. Anyway, sentiment analysis is a huge field to be addressed in a small portion of a blog post, so in this part we'll only do some basic data analysis regarding the location and hashtags used by people tweeting "python".
Please note that the point of these examples is just to show what the Twitter API data could be used for - our small sample of tweets should not be used in inferring conclusions, because it's not a good representative of the whole population of tweets, nor its collection times were independent and uniform.
First let's import our data from the "saved_tweets.csv" file and print out a few rows.
import pandas as pd
tweets = pd.read_csv("saved_tweets.csv")
tweets.head()
hashtags | text | user | location | |
---|---|---|---|---|
0 | ['IBM'] | RT @freschesolution: Join us TOMORROW with @OC... | rbrownpa | NaN |
1 | [] | pylocus 1.0.1: Localization Package https://t.... | pypi_updates2 | NaN |
2 | [] | humilis-push-processor 0.0.10: Humilis push ev... | pypi_updates2 | NaN |
3 | ['Python', 'python', 'postgresql'] | #Python Digest is out! https://t.co/LEmyR3yDMh... | horstwilmes | Zürich |
4 | ['NeuralNetworks', 'Python', 'KDN'] | RT @kdnuggets: A Beginners Guide to #NeuralNet... | giodegas | L'Aquila, ITALY |
What are the most common hashtags that go with our keyword "python"? Since all the data in our DataFrame are represented as strings including brackets in the hashtags
column, to get a list of hashtags we'll need to go from a list of strings, to a list of lists, to a list of hashtags. Then we'll use the Counter
class to count the hashtags entries in our list, and print a sorted list of 20 most common hashtags.
from collections import Counter
import ast
tweets = pd.read_csv("saved_tweets.csv")
# Extract hashtags and put them in a list
list_hashtag_strings = [entry for entry in tweets.hashtags]
list_hashtag_lists = ast.literal_eval(','.join(list_hashtag_strings))
hashtag_list = [ht.lower() for list_ in list_hashtag_lists for ht in list_]
# Count most common hashtags
counter_hashtags = Counter(hashtag_list)
counter_hashtags.most_common(20)
[('python', 1337),
('datascience', 218),
('bigdata', 140),
('machinelearning', 128),
('deeplearning', 107),
('django', 93),
('java', 76),
('ai', 76),
('coding', 68),
('100daysofcode', 65),
('javascript', 64),
('iot', 58),
('rstats', 52),
('business', 52),
('tech', 48),
('ruby', 45),
('programming', 43),
('cybersecurity', 43),
('angularjs', 41),
('pythonbot_', 41)]
Next, we can use the user location to answer - which areas of the world tweet most about "python"? For this step, we'll use the geocode
method of the geopy library which returns the coordinates of a given input location. To visualize a world heat map of tweets, we'll use the gmplot library. A reminder: our small data is not a real representative of the world.
from geopy.geocoders import Nominatim
import gmplot
geolocator = Nominatim()
# Go through all tweets and add locations to 'coordinates' dictionary
coordinates = {'latitude': [], 'longitude': []}
for count, user_loc in enumerate(tweets.location):
try:
location = geolocator.geocode(user_loc)
# If coordinates are found for location
if location:
coordinates['latitude'].append(location.latitude)
coordinates['longitude'].append(location.longitude)
# If too many connection requests
except:
pass
# Instantiate and center a GoogleMapPlotter object to show our map
gmap = gmplot.GoogleMapPlotter(30, 0, 3)
# Insert points on the map passing a list of latitudes and longitudes
gmap.heatmap(coordinates['latitude'], coordinates['longitude'], radius=20)
# Save the map to html file
gmap.draw("python_heatmap.html")
The above code produced the heat map in the following figure, showing a higher activity in "python" tweets in US, UK, Nigeria and India. One downside of the described approach is that we didn't do any data cleaning; there turned out to be many machine generated tweets coming from a single location, or multiple locations producing one same tweet. Of course these samples should be discarded, to get a more realistic picture of the geographical distribution of humans tweeting "python". A second improvement would simply be to collect more data over longer and uninterrupted periods.
Conclusions
In this blog post we presented a pretty modest part of the Twitter API. Overall, Twitter is a very powerful tool for understanding the public opinion, doing research and market analysis, and therefore its APIs are a great way for businesses to create automated tools for drawing insights related to their scope of work. Not only businesses, but individuals could also use the APIs for building creative apps.
We also listed a few of the most popular Python wrappers, but it's important to note that different wrappers implement different possibilities of the Twitter APIs. Therefore one should choose a Python wrapper according to its purpose. The two examples we showed with the Search and Streaming APIs, briefly described the process of collecting tweets, and some of the possible insights they could draw. Feel free to create ones yourself!
References
- "Data Science from Scratch" by Joel Grus (book)
- Twitter API - documentation
geopy
library - Pypi web pagegmplot
library - GitHub project