This is my second (of five) post on using Python to process Twitter data.

Check out my all the posts in the series.

In this post I was going to look at two particular aspects. The first is the converting of Tweets to Pandas. This will allow you to do additional analysis of tweets. The second part of this post looks at how to setup and process streaming of tweets. The first part was longer than expected so I'm going to hold the second part for a later post.

 

Step 6 – Convert Tweets to Pandas

In my previous blog post I show you how to connect and download tweets. Sometimes you may want to convert these tweets into a structured format to allow you to do further analysis. A very popular way of analysing data is to us Pandas. Using Pandas to store your data is like having data stored in a spreadsheet, with columns and rows. There are also lots of analytic functions available to use with Pandas.

In my previous blog post I showed how you could extract tweets using the Twitter API and to do selective pulls using the Tweepy Python library. Now that we have these tweet how do I go about converting them into Pandas for additional analysis? But before we do that we need to understand a bit more a bout the structure of the Tweet object that is returned by the Twitter API. We can examine the structure of the User object and the Tweet object using the following commands.


dir(user)
['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getstate__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_api',
'_json',
'contributors_enabled',
'created_at',
'default_profile',
'default_profile_image',
'description',
'entities',
'favourites_count',
'follow',
'follow_request_sent',
'followers',
'followers_count',
'followers_ids',
'following',
'friends',
'friends_count',
'geo_enabled',
'has_extended_profile',
'id',
'id_str',
'is_translation_enabled',
'is_translator',
'lang',
'listed_count',
'lists',
'lists_memberships',
'lists_subscriptions',
'location',
'name',
'needs_phone_verification',
'notifications',
'parse',
'parse_list',
'profile_background_color',
'profile_background_image_url',
'profile_background_image_url_https',
'profile_background_tile',
'profile_banner_url',
'profile_image_url',
'profile_image_url_https',
'profile_link_color',
'profile_location',
'profile_sidebar_border_color',
'profile_sidebar_fill_color',
'profile_text_color',
'profile_use_background_image',
'protected',
'screen_name',
'status',
'statuses_count',
'suspended',
'time_zone',
'timeline',
'translator_type',
'unfollow',
'url',
'utc_offset',
'verified']

 

dir(tweets)

['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getstate__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_api',
'_json',
'author',
'contributors',
'coordinates',
'created_at',
'destroy',
'entities',
'favorite',
'favorite_count',
'favorited',
'geo',
'id',
'id_str',
'in_reply_to_screen_name',
'in_reply_to_status_id',
'in_reply_to_status_id_str',
'in_reply_to_user_id',
'in_reply_to_user_id_str',
'is_quote_status',
'lang',
'parse',
'parse_list',
'place',
'retweet',
'retweet_count',
'retweeted',
'retweets',
'source',
'source_url',
'text',
'truncated',
'user']

We can see all this additional information to construct what data we really want to extract.

The following example illustrates the searching for tweets containing a certain word and then extracting a subset of the metadata associated with those tweets.

oracleace_tweets = tweepy.Cursor(api.search,q="oracleace").items()
tweets_data = []
for t in oracleace_tweets:
tweets_data.append((t.author.screen_name,
t.place,
t.lang,
t.created_at,
t.favorite_count,
t.retweet_count,
t.text))

We print the contents of the tweet_data object.

print(tweets_data)

[('jpraulji', None, 'en', datetime.datetime(2018, 5, 28, 13, 41, 59), 0, 5, 'RT @tanwanichandan: Hello Friends,nnODevC Yatra is schedule now for all seven location.nThis time we have four parallel tracks i.e. Databas…'), ('opal_EPM', None, 'en', datetime.datetime(2018, 5, 28, 13, 15, 30), 0, 6, "RT @odtug: Oracle #ACE Director @CaryMillsap is presenting 2 #Kscope18 sessions you don't want to miss! n- Hands-On Lab: How to Write Bette…"), ('msjsr', None, 'en', datetime.datetime(2018, 5, 28, 12, 32, 8), 0, 5, 'RT @tanwanichandan: Hello Friends,nnODevC Yatra is schedule now for all seven location.nThis time we have four parallel tracks i.e. Databas…'), ('cmvithlani', None, 'en', datetime.datetime(2018, 5, 28, 12, 24, 10), 0, 5, 'RT @tanwanichandan: Hel ……

I've only shown a subset of the tweets_data above.

Now we want to convert the tweets_data object to a panda object. This is a relative trivial task but an important steps is to define the columns names otherwise you will end up with columns with labels 0,1,2,3…

Import pandas as pd
tweets_pd = pd.DataFrame(tweets_data,
columns=['screen_name', 'place', 'lang', 'created_at', 'fav_count', 'retweet_count', 'text'])

Now we have a panda structure that we can use for additional analysis. This can be easily examined as follows.

tweets_pd
screen_name 	place 	lang 	created_at 	fav_count 	retweet_count 	text
0 jpraulji None en 2018-05-28 13:41:59 0 5 RT @tanwanichandan: Hello Friends,nnODevC Ya...
1 opal_EPM None en 2018-05-28 13:15:30 0 6 RT @odtug: Oracle #ACE Director @CaryMillsap i...
2 msjsr None en 2018-05-28 12:32:08 0 5 RT @tanwanichandan: Hello Friends,nnODevC Ya...

Now we can use all the analytic features of pandas to do some analytics. For example, in the following we do a could of the number of times a language has been used in our tweets data set/panda, and then plot it.

import matplotlib.pyplot as plt
tweets_by_lang = tweets_pd['lang'].value_counts()
print(tweets_by_lang)

lang_plot = tweets_by_lang.plot(kind='bar')
lang_plot.set_xlabel("Languages")
lang_plot.set_ylabel("Num. Tweets")
lang_plot.set_title("Language Frequency")

en 182
fr 7
es 2
ca 2
et 1
in 1

Pandas1

Similarly we can analyse the number of times a twitter screen name has been used, and limited to the 20 most commonly occurring screen names.

tweets_by_screen_name = tweets_pd['screen_name'].value_counts()
#print(tweets_by_screen_name)

top_twitter_screen_name = tweets_by_screen_name[:20]
print(top_twitter_screen_name)

name_plot = top_twitter_screen_name.plot(kind='bar')
name_plot.set_xlabel("Users")
name_plot.set_ylabel("Num. Tweets")
name_plot.set_title("Frequency Twitter users using oracleace")

oraesque 7
DBoriented 5
Addidici 5
odtug 5
RonEkins 5
opal_EPM 5
fritshoogland 4
svilmune 4
FranckPachot 4
hariprasathdba 3
oraclemagazine 3
ritan2000 3
yvrk1973 3
...

Pandas2

There you go, this post has shown you how to take twitter objects, convert them in pandas and then use the analytics features of pandas to aggregate the data and create some plots.

Check out the other blog posts in this series of Twitter Analytics using Python.

About the Author

Brendan Tierney

Brendan Tierney, Oracle ACE Director, is an independent consultant and lectures on Data Mining and Advanced Databases in the Dublin Institute of Technology in Ireland. He has 22+ years of extensive experience working in the areas of Data Mining, Data Warehousing, Data Architecture and Database Design. Brendan has worked on projects in Ireland, UK, Belgium and USA and is the editor of the UKOUG Oracle Scene magazine and deputy chair of the OUG Ireland BI SIG. Brendan is a regular speaker at conferences across Europe and the USA and has written technical articles for OTN, Oracle Scene, IOUG SELECT Journal and ODTUG Technical Journal.

Start the discussion at forums.toadworld.com