Python: Mining Twitter for GamerGate: A How-To
I’ve gotten interested in the #GamerGate “controversy”—I’m pretty completely persuaded that any talk about “ethics” is a façade for a lot of reactionary nonsense, as well as abundant harassment and misogny—and it occurred to me that it represented an interesting data set to mine using Python. This is a quick guide for how to get started, but it could be adapted to any effort to datamine Twitter.
Setting Up to Connect to Twitter
First, you’re going to need to set up a Twitter app that you can use for authentication. You can do this at You’ll need to have a valid Twitter account with an authenticated phone number.
Enter a name, description and web site URL for your application. You won’t need a callback URL.
Check “Yes, I agree” at the bottom of the Developer Agreement, and click the “Create your Twitter application” button.
Your application will be created. To use Tweepy to capture tweets, we’ll need the Consumer Key and Consumer Secret, and we’ll also need to set up an access token. Click on the “manage keys and access tokens” link next to your “Consumer Key (API Key)” in the “Application Settings” section.
This will take you to the “Keys and Access Tokens” tab. Note your “Consumer Key” and “Consumer Secret” (greyed out here).
In the “Your Access Token” section at the bottom of the page, click on “Create my access token”.
An “Access Token” and an “Access Token Secret” — again, greyed out here — will be generated, you’ll need these as well.
Install the Python Prerequisites
For this project, we’re going to need the Tweepy, Pandas, and matplotlib libraries
sudo pip install tweepy pandas matplotlib
Here’s a simple-minded Python script using Tweepy to collect tweets mentioning “gamergate” from the Twitter streaming API:
from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream access_token = "YOUR ACCESS TOKEN GOES HERE" access_token_secret = "YOUR ACCESS TOKEN SECRET GOES HERE" consumer_key ="YOUR CONSUMER KEY GOES HERE" consumer_secret = "YOUR CONSUMER KEY SECRET GOES HERE" class StdOutListener(StreamListener): def on_data(self, data): print data return True def on_error(self, status): print status if __name__ == '__main__': listener = StdOutListener() auth_handler = OAuthHandler(consumer_key, consumer_secret) auth_handler.set_access_token(access_token, access_token_secret) stream = Stream(auth_handler, listener) stream.filter(track=['gamergate'])
The script, as it stands, times out on a read every once in a while, so there’s a minor improvement to be had here by embedding the collection in a while loop with a try and an except to keep it from crashing back to the shell prompt occasionally:
while True: try: stream.filter(track=['gamergate']) except: continue
All this script does is print out every tweet which is captured by Tweepy, in JSON format. If you run it, the output will look something like this — this is a single tweet in JSON notation:
{u'contributors': None, u'truncated': False, u'text': u'RT @CommissarOfGG: Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate ht\u2026', 'retweet': True, u'in_reply_to_status_id': None, u'id': 584828601125773314, u'favorite_count': 0, u'source': u'<a href="" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1428268978755', u'entities': {u'symbols': [], u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'', u'display_url': u'', u'url': u'', u'media_url_https': u'', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u''}], u'hashtags': [{u'indices': [126, 136], u'text': u'GamerGate'}], u'user_mentions': [{u'id': 2729513808, u'indices': [3, 17], u'id_str': u'2729513808', u'screen_name': u'CommissarOfGG', u'name': u'Comrade Commissar'}], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828601125773314', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate', u'in_reply_to_status_id': None, u'id': 584828243808661504, u'favorite_count': 2, u'source': u'<a href="" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'media': [{u'expanded_url': u'', u'display_url': u'', u'url': u'', u'media_url_https': u'', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u''}], u'hashtags': [{u'indices': [107, 117], u'text': u'GamerGate'}], u'user_mentions': [], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828243808661504', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2729513808, u'verified': False, u'profile_image_url_https': u'', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 2047, u'profile_sidebar_border_color': u'000000', u'id_str': u'2729513808', u'profile_background_color': u'000000', u'listed_count': 26, u'profile_background_image_url_https': u'', u'utc_offset': -14400, u'statuses_count': 4940, u'description': u'#GamerGate #OpSKYNET', u'friends_count': 1584, u'location': u'Moscow', u'profile_link_color': u'DD2E44', u'profile_image_url': u'', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'', u'profile_background_image_url': u'', u'name': u'Comrade Commissar', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 1980, u'screen_name': u'CommissarOfGG', u'notifications': None, u'url': u'', u'created_at': u'Wed Aug 13 14:10:24 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Eastern Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:21:33 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'', u'display_url': u'', u'url': u'', u'media_url_https': u'', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u''}]}}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2784597626, u'verified': False, u'profile_image_url_https': u'', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 986, u'profile_sidebar_border_color': u'000000', u'id_str': u'2784597626', u'profile_background_color': u'000000', u'listed_count': 35, u'profile_background_image_url_https': u'', u'utc_offset': -18000, u'statuses_count': 25217, u'description': u"I wasn't born with enough middle fingers for perpetually outraged hipster douchebags compensating for their mediocrity with shelves of participation trophies.", u'friends_count': 785, u'location': u'Parts Unknown', u'profile_link_color': u'4A913C', u'profile_image_url': u'', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'', u'profile_background_image_url': u'', u'name': u'Unnecessary Robness', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 15912, u'screen_name': u'aDouScheiBler', u'notifications': None, u'url': None, u'created_at': u'Mon Sep 01 19:36:40 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Central Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:22:58 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'', u'display_url': u'', u'url': u'', u'media_url_https': u'', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u''}]}}
Set a terminal running the script above for as long as you like. I left mine going for 42 hours, and collected about 65000 tweets in a text file about 300MB long.
python >> gamergate.txt
When you’ve collected your data, here’s some Python to set up a sample pandas DataFrame containing information of interest: who tweeted, how many days old their account is, how many followers they have, who it was a retweet of (if it was one) and to whom it was a reply (if it was one).
That should give you plenty of grist for analysis.
import json import pandas as pd import matplotlib.pyplot as plt from time import gmtime, mktime, strptime tweets_data_path = 'gamergate.txt' tweets_data = [] tweets_file = open(tweets_data_path, "r") for line in tweets_file: try: tweet = json.loads(line) tweets_data.append(tweet) except: continue # # Clean out limit messages, etc. # for tweet in tweets_data: try: user = tweet['user'] except: tweets_data.remove(tweet) print len(tweets_data) # # Pull the data we're interested in out of the Twitter data we captured # rows_list = [] now=mktime(gmtime()) for tweet in tweets_data: author = "" rtauthor = "" # # If it was a retweet, get both the original author and the retweeter # try: author = tweet['user']['screen_name'] rtauthor = tweet['retweeted_status']['user']['screen_name'] except: # # Otherwise, just get the original author # try: author = tweet['user']['screen_name'] except: continue reply_to = "" if (tweet['in_reply_to_screen_name'] != None): reply_to = tweet['in_reply_to_screen_name'] age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y"))/(60*60*24)) followers = tweet['user']['followers_count'] dict1 = {} dict1.update({'author': author, 'retweet_of': rtauthor, 'reply_to': reply_to, 'age': age, 'followers': followers}) rows_list.append(dict1) tweets = pd.DataFrame(rows_list)
the resulting DataFrame will look something like this—note that rows 0-4 are retweets, and row 6 is a reply; “age” is days since the Twitter ID was created:
age author followers reply_to retweet_of 0 137 Maskgamer64 428 CultOfVivian 1 231 Smackfacemcgee 1304 Daddy_Warpig 2 2240 LenFirewood 1658 RSG_VILLENA 3 171 8bitsofsound 650 CommissarOfGG 4 102 devilstwosome 9 atlasnodded 5 24 tophatdril 34 6 11 TheRalphRetart 63 Dr_Louse 7... ... ... ... ... ... 64531 65 4EverPlayer2 614 mombot 64532 143 EnwroughtDreams 222 thewtfmagazine 64533 1996 _icze4r 22689 dauthaz 64534 1581 __DavidFlanagan 8315 Spacekatgal 64535 872 jtdg_b8z 621 GamingAndPandas 64536 2238 hanytimeh 914 thewtfmagazine
At this point you could easily find out the most-retweeted IDs in the DataFrame, for example:
In [146]: tweets['retweet_of'].value_counts() Out[146]: 17974 Sargon_of_Akkad 1574 ItalyGG 1516 TheRalphRetort 1064 Blaugast 910 mylittlepwnies3 899 thewtfmagazine 823 Nero 721 srhbutts 706 Daddy_Warpig 705 randomfox 627 atlasnodded 592 full_mcintosh 586 whenindoubtdo 584 ToKnowIsToBe 569 ...
Check out the follow-on posting to see how to use NetworkX and Gephi to make visualizations of the data.