Difference between revisions of "Python: Mining Twitter for GamerGate: A How-To"

From OnnoWiki
Jump to navigation Jump to search
(Created page with " I’ve gotten interested in the #GamerGate “controversy”—I’m pretty completely persuaded that any talk about “ethics” is a façade for a lot of reactionary nonse...")
 
Line 8: Line 8:
 
Enter a name, description and web site URL for your application. You won’t need a callback URL.
 
Enter a name, description and web site URL for your application. You won’t need a callback URL.
  
FirefoxDeveloperEditionScreenSnapz068
+
FirefoxDeveloperEditionScreenSnapz068
  
 
Check “Yes, I agree” at the bottom of the Developer Agreement, and click the “Create your Twitter application” button.
 
Check “Yes, I agree” at the bottom of the Developer Agreement, and click the “Create your Twitter application” button.
  
FirefoxDeveloperEditionScreenSnapz069
+
FirefoxDeveloperEditionScreenSnapz069
  
 
Your application will be created. To use Tweepy to capture tweets, we’ll need the Consumer Key and Consumer Secret, and we’ll also need to set up an access token. Click on the “manage keys and access tokens” link next to your “Consumer Key (API Key)” in the “Application Settings” section.
 
Your application will be created. To use Tweepy to capture tweets, we’ll need the Consumer Key and Consumer Secret, and we’ll also need to set up an access token. Click on the “manage keys and access tokens” link next to your “Consumer Key (API Key)” in the “Application Settings” section.
  
FirefoxDeveloperEditionScreenSnapz070
+
FirefoxDeveloperEditionScreenSnapz070
  
 
This will take you to the “Keys and Access Tokens” tab. Note your “Consumer Key” and “Consumer Secret” (greyed out here).
 
This will take you to the “Keys and Access Tokens” tab. Note your “Consumer Key” and “Consumer Secret” (greyed out here).
  
FirefoxDeveloperEditionScreenSnapz071
+
FirefoxDeveloperEditionScreenSnapz071
  
 
In the “Your Access Token” section at the bottom of the page, click on “Create my access token”.
 
In the “Your Access Token” section at the bottom of the page, click on “Create my access token”.
  
FirefoxDeveloperEditionScreenSnapz072
+
FirefoxDeveloperEditionScreenSnapz072
  
 
An “Access Token” and an “Access Token Secret” — again, greyed out here — will be generated, you’ll need these as well.
 
An “Access Token” and an “Access Token Secret” — again, greyed out here — will be generated, you’ll need these as well.
  
FirefoxDeveloperEditionScreenSnapz073
+
FirefoxDeveloperEditionScreenSnapz073
Install the Python Prerequisites
+
 
 +
==Install the Python Prerequisites==
  
 
For this project, we’re going to need the Tweepy, Pandas, and matplotlib libraries
 
For this project, we’re going to need the Tweepy, Pandas, and matplotlib libraries
  
    pip install tweepy pandas matplotlib
+
sudo pip install tweepy pandas matplotlib
  
 
Here’s a simple-minded Python script using Tweepy to collect tweets mentioning “gamergate” from the Twitter streaming API:
 
Here’s a simple-minded Python script using Tweepy to collect tweets mentioning “gamergate” from the Twitter streaming API:
  
    from tweepy.streaming import StreamListener
+
from tweepy.streaming import StreamListener
    from tweepy import OAuthHandler
+
from tweepy import OAuthHandler
    from tweepy import Stream
+
from tweepy import Stream
   
+
   
    access_token = "YOUR ACCESS TOKEN GOES HERE"
+
access_token = "YOUR ACCESS TOKEN GOES HERE"
    access_token_secret = "YOUR ACCESS TOKEN SECRET GOES HERE"
+
access_token_secret = "YOUR ACCESS TOKEN SECRET GOES HERE"
    consumer_key ="YOUR CONSUMER KEY GOES HERE"
+
consumer_key ="YOUR CONSUMER KEY GOES HERE"
    consumer_secret = "YOUR CONSUMER KEY SECRET GOES HERE"
+
consumer_secret = "YOUR CONSUMER KEY SECRET GOES HERE"
   
+
   
    class StdOutListener(StreamListener):
+
class StdOutListener(StreamListener):
      
+
 
        def on_data(self, data):
+
     def on_data(self, data):
            print data
+
        print data
            return True
+
        return True
      
+
 
        def on_error(self, status):
+
     def on_error(self, status):
            print status
+
        print status
   
+
 
    if __name__ == '__main__':
+
if __name__ == '__main__':
      
+
 
        listener = StdOutListener()
+
     listener = StdOutListener()
        auth_handler = OAuthHandler(consumer_key, consumer_secret)
+
    auth_handler = OAuthHandler(consumer_key, consumer_secret)
        auth_handler.set_access_token(access_token, access_token_secret)
+
    auth_handler.set_access_token(access_token, access_token_secret)
        stream = Stream(auth_handler, listener)
+
    stream = Stream(auth_handler, listener)
      
+
 
        stream.filter(track=['gamergate'])
+
     stream.filter(track=['gamergate'])
  
UPDATE
+
==UPDATE==
  
 
The script, as it stands, times out on a read every once in a while, so there’s a minor improvement to be had here by embedding the collection in a while loop with a try and an except to keep it from crashing back to the shell prompt occasionally:
 
The script, as it stands, times out on a read every once in a while, so there’s a minor improvement to be had here by embedding the collection in a while loop with a try and an except to keep it from crashing back to the shell prompt occasionally:
  
        while True:
+
      while True:
            try:
+
        try:
                stream.filter(track=['gamergate'])
+
            stream.filter(track=['gamergate'])
            except:
+
        except:
                continue
+
            continue
  
 
All this script does is print out every tweet which is captured by Tweepy, in JSON format. If you run it, the output will look something like this — this is a single tweet in JSON notation:
 
All this script does is print out every tweet which is captured by Tweepy, in JSON format. If you run it, the output will look something like this — this is a single tweet in JSON notation:
  
    {u'contributors': None, u'truncated': False, u'text': u'RT @CommissarOfGG: Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate ht\u2026', 'retweet': True, u'in_reply_to_status_id': None, u'id': 584828601125773314, u'favorite_count': 0, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1428268978755', u'entities': {u'symbols': [], u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [126, 136], u'text': u'GamerGate'}], u'user_mentions': [{u'id': 2729513808, u'indices': [3, 17], u'id_str': u'2729513808', u'screen_name': u'CommissarOfGG', u'name': u'Comrade Commissar'}], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828601125773314', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate http://t.co/CS3Kb2Bkcm', u'in_reply_to_status_id': None, u'id': 584828243808661504, u'favorite_count': 2, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [107, 117], u'text': u'GamerGate'}], u'user_mentions': [], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828243808661504', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2729513808, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 2047, u'profile_sidebar_border_color': u'000000', u'id_str': u'2729513808', u'profile_background_color': u'000000', u'listed_count': 26, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -14400, u'statuses_count': 4940, u'description': u'#GamerGate #OpSKYNET', u'friends_count': 1584, u'location': u'Moscow', u'profile_link_color': u'DD2E44', u'profile_image_url': u'http://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2729513808/1407939361', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Comrade Commissar', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 1980, u'screen_name': u'CommissarOfGG', u'notifications': None, u'url': u'http://www.facebook.com/commissarofgamergate', u'created_at': u'Wed Aug 13 14:10:24 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Eastern Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:21:33 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2784597626, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 986, u'profile_sidebar_border_color': u'000000', u'id_str': u'2784597626', u'profile_background_color': u'000000', u'listed_count': 35, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -18000, u'statuses_count': 25217, u'description': u"I wasn't born with enough middle fingers for perpetually outraged hipster douchebags compensating for their mediocrity with shelves of participation trophies.", u'friends_count': 785, u'location': u'Parts Unknown', u'profile_link_color': u'4A913C', u'profile_image_url': u'http://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2784597626/1425335831', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Unnecessary Robness', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 15912, u'screen_name': u'aDouScheiBler', u'notifications': None, u'url': None, u'created_at': u'Mon Sep 01 19:36:40 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Central Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:22:58 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}
+
{u'contributors': None, u'truncated': False, u'text': u'RT @CommissarOfGG: Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate ht\u2026', 'retweet': True, u'in_reply_to_status_id': None, u'id': 584828601125773314, u'favorite_count': 0, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1428268978755', u'entities': {u'symbols': [], u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [126, 136], u'text': u'GamerGate'}], u'user_mentions': [{u'id': 2729513808, u'indices': [3, 17], u'id_str': u'2729513808', u'screen_name': u'CommissarOfGG', u'name': u'Comrade Commissar'}], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828601125773314', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate http://t.co/CS3Kb2Bkcm', u'in_reply_to_status_id': None, u'id': 584828243808661504, u'favorite_count': 2, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [107, 117], u'text': u'GamerGate'}], u'user_mentions': [], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828243808661504', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2729513808, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 2047, u'profile_sidebar_border_color': u'000000', u'id_str': u'2729513808', u'profile_background_color': u'000000', u'listed_count': 26, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -14400, u'statuses_count': 4940, u'description': u'#GamerGate #OpSKYNET', u'friends_count': 1584, u'location': u'Moscow', u'profile_link_color': u'DD2E44', u'profile_image_url': u'http://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2729513808/1407939361', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Comrade Commissar', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 1980, u'screen_name': u'CommissarOfGG', u'notifications': None, u'url': u'http://www.facebook.com/commissarofgamergate', u'created_at': u'Wed Aug 13 14:10:24 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Eastern Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:21:33 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2784597626, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 986, u'profile_sidebar_border_color': u'000000', u'id_str': u'2784597626', u'profile_background_color': u'000000', u'listed_count': 35, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -18000, u'statuses_count': 25217, u'description': u"I wasn't born with enough middle fingers for perpetually outraged hipster douchebags compensating for their mediocrity with shelves of participation trophies.", u'friends_count': 785, u'location': u'Parts Unknown', u'profile_link_color': u'4A913C', u'profile_image_url': u'http://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2784597626/1425335831', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Unnecessary Robness', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 15912, u'screen_name': u'aDouScheiBler', u'notifications': None, u'url': None, u'created_at': u'Mon Sep 01 19:36:40 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Central Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:22:58 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}
  
 
Set a terminal running the script above for as long as you like. I left mine going for 42 hours, and collected about 65000 tweets in a text file about 300MB long.
 
Set a terminal running the script above for as long as you like. I left mine going for 42 hours, and collected about 65000 tweets in a text file about 300MB long.
  
    python tweetminer.py >> gamergate.txt
+
python tweetminer.py >> gamergate.txt
  
 
When you’ve collected your data, here’s some Python to set up a sample pandas DataFrame containing information of interest: who tweeted, how many days old their account is, how many followers they have, who it was a retweet of (if it was one) and to whom it was a reply (if it was one).
 
When you’ve collected your data, here’s some Python to set up a sample pandas DataFrame containing information of interest: who tweeted, how many days old their account is, how many followers they have, who it was a retweet of (if it was one) and to whom it was a reply (if it was one).
Line 86: Line 87:
 
That should give you plenty of grist for analysis.
 
That should give you plenty of grist for analysis.
  
    import json
+
import json
    import pandas as pd
+
import pandas as pd
    import matplotlib.pyplot as plt
+
import matplotlib.pyplot as plt
    from time import gmtime, mktime, strptime
+
from time import gmtime, mktime, strptime
   
+
 
    tweets_data_path = 'gamergate.txt'
+
tweets_data_path = 'gamergate.txt'
   
+
 
    tweets_data = []
+
tweets_data = []
    tweets_file = open(tweets_data_path, "r")
+
tweets_file = open(tweets_data_path, "r")
    for line in tweets_file:
+
for line in tweets_file:
        try:
+
    try:
            tweet = json.loads(line)
+
        tweet = json.loads(line)
            tweets_data.append(tweet)
+
        tweets_data.append(tweet)
        except:
+
    except:
            continue
+
        continue
    #
+
#
    # Clean out limit messages, etc.
+
# Clean out limit messages, etc.
    #
+
#
    for tweet in tweets_data:
+
for tweet in tweets_data:
        try:
+
    try:
            user = tweet['user']
+
        user = tweet['user']
        except:
+
    except:
            tweets_data.remove(tweet)
+
        tweets_data.remove(tweet)
   
+
 
    print len(tweets_data)
+
print len(tweets_data)
   
+
 
    #
+
#
    # Pull the data we're interested in out of the Twitter data we captured
+
# Pull the data we're interested in out of the Twitter data we captured
    #
+
#
    rows_list = []
+
rows_list = []
    now=mktime(gmtime())
+
now=mktime(gmtime())
    for tweet in tweets_data:
+
for tweet in tweets_data:
        author = ""
+
    author = ""
        rtauthor = ""
+
    rtauthor = ""
    #
+
#
    # If it was a retweet, get both the original author and the retweeter
+
# If it was a retweet, get both the original author and the retweeter
    #
+
#
        try:
+
    try:
            author = tweet['user']['screen_name']
+
        author = tweet['user']['screen_name']
            rtauthor = tweet['retweeted_status']['user']['screen_name']
+
        rtauthor = tweet['retweeted_status']['user']['screen_name']
        except:
+
    except:
    #
+
#
    # Otherwise, just get the original author
+
# Otherwise, just get the original author
    #
+
#
            try:
+
        try:
                author = tweet['user']['screen_name']
+
            author = tweet['user']['screen_name']
            except:
+
        except:
                continue
+
            continue
      
+
 
        reply_to = ""
+
     reply_to = ""
        if (tweet['in_reply_to_screen_name'] != None):
+
    if (tweet['in_reply_to_screen_name'] != None):
            reply_to = tweet['in_reply_to_screen_name']
+
        reply_to = tweet['in_reply_to_screen_name']
       
 
        age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y"))/(60*60*24))
 
        followers = tweet['user']['followers_count']
 
        dict1 = {}
 
        dict1.update({'author': author, 'retweet_of': rtauthor, 'reply_to': reply_to, 'age': age, 'followers': followers})
 
        rows_list.append(dict1)
 
 
      
 
      
    tweets = pd.DataFrame(rows_list)
+
    age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y"))/(60*60*24))
 +
    followers = tweet['user']['followers_count']
 +
    dict1 = {}
 +
    dict1.update({'author': author, 'retweet_of': rtauthor, 'reply_to': reply_to, 'age': age, 'followers': followers})
 +
    rows_list.append(dict1)
 +
 
 +
tweets = pd.DataFrame(rows_list)
  
The resulting DataFrame will look something like this—note that rows 0-4 are retweets, and row 6 is a reply; “age” is days since the Twitter ID was created:
+
the resulting DataFrame will look something like this—note that rows 0-4 are retweets, and row 6 is a reply; “age” is days since the Twitter ID was created:
  
            age          author  followers    reply_to      retweet_of
+
        age          author  followers    reply_to      retweet_of
    0      137      Maskgamer64        428                  CultOfVivian
+
0      137      Maskgamer64        428                  CultOfVivian
    1      231  Smackfacemcgee      1304                  Daddy_Warpig
+
1      231  Smackfacemcgee      1304                  Daddy_Warpig
    2      2240      LenFirewood      1658                  RSG_VILLENA
+
2      2240      LenFirewood      1658                  RSG_VILLENA
    3      171    8bitsofsound        650                CommissarOfGG
+
3      171    8bitsofsound        650                CommissarOfGG
    4      102    devilstwosome          9                  atlasnodded
+
4      102    devilstwosome          9                  atlasnodded
    5        24      tophatdril        34                               
+
5        24      tophatdril        34                               
    6        11  TheRalphRetart        63    Dr_Louse                 
+
6        11  TheRalphRetart        63    Dr_Louse                 
    7...    ...              ...        ...          ...              ...
+
7...    ...              ...        ...          ...              ...
    64531    65    4EverPlayer2        614                        mombot
+
64531    65    4EverPlayer2        614                        mombot
    64532  143  EnwroughtDreams        222                thewtfmagazine
+
64532  143  EnwroughtDreams        222                thewtfmagazine
    64533  1996          _icze4r      22689                      dauthaz
+
64533  1996          _icze4r      22689                      dauthaz
    64534  1581  __DavidFlanagan      8315                  Spacekatgal
+
64534  1581  __DavidFlanagan      8315                  Spacekatgal
    64535  872        jtdg_b8z        621              GamingAndPandas
+
64535  872        jtdg_b8z        621              GamingAndPandas
    64536  2238        hanytimeh        914                thewtfmagazine
+
64536  2238        hanytimeh        914                thewtfmagazine
  
 
At this point you could easily find out the most-retweeted IDs in the DataFrame, for example:
 
At this point you could easily find out the most-retweeted IDs in the DataFrame, for example:
  
    In [146]: tweets['retweet_of'].value_counts()
+
In [146]: tweets['retweet_of'].value_counts()
    Out[146]:  
+
Out[146]:  
                      17974
+
                    17974
    Sargon_of_Akkad    1574
+
Sargon_of_Akkad    1574
    ItalyGG            1516
+
ItalyGG            1516
    TheRalphRetort      1064
+
TheRalphRetort      1064
    Blaugast            910
+
Blaugast            910
    mylittlepwnies3      899
+
mylittlepwnies3      899
    thewtfmagazine      823
+
thewtfmagazine      823
    Nero                721
+
Nero                721
    srhbutts            706
+
srhbutts            706
    Daddy_Warpig        705
+
Daddy_Warpig        705
    randomfox            627
+
randomfox            627
    atlasnodded          592
+
atlasnodded          592
    full_mcintosh        586
+
full_mcintosh        586
    whenindoubtdo        584
+
whenindoubtdo        584
    ToKnowIsToBe        569
+
ToKnowIsToBe        569
    ...
+
...
  
 
Check out the follow-on posting to see how to use NetworkX and Gephi to make visualizations of the data.
 
Check out the follow-on posting to see how to use NetworkX and Gephi to make visualizations of the data.

Revision as of 11:54, 29 January 2017


I’ve gotten interested in the #GamerGate “controversy”—I’m pretty completely persuaded that any talk about “ethics” is a façade for a lot of reactionary nonsense, as well as abundant harassment and misogny—and it occurred to me that it represented an interesting data set to mine using Python. This is a quick guide for how to get started, but it could be adapted to any effort to datamine Twitter. Setting Up to Connect to Twitter

First, you’re going to need to set up a Twitter app that you can use for authentication. You can do this at apps.twitter.com/app/new. You’ll need to have a valid Twitter account with an authenticated phone number.

Enter a name, description and web site URL for your application. You won’t need a callback URL.

FirefoxDeveloperEditionScreenSnapz068

Check “Yes, I agree” at the bottom of the Developer Agreement, and click the “Create your Twitter application” button.

FirefoxDeveloperEditionScreenSnapz069

Your application will be created. To use Tweepy to capture tweets, we’ll need the Consumer Key and Consumer Secret, and we’ll also need to set up an access token. Click on the “manage keys and access tokens” link next to your “Consumer Key (API Key)” in the “Application Settings” section.

FirefoxDeveloperEditionScreenSnapz070

This will take you to the “Keys and Access Tokens” tab. Note your “Consumer Key” and “Consumer Secret” (greyed out here).

FirefoxDeveloperEditionScreenSnapz071

In the “Your Access Token” section at the bottom of the page, click on “Create my access token”.

FirefoxDeveloperEditionScreenSnapz072

An “Access Token” and an “Access Token Secret” — again, greyed out here — will be generated, you’ll need these as well.

FirefoxDeveloperEditionScreenSnapz073

Install the Python Prerequisites

For this project, we’re going to need the Tweepy, Pandas, and matplotlib libraries

sudo pip install tweepy pandas matplotlib

Here’s a simple-minded Python script using Tweepy to collect tweets mentioning “gamergate” from the Twitter streaming API:

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
   
access_token = "YOUR ACCESS TOKEN GOES HERE"
access_token_secret = "YOUR ACCESS TOKEN SECRET GOES HERE"
consumer_key ="YOUR CONSUMER KEY GOES HERE"
consumer_secret = "YOUR CONSUMER KEY SECRET GOES HERE"
   
class StdOutListener(StreamListener):
 
    def on_data(self, data):
        print data
        return True
 
    def on_error(self, status):
        print status
 
if __name__ == '__main__':
 
    listener = StdOutListener()
    auth_handler = OAuthHandler(consumer_key, consumer_secret)
    auth_handler.set_access_token(access_token, access_token_secret)
    stream = Stream(auth_handler, listener)
 
    stream.filter(track=['gamergate'])

UPDATE

The script, as it stands, times out on a read every once in a while, so there’s a minor improvement to be had here by embedding the collection in a while loop with a try and an except to keep it from crashing back to the shell prompt occasionally:

     while True:
        try:
            stream.filter(track=['gamergate'])
        except:
            continue

All this script does is print out every tweet which is captured by Tweepy, in JSON format. If you run it, the output will look something like this — this is a single tweet in JSON notation:

{u'contributors': None, u'truncated': False, u'text': u'RT @CommissarOfGG: Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate ht\u2026', 'retweet': True, u'in_reply_to_status_id': None, u'id': 584828601125773314, u'favorite_count': 0, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'timestamp_ms': u'1428268978755', u'entities': {u'symbols': [], u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [126, 136], u'text': u'GamerGate'}], u'user_mentions': [{u'id': 2729513808, u'indices': [3, 17], u'id_str': u'2729513808', u'screen_name': u'CommissarOfGG', u'name': u'Comrade Commissar'}], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828601125773314', u'retweet_count': 0, u'in_reply_to_user_id': None, u'favorited': False, u'retweeted_status': {u'contributors': None, u'truncated': False, u'text': u'Anti taking pride that nobody can tell the difference between them and someone pretending to be retarded.\n\n#GamerGate http://t.co/CS3Kb2Bkcm', u'in_reply_to_status_id': None, u'id': 584828243808661504, u'favorite_count': 2, u'source': u'<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}], u'hashtags': [{u'indices': [107, 117], u'text': u'GamerGate'}], u'user_mentions': [], u'trends': [], u'urls': []}, u'in_reply_to_screen_name': None, u'id_str': u'584828243808661504', u'retweet_count': 4, u'in_reply_to_user_id': None, u'favorited': False, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2729513808, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 2047, u'profile_sidebar_border_color': u'000000', u'id_str': u'2729513808', u'profile_background_color': u'000000', u'listed_count': 26, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -14400, u'statuses_count': 4940, u'description': u'#GamerGate #OpSKYNET', u'friends_count': 1584, u'location': u'Moscow', u'profile_link_color': u'DD2E44', u'profile_image_url': u'http://pbs.twimg.com/profile_images/533572934799876096/DYR05LI4_normal.png', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2729513808/1407939361', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Comrade Commissar', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 1980, u'screen_name': u'CommissarOfGG', u'notifications': None, u'url': u'http://www.facebook.com/commissarofgamergate', u'created_at': u'Wed Aug 13 14:10:24 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Eastern Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:21:33 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [118, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}, u'user': {u'follow_request_sent': None, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2784597626, u'verified': False, u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'profile_sidebar_fill_color': u'000000', u'profile_text_color': u'000000', u'followers_count': 986, u'profile_sidebar_border_color': u'000000', u'id_str': u'2784597626', u'profile_background_color': u'000000', u'listed_count': 35, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'utc_offset': -18000, u'statuses_count': 25217, u'description': u"I wasn't born with enough middle fingers for perpetually outraged hipster douchebags compensating for their mediocrity with shelves of participation trophies.", u'friends_count': 785, u'location': u'Parts Unknown', u'profile_link_color': u'4A913C', u'profile_image_url': u'http://pbs.twimg.com/profile_images/532401111823822848/KSIxqiLe_normal.jpeg', u'following': None, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/2784597626/1425335831', u'profile_background_image_url': u'http://abs.twimg.com/images/themes/theme1/bg.png', u'name': u'Unnecessary Robness', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 15912, u'screen_name': u'aDouScheiBler', u'notifications': None, u'url': None, u'created_at': u'Mon Sep 01 19:36:40 +0000 2014', u'contributors_enabled': False, u'time_zone': u'Central Time (US & Canada)', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'en', u'created_at': u'Sun Apr 05 21:22:58 +0000 2015', u'filter_level': u'low', u'in_reply_to_status_id_str': None, u'place': None, u'extended_entities': {u'media': [{u'source_status_id_str': u'584828243808661504', u'expanded_url': u'http://twitter.com/CommissarOfGG/status/584828243808661504/photo/1', u'display_url': u'pic.twitter.com/CS3Kb2Bkcm', u'url': u'http://t.co/CS3Kb2Bkcm', u'media_url_https': u'https://pbs.twimg.com/media/CB26L3HWAAIVSKF.png', u'source_status_id': 584828243808661504, u'id_str': u'584828239564111874', u'sizes': {u'small': {u'h': 351, u'resize': u'fit', u'w': 340}, u'large': {u'h': 607, u'resize': u'fit', u'w': 587}, u'medium': {u'h': 607, u'resize': u'fit', u'w': 587}, u'thumb': {u'h': 150, u'resize': u'crop', u'w': 150}}, u'indices': [139, 140], u'type': u'photo', u'id': 584828239564111874, u'media_url': u'http://pbs.twimg.com/media/CB26L3HWAAIVSKF.png'}]}}

Set a terminal running the script above for as long as you like. I left mine going for 42 hours, and collected about 65000 tweets in a text file about 300MB long.

python tweetminer.py >> gamergate.txt

When you’ve collected your data, here’s some Python to set up a sample pandas DataFrame containing information of interest: who tweeted, how many days old their account is, how many followers they have, who it was a retweet of (if it was one) and to whom it was a reply (if it was one).

That should give you plenty of grist for analysis.

import json
import pandas as pd
import matplotlib.pyplot as plt
from time import gmtime, mktime, strptime
 
tweets_data_path = 'gamergate.txt'
 
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue
#
# Clean out limit messages, etc.
#
for tweet in tweets_data:
    try:
        user = tweet['user']
    except:
        tweets_data.remove(tweet)
 
print len(tweets_data)
 
#
# Pull the data we're interested in out of the Twitter data we captured
#
rows_list = []
now=mktime(gmtime())
for tweet in tweets_data:
    author = ""
    rtauthor = ""
#
# If it was a retweet, get both the original author and the retweeter
#
    try:
        author = tweet['user']['screen_name']
        rtauthor = tweet['retweeted_status']['user']['screen_name']
    except:
#
# Otherwise, just get the original author
#
        try:
            author = tweet['user']['screen_name']
        except:
            continue
 
    reply_to = ""
    if (tweet['in_reply_to_screen_name'] != None):
        reply_to = tweet['in_reply_to_screen_name']
    
    age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y"))/(60*60*24))
    followers = tweet['user']['followers_count']
    dict1 = {}
    dict1.update({'author': author, 'retweet_of': rtauthor, 'reply_to': reply_to, 'age': age, 'followers': followers})
    rows_list.append(dict1)
 
tweets = pd.DataFrame(rows_list)

the resulting DataFrame will look something like this—note that rows 0-4 are retweets, and row 6 is a reply; “age” is days since the Twitter ID was created:

        age           author  followers     reply_to       retweet_of
0       137      Maskgamer64        428                  CultOfVivian
1       231   Smackfacemcgee       1304                  Daddy_Warpig
2      2240      LenFirewood       1658                   RSG_VILLENA
3       171     8bitsofsound        650                 CommissarOfGG
4       102    devilstwosome          9                   atlasnodded
5        24       tophatdril         34                              
6        11   TheRalphRetart         63     Dr_Louse                 
7...    ...              ...        ...          ...              ...
64531    65     4EverPlayer2        614                        mombot
64532   143  EnwroughtDreams        222                thewtfmagazine
64533  1996          _icze4r      22689                       dauthaz
64534  1581  __DavidFlanagan       8315                   Spacekatgal
64535   872         jtdg_b8z        621               GamingAndPandas
64536  2238        hanytimeh        914                thewtfmagazine

At this point you could easily find out the most-retweeted IDs in the DataFrame, for example:

In [146]: tweets['retweet_of'].value_counts()
Out[146]: 
                   17974
Sargon_of_Akkad     1574
ItalyGG             1516
TheRalphRetort      1064
Blaugast             910
mylittlepwnies3      899
thewtfmagazine       823
Nero                 721
srhbutts             706
Daddy_Warpig         705
randomfox            627
atlasnodded          592
full_mcintosh        586
whenindoubtdo        584
ToKnowIsToBe         569
...

Check out the follow-on posting to see how to use NetworkX and Gephi to make visualizations of the data.


Referensi