Python: SCraping Twitter lagi
As long-time readers will remember, I have been collecting Twitter with the R library(twitteR). Unfortunately that workflow has proven to be buggy, mostly for reasons having to do with authentication. As such I decided to learn Python and migrate my project to the Twython module. Overall, I’ve been very impressed by the language and the module. I haven’t had any dependency problems and authentication works pretty smoothly. On the other hand, it requires a lot more manual coding to get around rate limits than does twitteR and this is a big part of what my scripts are doing.
I’ll let you follow the standard instructions for installing Python 3 and the Twython module before showing you my workflow. Note that all of my code was run on Python 3.5.1 and OSX 10.9. You want to use Python 3, not Python 2 as tweets are UTF-8. If you’re a Mac person, OSX comes with 2.7 but you will need to install Python3. For the same reason, use Stata 14 for tweets.
One tip on installation, pip tends to default to 2.7 so use this syntax in bash.
python3 -m pip install twython
I use three py scripts, one to write Twython queries to disk, one to query information about a set of Twitter users, and one to query tweets from a particular user. Note that the query scripts can be slow to execute, which is deliberate as otherwise you end up hitting rate limits. (Twitter’s API allows fifteen queries per fifteen minutes). I call the two query scripts from bash with argument passing. The disk writing script is called by the query scripts and doesn’t require user intervention, though you do need to be sure Python knows where to find it (usually by keeping it in the current working directory). Note that you will need to adjust things like file paths and authentication keys. (When accessing Twitter through scripts instead of your phone, you don’t use usernames and passwords but keys and secrets, you can generate the keys by registering an application). tw2csv.py
I am discussing this script first even though it is not directly called by the user because it is the most natural place to discuss Twython’s somewhat complicated data structure. A Twython data object is a list of dictionaries. (I adapted this script for exporting lists of dictionaries). You can get a pretty good feel for what these objects look like by using type() and the pprint module. In this sample code, I explore a data object created by infoquery.py.
type(users) #shows that users is a list type(users[0]) #shows that each element of users is a dictionary #the objects are a bunch of brackets and commas, use pprint to make a dictionary (sub)object human-readable with whitespace import pprint pp=pprint.PrettyPrinter(indent=4) pp.pprint(users[0]) pp.pprint(users[0]['status']) #you can also zoom in on daughter objects, in this case the user's most recent tweet object. Note that this tweet is a sub-object within the user object, but may itself have sub-objects
As you can see if you use the pprint command, some of the dictionary values are themselves dictionaries. It’s a real fleas upon fleas kind of deal. In the datacollection.py script I pull some of these objects out and delete others for the “clean” version of the data. Also note that tw2csv defaults to writing these second-level fields as one first-level field with escaped internal delimiters. So if you open a file in Excel, some of the cells will be really long and have a lot of commas in them. While Excel automatically parses the escaped commas correctly, Stata assumes you don’t want them escaped unless you use this command:
import delimited "foo.csv", delimiter(comma) bindquote(strict) varnames(1) asdouble encoding(UTF-8) clear
Another tricky thing about Twython data is there can be variable number of dictionary entries (ie, some fields are missing from some cases). For instance, if a tweet is not a retweet it will be missing the “retweeted_status” dictionary within a dictionary. This was the biggest problem with reusing the Stack Overflow code and required adapting another piece of code for getting the union set of dictionary keys. Note this will give you all the keys used in any entry from the current query, but not those found uniquely in past or future queries. Likewise, Python sorts field order randomly. For these two reasons, I hard-coded tw2csv as overwrite, not append, and build in a timestamp to the query scripts. If you tweak the code to append, you will run into problems with the fields not lining up.
Anyway, here’s the actual tw2csv code.
#tw2csv.py def tw2csv(twdata,csvfile_out): import csv import functools allkey = functools.reduce(lambda x, y: x.union(y.keys()), twdata, set()) with open(csvfile_out,'wt') as output_file: dict_writer=csv.DictWriter(output_file,allkey) dict_writer.writeheader() dict_writer.writerows(twdata) infoquery.py
One of the queries I like to run is getting basic information like date created, description, and follower counts. Basically, all the stuff that shows up on a user’s profile page. The Twitter API allows you to do this for 100 users simultaneously and I do this with the infoquery.py script. It assumes that your list of target users is stored in a text file, but there’s a commented out line that lets you hard code the users, which may be easier if you’re doing it interactively. Likewise, it’s designed to only query 100 users at a time, but there’s a commented out line that’s much simpler in interactive use if you’re only querying a few users.
You can call it from the command line and it takes as an argument the location of the input file. I hard-coded the location of the output. Note the “3” in the command-line call is important as operating systems like OSX default to calling Python 2.7.
python3 infoquery.py list.txt
And here’s the actual script. Note that I’ve taken out my key and secret. You’ll have to register as an “application” and generate these yourself.
#infoquery.py from twython import Twython import sys import time from math import ceil import tw2csv #custom module parentpath='/Users/rossman/Documents/twittertrucks/infoquery_py' targetlist=sys.argv[1] #text file listing feeds to query, one per line. full path ok. today = time.strftime("%Y%m%d") csvfilepath_info=parentpath+'/info_'+today+'.csv' #authenticate APP_KEY= #25 alphanumeric characters APP_SECRET= #50 alphanumeric characters twitter=Twython(APP_KEY,APP_SECRET,oauth_version=2) #simple authentication object ACCESS_TOKEN=twitter.obtain_access_token() twitter=Twython(APP_KEY,access_token=ACCESS_TOKEN) handles = [line.rstrip() for line in open(targetlist)] #read from text file given as cmd-line argument #handles=("gabrielrossman,sociologicalsci,twitter") #alternately, hard-code the list of handles #API allows 100 users per query. Cycle through, 100 at a time #users = twitter.lookup_user(screen_name=handles) #this one line is all you need if len(handles) < 100 users=[] #initialize data object hl=len(handles) cycles=ceil(hl/100) #unlike a get_user_timeline query, there is no need to cap total cycles for i in range(0, cycles): ## iterate through all tweets up to max of 3200 h=handles[0:100] del handles[0:100] incremental = twitter.lookup_user(screen_name=h) users.extend(incremental) time.sleep(90) ## 90 second rest between api calls. The API allows 15 calls per 15 minutes so this is conservative tw2csv.tw2csv(users,csvfilepath_info) datacollection.py
This last script collects tweets for a specified user. The tricky thing about this code is that the Twitter API allows you to query the last 3200 tweets per user, but only 200 at a time, so you have to cycle over them. moreover, you have to build in a delay so you don’t get rate-limited. I adapted the script from this code but made some tweaks.
One change I made was to only scrape as deep as necessary for any given user. For instance, as of this writing, @SociologicalSci has 1192 tweets, so it cycles six times, but if you run it in a few weeks @SociologicalSci would have over 1200 and so it would run at least seven cycles. This change makes the script run faster, but ultimately gets you to the same place.
The other change I made is that I save two versions of the file, one as is and the other that pulls out some objects from the subdictionaries and deletes the rest. If for some reason you don’t care about retweet count but are very interested in retweeting user’s profile background color, go ahead and modify the code. See above for tips on exploring the data structure interactively so you can see what there is to choose from.
As above, you’ll need to register as an application and supply a key and secret.
You call it from bash with the target screenname as an argument.
python3 datacollection.py sociologicalsci
#datacollection.py from twython import Twython import sys import time import simplejson from math import ceil import tw2csv #custom module parentpath='/Users/rossman/Documents/twittertrucks/feeds_py' handle=sys.argv[1] #takes target twitter screenname as command-line argument today = time.strftime("%Y%m%d") csvfilepath=parentpath+'/'+handle+'_'+today+'.csv' csvfilepath_clean=parentpath+'/'+handle+'_'+today+'_clean.csv' #authenticate APP_KEY= #25 alphanumeric characters APP_SECRET= #50 alphanumeric characters twitter=Twython(APP_KEY,APP_SECRET,oauth_version=2) #simple authentication object ACCESS_TOKEN=twitter.obtain_access_token() twitter=Twython(APP_KEY,access_token=ACCESS_TOKEN) #adapted from http://www.craigaddyman.com/mining-all-tweets-with-python/ #user_timeline=twitter.get_user_timeline(screen_name=handle,count=200) #if doing 200 or less, just do this one line user_timeline=twitter.get_user_timeline(screen_name=handle,count=1) #get most recent tweet lis=user_timeline[0]['id']-1 #tweet id # for most recent tweet #only query as deep as necessary tweetsum= user_timeline[0]['user']['statuses_count'] cycles=ceil(tweetsum / 200) if cycles>16: cycles=16 #API only allows depth of 3200 so no point trying deeper than 200*16 time.sleep(60) for i in range(0, cycles): ## iterate through all tweets up to max of 3200 incremental = twitter.get_user_timeline(screen_name=handle, count=200, include_retweets=True, max_id=lis) user_timeline.extend(incremental) lis=user_timeline[-1]['id']-1 time.sleep(90) ## 90 second rest between api calls. The API allows 15 calls per 15 minutes so this is conservative tw2csv.tw2csv(user_timeline,csvfilepath) #clean the file and save it for i, val in enumerate(user_timeline): user_timeline[i]['user_screen_name']=user_timeline[i]['user']['screen_name'] user_timeline[i]['user_followers_count']=user_timeline[i]['user']['followers_count'] user_timeline[i]['user_id']=user_timeline[i]['user']['id'] user_timeline[i]['user_created_at']=user_timeline[i]['user']['created_at'] if 'retweeted_status' in user_timeline[i].keys(): user_timeline[i]['rt_count'] = user_timeline[i]['retweeted_status']['retweet_count'] user_timeline[i]['qt_id'] = user_timeline[i]['retweeted_status']['id'] user_timeline[i]['rt_created'] = user_timeline[i]['retweeted_status']['created_at'] user_timeline[i]['rt_user_screenname'] = user_timeline[i]['retweeted_status']['user']['name'] user_timeline[i]['rt_user_id'] = user_timeline[i]['retweeted_status']['user']['id'] user_timeline[i]['rt_user_followers'] = user_timeline[i]['retweeted_status']['user']['followers_count'] del user_timeline[i]['retweeted_status'] if 'quoted_status' in user_timeline[i].keys(): user_timeline[i]['qt_created'] = user_timeline[i]['quoted_status']['created_at'] user_timeline[i]['qt_id'] = user_timeline[i]['quoted_status']['id'] user_timeline[i]['qt_text'] = user_timeline[i]['quoted_status']['text'] user_timeline[i]['qt_user_screenname'] = user_timeline[i]['quoted_status']['user']['name'] user_timeline[i]['qt_user_id'] = user_timeline[i]['quoted_status']['user']['id'] user_timeline[i]['qt_user_followers'] = user_timeline[i]['quoted_status']['user']['followers_count'] del user_timeline[i]['quoted_status'] if user_timeline[i]['entities']['urls']: #list for j, val in enumerate(user_timeline[i]['entities']['urls']): urlj='url_'+str(j) user_timeline[i][urlj]=user_timeline[i]['entities']['urls'][j]['expanded_url'] if user_timeline[i]['entities']['user_mentions']: #list for j, val in enumerate(user_timeline[i]['entities']['user_mentions']): mentionj='mention_'+str(j) user_timeline[i][mentionj] = user_timeline[i]['entities']['user_mentions'][j]['screen_name'] if user_timeline[i]['entities']['hashtags']: #list for j, val in enumerate(user_timeline[i]['entities']['hashtags']): hashtagj='hashtag_'+str(j) user_timeline[i][hashtagj] = user_timeline[i]['entities']['hashtags'][j]['text'] if user_timeline[i]['coordinates'] is not None: #NoneType or Dict user_timeline[i]['coord_long'] = user_timeline[i]['coordinates']['coordinates'][0] user_timeline[i]['coord_lat'] = user_timeline[i]['coordinates']['coordinates'][1] del user_timeline[i]['coordinates'] del user_timeline[i]['user'] del user_timeline[i]['entities'] if 'place' in user_timeline[i].keys(): #NoneType or Dict del user_timeline[i]['place'] if 'extended_entities' in user_timeline[i].keys(): del user_timeline[i]['extended_entities'] if 'geo' in user_timeline[i].keys(): del user_timeline[i]['geo'] tw2csv.tw2csv(user_timeline,csvfilepath_clean)