Python: SCraping Twitter lagi

From OnnoWiki
Jump to navigation Jump to search

As long-time readers will remember, I have been collecting Twitter with the R library(twitteR). Unfortunately that workflow has proven to be buggy, mostly for reasons having to do with authentication. As such I decided to learn Python and migrate my project to the Twython module. Overall, I’ve been very impressed by the language and the module. I haven’t had any dependency problems and authentication works pretty smoothly. On the other hand, it requires a lot more manual coding to get around rate limits than does twitteR and this is a big part of what my scripts are doing.

I’ll let you follow the standard instructions for installing Python 3 and the Twython module before showing you my workflow. Note that all of my code was run on Python 3.5.1 and OSX 10.9. You want to use Python 3, not Python 2 as tweets are UTF-8. If you’re a Mac person, OSX comes with 2.7 but you will need to install Python3. For the same reason, use Stata 14 for tweets.

One tip on installation, pip tends to default to 2.7 so use this syntax in bash.

python3   -m pip install twython

I use three py scripts, one to write Twython queries to disk, one to query information about a set of Twitter users, and one to query tweets from a particular user. Note that the query scripts can be slow to execute, which is deliberate as otherwise you end up hitting rate limits. (Twitter’s API allows fifteen queries per fifteen minutes). I call the two query scripts from bash with argument passing. The disk writing script is called by the query scripts and doesn’t require user intervention, though you do need to be sure Python knows where to find it (usually by keeping it in the current working directory). Note that you will need to adjust things like file paths and authentication keys. (When accessing Twitter through scripts instead of your phone, you don’t use usernames and passwords but keys and secrets, you can generate the keys by registering an application). tw2csv.py

I am discussing this script first even though it is not directly called by the user because it is the most natural place to discuss Twython’s somewhat complicated data structure. A Twython data object is a list of dictionaries. (I adapted this script for exporting lists of dictionaries). You can get a pretty good feel for what these objects look like by using type() and the pprint module. In this sample code, I explore a data object created by infoquery.py.

type(users) #shows that users is a list
type(users[0]) #shows that each element of users is a dictionary
#the objects are a bunch of brackets and commas, use pprint to make a dictionary (sub)object human-readable with whitespace
import pprint
pp=pprint.PrettyPrinter(indent=4)
pp.pprint(users[0])
pp.pprint(users[0]['status']) #you can also zoom in on daughter objects, in this case the user's most recent tweet object. Note that this tweet is a sub-object within the user object, but may itself have sub-objects

As you can see if you use the pprint command, some of the dictionary values are themselves dictionaries. It’s a real fleas upon fleas kind of deal. In the datacollection.py script I pull some of these objects out and delete others for the “clean” version of the data. Also note that tw2csv defaults to writing these second-level fields as one first-level field with escaped internal delimiters. So if you open a file in Excel, some of the cells will be really long and have a lot of commas in them. While Excel automatically parses the escaped commas correctly, Stata assumes you don’t want them escaped unless you use this command:

import delimited "foo.csv", delimiter(comma) bindquote(strict) varnames(1) asdouble encoding(UTF-8) clear

Another tricky thing about Twython data is there can be variable number of dictionary entries (ie, some fields are missing from some cases). For instance, if a tweet is not a retweet it will be missing the “retweeted_status” dictionary within a dictionary. This was the biggest problem with reusing the Stack Overflow code and required adapting another piece of code for getting the union set of dictionary keys. Note this will give you all the keys used in any entry from the current query, but not those found uniquely in past or future queries. Likewise, Python sorts field order randomly. For these two reasons, I hard-coded tw2csv as overwrite, not append, and build in a timestamp to the query scripts. If you tweak the code to append, you will run into problems with the fields not lining up.

Anyway, here’s the actual tw2csv code.


#tw2csv.py
def tw2csv(twdata,csvfile_out):
    import csv
    import functools
    allkey = functools.reduce(lambda x, y: x.union(y.keys()), twdata, set())
    with open(csvfile_out,'wt') as output_file:
        dict_writer=csv.DictWriter(output_file,allkey)
        dict_writer.writeheader()
        dict_writer.writerows(twdata)
infoquery.py

One of the queries I like to run is getting basic information like date created, description, and follower counts. Basically, all the stuff that shows up on a user’s profile page. The Twitter API allows you to do this for 100 users simultaneously and I do this with the infoquery.py script. It assumes that your list of target users is stored in a text file, but there’s a commented out line that lets you hard code the users, which may be easier if you’re doing it interactively. Likewise, it’s designed to only query 100 users at a time, but there’s a commented out line that’s much simpler in interactive use if you’re only querying a few users.

You can call it from the command line and it takes as an argument the location of the input file. I hard-coded the location of the output. Note the “3” in the command-line call is important as operating systems like OSX default to calling Python 2.7.

python3 infoquery.py list.txt

And here’s the actual script. Note that I’ve taken out my key and secret. You’ll have to register as an “application” and generate these yourself.

#infoquery.py
from twython import Twython
import sys
import time
from math import ceil
import tw2csv #custom module
 
parentpath='/Users/rossman/Documents/twittertrucks/infoquery_py'
targetlist=sys.argv[1] #text file listing feeds to query, one per line. full path ok.
today = time.strftime("%Y%m%d")
csvfilepath_info=parentpath+'/info_'+today+'.csv'
 
#authenticate
APP_KEY= #25 alphanumeric characters
APP_SECRET= #50 alphanumeric characters
twitter=Twython(APP_KEY,APP_SECRET,oauth_version=2) #simple authentication object
ACCESS_TOKEN=twitter.obtain_access_token()
twitter=Twython(APP_KEY,access_token=ACCESS_TOKEN)

handles = [line.rstrip() for line in open(targetlist)] #read from text file given as cmd-line argument
#handles=("gabrielrossman,sociologicalsci,twitter") #alternately, hard-code the list of handles
 
#API allows 100 users per query. Cycle through, 100 at a time
#users = twitter.lookup_user(screen_name=handles) #this one line is all you need if len(handles) < 100
users=[] #initialize data object
hl=len(handles)
cycles=ceil(hl/100)
#unlike a get_user_timeline query, there is no need to cap total cycles
for i in range(0, cycles): ## iterate through all tweets up to max of 3200
    h=handles[0:100]
    del handles[0:100]
    incremental = twitter.lookup_user(screen_name=h)
    users.extend(incremental)
    time.sleep(90) ## 90 second rest between api calls. The API allows 15 calls per 15 minutes so this is conservative
 
tw2csv.tw2csv(users,csvfilepath_info)
datacollection.py

This last script collects tweets for a specified user. The tricky thing about this code is that the Twitter API allows you to query the last 3200 tweets per user, but only 200 at a time, so you have to cycle over them. moreover, you have to build in a delay so you don’t get rate-limited. I adapted the script from this code but made some tweaks.

One change I made was to only scrape as deep as necessary for any given user. For instance, as of this writing, @SociologicalSci has 1192 tweets, so it cycles six times, but if you run it in a few weeks @SociologicalSci would have over 1200 and so it would run at least seven cycles. This change makes the script run faster, but ultimately gets you to the same place.

The other change I made is that I save two versions of the file, one as is and the other that pulls out some objects from the subdictionaries and deletes the rest. If for some reason you don’t care about retweet count but are very interested in retweeting user’s profile background color, go ahead and modify the code. See above for tips on exploring the data structure interactively so you can see what there is to choose from.

As above, you’ll need to register as an application and supply a key and secret.

You call it from bash with the target screenname as an argument.

python3 datacollection.py sociologicalsci


#datacollection.py
from twython import Twython
import sys
import time
import simplejson
from math import ceil
import tw2csv #custom module
 
parentpath='/Users/rossman/Documents/twittertrucks/feeds_py'
handle=sys.argv[1] #takes target twitter screenname as command-line argument
today = time.strftime("%Y%m%d")
csvfilepath=parentpath+'/'+handle+'_'+today+'.csv'
csvfilepath_clean=parentpath+'/'+handle+'_'+today+'_clean.csv'
 
#authenticate
APP_KEY= #25 alphanumeric characters
APP_SECRET= #50 alphanumeric characters
twitter=Twython(APP_KEY,APP_SECRET,oauth_version=2) #simple authentication object
ACCESS_TOKEN=twitter.obtain_access_token()
twitter=Twython(APP_KEY,access_token=ACCESS_TOKEN)
 
#adapted from http://www.craigaddyman.com/mining-all-tweets-with-python/
#user_timeline=twitter.get_user_timeline(screen_name=handle,count=200) #if doing 200 or less, just do this one line
user_timeline=twitter.get_user_timeline(screen_name=handle,count=1) #get most recent tweet
lis=user_timeline[0]['id']-1 #tweet id # for most recent tweet
#only query as deep as necessary
tweetsum= user_timeline[0]['user']['statuses_count']
cycles=ceil(tweetsum / 200)
if cycles>16:
    cycles=16 #API only allows depth of 3200 so no point trying deeper than 200*16
time.sleep(60)
for i in range(0, cycles): ## iterate through all tweets up to max of 3200
    incremental = twitter.get_user_timeline(screen_name=handle,
    count=200, include_retweets=True, max_id=lis)
    user_timeline.extend(incremental)
    lis=user_timeline[-1]['id']-1
    time.sleep(90) ## 90 second rest between api calls. The API allows 15 calls per 15 minutes so this is conservative

tw2csv.tw2csv(user_timeline,csvfilepath)

#clean the file and save it
for i, val in enumerate(user_timeline):
    user_timeline[i]['user_screen_name']=user_timeline[i]['user']['screen_name']
    user_timeline[i]['user_followers_count']=user_timeline[i]['user']['followers_count']
    user_timeline[i]['user_id']=user_timeline[i]['user']['id']
    user_timeline[i]['user_created_at']=user_timeline[i]['user']['created_at']
    if 'retweeted_status' in user_timeline[i].keys():
        user_timeline[i]['rt_count'] = user_timeline[i]['retweeted_status']['retweet_count']
        user_timeline[i]['qt_id'] = user_timeline[i]['retweeted_status']['id']
        user_timeline[i]['rt_created'] = user_timeline[i]['retweeted_status']['created_at']
        user_timeline[i]['rt_user_screenname'] = user_timeline[i]['retweeted_status']['user']['name']
        user_timeline[i]['rt_user_id'] = user_timeline[i]['retweeted_status']['user']['id']
        user_timeline[i]['rt_user_followers'] = user_timeline[i]['retweeted_status']['user']['followers_count']
        del user_timeline[i]['retweeted_status']
    if 'quoted_status' in user_timeline[i].keys():
        user_timeline[i]['qt_created'] = user_timeline[i]['quoted_status']['created_at']
        user_timeline[i]['qt_id'] = user_timeline[i]['quoted_status']['id']
        user_timeline[i]['qt_text'] = user_timeline[i]['quoted_status']['text']
        user_timeline[i]['qt_user_screenname'] = user_timeline[i]['quoted_status']['user']['name']
        user_timeline[i]['qt_user_id'] = user_timeline[i]['quoted_status']['user']['id']
        user_timeline[i]['qt_user_followers'] = user_timeline[i]['quoted_status']['user']['followers_count']
        del user_timeline[i]['quoted_status']
    if user_timeline[i]['entities']['urls']: #list
        for j, val in enumerate(user_timeline[i]['entities']['urls']):
            urlj='url_'+str(j)
            user_timeline[i][urlj]=user_timeline[i]['entities']['urls'][j]['expanded_url']
    if user_timeline[i]['entities']['user_mentions']: #list
        for j, val in enumerate(user_timeline[i]['entities']['user_mentions']):
            mentionj='mention_'+str(j)
            user_timeline[i][mentionj] = user_timeline[i]['entities']['user_mentions'][j]['screen_name']
    if user_timeline[i]['entities']['hashtags']: #list
        for j, val in enumerate(user_timeline[i]['entities']['hashtags']):
            hashtagj='hashtag_'+str(j)
            user_timeline[i][hashtagj] = user_timeline[i]['entities']['hashtags'][j]['text']
    if user_timeline[i]['coordinates'] is not None:  #NoneType or Dict
        user_timeline[i]['coord_long'] = user_timeline[i]['coordinates']['coordinates'][0]
        user_timeline[i]['coord_lat'] = user_timeline[i]['coordinates']['coordinates'][1]
    del user_timeline[i]['coordinates']
    del user_timeline[i]['user']
    del user_timeline[i]['entities']
    if 'place' in user_timeline[i].keys():  #NoneType or Dict
        del user_timeline[i]['place']
    if 'extended_entities' in user_timeline[i].keys():
        del user_timeline[i]['extended_entities']
    if 'geo' in user_timeline[i].keys():
        del user_timeline[i]['geo']

tw2csv.tw2csv(user_timeline,csvfilepath_clean)



Referensi