Python: Generating Network Graph of Twitter Follower

From OnnoWiki
Revision as of 10:51, 22 January 2017 by Onnowpurbo (talk | contribs) (Created page with "Generating a network graph of Twitter followers using Python and NetworkX 15th August, 2014 mark 8 Comments twitter network In this article I show you how by starting at a...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Generating a network graph of Twitter followers using Python and NetworkX 15th August, 2014 mark 8 Comments twitter network

In this article I show you how by starting at a single twitter account we can build up a network graph of twitter followers and then visualize that network using the NetworkX library.

The steps are:

   From initial seed account collect followers using the Snowball Sampling technique.
   Process the collected twitter data to generate an output file of relationships between twitter accounts.
   Visualize network data in a network graph using the NetworkX library.

Step 1. Collect follower data from the Twitter API

You will need to have API keys to be able to query the Twitter API. I have written in previous articles how to do this, e.g. Collecting tweets using Python.

When you interact with the Twitter API you will learn quickly that you need to cache data as you go along. This is because the API is rate limited and you will find any script you write will halt frequently when hitting a rate limit if you don’t cache responses. The solution is to check for cached data before making an API call, if you get a cache miss then query the API and write the returned data to disk.

I use two directories for cached data. The directory ‘following’ contains a CSV file for each twitter account queried. The name of each file is the screen name of the twitter account and the content is a tab delimited list, each row contains the twitter id, screen name and account name of a follower, up to a maximum of 200 followers.

$ ls following/ -rw-r--r-- 1 mark mark 7.1K Aug 14 21:04 TEDxMtHood.csv -rw-r--r-- 1 mark mark 7.0K Aug 14 21:21 TEDxYYC.csv -rw-r--r-- 1 mark mark 5.7K Aug 15 07:29 TEDxCibeles.csv -rw-r--r-- 1 mark mark 2.8K Aug 15 07:30 TEDxProvidence.csv -rw-r--r-- 1 mark mark 6.9K Aug 15 07:46 TEDxUHasselt.csv -rw-r--r-- 1 mark mark 625 Aug 15 07:46 TEDxWestVillage.csv -rw-r--r-- 1 mark mark 196 Aug 15 07:46 TEDxESPRIT.csv -rw-r--r-- 1 mark mark 2.9K Aug 15 08:02 TEDxUU.csv

cat following/TEDxESPRIT XXXXXXXXX dediil hedil jabou XXXXXXXXX MehdiBJemia Mehdi Ben Jemia XXXXXXXXX _willywall _william XXXXXXXX MirakHikimori Hello Hikimori XXXXXXXX maroo_king Marou

The second directory is called ‘twitter-users’, it is a cache of twitter user details, each file contains cached data for a twitter user including friend and follower counts and a list of follower IDs (up to a maximum of 5000 follower IDs can be queried from the API).

$ ls twitter-users/ -rw-r--r-- 1 mark mark 252 Jul 24 16:45 XXXXXXXXX.json -rw-r--r-- 1 mark mark 57K Jul 24 16:46 XXXXXXXX.json -rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json

... Lots more ...


$ cat twitter-users/XXXXXXXX {

"name": "TEDxSingapore",
"friends_count": 147,
"followers_count": 12814,
"followers_ids": [
 XXXXXXXXXX,
 XXXXXXXXXX,
 XXXXXXXXX,
 ...
 XXXXXXXXXX,
 XXXXXXXXXX
],
"id": XXXXXXXX,
"screen_name": "TEDxSingapore"

}

Here is the script to collect this data:

import tweepy import time import os import sys import json import argparse

FOLLOWING_DIR = 'following' MAX_FRIENDS = 200 FRIENDS_OF_FRIENDS_LIMIT = 200

if not os.path.exists(FOLLOWING_DIR):

   os.makedir(FOLLOWING_DIR)

enc = lambda x: x.encode('ascii', errors='ignore')

  1. The consumer keys can be found on your application's Details
  2. page located at https://dev.twitter.com/apps (under "OAuth settings")

CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXX' CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

  1. The access tokens can be found on your applications's Details
  2. page located at https://dev.twitter.com/apps (located
  3. under "Your access token")

ACCESS_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' ACCESS_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

  1. == OAuth Authentication ==
  2. This mode of authentication is the new preferred way
  3. of authenticating with Twitter.

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth)

def get_follower_ids(centre, max_depth=1, current_depth=0, taboo_list=[]):

   # print 'current depth: %d, max depth: %d' % (current_depth, max_depth)
   # print 'taboo list: ', ','.join([ str(i) for i in taboo_list ])
   if current_depth == max_depth:
       print 'out of depth'
       return taboo_list
   if centre in taboo_list:
       # we've been here before
       print 'Already been here.'
       return taboo_list
   else:
       taboo_list.append(centre)
   try:
       userfname = os.path.join('twitter-users', str(centre) + '.json')
       if not os.path.exists(userfname):
           print 'Retrieving user details for twitter id %s' % str(centre)
           while True:
               try:
                   user = api.get_user(centre)
                   d = {'name': user.name,
                        'screen_name': user.screen_name,
                        'id': user.id,
                        'friends_count': user.friends_count,
                        'followers_count': user.followers_count,
                        'followers_ids': user.followers_ids()}
                   with open(userfname, 'w') as outf:
                       outf.write(json.dumps(d, indent=1))
                   user = d
                   break
               except tweepy.TweepError, error:
                   print type(error)
                   if str(error) == 'Not authorized.':
                       print 'Cant access user data - not authorized.'
                       return taboo_list
                   if str(error) == 'User has been suspended.':
                       print 'User suspended.'
                       return taboo_list
                   errorObj = error[0][0]
                   print errorObj
                   if errorObj['message'] == 'Rate limit exceeded':
                       print 'Rate limited. Sleeping for 15 minutes.'
                       time.sleep(15 * 60 + 15)
                       continue
                   return taboo_list
       else:
           user = json.loads(file(userfname).read())
       screen_name = enc(user['screen_name'])
       fname = os.path.join(FOLLOWING_DIR, screen_name + '.csv')
       friendids = []
       # only retrieve friends of TED... screen names
       if screen_name.startswith('TED'):
           if not os.path.exists(fname):
               print 'No cached data for screen name "%s"' % screen_name
               with open(fname, 'w') as outf:
                   params = (enc(user['name']), screen_name)
                   print 'Retrieving friends for user "%s" (%s)' % params
                   # page over friends
                   c = tweepy.Cursor(api.friends, id=user['id']).items()
                   friend_count = 0
                   while True:
                       try:
                           friend = c.next()
                           friendids.append(friend.id)
                           params = (friend.id, enc(friend.screen_name), enc(friend.name))
                           outf.write('%s\t%s\t%s\n' % params)
                           friend_count += 1
                           if friend_count >= MAX_FRIENDS:
                               print 'Reached max no. of friends for "%s".' % friend.screen_name
                               break
                       except tweepy.TweepError:
                           # hit rate limit, sleep for 15 minutes
                           print 'Rate limited. Sleeping for 15 minutes.'
                           time.sleep(15 * 60 + 15)
                           continue
                       except StopIteration:
                           break
           else:
               friendids = [int(line.strip().split('\t')[0]) for line in file(fname)]
           print 'Found %d friends for %s' % (len(friendids), screen_name)
           # get friends of friends
           cd = current_depth
           if cd+1 < max_depth:
               for fid in friendids[:FRIENDS_OF_FRIENDS_LIMIT]:
                   taboo_list = get_follower_ids(fid, max_depth=max_depth,
                       current_depth=cd+1, taboo_list=taboo_list)
           if cd+1 < max_depth and len(friendids) > FRIENDS_OF_FRIENDS_LIMIT:
               print 'Not all friends retrieved for %s.' % screen_name
   except Exception, error:
       print 'Error retrieving followers for user id: ', centre
       print error
       if os.path.exists(fname):
           os.remove(fname)
           print 'Removed file "%s".' % fname
       sys.exit(1)
   return taboo_list

if __name__ == '__main__':

   ap = argparse.ArgumentParser()
   ap.add_argument("-s", "--screen-name", required=True, help="Screen name of twitter user")
   ap.add_argument("-d", "--depth", required=True, type=int, help="How far to follow user network")
   args = vars(ap.parse_args())
   twitter_screenname = args['screen_name']
   depth = int(args['depth'])
   if depth < 1 or depth > 3:
       print 'Depth value %d is not valid. Valid range is 1-3.' % depth
       sys.exit('Invalid depth argument.')
   print 'Max Depth: %d' % depth
   matches = api.lookup_users(screen_names=[twitter_screenname])
   if len(matches) == 1:
       print get_follower_ids(matches[0].id, max_depth=depth)
   else:
       print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname

view raw get_followers.py hosted with ❤ by GitHub

Python file: get_followers.py

I ran this script twice first without a filter on the screen name but limiting the maximum number of following accounts to 20 then again but this time filtering for accounts starting with ‘TED’ (line 102) and allowing up to 200 following accounts to be queried. This will give a mix of TED and non-TED twitter accounts. Running the script:

$ python get_followers.py -s TEDxSingapore -d 3

Max Depth: 3 Found 147 friends for TEDxSingapore Found 200 friends for TEDWomen Already been here. Found 72 friends for TEDxDanteSchool Found 33 friends for TEDHelp Retrieving user details for twitter id XXXXXXXX from API...

... Lots more output ...

Step 2. Process twitter data to generate an output file of relationships between twitter accounts

The script below will process the data collected from the twitter API and generate an edge list. That is a list of relationships between twitter accounts. A weight value is included, this value is the total number of followers for the first twitter account, this value is retrieved from the API. The weight value can be used later to prune the network graph.

import glob import os import json import sys from collections import defaultdict

users = defaultdict(lambda: { 'followers': 0 })

for f in glob.glob('twitter-users/*.json'):

   data = json.load(file(f))
   screen_name = data['screen_name']
   users[screen_name] = { 'followers': data['followers_count'] }

SEED = 'TEDxSingapore'

def process_follower_list(screen_name, edges=[], depth=0, max_depth=2):

   f = os.path.join('following', screen_name + '.csv')
   if not os.path.exists(f):
       return edges
   followers = [line.strip().split('\t') for line in file(f)]
   for follower_data in followers:
       if len(follower_data) < 2:
           continue
       screen_name_2 = follower_data[1]
       # use the number of followers for screen_name as the weight
       weight = users[screen_name]['followers']
       edges.append([screen_name, screen_name_2, weight])
       if depth+1 < max_depth:
           process_follower_list(screen_name_2, edges, depth+1, max_depth)
   return edges

edges = process_follower_list(SEED, max_depth=3)

with open('twitter_network.csv', 'w') as outf:

   edge_exists = {}
   for edge in edges:
       key = ','.join([str(x) for x in edge])
       if not(key in edge_exists):
           outf.write('%s\t%s\t%d\n' % (edge[0], edge[1], edge[2]))
           edge_exists[key] = True

view raw twitter_network.py hosted with ❤ by GitHub

Python file: twitter_network.py

The output generated from this script:

...

TEDxSingapore trendwatchingAP 12814 adaptev TEDxSingapore 321 IS_magazine TEDxSingapore 9955 trendwatchingAP TEDxSingapore 678 TEDxSingapore GuyKawasaki 12814 TEDxSingapore InnovateAP 12814 TEDxSingapore InnosightTeam 12814 TEDxSingapore ScottDAnthony 12814 TEDxSingapore WorldAndScience 12814 TEDxSingapore EntMagazine 12814 ...

Step 3. Visualizing the Network using the NetworkX library

We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph:

   Create a directed graph (net.DiGraph) containing all the edge data including metadata.
   Remove nodes based on how connected they are to other nodes in the network (i.e. remove poorly connected nodes)
   Remove edges that have less than a minimum number of followers
   Split nodes into two separate categories, ‘TED’ and ‘non-TED’ sets.
   Render each nodeset
   Render edges between nodes
   Render node labels

Here is the code to generate the twitter network image. I wrote this code in IPython Notebook (this is the reason Line 3 has a magic command that causes matplotlib output to be rendered in the browser):

import networkx as net import matplotlib.pyplot as plt

from collections import defaultdict import math

twitter_network = [ line.strip().split('\t') for line in file('twitter_network.csv') ]

o = net.DiGraph() hfollowers = defaultdict(lambda: 0) for (twitter_user, followed_by, followers) in twitter_network:

   o.add_edge(twitter_user, followed_by, followers=int(followers))
   hfollowers[twitter_user] = int(followers)

SEED = 'TEDxSingapore'

  1. centre around the SEED node and set radius of graph

g = net.DiGraph(net.ego_graph(o, SEED, radius=4))

def trim_degrees_ted(g, degree=1, ted_degree=1):

   g2 = g.copy()
   d = net.degree(g2)
   for n in g2.nodes():
       if n == SEED: continue # don't prune the SEED node
       if d[n] <= degree and not n.lower().startswith('ted'):
           g2.remove_node(n)
       elif n.lower().startswith('ted') and d[n] <= ted_degree:
           g2.remove_node(n)
   return g2

def trim_edges_ted(g, weight=1, ted_weight=10):

   g2 = net.DiGraph()
   for f, to, edata in g.edges_iter(data=True):
       if f == SEED or to == SEED: # keep edges that link to the SEED node
           g2.add_edge(f, to, edata)
       elif f.lower().startswith('ted') or to.lower().startswith('ted'):
           if edata['followers'] >= ted_weight:
               g2.add_edge(f, to, edata)
       elif edata['followers'] >= weight:
           g2.add_edge(f, to, edata)
   return g2

print 'g: ', len(g) core = trim_degrees_ted(g, degree=235, ted_degree=1) print 'core after node pruning: ', len(core) core = trim_edges_ted(core, weight=250000, ted_weight=35000) print 'core after edge pruning: ', len(core)

nodeset_types = { 'TED': lambda s: s.lower().startswith('ted'), 'Not TED': lambda s: not s.lower().startswith('ted') }

nodesets = defaultdict(list)

for nodeset_typename, nodeset_test in nodeset_types.iteritems():

   nodesets[nodeset_typename] = [ n for n in core.nodes_iter() if nodeset_test(n) ]

pos = net.spring_layout(core) # compute layout

colours = ['red','green'] colourmap = {}

plt.figure(figsize=(18,18)) plt.axis('off')

  1. draw nodes

i = 0 alphas = {'TED': 0.6, 'Not TED': 0.4} for k in nodesets.keys():

   ns = [ math.log10(hfollowers[n]+1) * 80 for n in nodesets[k] ]
   print k, len(ns)
   net.draw_networkx_nodes(core, pos, nodelist=nodesets[k], node_size=ns, node_color=colours[i], alpha=alphas[k])
   colourmap[k] = colours[i]
   i += 1

print 'colourmap: ', colourmap

  1. draw edges

net.draw_networkx_edges(core, pos, width=0.5, alpha=0.5)

  1. draw labels

alphas = { 'TED': 1.0, 'Not TED': 0.5} for k in nodesets.keys():

   for n in nodesets[k]:
       x, y = pos[n]
       plt.text(x, y+0.02, s=n, alpha=alphas[k], horizontalalignment='center', fontsize=9)

view raw visualize.py hosted with ❤ by GitHub

Python file: visualize.py

   Line 7 Load edge data from disk
   Line 9-13 Create a directed graph from the edge data and populate a dictionary with the followers count data
   Line 18 Centre and restrict size of graph around the SEED node (TEDxSingapore)
   Line 20-29 Method to prune the network graph by eliminating nodes that don’t meet filter criteria
   Line 31-41 Method to prune the network graph by eliminating edges that don’t meet filter criteria
   Line 44, 46 removes nodes and edges from the network that don’t meet the filter criteria
   Line 67-73 For each nodeset draw the nodes, the size of each node is based on the log value of the followers count
   Line 76 Draw network edges
   Line 80-83 Draw network labels, use matplotlib directly to do this rather than net.draw_networkx_labels() method.

Output from running script in IPython Notebook

g: 119567 core after node pruning: 958 core after edge pruning: 198 Not TED 38 TED 160 colourmap: {'Not TED': 'red', 'TED': 'green'}

twitter network

See Also:

   NetworkX library
   Social Network Analysis for Startups by Maksim Tsvetovat; Alexander Kouzetsov
   Snowball Samping





Referensi