Difference between revisions of "Python: Generating Network Graph of Twitter Follower"
Onnowpurbo (talk | contribs) |
Onnowpurbo (talk | contribs) (→Step 2. Process twitter data to generate an output file of relationships between twitter accounts) |
||
(9 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | + | ==Persiapan== | |
− | + | Instalasi | |
− | |||
− | |||
− | |||
− | |||
sudo apt install python-pip | sudo apt install python-pip | ||
− | pip install | + | sudo pip install --upgrade pip |
− | + | sudo pip install tweepy | |
− | + | mkdir following | |
− | + | mkdir twitter_users | |
− | |||
− | |||
+ | Login ke https://dev.twitter.com/apps, dapatkan: | ||
− | + | * CONSUMER_KEY | |
+ | * CONSUMER_SECRET | ||
+ | * ACCESS_TOKEN | ||
+ | * ACCESS_TOKEN_SECRET | ||
− | + | ==Langkah Secara umum== | |
* From initial seed account collect followers using the Snowball Sampling technique. | * From initial seed account collect followers using the Snowball Sampling technique. | ||
Line 24: | Line 22: | ||
* Visualize network data in a network graph using the NetworkX library. | * Visualize network data in a network graph using the NetworkX library. | ||
− | + | ==Cara Cepat== | |
− | + | python get_followers.py -s TEDxSingapore -d 3 | |
+ | python twitter_network.py | ||
+ | python visualize.py | ||
− | |||
− | + | ==Step 1. Collect follower data from the Twitter API== | |
+ | |||
+ | follower data akan di simpan di folder following, sesudah proses di jalankan akan data di simpan dalam CSV format. | ||
$ ls following/ | $ ls following/ | ||
Line 49: | Line 50: | ||
XXXXXXXX maroo_king Marou | XXXXXXXX maroo_king Marou | ||
− | + | Directory yang kedua adalah ‘twitter-users’, menyimpan twitter user detail dalam format JSON. | |
+ | |||
$ ls twitter-users/ | $ ls twitter-users/ | ||
Line 55: | Line 57: | ||
-rw-r--r-- 1 mark mark 57K Jul 24 16:46 XXXXXXXX.json | -rw-r--r-- 1 mark mark 57K Jul 24 16:46 XXXXXXXX.json | ||
-rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json | -rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json | ||
− | + | ||
− | ... Lots more ... | + | ... Lots more ... |
− | + | ||
− | |||
$ cat twitter-users/XXXXXXXX | $ cat twitter-users/XXXXXXXX | ||
− | { | + | |
− | + | { | |
− | + | "name": "TEDxSingapore", | |
− | + | "friends_count": 147, | |
− | + | "followers_count": 12814, | |
− | + | "followers_ids": [ | |
− | + | XXXXXXXXXX, | |
− | + | XXXXXXXXXX, | |
− | + | XXXXXXXXX, | |
− | + | ... | |
− | + | XXXXXXXXXX, | |
− | + | XXXXXXXXXX | |
− | + | ], | |
− | + | "id": XXXXXXXX, | |
− | + | "screen_name": "TEDxSingapore" | |
+ | } | ||
− | + | Script [https://gist.github.com/mjcreativeventures/41de04c6bbe47ee14411 get_followers.py] untuk mengumpulkan data adalah sebagai berikut, | |
import tweepy | import tweepy | ||
Line 255: | Line 257: | ||
print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname | print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname | ||
view raw | view raw | ||
− | + | ||
+ | |||
+ | |||
− | + | I ran this script twice first without a filter on the screen name but limiting the maximum number of following accounts to 20 then again but this time filtering for accounts starting with ‘TED’ (line 102) and allowing up to 200 following accounts to be queried. This will give a mix of TED and non-TED twitter accounts. Running the script: | |
− | + | $ mkdir following | |
+ | $ mkdir twitter_user | ||
$ python get_followers.py -s TEDxSingapore -d 3 | $ python get_followers.py -s TEDxSingapore -d 3 | ||
Line 273: | Line 278: | ||
... Lots more output ... | ... Lots more output ... | ||
− | Step 2. Process twitter data to generate an output file of relationships between twitter accounts | + | ==Step 2. Process twitter data to generate an output file of relationships between twitter accounts== |
− | + | Script [https://gist.github.com/mjcreativeventures/58a037a03b63355e02a3 twitter_network.py] di bawah ini akan memproses data yang dikumpulkan oleh twitter API dan membuat edge list. List ini berisi hubungan antar twitter account. Weight value dimasukan, nilai ini merupakan jumlah total follower untuk twitter account pertama, nilai ini di ambil dari API twitter. Weight value akan digunakan nanti saat kita menggambar network graph. | |
import glob | import glob | ||
Line 327: | Line 332: | ||
view raw | view raw | ||
− | + | ==Python file: twitter_network.py== | |
− | |||
− | Python file: twitter_network.py | ||
The output generated from this script: | The output generated from this script: | ||
Line 347: | Line 350: | ||
... | ... | ||
− | Step 3. Visualizing the Network using the NetworkX library | + | ==Step 3. Visualizing the Network using the NetworkX library== |
We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph: | We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph: | ||
Line 448: | Line 451: | ||
visualize.py hosted with ❤ by GitHub | visualize.py hosted with ❤ by GitHub | ||
− | Python file: visualize.py | + | ==Python file: visualize.py== |
* Line 7 Load edge data from disk | * Line 7 Load edge data from disk |
Latest revision as of 12:48, 29 January 2017
Persiapan
Instalasi
sudo apt install python-pip sudo pip install --upgrade pip sudo pip install tweepy mkdir following mkdir twitter_users
Login ke https://dev.twitter.com/apps, dapatkan:
- CONSUMER_KEY
- CONSUMER_SECRET
- ACCESS_TOKEN
- ACCESS_TOKEN_SECRET
Langkah Secara umum
- From initial seed account collect followers using the Snowball Sampling technique.
- Process the collected twitter data to generate an output file of relationships between twitter accounts.
- Visualize network data in a network graph using the NetworkX library.
Cara Cepat
python get_followers.py -s TEDxSingapore -d 3 python twitter_network.py python visualize.py
Step 1. Collect follower data from the Twitter API
follower data akan di simpan di folder following, sesudah proses di jalankan akan data di simpan dalam CSV format.
$ ls following/ -rw-r--r-- 1 mark mark 7.1K Aug 14 21:04 TEDxMtHood.csv -rw-r--r-- 1 mark mark 7.0K Aug 14 21:21 TEDxYYC.csv -rw-r--r-- 1 mark mark 5.7K Aug 15 07:29 TEDxCibeles.csv -rw-r--r-- 1 mark mark 2.8K Aug 15 07:30 TEDxProvidence.csv -rw-r--r-- 1 mark mark 6.9K Aug 15 07:46 TEDxUHasselt.csv -rw-r--r-- 1 mark mark 625 Aug 15 07:46 TEDxWestVillage.csv -rw-r--r-- 1 mark mark 196 Aug 15 07:46 TEDxESPRIT.csv -rw-r--r-- 1 mark mark 2.9K Aug 15 08:02 TEDxUU.csv
cat following/TEDxESPRIT XXXXXXXXX dediil hedil jabou XXXXXXXXX MehdiBJemia Mehdi Ben Jemia XXXXXXXXX _willywall _william XXXXXXXX MirakHikimori Hello Hikimori XXXXXXXX maroo_king Marou
Directory yang kedua adalah ‘twitter-users’, menyimpan twitter user detail dalam format JSON.
$ ls twitter-users/ -rw-r--r-- 1 mark mark 252 Jul 24 16:45 XXXXXXXXX.json -rw-r--r-- 1 mark mark 57K Jul 24 16:46 XXXXXXXX.json -rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json ... Lots more ... $ cat twitter-users/XXXXXXXX { "name": "TEDxSingapore", "friends_count": 147, "followers_count": 12814, "followers_ids": [ XXXXXXXXXX, XXXXXXXXXX, XXXXXXXXX, ... XXXXXXXXXX, XXXXXXXXXX ], "id": XXXXXXXX, "screen_name": "TEDxSingapore" }
Script get_followers.py untuk mengumpulkan data adalah sebagai berikut,
import tweepy import time import os import sys import json import argparse FOLLOWING_DIR = 'following' MAX_FRIENDS = 200 FRIENDS_OF_FRIENDS_LIMIT = 200 if not os.path.exists(FOLLOWING_DIR): os.makedir(FOLLOWING_DIR) enc = lambda x: x.encode('ascii', errors='ignore') # The consumer keys can be found on your application's Details # page located at https://dev.twitter.com/apps (under "OAuth settings") CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXX' CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' # The access tokens can be found on your applications's Details # page located at https://dev.twitter.com/apps (located # under "Your access token") ACCESS_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' ACCESS_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' # == OAuth Authentication == # # This mode of authentication is the new preferred way # of authenticating with Twitter. auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) api = tweepy.API(auth) def get_follower_ids(centre, max_depth=1, current_depth=0, taboo_list=[]): # print 'current depth: %d, max depth: %d' % (current_depth, max_depth) # print 'taboo list: ', ','.join([ str(i) for i in taboo_list ]) if current_depth == max_depth: print 'out of depth' return taboo_list if centre in taboo_list: # we've been here before print 'Already been here.' return taboo_list else: taboo_list.append(centre) try: userfname = os.path.join('twitter-users', str(centre) + '.json') if not os.path.exists(userfname): print 'Retrieving user details for twitter id %s' % str(centre) while True: try: user = api.get_user(centre) d = {'name': user.name, 'screen_name': user.screen_name, 'id': user.id, 'friends_count': user.friends_count, 'followers_count': user.followers_count, 'followers_ids': user.followers_ids()} with open(userfname, 'w') as outf: outf.write(json.dumps(d, indent=1)) user = d break except tweepy.TweepError, error: print type(error) if str(error) == 'Not authorized.': print 'Cant access user data - not authorized.' return taboo_list if str(error) == 'User has been suspended.': print 'User suspended.' return taboo_list errorObj = error[0][0] print errorObj if errorObj['message'] == 'Rate limit exceeded': print 'Rate limited. Sleeping for 15 minutes.' time.sleep(15 * 60 + 15) continue return taboo_list else: user = json.loads(file(userfname).read()) screen_name = enc(user['screen_name']) fname = os.path.join(FOLLOWING_DIR, screen_name + '.csv') friendids = [] # only retrieve friends of TED... screen names if screen_name.startswith('TED'): if not os.path.exists(fname): print 'No cached data for screen name "%s"' % screen_name with open(fname, 'w') as outf: params = (enc(user['name']), screen_name) print 'Retrieving friends for user "%s" (%s)' % params # page over friends c = tweepy.Cursor(api.friends, id=user['id']).items() friend_count = 0 while True: try: friend = c.next() friendids.append(friend.id) params = (friend.id, enc(friend.screen_name), enc(friend.name)) outf.write('%s\t%s\t%s\n' % params) friend_count += 1 if friend_count >= MAX_FRIENDS: print 'Reached max no. of friends for "%s".' % friend.screen_name break except tweepy.TweepError: # hit rate limit, sleep for 15 minutes print 'Rate limited. Sleeping for 15 minutes.' time.sleep(15 * 60 + 15) continue except StopIteration: break else: friendids = [int(line.strip().split('\t')[0]) for line in file(fname)] print 'Found %d friends for %s' % (len(friendids), screen_name) # get friends of friends cd = current_depth if cd+1 < max_depth: for fid in friendids[:FRIENDS_OF_FRIENDS_LIMIT]: taboo_list = get_follower_ids(fid, max_depth=max_depth, current_depth=cd+1, taboo_list=taboo_list) if cd+1 < max_depth and len(friendids) > FRIENDS_OF_FRIENDS_LIMIT: print 'Not all friends retrieved for %s.' % screen_name except Exception, error: print 'Error retrieving followers for user id: ', centre print error if os.path.exists(fname): os.remove(fname) print 'Removed file "%s".' % fname sys.exit(1) return taboo_list
if __name__ == '__main__': ap = argparse.ArgumentParser() ap.add_argument("-s", "--screen-name", required=True, help="Screen name of twitter user") ap.add_argument("-d", "--depth", required=True, type=int, help="How far to follow user network") args = vars(ap.parse_args()) twitter_screenname = args['screen_name'] depth = int(args['depth']) if depth < 1 or depth > 3: print 'Depth value %d is not valid. Valid range is 1-3.' % depth sys.exit('Invalid depth argument.') print 'Max Depth: %d' % depth matches = api.lookup_users(screen_names=[twitter_screenname]) if len(matches) == 1: print get_follower_ids(matches[0].id, max_depth=depth) else: print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname view raw
I ran this script twice first without a filter on the screen name but limiting the maximum number of following accounts to 20 then again but this time filtering for accounts starting with ‘TED’ (line 102) and allowing up to 200 following accounts to be queried. This will give a mix of TED and non-TED twitter accounts. Running the script:
$ mkdir following $ mkdir twitter_user
$ python get_followers.py -s TEDxSingapore -d 3 Max Depth: 3 Found 147 friends for TEDxSingapore Found 200 friends for TEDWomen Already been here. Found 72 friends for TEDxDanteSchool Found 33 friends for TEDHelp Retrieving user details for twitter id XXXXXXXX from API... ... Lots more output ...
Step 2. Process twitter data to generate an output file of relationships between twitter accounts
Script twitter_network.py di bawah ini akan memproses data yang dikumpulkan oleh twitter API dan membuat edge list. List ini berisi hubungan antar twitter account. Weight value dimasukan, nilai ini merupakan jumlah total follower untuk twitter account pertama, nilai ini di ambil dari API twitter. Weight value akan digunakan nanti saat kita menggambar network graph.
import glob import os import json import sys from collections import defaultdict users = defaultdict(lambda: { 'followers': 0 }) for f in glob.glob('twitter-users/*.json'): data = json.load(file(f)) screen_name = data['screen_name'] users[screen_name] = { 'followers': data['followers_count'] } SEED = 'TEDxSingapore' def process_follower_list(screen_name, edges=[], depth=0, max_depth=2): f = os.path.join('following', screen_name + '.csv') if not os.path.exists(f): return edges followers = [line.strip().split('\t') for line in file(f)] for follower_data in followers: if len(follower_data) < 2: continue screen_name_2 = follower_data[1] # use the number of followers for screen_name as the weight weight = users[screen_name]['followers'] edges.append([screen_name, screen_name_2, weight]) if depth+1 < max_depth: process_follower_list(screen_name_2, edges, depth+1, max_depth) return edges edges = process_follower_list(SEED, max_depth=3) with open('twitter_network.csv', 'w') as outf: edge_exists = {} for edge in edges: key = ','.join([str(x) for x in edge]) if not(key in edge_exists): outf.write('%s\t%s\t%d\n' % (edge[0], edge[1], edge[2])) edge_exists[key] = True view raw
Python file: twitter_network.py
The output generated from this script:
... TEDxSingapore trendwatchingAP 12814 adaptev TEDxSingapore 321 IS_magazine TEDxSingapore 9955 trendwatchingAP TEDxSingapore 678 TEDxSingapore GuyKawasaki 12814 TEDxSingapore InnovateAP 12814 TEDxSingapore InnosightTeam 12814 TEDxSingapore ScottDAnthony 12814 TEDxSingapore WorldAndScience 12814 TEDxSingapore EntMagazine 12814 ...
Step 3. Visualizing the Network using the NetworkX library
We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph:
- Create a directed graph (net.DiGraph) containing all the edge data including metadata.
- Remove nodes based on how connected they are to other nodes in the network (i.e. remove poorly connected nodes)
- Remove edges that have less than a minimum number of followers
- Split nodes into two separate categories, ‘TED’ and ‘non-TED’ sets.
- Render each nodeset
- Render edges between nodes
- Render node labels
Here is the code to generate the twitter network image. I wrote this code in IPython Notebook (this is the reason Line 3 has a magic command that causes matplotlib output to be rendered in the browser):
import networkx as net import matplotlib.pyplot as plt from collections import defaultdict import math twitter_network = [ line.strip().split('\t') for line in file('twitter_network.csv') ] o = net.DiGraph() hfollowers = defaultdict(lambda: 0) for (twitter_user, followed_by, followers) in twitter_network: o.add_edge(twitter_user, followed_by, followers=int(followers)) hfollowers[twitter_user] = int(followers) SEED = 'TEDxSingapore' # centre around the SEED node and set radius of graph g = net.DiGraph(net.ego_graph(o, SEED, radius=4)) def trim_degrees_ted(g, degree=1, ted_degree=1): g2 = g.copy() d = net.degree(g2) for n in g2.nodes(): if n == SEED: continue # don't prune the SEED node if d[n] <= degree and not n.lower().startswith('ted'): g2.remove_node(n) elif n.lower().startswith('ted') and d[n] <= ted_degree: g2.remove_node(n) return g2 def trim_edges_ted(g, weight=1, ted_weight=10): g2 = net.DiGraph() for f, to, edata in g.edges_iter(data=True): if f == SEED or to == SEED: # keep edges that link to the SEED node g2.add_edge(f, to, edata) elif f.lower().startswith('ted') or to.lower().startswith('ted'): if edata['followers'] >= ted_weight: g2.add_edge(f, to, edata) elif edata['followers'] >= weight: g2.add_edge(f, to, edata) return g2 print 'g: ', len(g) core = trim_degrees_ted(g, degree=235, ted_degree=1) print 'core after node pruning: ', len(core) core = trim_edges_ted(core, weight=250000, ted_weight=35000) print 'core after edge pruning: ', len(core) nodeset_types = { 'TED': lambda s: s.lower().startswith('ted'), 'Not TED': lambda s: not s.lower().startswith('ted') } nodesets = defaultdict(list) for nodeset_typename, nodeset_test in nodeset_types.iteritems(): nodesets[nodeset_typename] = [ n for n in core.nodes_iter() if nodeset_test(n) ] pos = net.spring_layout(core) # compute layout colours = ['red','green'] colourmap = {} plt.figure(figsize=(18,18)) plt.axis('off') # draw nodes i = 0 alphas = {'TED': 0.6, 'Not TED': 0.4} for k in nodesets.keys(): ns = [ math.log10(hfollowers[n]+1) * 80 for n in nodesets[k] ] print k, len(ns) net.draw_networkx_nodes(core, pos, nodelist=nodesets[k], node_size=ns, node_color=colours[i], alpha=alphas[k]) colourmap[k] = colours[i] i += 1 print 'colourmap: ', colourmap # draw edges net.draw_networkx_edges(core, pos, width=0.5, alpha=0.5) # draw labels alphas = { 'TED': 1.0, 'Not TED': 0.5} for k in nodesets.keys(): for n in nodesets[k]: x, y = pos[n] plt.text(x, y+0.02, s=n, alpha=alphas[k], horizontalalignment='center', fontsize=9) view raw
visualize.py hosted with ❤ by GitHub
Python file: visualize.py
- Line 7 Load edge data from disk
- Line 9-13 Create a directed graph from the edge data and populate a dictionary with the followers count data
- Line 18 Centre and restrict size of graph around the SEED node (TEDxSingapore)
- Line 20-29 Method to prune the network graph by eliminating nodes that don’t meet filter criteria
- Line 31-41 Method to prune the network graph by eliminating edges that don’t meet filter criteria
- Line 44, 46 removes nodes and edges from the network that don’t meet the filter criteria
- Line 67-73 For each nodeset draw the nodes, the size of each node is based on the log value of the followers count
- Line 76 Draw network edges
- Line 80-83 Draw network labels, use matplotlib directly to do this rather than net.draw_networkx_labels() method.
Output from running script in IPython Notebook
g: 119567 core after node pruning: 958 core after edge pruning: 198 Not TED 38 TED 160 colourmap: {'Not TED': 'red', 'TED': 'green'}
twitter network
See Also:
- NetworkX library
- Social Network Analysis for Startups by Maksim Tsvetovat; Alexander Kouzetsov
- Snowball Samping