Difference between revisions of "Python: Generating Network Graph of Twitter Follower"

From OnnoWiki
Jump to navigation Jump to search
(Created page with "Generating a network graph of Twitter followers using Python and NetworkX 15th August, 2014 mark 8 Comments twitter network In this article I show you how by starting at a...")
 
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
Generating a network graph of Twitter followers using Python and NetworkX
+
==Persiapan==
15th August, 2014 mark 8 Comments
 
twitter network
 
  
In this article I show you how by starting at a single twitter account we can build up a network graph of twitter followers and then visualize that network using the NetworkX library.
+
Instalasi
  
The steps are:
+
sudo apt install python-pip
 +
sudo pip install --upgrade pip
 +
sudo pip install tweepy
 +
mkdir following
 +
mkdir twitter_users
  
    From initial seed account collect followers using the Snowball Sampling technique.
+
Login ke https://dev.twitter.com/apps, dapatkan:
    Process the collected twitter data to generate an output file of relationships between twitter accounts.
 
    Visualize network data in a network graph using the NetworkX library.
 
  
Step 1. Collect follower data from the Twitter API
+
* CONSUMER_KEY
 +
* CONSUMER_SECRET
 +
* ACCESS_TOKEN
 +
* ACCESS_TOKEN_SECRET
  
You will need to have API keys to be able to query the Twitter API. I have written in previous articles how to do this, e.g. Collecting tweets using Python.
+
==Langkah Secara umum==
  
When you interact with the Twitter API you will learn quickly that you need to cache data as you go along. This is because the API is rate limited and you will find any script you write will halt frequently when hitting a rate limit if you don’t cache responses. The solution is to check for cached data before making an API call, if you get a cache miss then query the API and write the returned data to disk.
+
* From initial seed account collect followers using the Snowball Sampling technique.
 +
* Process the collected twitter data to generate an output file of relationships between twitter accounts.
 +
* Visualize network data in a network graph using the NetworkX library.
  
I use two directories for cached data. The directory ‘following’ contains a CSV file for each twitter account queried. The name of each file is the screen name of the twitter account and the content is a tab delimited list, each row contains the twitter id, screen name and account name of a follower, up to a maximum of 200 followers.
+
==Cara Cepat==
  
$ ls following/
+
python get_followers.py -s TEDxSingapore -d 3
-rw-r--r-- 1 mark mark 7.1K Aug 14 21:04 TEDxMtHood.csv
+
  python twitter_network.py
-rw-r--r-- 1 mark mark 7.0K Aug 14 21:21 TEDxYYC.csv
+
  python visualize.py
-rw-r--r-- 1 mark mark 5.7K Aug 15 07:29 TEDxCibeles.csv
 
-rw-r--r-- 1 mark mark 2.8K Aug 15 07:30 TEDxProvidence.csv
 
-rw-r--r-- 1 mark mark 6.9K Aug 15 07:46 TEDxUHasselt.csv
 
-rw-r--r-- 1 mark mark 625 Aug 15 07:46 TEDxWestVillage.csv
 
-rw-r--r-- 1 mark mark 196 Aug 15 07:46 TEDxESPRIT.csv
 
-rw-r--r-- 1 mark mark 2.9K Aug 15 08:02 TEDxUU.csv
 
  
cat following/TEDxESPRIT
 
XXXXXXXXX      dediil  hedil jabou
 
XXXXXXXXX      MehdiBJemia    Mehdi Ben Jemia
 
XXXXXXXXX      _willywall      _william
 
XXXXXXXX        MirakHikimori  Hello Hikimori
 
XXXXXXXX        maroo_king      Marou
 
  
The second directory is called ‘twitter-users’, it is a cache of twitter user details, each file contains cached data for a twitter user including friend and follower counts and a list of follower IDs (up to a maximum of 5000 follower IDs can be queried from the API).
+
==Step 1. Collect follower data from the Twitter API==
  
$ ls twitter-users/
+
follower data akan di simpan di folder following, sesudah proses di jalankan akan data di simpan dalam CSV format.
-rw-r--r-- 1 mark mark  252 Jul 24 16:45 XXXXXXXXX.json
 
-rw-r--r-- 1 mark mark  57K Jul 24 16:46 XXXXXXXX.json
 
-rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json
 
  
... Lots more ...
+
$ ls following/
 +
-rw-r--r-- 1 mark mark 7.1K Aug 14 21:04 TEDxMtHood.csv
 +
-rw-r--r-- 1 mark mark 7.0K Aug 14 21:21 TEDxYYC.csv
 +
-rw-r--r-- 1 mark mark 5.7K Aug 15 07:29 TEDxCibeles.csv
 +
-rw-r--r-- 1 mark mark 2.8K Aug 15 07:30 TEDxProvidence.csv
 +
-rw-r--r-- 1 mark mark 6.9K Aug 15 07:46 TEDxUHasselt.csv
 +
-rw-r--r-- 1 mark mark  625 Aug 15 07:46 TEDxWestVillage.csv
 +
-rw-r--r-- 1 mark mark  196 Aug 15 07:46 TEDxESPRIT.csv
 +
-rw-r--r-- 1 mark mark 2.9K Aug 15 08:02 TEDxUU.csv
  
 +
cat following/TEDxESPRIT
 +
XXXXXXXXX      dediil  hedil jabou
 +
XXXXXXXXX      MehdiBJemia    Mehdi Ben Jemia
 +
XXXXXXXXX      _willywall      _william
 +
XXXXXXXX        MirakHikimori  Hello Hikimori
 +
XXXXXXXX        maroo_king      Marou
  
$ cat twitter-users/XXXXXXXX
+
Directory yang kedua adalah ‘twitter-users’, menyimpan twitter user detail dalam format JSON.
{
 
"name": "TEDxSingapore",
 
"friends_count": 147,
 
"followers_count": 12814,
 
"followers_ids": [
 
  XXXXXXXXXX,
 
  XXXXXXXXXX,
 
  XXXXXXXXX,
 
  ...
 
  XXXXXXXXXX,
 
  XXXXXXXXXX
 
],
 
"id": XXXXXXXX,
 
"screen_name": "TEDxSingapore"
 
}
 
  
Here is the script to collect this data:
 
  
import tweepy
+
$ ls twitter-users/
import time
+
-rw-r--r-- 1 mark mark  252 Jul 24 16:45 XXXXXXXXX.json
import os
+
-rw-r--r-- 1 mark mark  57K Jul 24 16:46 XXXXXXXX.json
import sys
+
-rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json
import json
+
import argparse
+
... Lots more ...
 +
 +
$ cat twitter-users/XXXXXXXX
 +
 +
  {
 +
  "name": "TEDxSingapore",
 +
  "friends_count": 147,
 +
  "followers_count": 12814,
 +
  "followers_ids": [
 +
    XXXXXXXXXX,
 +
    XXXXXXXXXX,
 +
    XXXXXXXXX,
 +
    ...
 +
    XXXXXXXXXX,
 +
    XXXXXXXXXX
 +
  ],
 +
  "id": XXXXXXXX,
 +
  "screen_name": "TEDxSingapore"
 +
  }
  
FOLLOWING_DIR = 'following'
+
Script [https://gist.github.com/mjcreativeventures/41de04c6bbe47ee14411 get_followers.py] untuk mengumpulkan data adalah sebagai berikut,
MAX_FRIENDS = 200
 
FRIENDS_OF_FRIENDS_LIMIT = 200
 
  
if not os.path.exists(FOLLOWING_DIR):
+
import tweepy
    os.makedir(FOLLOWING_DIR)
+
import time
 +
import os
 +
import sys
 +
import json
 +
import argparse
 +
 +
FOLLOWING_DIR = 'following'
 +
MAX_FRIENDS = 200
 +
FRIENDS_OF_FRIENDS_LIMIT = 200
 +
 +
if not os.path.exists(FOLLOWING_DIR):
 +
    os.makedir(FOLLOWING_DIR)
 +
 +
enc = lambda x: x.encode('ascii', errors='ignore')
 +
 +
# The consumer keys can be found on your application's Details
 +
# page located at https://dev.twitter.com/apps (under "OAuth settings")
 +
CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXX'
 +
CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
 +
 +
# The access tokens can be found on your applications's Details
 +
# page located at https://dev.twitter.com/apps (located
 +
# under "Your access token")
 +
ACCESS_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
 +
ACCESS_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
 +
 +
# == OAuth Authentication ==
 +
#
 +
# This mode of authentication is the new preferred way
 +
# of authenticating with Twitter.
 +
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
 +
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
 +
 +
api = tweepy.API(auth)
 +
 +
def get_follower_ids(centre, max_depth=1, current_depth=0, taboo_list=[]):
 +
 +
    # print 'current depth: %d, max depth: %d' % (current_depth, max_depth)
 +
    # print 'taboo list: ', ','.join([ str(i) for i in taboo_list ])
 +
 +
    if current_depth == max_depth:
 +
        print 'out of depth'
 +
        return taboo_list
 +
 +
    if centre in taboo_list:
 +
        # we've been here before
 +
        print 'Already been here.'
 +
        return taboo_list
 +
    else:
 +
        taboo_list.append(centre)
 +
 +
    try:
 +
        userfname = os.path.join('twitter-users', str(centre) + '.json')
 +
        if not os.path.exists(userfname):
 +
            print 'Retrieving user details for twitter id %s' % str(centre)
 +
            while True:
 +
                try:
 +
                    user = api.get_user(centre)
 +
 +
                    d = {'name': user.name,
 +
                          'screen_name': user.screen_name,
 +
                          'id': user.id,
 +
                          'friends_count': user.friends_count,
 +
                          'followers_count': user.followers_count,
 +
                          'followers_ids': user.followers_ids()}
 +
 +
                    with open(userfname, 'w') as outf:
 +
                        outf.write(json.dumps(d, indent=1))
 +
 +
                    user = d
 +
                    break
 +
                except tweepy.TweepError, error:
 +
                    print type(error)
 +
 +
                    if str(error) == 'Not authorized.':
 +
                        print 'Can''t access user data - not authorized.'
 +
                        return taboo_list
 +
 +
                    if str(error) == 'User has been suspended.':
 +
                        print 'User suspended.'
 +
                        return taboo_list
 +
 +
                    errorObj = error[0][0]
 +
 +
                    print errorObj
 +
 +
                    if errorObj['message'] == 'Rate limit exceeded':
 +
                        print 'Rate limited. Sleeping for 15 minutes.'
 +
                        time.sleep(15 * 60 + 15)
 +
                        continue
 +
 +
                    return taboo_list
 +
        else:
 +
            user = json.loads(file(userfname).read())
 +
 +
        screen_name = enc(user['screen_name'])
 +
        fname = os.path.join(FOLLOWING_DIR, screen_name + '.csv')
 +
        friendids = []
 +
 +
        # only retrieve friends of TED... screen names
 +
        if screen_name.startswith('TED'):
 +
            if not os.path.exists(fname):
 +
                print 'No cached data for screen name "%s"' % screen_name
 +
                with open(fname, 'w') as outf:
 +
                    params = (enc(user['name']), screen_name)
 +
                    print 'Retrieving friends for user "%s" (%s)' % params
 +
 +
                    # page over friends
 +
                    c = tweepy.Cursor(api.friends, id=user['id']).items()
 +
 +
                    friend_count = 0
 +
                    while True:
 +
                        try:
 +
                            friend = c.next()
 +
                            friendids.append(friend.id)
 +
                            params = (friend.id, enc(friend.screen_name), enc(friend.name))
 +
                            outf.write('%s\t%s\t%s\n' % params)
 +
                            friend_count += 1
 +
                            if friend_count >= MAX_FRIENDS:
 +
                                print 'Reached max no. of friends for "%s".' % friend.screen_name
 +
                                break
 +
                        except tweepy.TweepError:
 +
                            # hit rate limit, sleep for 15 minutes
 +
                            print 'Rate limited. Sleeping for 15 minutes.'
 +
                            time.sleep(15 * 60 + 15)
 +
                            continue
 +
                        except StopIteration:
 +
                            break
 +
            else:
 +
                friendids = [int(line.strip().split('\t')[0]) for line in file(fname)]
 +
 +
            print 'Found %d friends for %s' % (len(friendids), screen_name)
 +
 +
            # get friends of friends
 +
            cd = current_depth
 +
            if cd+1 < max_depth:
 +
                for fid in friendids[:FRIENDS_OF_FRIENDS_LIMIT]:
 +
                    taboo_list = get_follower_ids(fid, max_depth=max_depth,
 +
                        current_depth=cd+1, taboo_list=taboo_list)
 +
 +
            if cd+1 < max_depth and len(friendids) > FRIENDS_OF_FRIENDS_LIMIT:
 +
                print 'Not all friends retrieved for %s.' % screen_name
 +
 +
    except Exception, error:
 +
        print 'Error retrieving followers for user id: ', centre
 +
        print error
 +
 +
        if os.path.exists(fname):
 +
            os.remove(fname)
 +
            print 'Removed file "%s".' % fname
 +
 +
        sys.exit(1)
 +
 +
    return taboo_list
  
enc = lambda x: x.encode('ascii', errors='ignore')
+
if __name__ == '__main__':
 +
    ap = argparse.ArgumentParser()
 +
    ap.add_argument("-s", "--screen-name", required=True, help="Screen name of twitter user")
 +
    ap.add_argument("-d", "--depth", required=True, type=int, help="How far to follow user network")
 +
    args = vars(ap.parse_args())
 +
 +
    twitter_screenname = args['screen_name']
 +
    depth = int(args['depth'])
 +
 +
    if depth < 1 or depth > 3:
 +
        print 'Depth value %d is not valid. Valid range is 1-3.' % depth
 +
        sys.exit('Invalid depth argument.')
 +
 +
    print 'Max Depth: %d' % depth
 +
    matches = api.lookup_users(screen_names=[twitter_screenname])
 +
 +
    if len(matches) == 1:
 +
        print get_follower_ids(matches[0].id, max_depth=depth)
 +
    else:
 +
        print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname
 +
view raw
  
# The consumer keys can be found on your application's Details
 
# page located at https://dev.twitter.com/apps (under "OAuth settings")
 
CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXX'
 
CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
 
  
# The access tokens can be found on your applications's Details
 
# page located at https://dev.twitter.com/apps (located
 
# under "Your access token")
 
ACCESS_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
 
ACCESS_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
 
 
# == OAuth Authentication ==
 
#
 
# This mode of authentication is the new preferred way
 
# of authenticating with Twitter.
 
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
 
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
 
 
api = tweepy.API(auth)
 
 
def get_follower_ids(centre, max_depth=1, current_depth=0, taboo_list=[]):
 
 
    # print 'current depth: %d, max depth: %d' % (current_depth, max_depth)
 
    # print 'taboo list: ', ','.join([ str(i) for i in taboo_list ])
 
 
    if current_depth == max_depth:
 
        print 'out of depth'
 
        return taboo_list
 
 
    if centre in taboo_list:
 
        # we've been here before
 
        print 'Already been here.'
 
        return taboo_list
 
    else:
 
        taboo_list.append(centre)
 
 
    try:
 
        userfname = os.path.join('twitter-users', str(centre) + '.json')
 
        if not os.path.exists(userfname):
 
            print 'Retrieving user details for twitter id %s' % str(centre)
 
            while True:
 
                try:
 
                    user = api.get_user(centre)
 
 
                    d = {'name': user.name,
 
                        'screen_name': user.screen_name,
 
                        'id': user.id,
 
                        'friends_count': user.friends_count,
 
                        'followers_count': user.followers_count,
 
                        'followers_ids': user.followers_ids()}
 
 
                    with open(userfname, 'w') as outf:
 
                        outf.write(json.dumps(d, indent=1))
 
 
                    user = d
 
                    break
 
                except tweepy.TweepError, error:
 
                    print type(error)
 
 
                    if str(error) == 'Not authorized.':
 
                        print 'Can''t access user data - not authorized.'
 
                        return taboo_list
 
 
                    if str(error) == 'User has been suspended.':
 
                        print 'User suspended.'
 
                        return taboo_list
 
 
                    errorObj = error[0][0]
 
 
                    print errorObj
 
 
                    if errorObj['message'] == 'Rate limit exceeded':
 
                        print 'Rate limited. Sleeping for 15 minutes.'
 
                        time.sleep(15 * 60 + 15)
 
                        continue
 
 
                    return taboo_list
 
        else:
 
            user = json.loads(file(userfname).read())
 
 
        screen_name = enc(user['screen_name'])
 
        fname = os.path.join(FOLLOWING_DIR, screen_name + '.csv')
 
        friendids = []
 
 
        # only retrieve friends of TED... screen names
 
        if screen_name.startswith('TED'):
 
            if not os.path.exists(fname):
 
                print 'No cached data for screen name "%s"' % screen_name
 
                with open(fname, 'w') as outf:
 
                    params = (enc(user['name']), screen_name)
 
                    print 'Retrieving friends for user "%s" (%s)' % params
 
 
                    # page over friends
 
                    c = tweepy.Cursor(api.friends, id=user['id']).items()
 
 
                    friend_count = 0
 
                    while True:
 
                        try:
 
                            friend = c.next()
 
                            friendids.append(friend.id)
 
                            params = (friend.id, enc(friend.screen_name), enc(friend.name))
 
                            outf.write('%s\t%s\t%s\n' % params)
 
                            friend_count += 1
 
                            if friend_count >= MAX_FRIENDS:
 
                                print 'Reached max no. of friends for "%s".' % friend.screen_name
 
                                break
 
                        except tweepy.TweepError:
 
                            # hit rate limit, sleep for 15 minutes
 
                            print 'Rate limited. Sleeping for 15 minutes.'
 
                            time.sleep(15 * 60 + 15)
 
                            continue
 
                        except StopIteration:
 
                            break
 
            else:
 
                friendids = [int(line.strip().split('\t')[0]) for line in file(fname)]
 
 
            print 'Found %d friends for %s' % (len(friendids), screen_name)
 
 
            # get friends of friends
 
            cd = current_depth
 
            if cd+1 < max_depth:
 
                for fid in friendids[:FRIENDS_OF_FRIENDS_LIMIT]:
 
                    taboo_list = get_follower_ids(fid, max_depth=max_depth,
 
                        current_depth=cd+1, taboo_list=taboo_list)
 
 
            if cd+1 < max_depth and len(friendids) > FRIENDS_OF_FRIENDS_LIMIT:
 
                print 'Not all friends retrieved for %s.' % screen_name
 
 
    except Exception, error:
 
        print 'Error retrieving followers for user id: ', centre
 
        print error
 
 
        if os.path.exists(fname):
 
            os.remove(fname)
 
            print 'Removed file "%s".' % fname
 
 
        sys.exit(1)
 
 
    return taboo_list
 
 
if __name__ == '__main__':
 
    ap = argparse.ArgumentParser()
 
    ap.add_argument("-s", "--screen-name", required=True, help="Screen name of twitter user")
 
    ap.add_argument("-d", "--depth", required=True, type=int, help="How far to follow user network")
 
    args = vars(ap.parse_args())
 
 
    twitter_screenname = args['screen_name']
 
    depth = int(args['depth'])
 
 
    if depth < 1 or depth > 3:
 
        print 'Depth value %d is not valid. Valid range is 1-3.' % depth
 
        sys.exit('Invalid depth argument.')
 
 
    print 'Max Depth: %d' % depth
 
    matches = api.lookup_users(screen_names=[twitter_screenname])
 
 
    if len(matches) == 1:
 
        print get_follower_ids(matches[0].id, max_depth=depth)
 
    else:
 
        print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname
 
view raw
 
get_followers.py hosted with ❤ by GitHub
 
 
Python file: get_followers.py
 
  
 +
 
I ran this script twice first without a filter on the screen name but limiting the maximum number of following accounts to 20 then again but this time filtering for accounts starting with ‘TED’ (line 102) and allowing up to 200 following accounts to be queried. This will give a mix of TED and non-TED twitter accounts. Running the script:
 
I ran this script twice first without a filter on the screen name but limiting the maximum number of following accounts to 20 then again but this time filtering for accounts starting with ‘TED’ (line 102) and allowing up to 200 following accounts to be queried. This will give a mix of TED and non-TED twitter accounts. Running the script:
  
$ python get_followers.py -s TEDxSingapore -d 3
+
$ mkdir following
 +
$ mkdir twitter_user
  
Max Depth: 3
+
$ python get_followers.py -s TEDxSingapore -d 3
Found 147 friends for TEDxSingapore
+
Found 200 friends for TEDWomen
+
Max Depth: 3
Already been here.
+
Found 147 friends for TEDxSingapore
Found 72 friends for TEDxDanteSchool
+
Found 200 friends for TEDWomen
Found 33 friends for TEDHelp
+
Already been here.
Retrieving user details for twitter id XXXXXXXX from API...
+
Found 72 friends for TEDxDanteSchool
 +
Found 33 friends for TEDHelp
 +
Retrieving user details for twitter id XXXXXXXX from API...
 +
 +
... Lots more output ...
  
... Lots more output ...
+
==Step 2. Process twitter data to generate an output file of relationships between twitter accounts==
  
Step 2. Process twitter data to generate an output file of relationships between twitter accounts
+
Script [https://gist.github.com/mjcreativeventures/58a037a03b63355e02a3 twitter_network.py] di bawah ini akan memproses data yang dikumpulkan oleh twitter API dan membuat edge list. List ini berisi hubungan antar twitter account. Weight value dimasukan, nilai ini merupakan jumlah total follower untuk twitter account pertama, nilai ini di ambil dari API twitter. Weight value akan digunakan nanti saat kita menggambar network graph.
  
The script below will process the data collected from the twitter API and generate an edge list. That is a list of relationships between twitter accounts. A weight value is included, this value is the total number of followers for the first twitter account, this value is retrieved from the API. The weight value can be used later to prune the network graph.
+
import glob
 +
import os
 +
import json
 +
import sys
 +
from collections import defaultdict
 +
 +
users = defaultdict(lambda: { 'followers': 0 })
 +
 +
for f in glob.glob('twitter-users/*.json'):
 +
    data = json.load(file(f))
 +
    screen_name = data['screen_name']
 +
    users[screen_name] = { 'followers': data['followers_count'] }
 +
 +
SEED = 'TEDxSingapore'
 +
 +
def process_follower_list(screen_name, edges=[], depth=0, max_depth=2):
 +
    f = os.path.join('following', screen_name + '.csv')
 +
 +
    if not os.path.exists(f):
 +
        return edges
 +
 +
    followers = [line.strip().split('\t') for line in file(f)]
 +
 +
    for follower_data in followers:
 +
        if len(follower_data) < 2:
 +
            continue
 +
 +
        screen_name_2 = follower_data[1]
 +
 +
        # use the number of followers for screen_name as the weight
 +
        weight = users[screen_name]['followers']
 +
 +
        edges.append([screen_name, screen_name_2, weight])
 +
 +
        if depth+1 < max_depth:
 +
            process_follower_list(screen_name_2, edges, depth+1, max_depth)
 +
 +
    return edges
 +
 +
edges = process_follower_list(SEED, max_depth=3)
 +
 +
with open('twitter_network.csv', 'w') as outf:
 +
    edge_exists = {}
 +
    for edge in edges:
 +
        key = ','.join([str(x) for x in edge])
 +
        if not(key in edge_exists):
 +
            outf.write('%s\t%s\t%d\n' % (edge[0], edge[1], edge[2]))
 +
            edge_exists[key] = True
 +
view raw
  
import glob
+
==Python file: twitter_network.py==
import os
 
import json
 
import sys
 
from collections import defaultdict
 
 
 
users = defaultdict(lambda: { 'followers': 0 })
 
 
 
for f in glob.glob('twitter-users/*.json'):
 
    data = json.load(file(f))
 
    screen_name = data['screen_name']
 
    users[screen_name] = { 'followers': data['followers_count'] }
 
 
 
SEED = 'TEDxSingapore'
 
 
 
def process_follower_list(screen_name, edges=[], depth=0, max_depth=2):
 
    f = os.path.join('following', screen_name + '.csv')
 
 
 
    if not os.path.exists(f):
 
        return edges
 
 
 
    followers = [line.strip().split('\t') for line in file(f)]
 
 
 
    for follower_data in followers:
 
        if len(follower_data) < 2:
 
            continue
 
 
 
        screen_name_2 = follower_data[1]
 
 
 
        # use the number of followers for screen_name as the weight
 
        weight = users[screen_name]['followers']
 
 
 
        edges.append([screen_name, screen_name_2, weight])
 
 
 
        if depth+1 < max_depth:
 
            process_follower_list(screen_name_2, edges, depth+1, max_depth)
 
 
 
    return edges
 
 
 
edges = process_follower_list(SEED, max_depth=3)
 
 
 
with open('twitter_network.csv', 'w') as outf:
 
    edge_exists = {}
 
    for edge in edges:
 
        key = ','.join([str(x) for x in edge])
 
        if not(key in edge_exists):
 
            outf.write('%s\t%s\t%d\n' % (edge[0], edge[1], edge[2]))
 
            edge_exists[key] = True
 
view raw
 
twitter_network.py hosted with ❤ by GitHub
 
 
 
Python file: twitter_network.py
 
  
 
The output generated from this script:
 
The output generated from this script:
  
...
+
...
 
+
TEDxSingapore  trendwatchingAP 12814
+
TEDxSingapore  trendwatchingAP 12814
adaptev TEDxSingapore  321
+
adaptev TEDxSingapore  321
IS_magazine    TEDxSingapore  9955
+
IS_magazine    TEDxSingapore  9955
trendwatchingAP TEDxSingapore  678
+
trendwatchingAP TEDxSingapore  678
TEDxSingapore  GuyKawasaki    12814
+
TEDxSingapore  GuyKawasaki    12814
TEDxSingapore  InnovateAP      12814
+
TEDxSingapore  InnovateAP      12814
TEDxSingapore  InnosightTeam  12814
+
TEDxSingapore  InnosightTeam  12814
TEDxSingapore  ScottDAnthony  12814
+
TEDxSingapore  ScottDAnthony  12814
TEDxSingapore  WorldAndScience 12814
+
TEDxSingapore  WorldAndScience 12814
TEDxSingapore  EntMagazine    12814
+
TEDxSingapore  EntMagazine    12814
...
+
...
  
Step 3. Visualizing the Network using the NetworkX library
+
==Step 3. Visualizing the Network using the NetworkX library==
  
 
We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph:
 
We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph:
  
    Create a directed graph (net.DiGraph) containing all the edge data including metadata.
+
* Create a directed graph (net.DiGraph) containing all the edge data including metadata.
    Remove nodes based on how connected they are to other nodes in the network (i.e. remove poorly connected nodes)
+
* Remove nodes based on how connected they are to other nodes in the network (i.e. remove poorly connected nodes)
    Remove edges that have less than a minimum number of followers
+
* Remove edges that have less than a minimum number of followers
    Split nodes into two separate categories, ‘TED’ and ‘non-TED’ sets.
+
* Split nodes into two separate categories, ‘TED’ and ‘non-TED’ sets.
    Render each nodeset
+
* Render each nodeset
    Render edges between nodes
+
* Render edges between nodes
    Render node labels
+
* Render node labels
  
 
Here is the code to generate the twitter network image. I wrote this code in IPython Notebook (this is the reason Line 3 has a magic command that causes matplotlib output to be rendered in the browser):
 
Here is the code to generate the twitter network image. I wrote this code in IPython Notebook (this is the reason Line 3 has a magic command that causes matplotlib output to be rendered in the browser):
  
import networkx as net
+
import networkx as net
import matplotlib.pyplot as plt
+
import matplotlib.pyplot as plt
 
+
from collections import defaultdict
+
from collections import defaultdict
import math
+
import math
 
+
twitter_network = [ line.strip().split('\t') for line in file('twitter_network.csv') ]
+
twitter_network = [ line.strip().split('\t') for line in file('twitter_network.csv') ]
 
+
o = net.DiGraph()
+
o = net.DiGraph()
hfollowers = defaultdict(lambda: 0)
+
hfollowers = defaultdict(lambda: 0)
for (twitter_user, followed_by, followers) in twitter_network:
+
for (twitter_user, followed_by, followers) in twitter_network:
    o.add_edge(twitter_user, followed_by, followers=int(followers))
+
    o.add_edge(twitter_user, followed_by, followers=int(followers))
    hfollowers[twitter_user] = int(followers)
+
    hfollowers[twitter_user] = int(followers)
 
+
SEED = 'TEDxSingapore'
+
SEED = 'TEDxSingapore'
 
+
# centre around the SEED node and set radius of graph
+
# centre around the SEED node and set radius of graph
g = net.DiGraph(net.ego_graph(o, SEED, radius=4))
+
g = net.DiGraph(net.ego_graph(o, SEED, radius=4))
 
+
def trim_degrees_ted(g, degree=1, ted_degree=1):
+
def trim_degrees_ted(g, degree=1, ted_degree=1):
    g2 = g.copy()
+
    g2 = g.copy()
    d = net.degree(g2)
+
    d = net.degree(g2)
    for n in g2.nodes():
+
    for n in g2.nodes():
        if n == SEED: continue # don't prune the SEED node
+
        if n == SEED: continue # don't prune the SEED node
        if d[n] <= degree and not n.lower().startswith('ted'):
+
        if d[n] <= degree and not n.lower().startswith('ted'):
            g2.remove_node(n)
+
            g2.remove_node(n)
        elif n.lower().startswith('ted') and d[n] <= ted_degree:
+
        elif n.lower().startswith('ted') and d[n] <= ted_degree:
            g2.remove_node(n)
+
            g2.remove_node(n)
    return g2
+
    return g2
 
+
def trim_edges_ted(g, weight=1, ted_weight=10):
+
def trim_edges_ted(g, weight=1, ted_weight=10):
    g2 = net.DiGraph()
+
    g2 = net.DiGraph()
    for f, to, edata in g.edges_iter(data=True):
+
    for f, to, edata in g.edges_iter(data=True):
        if f == SEED or to == SEED: # keep edges that link to the SEED node
+
        if f == SEED or to == SEED: # keep edges that link to the SEED node
            g2.add_edge(f, to, edata)
+
            g2.add_edge(f, to, edata)
        elif f.lower().startswith('ted') or to.lower().startswith('ted'):
+
        elif f.lower().startswith('ted') or to.lower().startswith('ted'):
            if edata['followers'] >= ted_weight:
+
            if edata['followers'] >= ted_weight:
                g2.add_edge(f, to, edata)
+
                g2.add_edge(f, to, edata)
        elif edata['followers'] >= weight:
+
        elif edata['followers'] >= weight:
            g2.add_edge(f, to, edata)
+
            g2.add_edge(f, to, edata)
    return g2
+
    return g2
 
+
print 'g: ', len(g)
+
print 'g: ', len(g)
core = trim_degrees_ted(g, degree=235, ted_degree=1)
+
core = trim_degrees_ted(g, degree=235, ted_degree=1)
print 'core after node pruning: ', len(core)
+
print 'core after node pruning: ', len(core)
core = trim_edges_ted(core, weight=250000, ted_weight=35000)
+
core = trim_edges_ted(core, weight=250000, ted_weight=35000)
print 'core after edge pruning: ', len(core)
+
print 'core after edge pruning: ', len(core)
 
+
nodeset_types = { 'TED': lambda s: s.lower().startswith('ted'), 'Not TED': lambda s: not s.lower().startswith('ted') }
+
nodeset_types = { 'TED': lambda s: s.lower().startswith('ted'), 'Not TED': lambda s: not s.lower().startswith('ted') }
 
+
nodesets = defaultdict(list)
+
nodesets = defaultdict(list)
 
+
for nodeset_typename, nodeset_test in nodeset_types.iteritems():
+
for nodeset_typename, nodeset_test in nodeset_types.iteritems():
    nodesets[nodeset_typename] = [ n for n in core.nodes_iter() if nodeset_test(n) ]
+
    nodesets[nodeset_typename] = [ n for n in core.nodes_iter() if nodeset_test(n) ]
 
+
pos = net.spring_layout(core) # compute layout
+
pos = net.spring_layout(core) # compute layout  
 
+
colours = ['red','green']
+
colours = ['red','green']
colourmap = {}
+
colourmap = {}
 
+
plt.figure(figsize=(18,18))
+
plt.figure(figsize=(18,18))
plt.axis('off')
+
plt.axis('off')
 
+
# draw nodes
+
# draw nodes
i = 0
+
i = 0
alphas = {'TED': 0.6, 'Not TED': 0.4}
+
alphas = {'TED': 0.6, 'Not TED': 0.4}
for k in nodesets.keys():
+
for k in nodesets.keys():
    ns = [ math.log10(hfollowers[n]+1) * 80 for n in nodesets[k] ]
+
    ns = [ math.log10(hfollowers[n]+1) * 80 for n in nodesets[k] ]
    print k, len(ns)
+
    print k, len(ns)
    net.draw_networkx_nodes(core, pos, nodelist=nodesets[k], node_size=ns, node_color=colours[i], alpha=alphas[k])
+
    net.draw_networkx_nodes(core, pos, nodelist=nodesets[k], node_size=ns, node_color=colours[i], alpha=alphas[k])
    colourmap[k] = colours[i]
+
    colourmap[k] = colours[i]
    i += 1
+
    i += 1
print 'colourmap: ', colourmap
+
print 'colourmap: ', colourmap
 
+
# draw edges
+
# draw edges
net.draw_networkx_edges(core, pos, width=0.5, alpha=0.5)
+
net.draw_networkx_edges(core, pos, width=0.5, alpha=0.5)
 +
 +
# draw labels
 +
alphas = { 'TED': 1.0, 'Not TED': 0.5}
 +
for k in nodesets.keys():
 +
    for n in nodesets[k]:
 +
        x, y = pos[n]
 +
        plt.text(x, y+0.02, s=n, alpha=alphas[k], horizontalalignment='center', fontsize=9)
 +
view raw
  
# draw labels
 
alphas = { 'TED': 1.0, 'Not TED': 0.5}
 
for k in nodesets.keys():
 
    for n in nodesets[k]:
 
        x, y = pos[n]
 
        plt.text(x, y+0.02, s=n, alpha=alphas[k], horizontalalignment='center', fontsize=9)
 
view raw
 
 
visualize.py hosted with ❤ by GitHub
 
visualize.py hosted with ❤ by GitHub
  
Python file: visualize.py
+
==Python file: visualize.py==
  
    Line 7 Load edge data from disk
+
* Line 7 Load edge data from disk
    Line 9-13 Create a directed graph from the edge data and populate a dictionary with the followers count data
+
* Line 9-13 Create a directed graph from the edge data and populate a dictionary with the followers count data
    Line 18 Centre and restrict size of graph around the SEED node (TEDxSingapore)
+
* Line 18 Centre and restrict size of graph around the SEED node (TEDxSingapore)
    Line 20-29 Method to prune the network graph by eliminating nodes that don’t meet filter criteria
+
* Line 20-29 Method to prune the network graph by eliminating nodes that don’t meet filter criteria
    Line 31-41 Method to prune the network graph by eliminating edges that don’t meet filter criteria
+
* Line 31-41 Method to prune the network graph by eliminating edges that don’t meet filter criteria
    Line 44, 46 removes nodes and edges from the network that don’t meet the filter criteria
+
* Line 44, 46 removes nodes and edges from the network that don’t meet the filter criteria
    Line 67-73 For each nodeset draw the nodes, the size of each node is based on the log value of the followers count
+
* Line 67-73 For each nodeset draw the nodes, the size of each node is based on the log value of the followers count
    Line 76 Draw network edges
+
* Line 76 Draw network edges
    Line 80-83 Draw network labels, use matplotlib directly to do this rather than net.draw_networkx_labels() method.
+
* Line 80-83 Draw network labels, use matplotlib directly to do this rather than net.draw_networkx_labels() method.
  
 
Output from running script in IPython Notebook
 
Output from running script in IPython Notebook
  
g:  119567
+
g:  119567
core after node pruning:  958
+
core after node pruning:  958
core after edge pruning:  198
+
core after edge pruning:  198
Not TED 38
+
Not TED 38
TED 160
+
TED 160
colourmap:  {'Not TED': 'red', 'TED': 'green'}
+
colourmap:  {'Not TED': 'red', 'TED': 'green'}
  
 
twitter network
 
twitter network
Line 458: Line 476:
 
See Also:
 
See Also:
  
    NetworkX library
+
* NetworkX library
    Social Network Analysis for Startups by Maksim Tsvetovat; Alexander Kouzetsov
+
* Social Network Analysis for Startups by Maksim Tsvetovat; Alexander Kouzetsov
    Snowball Samping
+
* Snowball Samping
  
  

Latest revision as of 12:48, 29 January 2017

Persiapan

Instalasi

sudo apt install python-pip
sudo pip install --upgrade pip
sudo pip install tweepy
mkdir following
mkdir twitter_users

Login ke https://dev.twitter.com/apps, dapatkan:

  • CONSUMER_KEY
  • CONSUMER_SECRET
  • ACCESS_TOKEN
  • ACCESS_TOKEN_SECRET

Langkah Secara umum

  • From initial seed account collect followers using the Snowball Sampling technique.
  • Process the collected twitter data to generate an output file of relationships between twitter accounts.
  • Visualize network data in a network graph using the NetworkX library.

Cara Cepat

python get_followers.py -s TEDxSingapore -d 3
python twitter_network.py
python visualize.py


Step 1. Collect follower data from the Twitter API

follower data akan di simpan di folder following, sesudah proses di jalankan akan data di simpan dalam CSV format.

$ ls following/
-rw-r--r-- 1 mark mark 7.1K Aug 14 21:04 TEDxMtHood.csv
-rw-r--r-- 1 mark mark 7.0K Aug 14 21:21 TEDxYYC.csv
-rw-r--r-- 1 mark mark 5.7K Aug 15 07:29 TEDxCibeles.csv
-rw-r--r-- 1 mark mark 2.8K Aug 15 07:30 TEDxProvidence.csv
-rw-r--r-- 1 mark mark 6.9K Aug 15 07:46 TEDxUHasselt.csv
-rw-r--r-- 1 mark mark  625 Aug 15 07:46 TEDxWestVillage.csv
-rw-r--r-- 1 mark mark  196 Aug 15 07:46 TEDxESPRIT.csv
-rw-r--r-- 1 mark mark 2.9K Aug 15 08:02 TEDxUU.csv
cat following/TEDxESPRIT
XXXXXXXXX       dediil  hedil jabou
XXXXXXXXX       MehdiBJemia     Mehdi Ben Jemia
XXXXXXXXX       _willywall      _william
XXXXXXXX        MirakHikimori   Hello Hikimori
XXXXXXXX        maroo_king      Marou

Directory yang kedua adalah ‘twitter-users’, menyimpan twitter user detail dalam format JSON.


$ ls twitter-users/
-rw-r--r-- 1 mark mark  252 Jul 24 16:45 XXXXXXXXX.json
-rw-r--r-- 1 mark mark  57K Jul 24 16:46 XXXXXXXX.json
-rw-r--r-- 1 mark mark 6.3K Jul 24 17:01 XXXXXXXXXX.json

... Lots more ...

$ cat twitter-users/XXXXXXXX

 {
  "name": "TEDxSingapore",
  "friends_count": 147,
  "followers_count": 12814,
  "followers_ids": [
   XXXXXXXXXX,
   XXXXXXXXXX,
   XXXXXXXXX,
   ...
   XXXXXXXXXX,
   XXXXXXXXXX
  ],
  "id": XXXXXXXX,
  "screen_name": "TEDxSingapore"
 }

Script get_followers.py untuk mengumpulkan data adalah sebagai berikut,

import tweepy
import time
import os
import sys
import json
import argparse 

FOLLOWING_DIR = 'following'
MAX_FRIENDS = 200
FRIENDS_OF_FRIENDS_LIMIT = 200

if not os.path.exists(FOLLOWING_DIR):
    os.makedir(FOLLOWING_DIR)

enc = lambda x: x.encode('ascii', errors='ignore')

# The consumer keys can be found on your application's Details
# page located at https://dev.twitter.com/apps (under "OAuth settings")
CONSUMER_KEY = 'XXXXXXXXXXXXXXXXXXXXXXXXX'
CONSUMER_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

# The access tokens can be found on your applications's Details
# page located at https://dev.twitter.com/apps (located
# under "Your access token")
ACCESS_TOKEN = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
ACCESS_TOKEN_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

# == OAuth Authentication ==
#
# This mode of authentication is the new preferred way
# of authenticating with Twitter.
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET) 

api = tweepy.API(auth)

def get_follower_ids(centre, max_depth=1, current_depth=0, taboo_list=[]):

    # print 'current depth: %d, max depth: %d' % (current_depth, max_depth)
    # print 'taboo list: ', ','.join([ str(i) for i in taboo_list ])

    if current_depth == max_depth:
        print 'out of depth'
        return taboo_list

    if centre in taboo_list:
        # we've been here before
        print 'Already been here.'
        return taboo_list
    else:
        taboo_list.append(centre) 

    try:
        userfname = os.path.join('twitter-users', str(centre) + '.json')
        if not os.path.exists(userfname):
            print 'Retrieving user details for twitter id %s' % str(centre)
            while True:
                try:
                    user = api.get_user(centre) 

                    d = {'name': user.name,
                         'screen_name': user.screen_name,
                         'id': user.id,
                         'friends_count': user.friends_count,
                         'followers_count': user.followers_count,
                         'followers_ids': user.followers_ids()}

                    with open(userfname, 'w') as outf:
                        outf.write(json.dumps(d, indent=1))

                    user = d
                    break
                except tweepy.TweepError, error:
                    print type(error)

                    if str(error) == 'Not authorized.':
                        print 'Cant access user data - not authorized.'
                        return taboo_list

                    if str(error) == 'User has been suspended.':
                        print 'User suspended.'
                        return taboo_list

                    errorObj = error[0][0]

                    print errorObj

                    if errorObj['message'] == 'Rate limit exceeded':
                        print 'Rate limited. Sleeping for 15 minutes.'
                        time.sleep(15 * 60 + 15)
                        continue

                    return taboo_list
        else:
            user = json.loads(file(userfname).read())

        screen_name = enc(user['screen_name'])
        fname = os.path.join(FOLLOWING_DIR, screen_name + '.csv')
        friendids = []

        # only retrieve friends of TED... screen names
        if screen_name.startswith('TED'):
            if not os.path.exists(fname):
                print 'No cached data for screen name "%s"' % screen_name
                with open(fname, 'w') as outf:
                    params = (enc(user['name']), screen_name)
                    print 'Retrieving friends for user "%s" (%s)' % params 

                    # page over friends
                    c = tweepy.Cursor(api.friends, id=user['id']).items()

                    friend_count = 0
                    while True:
                        try:
                            friend = c.next()
                            friendids.append(friend.id)
                            params = (friend.id, enc(friend.screen_name), enc(friend.name))
                            outf.write('%s\t%s\t%s\n' % params)
                            friend_count += 1
                            if friend_count >= MAX_FRIENDS:
                                print 'Reached max no. of friends for "%s".' % friend.screen_name
                                break
                        except tweepy.TweepError:
                            # hit rate limit, sleep for 15 minutes
                            print 'Rate limited. Sleeping for 15 minutes.'
                            time.sleep(15 * 60 + 15)
                            continue
                        except StopIteration:
                            break
            else:
                friendids = [int(line.strip().split('\t')[0]) for line in file(fname)] 

            print 'Found %d friends for %s' % (len(friendids), screen_name) 

            # get friends of friends
            cd = current_depth
            if cd+1 < max_depth:
                for fid in friendids[:FRIENDS_OF_FRIENDS_LIMIT]:
                    taboo_list = get_follower_ids(fid, max_depth=max_depth,
                        current_depth=cd+1, taboo_list=taboo_list) 

            if cd+1 < max_depth and len(friendids) > FRIENDS_OF_FRIENDS_LIMIT:
                print 'Not all friends retrieved for %s.' % screen_name 

    except Exception, error:
        print 'Error retrieving followers for user id: ', centre
        print error

        if os.path.exists(fname):
            os.remove(fname)
            print 'Removed file "%s".' % fname 

        sys.exit(1) 

    return taboo_list 
if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument("-s", "--screen-name", required=True, help="Screen name of twitter user")
    ap.add_argument("-d", "--depth", required=True, type=int, help="How far to follow user network")
    args = vars(ap.parse_args())

    twitter_screenname = args['screen_name']
    depth = int(args['depth']) 

    if depth < 1 or depth > 3:
        print 'Depth value %d is not valid. Valid range is 1-3.' % depth
        sys.exit('Invalid depth argument.')

    print 'Max Depth: %d' % depth
    matches = api.lookup_users(screen_names=[twitter_screenname])

    if len(matches) == 1:
        print get_follower_ids(matches[0].id, max_depth=depth)
    else:
        print 'Sorry, could not find twitter user with screen name: %s' % twitter_screenname
view raw



I ran this script twice first without a filter on the screen name but limiting the maximum number of following accounts to 20 then again but this time filtering for accounts starting with ‘TED’ (line 102) and allowing up to 200 following accounts to be queried. This will give a mix of TED and non-TED twitter accounts. Running the script:

$ mkdir following
$ mkdir twitter_user
$ python get_followers.py -s TEDxSingapore -d 3

Max Depth: 3
Found 147 friends for TEDxSingapore
Found 200 friends for TEDWomen
Already been here.
Found 72 friends for TEDxDanteSchool
Found 33 friends for TEDHelp
Retrieving user details for twitter id XXXXXXXX from API... 

... Lots more output ...

Step 2. Process twitter data to generate an output file of relationships between twitter accounts

Script twitter_network.py di bawah ini akan memproses data yang dikumpulkan oleh twitter API dan membuat edge list. List ini berisi hubungan antar twitter account. Weight value dimasukan, nilai ini merupakan jumlah total follower untuk twitter account pertama, nilai ini di ambil dari API twitter. Weight value akan digunakan nanti saat kita menggambar network graph.

import glob
import os
import json
import sys
from collections import defaultdict

users = defaultdict(lambda: { 'followers': 0 })

for f in glob.glob('twitter-users/*.json'):
    data = json.load(file(f))
    screen_name = data['screen_name']
    users[screen_name] = { 'followers': data['followers_count'] }

SEED = 'TEDxSingapore'

def process_follower_list(screen_name, edges=[], depth=0, max_depth=2):
    f = os.path.join('following', screen_name + '.csv') 

    if not os.path.exists(f):
        return edges

    followers = [line.strip().split('\t') for line in file(f)]

    for follower_data in followers:
        if len(follower_data) < 2:
            continue

        screen_name_2 = follower_data[1]

        # use the number of followers for screen_name as the weight
        weight = users[screen_name]['followers']

        edges.append([screen_name, screen_name_2, weight])

        if depth+1 < max_depth:
            process_follower_list(screen_name_2, edges, depth+1, max_depth)

    return edges

edges = process_follower_list(SEED, max_depth=3)

with open('twitter_network.csv', 'w') as outf:
    edge_exists = {}
    for edge in edges:
        key = ','.join([str(x) for x in edge])
        if not(key in edge_exists):
            outf.write('%s\t%s\t%d\n' % (edge[0], edge[1], edge[2]))
            edge_exists[key] = True
view raw

Python file: twitter_network.py

The output generated from this script:

...

TEDxSingapore   trendwatchingAP 12814
adaptev TEDxSingapore   321
IS_magazine     TEDxSingapore   9955
trendwatchingAP TEDxSingapore   678
TEDxSingapore   GuyKawasaki     12814
TEDxSingapore   InnovateAP      12814
TEDxSingapore   InnosightTeam   12814
TEDxSingapore   ScottDAnthony   12814
TEDxSingapore   WorldAndScience 12814
TEDxSingapore   EntMagazine     12814
...  

Step 3. Visualizing the Network using the NetworkX library

We now have all the data we need to generate a network graph. Here are the steps used to visualize the network graph:

  • Create a directed graph (net.DiGraph) containing all the edge data including metadata.
  • Remove nodes based on how connected they are to other nodes in the network (i.e. remove poorly connected nodes)
  • Remove edges that have less than a minimum number of followers
  • Split nodes into two separate categories, ‘TED’ and ‘non-TED’ sets.
  • Render each nodeset
  • Render edges between nodes
  • Render node labels

Here is the code to generate the twitter network image. I wrote this code in IPython Notebook (this is the reason Line 3 has a magic command that causes matplotlib output to be rendered in the browser):

import networkx as net
import matplotlib.pyplot as plt

from collections import defaultdict
import math

twitter_network = [ line.strip().split('\t') for line in file('twitter_network.csv') ]

o = net.DiGraph()
hfollowers = defaultdict(lambda: 0)
for (twitter_user, followed_by, followers) in twitter_network:
    o.add_edge(twitter_user, followed_by, followers=int(followers))
    hfollowers[twitter_user] = int(followers)

SEED = 'TEDxSingapore'

# centre around the SEED node and set radius of graph
g = net.DiGraph(net.ego_graph(o, SEED, radius=4))

def trim_degrees_ted(g, degree=1, ted_degree=1):
    g2 = g.copy()
    d = net.degree(g2)
    for n in g2.nodes():
        if n == SEED: continue # don't prune the SEED node
        if d[n] <= degree and not n.lower().startswith('ted'):
            g2.remove_node(n)
        elif n.lower().startswith('ted') and d[n] <= ted_degree:
            g2.remove_node(n)
    return g2

def trim_edges_ted(g, weight=1, ted_weight=10):
    g2 = net.DiGraph()
    for f, to, edata in g.edges_iter(data=True):
        if f == SEED or to == SEED: # keep edges that link to the SEED node
            g2.add_edge(f, to, edata)
        elif f.lower().startswith('ted') or to.lower().startswith('ted'):
            if edata['followers'] >= ted_weight:
                g2.add_edge(f, to, edata)
        elif edata['followers'] >= weight:
            g2.add_edge(f, to, edata)
    return g2

print 'g: ', len(g)
core = trim_degrees_ted(g, degree=235, ted_degree=1)
print 'core after node pruning: ', len(core)
core = trim_edges_ted(core, weight=250000, ted_weight=35000)
print 'core after edge pruning: ', len(core)

nodeset_types = { 'TED': lambda s: s.lower().startswith('ted'), 'Not TED': lambda s: not s.lower().startswith('ted') }

nodesets = defaultdict(list)

for nodeset_typename, nodeset_test in nodeset_types.iteritems():
    nodesets[nodeset_typename] = [ n for n in core.nodes_iter() if nodeset_test(n) ]

pos = net.spring_layout(core) # compute layout 

colours = ['red','green']
colourmap = {}

plt.figure(figsize=(18,18))
plt.axis('off')

# draw nodes
i = 0
alphas = {'TED': 0.6, 'Not TED': 0.4}
for k in nodesets.keys():
    ns = [ math.log10(hfollowers[n]+1) * 80 for n in nodesets[k] ]
    print k, len(ns)
    net.draw_networkx_nodes(core, pos, nodelist=nodesets[k], node_size=ns, node_color=colours[i], alpha=alphas[k])
    colourmap[k] = colours[i]
    i += 1
print 'colourmap: ', colourmap

# draw edges
net.draw_networkx_edges(core, pos, width=0.5, alpha=0.5)

# draw labels
alphas = { 'TED': 1.0, 'Not TED': 0.5}
for k in nodesets.keys():
    for n in nodesets[k]:
        x, y = pos[n]
        plt.text(x, y+0.02, s=n, alpha=alphas[k], horizontalalignment='center', fontsize=9)
view raw

visualize.py hosted with ❤ by GitHub

Python file: visualize.py

  • Line 7 Load edge data from disk
  • Line 9-13 Create a directed graph from the edge data and populate a dictionary with the followers count data
  • Line 18 Centre and restrict size of graph around the SEED node (TEDxSingapore)
  • Line 20-29 Method to prune the network graph by eliminating nodes that don’t meet filter criteria
  • Line 31-41 Method to prune the network graph by eliminating edges that don’t meet filter criteria
  • Line 44, 46 removes nodes and edges from the network that don’t meet the filter criteria
  • Line 67-73 For each nodeset draw the nodes, the size of each node is based on the log value of the followers count
  • Line 76 Draw network edges
  • Line 80-83 Draw network labels, use matplotlib directly to do this rather than net.draw_networkx_labels() method.

Output from running script in IPython Notebook

g:  119567
core after node pruning:  958
core after edge pruning:  198
Not TED 38
TED 160
colourmap:  {'Not TED': 'red', 'TED': 'green'}

twitter network

See Also:

  • NetworkX library
  • Social Network Analysis for Startups by Maksim Tsvetovat; Alexander Kouzetsov
  • Snowball Samping





Referensi