Difference between revisions of "Python: Mining Twitter for GamerGate: Visualization"

From OnnoWiki
Jump to navigation Jump to search
(Created page with "In the previous posting, I went over how to connect to Twitter’s streaming API using a connector app and the Tweepy Python library, as well as a quick overview of how to con...")
 
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
In the previous posting, I went over how to connect to Twitter’s streaming API using a connector app and the Tweepy Python library, as well as a quick overview of how to construct a Pandas dataframe from the tweets we’ve collected.
+
Disini kita akan extrak semua informasi yang dibutuhkan menggunakan NetworkX untuk membuat directed graph dan mem-visualisasikan menggunakan Gephi untuk melihat siapa me-retweet siapa, mencatat umur tweet dalam hari, jumlah follower setiap user agar dikemudian hari dapat di filter jika kita menginginkannya.
  
In this posting, we’ll extract all of the information we’ll need to use NetworkX to create a directed graph that we can visualize in Gephi of who’s retweeting whom, keeping track of the age in days and the number of followers that each user has so we can filter on those factors if we like.
+
Instalasi NetworkX,
  
First, if you don’t have NetworkX, install it with pip, and download and install Gephi.
+
sudo pip install networkx
  
Again, we’ll assume that our tweets are collected in a text file, “gamergate.txt”. Let’s pull the data out of the text file into a new data frame.
+
jangan lupa instalasi Gephi.
  
    import json
+
Asumsinya tweet berhasil dikumpulkan di “gamergate.txt”. Script berikut akan mengambil data dari text file dan memasukan ke frame data yang baru,
    import re
+
 
    import pandas as pd
+
import json
    from time import gmtime, mktime, strptime
+
import re
   
+
import pandas as pd
    tweets_data = []
+
from time import gmtime, mktime, strptime
    tweets_file = open(tweets_data_path, "r")
+
 
    for line in tweets_file:
+
tweets_data = []
        try:
+
tweets_file = open(tweets_data_path, "r")
            tweet = json.loads(line)
+
for line in tweets_file:
            tweets_data.append(tweet)
+
    try:
        except:
+
        tweet = json.loads(line)
            continue
+
        tweets_data.append(tweet)
    #
+
    except:
    # Clean out limit messages, etc.
+
        continue
    #
+
#
    for tweet in tweets_data:
+
# Clean out limit messages, etc.
        try:
+
#
            user = tweet['user']
+
for tweet in tweets_data:
        except:
+
    try:
            tweets_data.remove(tweet)
+
        user = tweet['user']
   
+
    except:
    for tweet in tweets_data:
+
        tweets_data.remove(tweet)
        try:
+
 
            user = tweet['text']
+
for tweet in tweets_data:
        except:
+
    try:
            tweets_data.remove(tweet)
+
        user = tweet['text']
   
+
    except:
    #
+
        tweets_data.remove(tweet)
    # See how many we wound up with
+
 
    #
+
#
    print len(tweets_data)
+
# See how many we wound up with
   
+
#
    #
+
print len(tweets_data)
    # Pull the data we're interested in out of the Twitter data we captured
+
 
    #
+
#
    rows_list = []
+
# Pull the data we're interested in out of the Twitter data we captured
    now = mktime(gmtime())
+
#
    for tweet in tweets_data:
+
rows_list = []
        author = ""
+
now = mktime(gmtime())
        rtauthor = ""
+
for tweet in tweets_data:
        age = rtage = followers = rtfollowers = 0
+
    author = ""
    #
+
    rtauthor = ""
    # If it was a retweet, get both the original author and the retweeter, save the original author's
+
    age = rtage = followers = rtfollowers = 0
    # follower count and age
+
#
    #
+
# If it was a retweet, get both the original author and the retweeter, save the original author's
 +
# follower count and age
 +
#
 +
    try:
 +
        author = tweet['user']['screen_name']
 +
        rtauthor = tweet['retweeted_status']['user']['screen_name']
 +
        rtage = int(now - mktime(strptime(tweet['retweeted_status']['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
 +
        rtfollowers = tweet['retweeted_status']['user']['followers_count']
 +
    except:
 +
#
 +
# Otherwise, just get the original author
 +
#
 
         try:
 
         try:
            author = tweet['user']['screen_name']
+
            author = tweet['user']['screen_name']
            rtauthor = tweet['retweeted_status']['user']['screen_name']
+
        except:
            rtage = int(now - mktime(strptime(tweet['retweeted_status']['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
+
            continue
            rtfollowers = tweet['retweeted_status']['user']['followers_count']
+
#
        except:
+
# If this was a reply, save the screen name being replied to
    #
+
#
    # Otherwise, just get the original author
+
    reply_to = ""
    #
+
    if (tweet['in_reply_to_screen_name'] != None):
            try:
+
        reply_to = tweet['in_reply_to_screen_name']
                author = tweet['user']['screen_name']
+
#
            except:
+
# Calculate the age, in days, of this Twitter ID
                continue
+
#
    #
+
    age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
    # If this was a reply, save the screen name being replied to
+
#
    #
+
# Grab this ID's follower count and the text of the tweet
        reply_to = ""
+
#
        if (tweet['in_reply_to_screen_name'] != None):
+
    followers = tweet['user']['followers_count']
            reply_to = tweet['in_reply_to_screen_name']
+
    text = tweet['text']
    #
+
    dict1 = {}
    # Calculate the age, in days, of this Twitter ID
+
#
    #
+
# Construct a row, add it to our list
        age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
+
#
    #
+
    dict1.update({'author': author, 'reply_to': reply_to, 'age': age, 'followers': followers, 'retweet_of': rtauthor, 'rtfollowers': rtfollowers, 'rtage': rtage, 'text': text})
    # Grab this ID's follower count and the text of the tweet
+
    rows_list.append(dict1)
    #
+
 
        followers = tweet['user']['followers_count']
+
#
        text = tweet['text']
+
# When we've processed all the tweets, build the DataFrame from the rows
        dict1 = {}
+
# we've collected
    #
+
#
    # Construct a row, add it to our list
+
tweets = pd.DataFrame(rows_list)
    #
 
        dict1.update({'author': author, 'reply_to': reply_to, 'age': age, 'followers': followers, 'retweet_of': rtauthor, 'rtfollowers': rtfollowers, 'rtage': rtage, 'text': text})
 
        rows_list.append(dict1)
 
   
 
    #
 
    # When we've processed all the tweets, build the DataFrame from the rows
 
    # we've collected
 
    #
 
    tweets = pd.DataFrame(rows_list)
 
  
 
Here’s a script that will iterate through the dataframe, row by row, and construct a directed graph of who’s retweeting whom. Each directed edge represented the relationship “is retweeted by”, the higher the weight of an edge, the more person B is getting retweeted by person A. Each node represents an individual ID on Twitter, and has attributes to track the number of followers and the age of the ID in days.
 
Here’s a script that will iterate through the dataframe, row by row, and construct a directed graph of who’s retweeting whom. Each directed edge represented the relationship “is retweeted by”, the higher the weight of an edge, the more person B is getting retweeted by person A. Each node represents an individual ID on Twitter, and has attributes to track the number of followers and the age of the ID in days.
  
    import networkx as nx
+
import networkx as nx
   
+
    #
+
#
    # Create a new directed graph
+
# Create a new directed graph
    #
+
#
    J = nx.DiGraph()
+
J = nx.DiGraph()
    #
+
#
    # Iterate through the rows of our dataframe
+
# Iterate through the rows of our dataframe
    #
+
#
    for index, row in tweets.iterrows():
+
for index, row in tweets.iterrows():
    #
+
#
    # Gather the data out of the row
+
# Gather the data out of the row
    #
+
#
        this_user_id = row['author']
+
    this_user_id = row['author']
        author = row['retweet_of']
+
    author = row['retweet_of']
        followers = row['followers']
+
    followers = row['followers']
        age = row['age']
+
    age = row['age']
        rtfollowers = row['rtfollowers']
+
    rtfollowers = row['rtfollowers']
        rtage = row['rtage']
+
    rtage = row['rtage']
    #
+
#
    # Is the sender of this tweet in our network?
+
# Is the sender of this tweet in our network?
    #
+
#
        if not this_user_id in J:
+
    if not this_user_id in J:
            J.add_node(this_user_id, attr_dict={
+
        J.add_node(this_user_id, attr_dict={
                    'followers': row['followers'],
+
                'followers': row['followers'],
                    'age': row['age'],
+
                'age': row['age'],
                })
+
            })
    #
+
#
    # If this is a retweet, is the original author a node?
+
# If this is a retweet, is the original author a node?
    #
+
#
        if author != "" and not author in J:
+
    if author != "" and not author in J:
            J.add_node(author, attr_dict={
+
        J.add_node(author, attr_dict={
                    'followers': row['rtfollowers'],
+
                'followers': row['rtfollowers'],
                    'age': row['rtage'],
+
                'age': row['rtage'],
                })
+
            })
    #
+
#
    # If this is a retweet, add an edge between the two nodes.
+
# If this is a retweet, add an edge between the two nodes.
    #
+
#
        if author != "":
+
    if author != "":
            if J.has_edge(author, this_user_id):
+
        if J.has_edge(author, this_user_id):
                J[author][this_user_id]['weight'] += 1
+
            J[author][this_user_id]['weight'] += 1
            else:
+
        else:
                J.add_weighted_edges_from([(author, this_user_id, 1.0)])
+
            J.add_weighted_edges_from([(author, this_user_id, 1.0)])
   
+
    nx.write_gexf(J, 'ggrtages.gexf')
+
nx.write_gexf(J, 'ggrtages.gexf')
  
 
The last thing we did was to save out a GEFX file we can then read into Gephi. Start Gephi up, and open our file; we called ours “ggrtages.gexf”.
 
The last thing we did was to save out a GEFX file we can then read into Gephi. Start Gephi up, and open our file; we called ours “ggrtages.gexf”.
  
gephiScreenSnapz013
 
 
You’ll get a dialog telling you how many nodes and edges there are in the graph, whether it’s directed or not, and other information, warnings, etc. Click “OK”.
 
 
gephiScreenSnapz014
 
 
Gephi will import the GEFX file. You can now look at the information it contains by clicking on the “Data Laboratory” button at the top.
 
 
gephiScreenSnapz015
 
 
Click on the “Overview” button to start working with the network. At first, it doesn’t look like anything, since we haven’t actually run a visualization on it. Before we do, we can use some of the node attributes to color nodes a darker blue based on their age.
 
 
gephiScreenSnapz016
 
 
We can use the “Ranking” settings to color our nodes. Click on the “Select attribute” popup, and choose “age”.
 
 
gephiScreenSnapz017
 
 
You can choose difference color schemes, change the spline curve used to apply color, etc., from here as well.
 
 
gephiScreenSnapz018
 
 
Click on the “Apply” button to apply the ranking to the network. The nodes will now be colored rather than gray.
 
 
gephiScreenSnapz019
 
 
Now, we’re ready to run a visualization on our data. From the “Layout” section, let’s choose “ForceAtlas 2″—it’s fast and good at showing relationships in a network.
 
 
gephiScreenSnapz020
 
 
Press the “Run” button, and let it go for a bit. A network this size—about 10K nodes and 30K edges—settled down on my MacBook Pro within five minutes or less. When you feel it’s stabilized into something interesting, press the “Stop” button, and then click on the “Preview” button at the top.
 
 
gephiScreenSnapz022
 
 
The preview panel won’t show anything at first. Click the “Refresh” button.
 
 
gephiScreenSnapz023
 
 
Gephi will render your visualization. You can use the mouse to drag it around, and you can zoom in and out with a scroll-wheel or with the “+” and “-” buttons below.
 
 
gephiScreenSnapz024
 
  
  

Latest revision as of 13:35, 29 January 2017

Disini kita akan extrak semua informasi yang dibutuhkan menggunakan NetworkX untuk membuat directed graph dan mem-visualisasikan menggunakan Gephi untuk melihat siapa me-retweet siapa, mencatat umur tweet dalam hari, jumlah follower setiap user agar dikemudian hari dapat di filter jika kita menginginkannya.

Instalasi NetworkX,

sudo pip install networkx

jangan lupa instalasi Gephi.

Asumsinya tweet berhasil dikumpulkan di “gamergate.txt”. Script berikut akan mengambil data dari text file dan memasukan ke frame data yang baru,

import json
import re
import pandas as pd
from time import gmtime, mktime, strptime
 
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue
#
# Clean out limit messages, etc.
#
for tweet in tweets_data:
    try:
        user = tweet['user']
    except:
        tweets_data.remove(tweet)
 
for tweet in tweets_data:
    try:
        user = tweet['text']
    except:
        tweets_data.remove(tweet)
 
#
# See how many we wound up with
#
print len(tweets_data)
 
#
# Pull the data we're interested in out of the Twitter data we captured
#
rows_list = []
now = mktime(gmtime())
for tweet in tweets_data:
    author = ""
    rtauthor = ""
    age = rtage = followers = rtfollowers = 0
#
# If it was a retweet, get both the original author and the retweeter, save the original author's
# follower count and age
#
    try:
        author = tweet['user']['screen_name']
        rtauthor = tweet['retweeted_status']['user']['screen_name']
        rtage = int(now - mktime(strptime(tweet['retweeted_status']['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
        rtfollowers = tweet['retweeted_status']['user']['followers_count']
    except:
#
# Otherwise, just get the original author
#
       try:
            author = tweet['user']['screen_name']
        except:
            continue
#
# If this was a reply, save the screen name being replied to
#
    reply_to = ""
    if (tweet['in_reply_to_screen_name'] != None):
        reply_to = tweet['in_reply_to_screen_name']
#
# Calculate the age, in days, of this Twitter ID
#
    age = int(now - mktime(strptime(tweet['user']['created_at'], "%a %b %d %H:%M:%S +0000 %Y")))/(60*60*24)
#
# Grab this ID's follower count and the text of the tweet
#
    followers = tweet['user']['followers_count']
    text = tweet['text']
    dict1 = {}
#
# Construct a row, add it to our list
#
    dict1.update({'author': author, 'reply_to': reply_to, 'age': age, 'followers': followers, 'retweet_of': rtauthor, 'rtfollowers': rtfollowers, 'rtage': rtage, 'text': text})
    rows_list.append(dict1)
  
#
# When we've processed all the tweets, build the DataFrame from the rows
# we've collected
#
tweets = pd.DataFrame(rows_list)

Here’s a script that will iterate through the dataframe, row by row, and construct a directed graph of who’s retweeting whom. Each directed edge represented the relationship “is retweeted by”, the higher the weight of an edge, the more person B is getting retweeted by person A. Each node represents an individual ID on Twitter, and has attributes to track the number of followers and the age of the ID in days.

import networkx as nx

#
# Create a new directed graph
#
J = nx.DiGraph()
#
# Iterate through the rows of our dataframe
#
for index, row in tweets.iterrows():
#
# Gather the data out of the row
#
    this_user_id = row['author']
    author = row['retweet_of']
    followers = row['followers']
    age = row['age']
    rtfollowers = row['rtfollowers']
    rtage = row['rtage']
#
# Is the sender of this tweet in our network?
#
    if not this_user_id in J:
        J.add_node(this_user_id, attr_dict={
                'followers': row['followers'],
                'age': row['age'],
            })
#
# If this is a retweet, is the original author a node?
#
    if author != "" and not author in J:
        J.add_node(author, attr_dict={
                'followers': row['rtfollowers'],
                'age': row['rtage'],
            })
#
# If this is a retweet, add an edge between the two nodes.
#
   if author != "":
       if J.has_edge(author, this_user_id):
           J[author][this_user_id]['weight'] += 1
       else:
           J.add_weighted_edges_from([(author, this_user_id, 1.0)])

nx.write_gexf(J, 'ggrtages.gexf')

The last thing we did was to save out a GEFX file we can then read into Gephi. Start Gephi up, and open our file; we called ours “ggrtages.gexf”.


Referensi