Difference between revisions of "Python: Collect Twitter follower network with twecoll"

From OnnoWiki
Jump to navigation Jump to search
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
How to collect any Twitter follower network with the Python script twecoll
+
Untuk mengumpulkan twitter follower dapat menggunakan Python script twecoll.
Learn how to create a graph file for visualization in Gephi
+
Di samping itu juga menggunakan Gephi untuk memvisualisasikannya.
  
This guide is written in a way that complete beginners can use it. You will be exposed to the command line and the programming language Python. You don’t need to know them as I will explain step by step what to do. I still recommend to learn about them, because it will make your life easier.
+
Langkah yang perlu di lakukan,
  
I published an analysis of all verified Twitter accounts and a guide how to visualize networks with Gephi. To enable people to collect their own networks and not rely on others, I wrote this article.
+
* Dapatkan key & password Twitter API
 +
* Download script twecoll
 +
* Konfigurasi script
 +
* Jalankan script untuk mengumpulkan network & membuat graph.
  
What are we going to do?
 
 
* Get Twitter API keys
 
* Download a customized version of the Python script twecoll
 
* Configure the script
 
* Run the script to collect the network and create the graph file
 
 
While twecoll is capable of several things, in this article I will only explain how to use it to collect the network of accounts someone follows and the network of accounts which follow someone.
 
  
 
==Prerequisites: Python and the command line==
 
==Prerequisites: Python and the command line==
  
On OSX and most Linux distributions Python comes pre-installed. On Windows you need to install it yourself. The script uses Python 2.7. Python Download.
+
Untuk menjalankan twecoll, kita membutuhkan
  
To see if you have Python installed and which version, you can use the command line. On OSX and Linux you can use the pre-installed Terminal, on Windows I recommend to use Windows PowerShell, which is pre-installed as well.
+
* python
 +
* gephi
 +
* sudo pip install igraph
  
Because we will need the command line in multiple situations in this guide, you should make yourself familiar with it. Don’t fear it, it works quite similar to a messenger bot. The command line tells you where it is and you can enter commands by typing them and sending them with enter.
+
==Getting Twitter API keys==
  
First we type python to see what version is defined as default. You now see the Python version. In my case it’s 2.7.12. The command line looks different as well. There is now >>> in front of the cursor. That’s because we are now inside a Python shell. We could write Python interactively here. But we are going to work with a script and therefore leave the Python shell with the command exit().
+
Kita perlu authentikasi dari twitter. Ini bisa dilakukan secara gratis & via web.
  
If you get an error, that Python is not recognized as a command, you probably need to add it to your path. Because you already have the PowerShell open, simply enter the following command, press enter. Close the PowerSheel and open it again:
+
* Masuk ke apps.twitter.com
[Environment]::SetEnvironmentVariable(“Path”, “$env:Path;C:\Python27”, “User”)
+
* Buat new app (top right)
 +
* Isi Name, Description, Website (URL apa saja yang valid) dan check Developer Agreement.
 +
* Switch ke ‘Keys and Access Tokens’ Tab.
  
Next we learn to move around in the file system to be able to locate our script later on. To do that we use the command cd (change directory). We always see the path of our location in front of the command we are typing. The command line supports auto complete. Simply press tab after you entered the first few characters. You can either use relative or absolute paths after the cd command. To see the content of a directory, we can use the command ls.
+
Catatan: jangan di share.
Going to an absolute path, showing the content of the directory, going to a relative path
 
  
Congratulations, now you know some command line basics.
 
  
==1. Getting Twitter API keys==
+
==2. Download twecoll==
  
The Twitter API is the programming interface our script will use to get data from Twitter. To use the Twitter API the client needs to be authenticated. Similar to how you log in to Twitter to use it. To do that we need to register an app first. This is free.
+
Download twecoll,
  
* Go to apps.twitter.com
+
* modifikasi oleh luca https://gist.githubusercontent.com/lucahammer/8ce36416fd4c8f6ddf4783c63e4bfa66/raw/ffaae52c5ca01a9f85bc6943dc6597ece10bae4c/twecoll
* Create new app (top right)
+
* versi original https://github.com/jdevoo/twecoll
* Fill out Name, Description, Website (any valid URL is fine) and check the Developer Agreement after you have read it.
 
* Switch to the ‘Keys and Access Tokens’ Tab.
 
  
We will need the Consumer Key and the Consumer Secret later. Either leave the site open or copy the keys somewhere to use them later. You could change the app to read-only to make sure that you don’t unintentional delete tweets or something like that. (The script is capable of deleting likes.)
 
  
Sidenote: Do not share these. People will abuse them. If you still do, like I did above, regenerate them to make the leaked version useless.
+
==3. Configure the script==
 
 
==2. Download twecoll==
 
  
For this tutorial I created a modified version of twecoll. I added an export to gdf option. This produces smaller files and Gephi will properly recognize every attribute. I also changed where twecoll saves the API keys. With the modified version they are in the same directory as the script instead of the user folder. Feel free to use the original twecoll. The GML it produces works with Gephi, but you may need to adjust types of values.
 
  
* Download my modified version https://gist.githubusercontent.com/lucahammer/8ce36416fd4c8f6ddf4783c63e4bfa66/raw/ffaae52c5ca01a9f85bc6943dc6597ece10bae4c/twecoll
+
Dalam script ada beberapa parameter.
* Download the original twecoll https://github.com/jdevoo/twecoll
+
FMAX yang perlu di ubah.
 +
FMAX mendefinisikan berapa banyak account yang mungkin kita kumpulkan datanya.
  
After downloading, unzip it and place the file twecoll in a folder you will be able to navigate through via the command line. I created a folder twecoll in my user directory and put it there: C:\Users\Luca\twecoll.
+
Untuk percobaan awal ada baiknya set FMAX kecil saja, misalnya 20.
  
==3. Configure the script==
+
==Penggunaan twecoll==
  
The only option you may want to change is FMAX. This defines how many followings the account may have you collect the data of. Not the account you collect the network of but each of the nodes. For smaller accounts I recommend 5000 because the Twitter API can give you up to 5000 ids with one call. For bigger accounts you probably want to use something smaller to make the network more handleable. It’s a theoretical decision as well. What does it mean when someone follows that many people? Is this still a meaningful relation or is it just noise?
+
Jalankan
  
==4. Run twecoll to collect data==
+
python twecoll.py -h
  
When the script runs for the first time, it will ask you for the consumer key and consumer secret. These are the two strings we got when we created the Twitter app. For future uses this isn’t necessary because twecoll saves the data in a file with the name .twecoll. If you want to start again, simply delete that file. If you use my modified version, it will be in the same directory, else it will be in the user directory.
+
Hasilnya
  
We will use the script with the init, fetch and edgelist arguments.
+
usage: twecoll.py [-h] [-s] [-v]
 +
                  {resolve,init,fetch,tweets,likes,edgelist} ...
 +
 +
Twitter Collection Tool
 +
 +
optional arguments:
 +
  -h, --help            show this help message and exit
 +
  -s, --stats          show Twitter throttling stats and exit
 +
  -v, --version        show program's version number and exit 
 +
 +
sub-commands:
 +
  {resolve,init,fetch,tweets,likes,edgelist}
 +
    resolve            retrieve user_id for screen_name or vice versa
 +
    init                retrieve friends data for screen_name
 +
    fetch              retrieve friends of handles in .dat file
 +
    tweets              retrieve tweets
 +
    likes              retrieve likes
 +
    edgelist            generate graph in GML format
  
===First run===
 
  
We need to cd to the location where we put the twecoll file. In my case I only need to enter cd twecoll because that’s the folder where I put the script. To run the script we need to tell the computer that Python should be used, then we say that the script should be run and finally we add the arguments for the script. For example
 
  
python twecoll init luca
+
==4. Run twecoll to collect data==
  
Because this is the first run we will be prompted to enter the consumer key and secret of our twitter app. The script will then generate an URL we need to open with a web browser, where we authorize the script to access our account. Twitter then gives us a PIN which we give the script. I then starts with the data collection.
+
Saat  jalan pertama kali twecoll akan menanyakan consumer key & consumer secret. Data ini biasanya akan di simpan di .twecoll.
  
Don’t use ctrl+c to copy in powershell. ctrl+c aborts the the running process in PowerShell. This is useful to stop the script at any time. But will make troubles when you want to copy the URL. You can either use the menu by clicking on the PowerShell symbol at the top left or you press enter to copy the highlighted text. If nothing is highlighted Enter will send the command you are writing at the moment. To paste text you can either use the menu or you simply click your right mouse button.
+
buat folder
  
Authorizing twecoll with Twitter
+
mkdir fdat
What you see when copying the URL to your browser.
+
mkdir img
The PIN you get after authorizing twecoll in the browser.
 
  
You can use python twecoll -h to see all available commands. Use python twecoll COMMAND -h to get help for a certain command. You need to replace COMMAND with for example init. I will give you more information on the commands you will be using to collect Twitter networks.
+
===First run===
init command
 
  
This initalizes the data collection process. You need at least one argument to use it and that’s the screenname of the person you want to collect the network of:
+
cd ke lokasi twecoll.
 +
Jalankan misalnya (untuk userid yang di evaluasi luca)
  
 
  python twecoll init luca
 
  python twecoll init luca
  
Per default twecoll init will collect the list of people the account follows. If you want to look at the followers, you need to add the -o argument or --followers.
+
secara default twecoll akan mengumpulkan nama-nama yang di follow oleh luca.
It should look like this:
+
 
 +
Jika kita ingin sebalik-nya, mencek siapa saja yang memfollow luca, maka peritah init-nya adalah
  
 
  python twecoll init luca -o
 
  python twecoll init luca -o
  
Or:
+
atau:
  
 
  python twecoll init luca --followers
 
  python twecoll init luca --followers
  
The init command creates a file called SCREENNAME.dat with information about the accounts followed by or following the account you specified. It will also create a folder img, if doesn’t exist already, and put the avatars of the accounts it collects the data of there.
+
ini perintah ini akan membuat file SCREENNAME.dat yang akan berisi account yang di follow / mem-follow account yang sedang kita evaluasi.
 
 
Once the init command finished collecting all accounts, it will write ‘Done.’.
 
  
 
===fetch command===
 
===fetch command===
  
The fetch command goes through all the accounts collected by init and collects the IDs of the accounts followed by each of them. Again, we need to specify for which account we want to fetch the followings of their followings/followers.
+
perintah fetch aman menyisir semua account yang di kumpulkan oleh init, juga mengumpulkan semua ID dari account yang di follow.
 
+
Contoh,
In my example I would enter:
 
  
 
  python twecoll fetch luca
 
  python twecoll fetch luca
  
If it doesn’t exist, it creates a folder ‘fdat’ and for each account it collects the following IDs, it creates a file with the ID of the account followed by .f as the name. Inside it saves the IDs, one per line. Without additional information, because only those are relevant which exist in the .dat file.
+
hasilnya akan di simpan pada folder fdat
 
 
The way fetch works makes it quite robust. If it stops for whatever reason (no internet, computer turned off, manually stopped) you can restart it with the same command and it will take up again where it stopped.
 
 
 
Sidenote: In the PowerShell/Terminal you can press the arrow keys to move through your command history.
 
 
 
If the .f file for a certain ID already exists, the script will skip that ID. This enables the pausing of data collection and helps when you create multiple networks where the same IDs are needed. But over time the files may be outdated. I recommend to delete the directory if you didn’t use the script for some time.
 
  
 
===edgelist command===
 
===edgelist command===
  
After we got all the necessary data, we need to combine it. We do that by using the edgelist command. Per default it creates a .gml file and tries to create a visualization with the Python package igraph. I don’t use igraph. If it isn’t installed, the script will skip this and print “Visualization skipped. No module named igraph”. This is fine. The .gml is still created.
+
Setelah semua data berhasil di kumpulkan, maka kita akan menggabungkannya menggunakan perintah edgelist.
 +
Hasil yang di peroleh adalah file .gml.
 +
Jika kita menginstalasi python package igraph, maka script akan memprint dan buat gambarnya.
 +
Perintah yang bisa digunakan adalah,
  
I like to use the -m or --missing argument to include accounts where the script was unable to collect data on: python twecoll edgelist luca -m
+
python twecoll edgelist luca -g -m
  
If you use my modified version, you can use the -g or --gdf argument to create a .gdf file:
 
  
  python twecoll edgelist luca -g -m
+
  -m --missing  untuk memasukan account yang tidak berhasil di collect datanya.
 +
-g --gdf untuk membuat .gdf file.
  
The order of the arguments is irrelevant.
+
Maka kita akan memiliki,
  
You now have eiter a SCREENNAME.gml or SCREENNAME.gdf file in the directory of the script, which you can open with Gephi or another tool that supports these formats.
+
SCREENNAME.gml
 +
SCREENNAME.gdf
  
I recommend my guide how to visualize Twitter networks with Gephi, which starts where you are now:
+
File ini bisa di buka menggunakan Gephi.
  
 
==Guide: Analyzing Twitter Networks with Gephi 0.9.1==
 
==Guide: Analyzing Twitter Networks with Gephi 0.9.1==

Latest revision as of 14:51, 28 January 2017

Untuk mengumpulkan twitter follower dapat menggunakan Python script twecoll. Di samping itu juga menggunakan Gephi untuk memvisualisasikannya.

Langkah yang perlu di lakukan,

  • Dapatkan key & password Twitter API
  • Download script twecoll
  • Konfigurasi script
  • Jalankan script untuk mengumpulkan network & membuat graph.


Prerequisites: Python and the command line

Untuk menjalankan twecoll, kita membutuhkan

  • python
  • gephi
  • sudo pip install igraph

Getting Twitter API keys

Kita perlu authentikasi dari twitter. Ini bisa dilakukan secara gratis & via web.

  • Masuk ke apps.twitter.com
  • Buat new app (top right)
  • Isi Name, Description, Website (URL apa saja yang valid) dan check Developer Agreement.
  • Switch ke ‘Keys and Access Tokens’ Tab.

Catatan: jangan di share.


2. Download twecoll

Download twecoll,


3. Configure the script

Dalam script ada beberapa parameter. FMAX yang perlu di ubah. FMAX mendefinisikan berapa banyak account yang mungkin kita kumpulkan datanya.

Untuk percobaan awal ada baiknya set FMAX kecil saja, misalnya 20.

Penggunaan twecoll

Jalankan

python twecoll.py -h

Hasilnya

usage: twecoll.py [-h] [-s] [-v]
                  {resolve,init,fetch,tweets,likes,edgelist} ...

Twitter Collection Tool

optional arguments:
  -h, --help            show this help message and exit
  -s, --stats           show Twitter throttling stats and exit
  -v, --version         show program's version number and exit  

sub-commands:
  {resolve,init,fetch,tweets,likes,edgelist}
    resolve             retrieve user_id for screen_name or vice versa
    init                retrieve friends data for screen_name
    fetch               retrieve friends of handles in .dat file
    tweets              retrieve tweets
    likes               retrieve likes
    edgelist            generate graph in GML format 


4. Run twecoll to collect data

Saat jalan pertama kali twecoll akan menanyakan consumer key & consumer secret. Data ini biasanya akan di simpan di .twecoll.

buat folder

mkdir fdat
mkdir img

First run

cd ke lokasi twecoll. Jalankan misalnya (untuk userid yang di evaluasi luca)

python twecoll init luca

secara default twecoll akan mengumpulkan nama-nama yang di follow oleh luca.

Jika kita ingin sebalik-nya, mencek siapa saja yang memfollow luca, maka peritah init-nya adalah

python twecoll init luca -o

atau:

python twecoll init luca --followers

ini perintah ini akan membuat file SCREENNAME.dat yang akan berisi account yang di follow / mem-follow account yang sedang kita evaluasi.

fetch command

perintah fetch aman menyisir semua account yang di kumpulkan oleh init, juga mengumpulkan semua ID dari account yang di follow. Contoh,

python twecoll fetch luca

hasilnya akan di simpan pada folder fdat

edgelist command

Setelah semua data berhasil di kumpulkan, maka kita akan menggabungkannya menggunakan perintah edgelist. Hasil yang di peroleh adalah file .gml. Jika kita menginstalasi python package igraph, maka script akan memprint dan buat gambarnya. Perintah yang bisa digunakan adalah,

python twecoll edgelist luca -g -m


-m --missing  untuk memasukan account yang tidak berhasil di collect datanya.
-g --gdf untuk membuat .gdf file.

Maka kita akan memiliki,

SCREENNAME.gml
SCREENNAME.gdf

File ini bisa di buka menggunakan Gephi.

Guide: Analyzing Twitter Networks with Gephi 0.9.1

This is by no means a complete guide, but a starting point for people who want to analyze Twitter networks with Gephi. medium.com HTTPError 429: Waiting for 15m to resume

Welcome to the world of API limits. Twitter allows 15 API calls per 15 minutes per user per app. Read more on their API rate limits. The script waits for 15 minutes whenever twitter returns an error, that the limit is reached. Depending on the amount of followings/followers it can take several hours to days to collect all the data. Because of that I run the script on a Raspberry Pi 2 to not have to leave my computer running all the time.

Update: Thanks to a comment by Jonáš Jančařík, I was able to improve the twecoll code to collect the initial list of accounts with up to 100 fewer API calls. The modified version linked above already has the improvement. I created a pull request for the original version as well so it should be soon available to everyone.



Reference