Difference between revisions of "Python: Collect Twitter follower network with twecoll"

From OnnoWiki
Jump to navigation Jump to search
Line 110: Line 110:
 
  python twecoll fetch luca
 
  python twecoll fetch luca
  
If it doesn’t exist, it creates a folder ‘fdat’ and for each account it collects the following IDs, it creates a file with the ID of the account followed by .f as the name. Inside it saves the IDs, one per line. Without additional information, because only those are relevant which exist in the .dat file.
+
hasilnya akan di simpan pada folder fdat
 
 
The way fetch works makes it quite robust. If it stops for whatever reason (no internet, computer turned off, manually stopped) you can restart it with the same command and it will take up again where it stopped.
 
 
 
Sidenote: In the PowerShell/Terminal you can press the arrow keys to move through your command history.
 
 
 
If the .f file for a certain ID already exists, the script will skip that ID. This enables the pausing of data collection and helps when you create multiple networks where the same IDs are needed. But over time the files may be outdated. I recommend to delete the directory if you didn’t use the script for some time.
 
  
 
===edgelist command===
 
===edgelist command===

Revision as of 14:41, 28 January 2017

Untuk mengumpulkan twitter follower dapat menggunakan Python script twecoll. Di samping itu juga menggunakan Gephi untuk memvisualisasikannya.

Langkah yang perlu di lakukan,

  • Get Twitter API keys
  • Download a customized version of the Python script twecoll
  • Configure the script
  • Run the script to collect the network and create the graph file


Prerequisites: Python and the command line

Untuk menjalankan twecoll, kita membutuhkan

  • python
  • gephi
  • sudo pip install igraph

Getting Twitter API keys

Kita perlu authentikasi dari twitter. Ini bisa dilakukan secara gratis & via web.


  • Go to apps.twitter.com
  • Create new app (top right)
  • Fill out Name, Description, Website (any valid URL is fine) and check the Developer Agreement after you have read it.
  • Switch to the ‘Keys and Access Tokens’ Tab.

Sidenote: Do not share these. People will abuse them. If you still do, like I did above, regenerate them to make the leaked version useless.

2. Download twecoll

Download twecoll,


3. Configure the script

Opsi FMAX yang perlu di ubah. FMAX mendefinisikan berapa banyak account yang mungkin kita kumpulkan datanya.

Untuk percobaan awal ada baiknya set FMAX kecil saja, misalnya 20.

Penggunaan twecoll

Jalankan

python twecoll.py -h

Hasilnya

usage: twecoll.py [-h] [-s] [-v]
                  {resolve,init,fetch,tweets,likes,edgelist} ...

Twitter Collection Tool

optional arguments:
  -h, --help            show this help message and exit
  -s, --stats           show Twitter throttling stats and exit
  -v, --version         show program's version number and exit  

sub-commands:
  {resolve,init,fetch,tweets,likes,edgelist}
    resolve             retrieve user_id for screen_name or vice versa
    init                retrieve friends data for screen_name
    fetch               retrieve friends of handles in .dat file
    tweets              retrieve tweets
    likes               retrieve likes
    edgelist            generate graph in GML format 


4. Run twecoll to collect data

Saat jalan pertama kali twecoll akan menanyakan consumer key & consumer secret. Data ini biasanya akan di simpan di .twecoll.

buat folder

mkdir fdat
mkdir img

First run

cd ke lokasi twecoll. Jalankan misalnya (untuk userid yang di evaluasi luca)

python twecoll init luca

secara default twecoll akan mengumpulkan nama-nama yang di follow oleh luca.

Jika kita ingin sebalik-nya, mencek siapa saja yang memfollow luca, maka peritah init-nya adalah

python twecoll init luca -o

atau:

python twecoll init luca --followers

ini perintah ini akan membuat file SCREENNAME.dat yang akan berisi account yang di follow / mem-follow account yang sedang kita evaluasi.

fetch command

perintah fetch aman menyisir semua account yang di kumpulkan oleh init, juga mengumpulkan semua ID dari account yang di follow. Contoh,

python twecoll fetch luca

hasilnya akan di simpan pada folder fdat

edgelist command

After we got all the necessary data, we need to combine it. We do that by using the edgelist command. Per default it creates a .gml file and tries to create a visualization with the Python package igraph. I don’t use igraph. If it isn’t installed, the script will skip this and print “Visualization skipped. No module named igraph”. This is fine. The .gml is still created.

I like to use the -m or --missing argument to include accounts where the script was unable to collect data on: python twecoll edgelist luca -m

If you use my modified version, you can use the -g or --gdf argument to create a .gdf file:

python twecoll edgelist luca -g -m

The order of the arguments is irrelevant.

You now have eiter a SCREENNAME.gml or SCREENNAME.gdf file in the directory of the script, which you can open with Gephi or another tool that supports these formats.

I recommend my guide how to visualize Twitter networks with Gephi, which starts where you are now:

Guide: Analyzing Twitter Networks with Gephi 0.9.1

This is by no means a complete guide, but a starting point for people who want to analyze Twitter networks with Gephi. medium.com HTTPError 429: Waiting for 15m to resume

Welcome to the world of API limits. Twitter allows 15 API calls per 15 minutes per user per app. Read more on their API rate limits. The script waits for 15 minutes whenever twitter returns an error, that the limit is reached. Depending on the amount of followings/followers it can take several hours to days to collect all the data. Because of that I run the script on a Raspberry Pi 2 to not have to leave my computer running all the time.

Update: Thanks to a comment by Jonáš Jančařík, I was able to improve the twecoll code to collect the initial list of accounts with up to 100 fewer API calls. The modified version linked above already has the improvement. I created a pull request for the original version as well so it should be soon available to everyone.



Reference