Difference between revisions of "Python: Twitter data mining with Python and Gephi"

From OnnoWiki
Jump to navigation Jump to search
 
(14 intermediate revisions by the same user not shown)
Line 1: Line 1:
Twitter data mining with Python and Gephi: Case synthetic biology
+
==Mengumpulkan Data==
  
    Published on August 18, 2015
+
Twitter menyediakan REST API https://dev.twitter.com/rest/public untuk coding. REST API menyediakan akses program untuk membaca dan menulis data Twitter. Membuat Tweet baru, membaca profil pengguna dan data follower, dan banyak lagi. REST API mengidentifikasi aplikasi Twitter dan pengguna menggunakan OAuth; me-responds dalam format JSON.
  
16
+
Jika tujuan Anda adalah untuk memantau atau proses Tweet secara real-time, pertimbangkan untuk menggunakan Streaming API Streaming.
  
4
+
Masalah utama dengan Twitter:
  
    1
+
* Ada pembatasan pada berapa banyak permintaan dapat dikirim dalam waktu tertentu, untuk mencegah sistem dari overloading.
 +
* yang utama satu: kita hanya dapat mengumpulkan tweets dari sekitar 7 hari terakhir.
  
Mikko Dufva
+
Cara lain untuk men-download twit lama adalah mengakses Twitter web dan scroll down. Bonus cara ini adalah, tidak perlu API dan tidak ada time limit. Tapi akan banyak sekali informasi yang akan di peroleh, termasuk, tanggal, user, tweet, jumlah retweet dan favorit. Hal ini tampaknya dapat di lakukan menggunakan Selenium http://selenium-python.readthedocs.io/ untuk mengontrol browser menggunakan python dan geckodriver. Instalasi menggunakan,
Mikko Dufva
 
  
Twitter is a source of latest commentary and signals on various topics. Whether it is a good or representative source depends on the topic. I recently decided to look into using tweets as a way to understand what the conversation in social media about synthetic biology is and who participates in it. I chose synthetic biology as a topic, since I have been working on a project exploring the future developments of synthetic biology and thus was a bit familiar with the subject. I also had a hidden agenda: to learn Python and Gephi. Here’s a brief description of what I learned from my play session with Twitter, Python and Gephi.
+
sudo pip install selenium
Gathering data
 
  
The obvious first step was to gather the tweets I wanted to single file, so I could structure and analyse them further. I decided to gather tweets with the word “synbio” (short for synthetic biology) in it, since it is fairly widely used among those who tweet about synthetic biology. Twitter offers a REST API for reading and writing tweets, but I ran into some problems with it. Problem 1: there is a restriction on how many requests can be sent in a given time, in order to prevent the system from overloading. Problem 2, which was the major one: you can only gather tweets from about last 7 days.
+
Jangan lupa menginstalasi geckodriver yang bisa di ambil dari,
  
It is, however, possible to see older tweets through the Twitter website, all you need to do is keep scrolling down. The added bonus is that there is no need for setting up an API key or worrying too much about time limits. The downside is that there is not as much data available through the webpage, but the basics are there: date, user, tweet, and number of retweets and favorites. After a bit of Googling I discovered that it is indeed possible to control a browser with python, for example with Selenium. I wrote a script to load the Twitter search page for a given keyword and scroll down a number of times:
+
wget https://github.com/mozilla/geckodriver/releases/download/v0.13.0/geckodriver-v0.13.0-linux64.tar.gz
  
driver.get(https://twitter.com/search?f=tweets&vertical=default&q=synbio&src=typd") #get the page
+
Contoh potongan script untuk search twitter menggunakan selenium adalah sebagai berikut
for i in range(1,100):   
 
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") #scroll down time.sleep(3) #wait a bit for the page to load
 
  
After the script had run, I copied the text to a text file and continued if necessary. It is possible to limit the search to a certain time range using “until=2014-01-01” in the url. These and other options for refining the search can be found in the Twitter advanced search page.
+
from selenium import webdriver
Structuring data
+
from selenium.webdriver.common.keys import Keys
 +
 +
driver = webdriver.Firefox()
 +
driver.get("https://twitter.com/search?f=tweets&vertical=default&q=ahok&src=typd")
 +
for i in range(1,3):   
 +
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
 +
driver.close()
 +
 
 +
disini keyword yang digunakan adalah ahok. Juga dapat membatasi search sampai waktu yang ditentukan menggunakan, misalnya, “until=2014-01-01” di url.
 +
 
 +
==Menstrukturkan Data==
  
 
After letting the algorithm enjoy its browsing, I had a long text file full of tweets. I only gathered tweets from 2015, although I could have gone all the way back to the beginning of twitter to 2006. Anyhow I had enough tweets (9609) to make some meaningful data mining.  The next task was to structure this mess into something that can be mined. Regular expressions are a good tool to help in this. They may seem intimidating at first (and second and third) glance, but they are really powerful for searching text. After going through the basic course offered online by Code Academy, I understood them enough for my purposes.
 
After letting the algorithm enjoy its browsing, I had a long text file full of tweets. I only gathered tweets from 2015, although I could have gone all the way back to the beginning of twitter to 2006. Anyhow I had enough tweets (9609) to make some meaningful data mining.  The next task was to structure this mess into something that can be mined. Regular expressions are a good tool to help in this. They may seem intimidating at first (and second and third) glance, but they are really powerful for searching text. After going through the basic course offered online by Code Academy, I understood them enough for my purposes.
Line 30: Line 37:
 
I looked at the text file and noticed that each tweet ended with “favorites”, denoting the number of favourites the tweet had. I also noticed that before the tweet there were two line breaks, the day had a dot at the end, year was followed by a line break etc. What made things even simpler was that the twitter webpage had identified me as a Finn browsing from Finland, and everything else except the tweets were in finnish, making it rather easy to identify words or expressions that were very unlikely to appear in the tweets itself. I managed to structure the text into a neat little CSV file using magical python spells such as:
 
I looked at the text file and noticed that each tweet ended with “favorites”, denoting the number of favourites the tweet had. I also noticed that before the tweet there were two line breaks, the day had a dot at the end, year was followed by a line break etc. What made things even simpler was that the twitter webpage had identified me as a Finn browsing from Finland, and everything else except the tweets were in finnish, making it rather easy to identify words or expressions that were very unlikely to appear in the tweets itself. I managed to structure the text into a neat little CSV file using magical python spells such as:
  
tweetList = re.split(r'suosikki\n|suosikkia\n', rawData) #split the text into tweets using the regular expression of “favourite(s)” which is “suosikki” or “suosikkia” in finnish
+
tweetList = re.split(r'suosikki\n|suosikkia\n', rawData) #split the text into tweets using the regular expression of “favourite(s)” which is “suosikki” or “suosikkia” in finnish
match = re.search(r"@[^ ]+", str(tweet)) #search for something beginning with @ and ending with whitespace, so basically a username
+
match = re.search(r"@[^ ]+", str(tweet)) #search for something beginning with @ and ending with whitespace, so basically a username
text = re.sub(r'\n', ' ', text) #remove linebreak from text
+
text = re.sub(r'\n', ' ', text) #remove linebreak from text
  
 
Now I had a file to look at in Excel and I drew a nice little graph, because that is what you do in Excel. But the lure of Python called me back and I continued analyzing the text, extracting mentions from tweets and making a network of who has mentioned who. Networks can be plotted directly from Python or wrote as a Gephi file, but I used a more cumbersome method of writing the network to CSV files and importing it to Gephi. I like to see the data in a spread sheet and am a prisoner of old habits.
 
Now I had a file to look at in Excel and I drew a nice little graph, because that is what you do in Excel. But the lure of Python called me back and I continued analyzing the text, extracting mentions from tweets and making a network of who has mentioned who. Networks can be plotted directly from Python or wrote as a Gephi file, but I used a more cumbersome method of writing the network to CSV files and importing it to Gephi. I like to see the data in a spread sheet and am a prisoner of old habits.
Line 84: Line 91:
  
 
Understanding the current discussion, how it has evolved and who are the people involved in it is crucial for identifying the key drivers and tensions that shape the future. Twitter offers one rather structured source of discussion data. However, after this exercise I'm still a bit puzzled about what to think about the results. Thus perhaps the biggest lesson I learned from this exercise was that while it is relatively easy to make graphs and calculate statistics, drawing clever insights from the data is the hard part.
 
Understanding the current discussion, how it has evolved and who are the people involved in it is crucial for identifying the key drivers and tensions that shape the future. Twitter offers one rather structured source of discussion data. However, after this exercise I'm still a bit puzzled about what to think about the results. Thus perhaps the biggest lesson I learned from this exercise was that while it is relatively easy to make graphs and calculate statistics, drawing clever insights from the data is the hard part.
 
  
 
==Referensi==
 
==Referensi==
Line 91: Line 97:
 
* http://www.nltk.org
 
* http://www.nltk.org
 
* http://www.nltk.org/book_1ed/
 
* http://www.nltk.org/book_1ed/
 +
* http://selenium-python.readthedocs.io/

Latest revision as of 07:48, 28 January 2017

Mengumpulkan Data

Twitter menyediakan REST API https://dev.twitter.com/rest/public untuk coding. REST API menyediakan akses program untuk membaca dan menulis data Twitter. Membuat Tweet baru, membaca profil pengguna dan data follower, dan banyak lagi. REST API mengidentifikasi aplikasi Twitter dan pengguna menggunakan OAuth; me-responds dalam format JSON.

Jika tujuan Anda adalah untuk memantau atau proses Tweet secara real-time, pertimbangkan untuk menggunakan Streaming API Streaming.

Masalah utama dengan Twitter:

  • Ada pembatasan pada berapa banyak permintaan dapat dikirim dalam waktu tertentu, untuk mencegah sistem dari overloading.
  • yang utama satu: kita hanya dapat mengumpulkan tweets dari sekitar 7 hari terakhir.

Cara lain untuk men-download twit lama adalah mengakses Twitter web dan scroll down. Bonus cara ini adalah, tidak perlu API dan tidak ada time limit. Tapi akan banyak sekali informasi yang akan di peroleh, termasuk, tanggal, user, tweet, jumlah retweet dan favorit. Hal ini tampaknya dapat di lakukan menggunakan Selenium http://selenium-python.readthedocs.io/ untuk mengontrol browser menggunakan python dan geckodriver. Instalasi menggunakan,

sudo pip install selenium

Jangan lupa menginstalasi geckodriver yang bisa di ambil dari,

wget https://github.com/mozilla/geckodriver/releases/download/v0.13.0/geckodriver-v0.13.0-linux64.tar.gz

Contoh potongan script untuk search twitter menggunakan selenium adalah sebagai berikut

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://twitter.com/search?f=tweets&vertical=default&q=ahok&src=typd")
for i in range(1,3):     
	driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.close()

disini keyword yang digunakan adalah ahok. Juga dapat membatasi search sampai waktu yang ditentukan menggunakan, misalnya, “until=2014-01-01” di url.

Menstrukturkan Data

After letting the algorithm enjoy its browsing, I had a long text file full of tweets. I only gathered tweets from 2015, although I could have gone all the way back to the beginning of twitter to 2006. Anyhow I had enough tweets (9609) to make some meaningful data mining. The next task was to structure this mess into something that can be mined. Regular expressions are a good tool to help in this. They may seem intimidating at first (and second and third) glance, but they are really powerful for searching text. After going through the basic course offered online by Code Academy, I understood them enough for my purposes.

I looked at the text file and noticed that each tweet ended with “favorites”, denoting the number of favourites the tweet had. I also noticed that before the tweet there were two line breaks, the day had a dot at the end, year was followed by a line break etc. What made things even simpler was that the twitter webpage had identified me as a Finn browsing from Finland, and everything else except the tweets were in finnish, making it rather easy to identify words or expressions that were very unlikely to appear in the tweets itself. I managed to structure the text into a neat little CSV file using magical python spells such as:

tweetList = re.split(r'suosikki\n|suosikkia\n', rawData) #split the text into tweets using the regular expression of “favourite(s)” which is “suosikki” or “suosikkia” in finnish
match = re.search(r"@[^ ]+", str(tweet)) #search for something beginning with @ and ending with whitespace, so basically a username
text = re.sub(r'\n', ' ', text) #remove linebreak from text

Now I had a file to look at in Excel and I drew a nice little graph, because that is what you do in Excel. But the lure of Python called me back and I continued analyzing the text, extracting mentions from tweets and making a network of who has mentioned who. Networks can be plotted directly from Python or wrote as a Gephi file, but I used a more cumbersome method of writing the network to CSV files and importing it to Gephi. I like to see the data in a spread sheet and am a prisoner of old habits.

Tweets in ExcelTweets mentioning synbio per month

Gephi is a fantastic open source program for drawing and analysing networks. It has built in tools for all sort of things, such as clustering and data analysis. I used the Force Atlas layout to structure the network of mentions and resized the nodes (i.e. users) based on the number of other nodes pointing to them. The result is a network graph showing who gets the most mentions and by whom in tweets containing “synbio”.

Snapshot of mention network, click for full pdf Analyse the data

Next I decided to analyse the tweets themselves, looking at common words and phrases to get an idea of what the tweets were about. Basically I wanted to know what words and concepts are connected to synthetic biology in the twittersphere. Luckily, there’s a handy toolkit for doing this, called nltk or natural language toolkit. It makes it easy to analyse word frequencies, remove common words and calculate word collocations, that is words that appear together. My process for creating a word collocation network was:

Find the most common not common words. I first looked at all the tweets, removed common words found in the Brown corpus, and made a list of 100 most common remaining words. This was to prevent the network from having too many nodes while keeping the most mentioned relevant ones. Synthetic biology is a forgiving topic in this sort of filtering, because most of the relevant words are rare in common parlance (like genetic, diybio, biotech). Here are some of the pythonese I used:

  1. divide to words

tokenizer = RegexpTokenizer(r'\w+') words = tokenizer.tokenize(tweets)

  1. remove more common words based on the brown corpus

fdist = FreqDist(brown.words()) mostcommon = fdist.most_common(100) mclist = [] for i in range(len(mostcommon)): mclist.append(mostcommon[i][0]) words = [w for w in words if w not in mclist]

  1. keep only most common words

fdist = FreqDist(words) mostcommon = fdist.most_common(100) mclist = [] for i in range(len(mostcommon)): mclist.append(mostcommon[i][0]) words = [w for w in words if w in mclist]

Go through the tweets and calculate word collocations. First I removed common words from the tweets. Then, if two words were close to each other, they formed a pair, which I collected to a undirected graph. I also counted how many times each word was mentioned. Some useful pythonese included:

  1. find word pairs

finder = BigramCollocationFinder.from_words(words, window_size = 5) pairs = sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))

Write it all to CSV an import to Gephi. Not surprisingly, synbio, synthetic and biology appeared often together. To bring the focus more to other words, I deleted them in Gephi and fine-tuned the remaining graph.

Word collocation graph, click for pdf Create a dynamic graph

The final thing I tried was to make a dynamic graph, that is one that changes over time. This way I could look at which words appeared when. But mainly I just wanted to make an even more confusing and cool looking graph. I already had the word collocations and dates of tweets, so the biggest challenge was to understand the format in which Gephi wants the dynamic data to be. The time intervals need to be input as <(start_1,end_1),(start_2, end_2),…,(start_n,end_n)> and the start and end can be either a float or a date (yyyy-mm-dd). Dates make the graph more readable, so I opted for them. I set the “duration” of each tweet to one day, played with Gephi a bit and voilà:


Use in foresight?

Understanding the current discussion, how it has evolved and who are the people involved in it is crucial for identifying the key drivers and tensions that shape the future. Twitter offers one rather structured source of discussion data. However, after this exercise I'm still a bit puzzled about what to think about the results. Thus perhaps the biggest lesson I learned from this exercise was that while it is relatively easy to make graphs and calculate statistics, drawing clever insights from the data is the hard part.

Referensi