Python: Twitter data mining with Python and Gephi

From OnnoWiki
Jump to navigation Jump to search

Mengumpulkan Data

Twitter menyediakan REST API https://dev.twitter.com/rest/public untuk coding. REST API menyediakan akses program untuk membaca dan menulis data Twitter. Membuat Tweet baru, membaca profil pengguna dan data follower, dan banyak lagi. REST API mengidentifikasi aplikasi Twitter dan pengguna menggunakan OAuth; me-responds dalam format JSON.

Jika tujuan Anda adalah untuk memantau atau proses Tweet secara real-time, pertimbangkan untuk menggunakan Streaming API Streaming.

Masalah utama dengan Twitter:

  • Ada pembatasan pada berapa banyak permintaan dapat dikirim dalam waktu tertentu, untuk mencegah sistem dari overloading.
  • yang utama satu: kita hanya dapat mengumpulkan tweets dari sekitar 7 hari terakhir.

Cara lain untuk men-download twit lama adalah mengakses Twitter web dan scroll down. Bonus cara ini adalah, tidak perlu API dan tidak ada time limit. Tapi akan banyak sekali informasi yang akan di peroleh, termasuk, tanggal, user, tweet, jumlah retweet dan favorit. Hal ini tampaknya dapat di lakukan menggunakan Selenium http://selenium-python.readthedocs.io/ untuk mengontrol browser menggunakan python dan geckodriver. Instalasi menggunakan,

sudo pip install selenium

Jangan lupa menginstalasi geckodriver yang bisa di ambil dari,

wget https://github.com/mozilla/geckodriver/releases/download/v0.13.0/geckodriver-v0.13.0-linux64.tar.gz

Contoh potongan script untuk search twitter menggunakan selenium adalah sebagai berikut

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Firefox()
driver.get("https://twitter.com/search?f=tweets&vertical=default&q=ahok&src=typd")
for i in range(1,3):     
	driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
driver.close()

disini keyword yang digunakan adalah ahok. Juga dapat membatasi search sampai waktu yang ditentukan menggunakan, misalnya, “until=2014-01-01” di url.

Menstrukturkan Data

After letting the algorithm enjoy its browsing, I had a long text file full of tweets. I only gathered tweets from 2015, although I could have gone all the way back to the beginning of twitter to 2006. Anyhow I had enough tweets (9609) to make some meaningful data mining. The next task was to structure this mess into something that can be mined. Regular expressions are a good tool to help in this. They may seem intimidating at first (and second and third) glance, but they are really powerful for searching text. After going through the basic course offered online by Code Academy, I understood them enough for my purposes.

I looked at the text file and noticed that each tweet ended with “favorites”, denoting the number of favourites the tweet had. I also noticed that before the tweet there were two line breaks, the day had a dot at the end, year was followed by a line break etc. What made things even simpler was that the twitter webpage had identified me as a Finn browsing from Finland, and everything else except the tweets were in finnish, making it rather easy to identify words or expressions that were very unlikely to appear in the tweets itself. I managed to structure the text into a neat little CSV file using magical python spells such as:

tweetList = re.split(r'suosikki\n|suosikkia\n', rawData) #split the text into tweets using the regular expression of “favourite(s)” which is “suosikki” or “suosikkia” in finnish
match = re.search(r"@[^ ]+", str(tweet)) #search for something beginning with @ and ending with whitespace, so basically a username
text = re.sub(r'\n', ' ', text) #remove linebreak from text

Now I had a file to look at in Excel and I drew a nice little graph, because that is what you do in Excel. But the lure of Python called me back and I continued analyzing the text, extracting mentions from tweets and making a network of who has mentioned who. Networks can be plotted directly from Python or wrote as a Gephi file, but I used a more cumbersome method of writing the network to CSV files and importing it to Gephi. I like to see the data in a spread sheet and am a prisoner of old habits.

Tweets in ExcelTweets mentioning synbio per month

Gephi is a fantastic open source program for drawing and analysing networks. It has built in tools for all sort of things, such as clustering and data analysis. I used the Force Atlas layout to structure the network of mentions and resized the nodes (i.e. users) based on the number of other nodes pointing to them. The result is a network graph showing who gets the most mentions and by whom in tweets containing “synbio”.

Snapshot of mention network, click for full pdf Analyse the data

Next I decided to analyse the tweets themselves, looking at common words and phrases to get an idea of what the tweets were about. Basically I wanted to know what words and concepts are connected to synthetic biology in the twittersphere. Luckily, there’s a handy toolkit for doing this, called nltk or natural language toolkit. It makes it easy to analyse word frequencies, remove common words and calculate word collocations, that is words that appear together. My process for creating a word collocation network was:

Find the most common not common words. I first looked at all the tweets, removed common words found in the Brown corpus, and made a list of 100 most common remaining words. This was to prevent the network from having too many nodes while keeping the most mentioned relevant ones. Synthetic biology is a forgiving topic in this sort of filtering, because most of the relevant words are rare in common parlance (like genetic, diybio, biotech). Here are some of the pythonese I used:

  1. divide to words

tokenizer = RegexpTokenizer(r'\w+') words = tokenizer.tokenize(tweets)

  1. remove more common words based on the brown corpus

fdist = FreqDist(brown.words()) mostcommon = fdist.most_common(100) mclist = [] for i in range(len(mostcommon)): mclist.append(mostcommon[i][0]) words = [w for w in words if w not in mclist]

  1. keep only most common words

fdist = FreqDist(words) mostcommon = fdist.most_common(100) mclist = [] for i in range(len(mostcommon)): mclist.append(mostcommon[i][0]) words = [w for w in words if w in mclist]

Go through the tweets and calculate word collocations. First I removed common words from the tweets. Then, if two words were close to each other, they formed a pair, which I collected to a undirected graph. I also counted how many times each word was mentioned. Some useful pythonese included:

  1. find word pairs

finder = BigramCollocationFinder.from_words(words, window_size = 5) pairs = sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))

Write it all to CSV an import to Gephi. Not surprisingly, synbio, synthetic and biology appeared often together. To bring the focus more to other words, I deleted them in Gephi and fine-tuned the remaining graph.

Word collocation graph, click for pdf Create a dynamic graph

The final thing I tried was to make a dynamic graph, that is one that changes over time. This way I could look at which words appeared when. But mainly I just wanted to make an even more confusing and cool looking graph. I already had the word collocations and dates of tweets, so the biggest challenge was to understand the format in which Gephi wants the dynamic data to be. The time intervals need to be input as <(start_1,end_1),(start_2, end_2),…,(start_n,end_n)> and the start and end can be either a float or a date (yyyy-mm-dd). Dates make the graph more readable, so I opted for them. I set the “duration” of each tweet to one day, played with Gephi a bit and voilà:


Use in foresight?

Understanding the current discussion, how it has evolved and who are the people involved in it is crucial for identifying the key drivers and tensions that shape the future. Twitter offers one rather structured source of discussion data. However, after this exercise I'm still a bit puzzled about what to think about the results. Thus perhaps the biggest lesson I learned from this exercise was that while it is relatively easy to make graphs and calculate statistics, drawing clever insights from the data is the hard part.

Referensi