Difference between revisions of "Python: Scraping Twitter"

From OnnoWiki
Jump to navigation Jump to search
(Created page with " Scraping tweets using Python Table of Content Introduction The Twitter API Python environment Unicode strings oauth2 library Get Twitter data In...")
 
 
Line 1: Line 1:
Scraping tweets using Python
+
==Introduction==
 
 
Table of Content
 
 
 
    Introduction
 
    The Twitter API
 
    Python environment
 
    Unicode strings
 
    oauth2 library
 
    Get Twitter data
 
 
 
 
 
 
Introduction
 
  
 
Twitter is a popular online social network where users can send and read short messages called "tweets". It is a new instrument to measure social events, and each day millions of people tweet to express their opinions across any topic imaginable. This data source is valuable for both research and business.
 
Twitter is a popular online social network where users can send and read short messages called "tweets". It is a new instrument to measure social events, and each day millions of people tweet to express their opinions across any topic imaginable. This data source is valuable for both research and business.
Line 18: Line 5:
 
Here are a few examples of analyzing Twitter data to get some interesting results:
 
Here are a few examples of analyzing Twitter data to get some interesting results:
  
    "Mood" of communication on twitter reflects biological rhythms
+
* "Mood" of communication on twitter reflects biological rhythms
    Researchers use twitter to predict the stock market
+
* Researchers use twitter to predict the stock market
    A student used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of a storm system in Summer 2012
+
* A student used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of a storm system in Summer 2012
    Characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to analyze moods at national scale
+
* Characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to analyze moods at national scale
  
 
In this tutorial, we will introduce how to use Python to scrape live tweets from Twitter.
 
In this tutorial, we will introduce how to use Python to scrape live tweets from Twitter.
  
 
  
The Twitter Application Programming Interface (API)
+
==The Twitter Application Programming Interface (API)==
  
 
Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API
 
Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API
  
+
===Python environment ===
 
 
Python environment  
 
  
 
If you are new to Python, you may find these resources valuable:
 
If you are new to Python, you may find these resources valuable:
  
    Codeacademy's Python tutorials
+
* Codeacademy's Python tutorials
    Google's Python class
+
* Google's Python class
    Cleaning data in Python tutorial
+
* Cleaning data in Python tutorial
  
 
You can install Python by downloading it from the Python website. We recommend the Anaconda Scientific Python Distribution - it is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at http://continuum.io/downloads. For more information please refer to "Set up enviroments" section in the Cleaning data in Python tutoral.
 
You can install Python by downloading it from the Python website. We recommend the Anaconda Scientific Python Distribution - it is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at http://continuum.io/downloads. For more information please refer to "Set up enviroments" section in the Cleaning data in Python tutoral.
  
 
   
 
   
 
+
===Unicode strings===
Unicode strings
 
  
 
Strings in the twitter data prefixed with the letter "u" are unicode strings. For example:
 
Strings in the twitter data prefixed with the letter "u" are unicode strings. For example:
  
u"I am a string!"
+
u"I am a string!"
  
 
Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)
 
Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)
Line 55: Line 38:
 
In most circumstances, you will be able to use a unicode object just like a string. If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this:
 
In most circumstances, you will be able to use a unicode object just like a string. If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this:
  
unicode_string = u"aaaà çççñññ"
+
unicode_string = u"aaaà çççñññ"
encoded_string = unicode_string.encode('utf-8')
+
encoded_string = unicode_string.encode('utf-8')
print encoded_string
+
print encoded_string
 
 
 
  
oauth2 library
+
==oauth2 library==
  
 
To get access the live stream tweets, you will need to install the oauth2 library so you can properly authenticate. You can install it yourself in your Python environment. (Go to command line and type pip install oauth2: it should work for most environments.)
 
To get access the live stream tweets, you will need to install the oauth2 library so you can properly authenticate. You can install it yourself in your Python environment. (Go to command line and type pip install oauth2: it should work for most environments.)
  
 
  
Get Twitter data
+
==Get Twitter data==
  
 
The steps below will help you set up your twitter account to be able to access live stream tweets.
 
The steps below will help you set up your twitter account to be able to access live stream tweets.
  
    Create a Twitter account if you do not have one.
+
* Create a Twitter account if you do not have one.
    Go to https://dev.twitter.com/apps and log in with your Twitter credentials.
+
* Go to https://dev.twitter.com/apps and log in with your Twitter credentials.
    Click "Create New App"
+
* Click "Create New App"
    Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
+
* Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
    On the next page, click the "Keys and Access Tokens" tab along the top, then scroll all the way down until you see the section "Your Access Token"
+
* On the next page, click the "Keys and Access Tokens" tab along the top, then scroll all the way down until you see the section "Your Access Token"
    Click the button "Create My Access Token". You can Read more about Oauth authorization.
+
* Click the button "Create My Access Token". You can Read more about Oauth authorization.
    You will now copy your unique four values into twitterstream.py (download this file on your computer). These values are your "API Key", your "API secret", your "Access token", and your "Access token secret". Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below in line 6-9 of the file:
+
* You will now copy your unique four values into twitterstream.py (download this file on your computer). These values are your "API Key", your "API secret", your "Access token", and your "Access token secret". Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below in line 6-9 of the file:
  
api_key = "<Enter api key>"
+
api_key = "<Enter api key>"
api_secret = "<Enter api secret>"
+
api_secret = "<Enter api secret>"
access_token_key = "<Enter your access token key here>"
+
access_token_key = "<Enter your access token key here>"
access_token_secret = "<Enter your access token secret here>"
+
access_token_secret = "<Enter your access token secret here>"
  
    After pasting the four credentials into the twitterstream.py, save the file and go to command line and type: python twitterstream.py > tweets.txt. Make sure you are in the directory where the file twittersteam.py is saved.
+
After pasting the four credentials into the twitterstream.py, save the file and go to command line and type:
  
 +
python twitterstream.py > tweets.txt
 +
 +
Make sure you are in the directory where the file twittersteam.py is saved.
 
   
 
   
  
 
You will see that a tweets.txt file has been created by the system (smilar to one below). This is the file where raw tweet data will be stored.
 
You will see that a tweets.txt file has been created by the system (smilar to one below). This is the file where raw tweet data will be stored.
  
    Wait 3-5 minutes before you stop the program using Crtl-C in command line. Open the tweets file and you'll see some raw tweets similar to this:
+
Wait 3-5 minutes before you stop the program using Crtl-C in command line. Open the tweets file and you'll see some raw tweets similar to this:
  
 
   
 
   
 
 
Congratulations! You just scaped some live tweets using Python. Our next tutorial will introduce how to extract useful information from these tweets. Stay tuned!
 
Congratulations! You just scaped some live tweets using Python. Our next tutorial will introduce how to extract useful information from these tweets. Stay tuned!
  

Latest revision as of 14:52, 22 January 2017

Introduction

Twitter is a popular online social network where users can send and read short messages called "tweets". It is a new instrument to measure social events, and each day millions of people tweet to express their opinions across any topic imaginable. This data source is valuable for both research and business.

Here are a few examples of analyzing Twitter data to get some interesting results:

  • "Mood" of communication on twitter reflects biological rhythms
  • Researchers use twitter to predict the stock market
  • A student used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of a storm system in Summer 2012
  • Characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to analyze moods at national scale

In this tutorial, we will introduce how to use Python to scrape live tweets from Twitter.


The Twitter Application Programming Interface (API)

Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API

Python environment

If you are new to Python, you may find these resources valuable:

  • Codeacademy's Python tutorials
  • Google's Python class
  • Cleaning data in Python tutorial

You can install Python by downloading it from the Python website. We recommend the Anaconda Scientific Python Distribution - it is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at http://continuum.io/downloads. For more information please refer to "Set up enviroments" section in the Cleaning data in Python tutoral.


Unicode strings

Strings in the twitter data prefixed with the letter "u" are unicode strings. For example:

u"I am a string!"

Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)

In most circumstances, you will be able to use a unicode object just like a string. If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this:

unicode_string = u"aaaà çççñññ"
encoded_string = unicode_string.encode('utf-8')
print encoded_string

oauth2 library

To get access the live stream tweets, you will need to install the oauth2 library so you can properly authenticate. You can install it yourself in your Python environment. (Go to command line and type pip install oauth2: it should work for most environments.)


Get Twitter data

The steps below will help you set up your twitter account to be able to access live stream tweets.

  • Create a Twitter account if you do not have one.
  • Go to https://dev.twitter.com/apps and log in with your Twitter credentials.
  • Click "Create New App"
  • Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
  • On the next page, click the "Keys and Access Tokens" tab along the top, then scroll all the way down until you see the section "Your Access Token"
  • Click the button "Create My Access Token". You can Read more about Oauth authorization.
  • You will now copy your unique four values into twitterstream.py (download this file on your computer). These values are your "API Key", your "API secret", your "Access token", and your "Access token secret". Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below in line 6-9 of the file:
api_key = "<Enter api key>"
api_secret = "<Enter api secret>"
access_token_key = "<Enter your access token key here>"
access_token_secret = "<Enter your access token secret here>"

After pasting the four credentials into the twitterstream.py, save the file and go to command line and type:

python twitterstream.py > tweets.txt

Make sure you are in the directory where the file twittersteam.py is saved.


You will see that a tweets.txt file has been created by the system (smilar to one below). This is the file where raw tweet data will be stored.

Wait 3-5 minutes before you stop the program using Crtl-C in command line. Open the tweets file and you'll see some raw tweets similar to this:


Congratulations! You just scaped some live tweets using Python. Our next tutorial will introduce how to extract useful information from these tweets. Stay tuned!


Acknowledgement: this tutorial partially builds on the first assignment of Introduction to Data Science on Coursera.




Referensi