Difference between revisions of "Python: Scraping Twitter"
Onnowpurbo (talk | contribs) (Created page with " Scraping tweets using Python Table of Content Introduction The Twitter API Python environment Unicode strings oauth2 library Get Twitter data In...") |
Onnowpurbo (talk | contribs) |
||
Line 1: | Line 1: | ||
− | + | ==Introduction== | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | Introduction | ||
Twitter is a popular online social network where users can send and read short messages called "tweets". It is a new instrument to measure social events, and each day millions of people tweet to express their opinions across any topic imaginable. This data source is valuable for both research and business. | Twitter is a popular online social network where users can send and read short messages called "tweets". It is a new instrument to measure social events, and each day millions of people tweet to express their opinions across any topic imaginable. This data source is valuable for both research and business. | ||
Line 18: | Line 5: | ||
Here are a few examples of analyzing Twitter data to get some interesting results: | Here are a few examples of analyzing Twitter data to get some interesting results: | ||
− | + | * "Mood" of communication on twitter reflects biological rhythms | |
− | + | * Researchers use twitter to predict the stock market | |
− | + | * A student used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of a storm system in Summer 2012 | |
− | + | * Characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to analyze moods at national scale | |
In this tutorial, we will introduce how to use Python to scrape live tweets from Twitter. | In this tutorial, we will introduce how to use Python to scrape live tweets from Twitter. | ||
− | |||
− | The Twitter Application Programming Interface (API) | + | ==The Twitter Application Programming Interface (API)== |
Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API | Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API | ||
− | + | ===Python environment === | |
− | |||
− | Python environment | ||
If you are new to Python, you may find these resources valuable: | If you are new to Python, you may find these resources valuable: | ||
− | + | * Codeacademy's Python tutorials | |
− | + | * Google's Python class | |
− | + | * Cleaning data in Python tutorial | |
You can install Python by downloading it from the Python website. We recommend the Anaconda Scientific Python Distribution - it is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at http://continuum.io/downloads. For more information please refer to "Set up enviroments" section in the Cleaning data in Python tutoral. | You can install Python by downloading it from the Python website. We recommend the Anaconda Scientific Python Distribution - it is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at http://continuum.io/downloads. For more information please refer to "Set up enviroments" section in the Cleaning data in Python tutoral. | ||
− | + | ===Unicode strings=== | |
− | Unicode strings | ||
Strings in the twitter data prefixed with the letter "u" are unicode strings. For example: | Strings in the twitter data prefixed with the letter "u" are unicode strings. For example: | ||
− | u"I am a string!" | + | u"I am a string!" |
Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.) | Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.) | ||
Line 55: | Line 38: | ||
In most circumstances, you will be able to use a unicode object just like a string. If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this: | In most circumstances, you will be able to use a unicode object just like a string. If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this: | ||
− | unicode_string = u"aaaà çççñññ" | + | unicode_string = u"aaaà çççñññ" |
− | encoded_string = unicode_string.encode('utf-8') | + | encoded_string = unicode_string.encode('utf-8') |
− | print encoded_string | + | print encoded_string |
− | |||
− | |||
− | oauth2 library | + | ==oauth2 library== |
To get access the live stream tweets, you will need to install the oauth2 library so you can properly authenticate. You can install it yourself in your Python environment. (Go to command line and type pip install oauth2: it should work for most environments.) | To get access the live stream tweets, you will need to install the oauth2 library so you can properly authenticate. You can install it yourself in your Python environment. (Go to command line and type pip install oauth2: it should work for most environments.) | ||
− | |||
− | Get Twitter data | + | ==Get Twitter data== |
The steps below will help you set up your twitter account to be able to access live stream tweets. | The steps below will help you set up your twitter account to be able to access live stream tweets. | ||
− | + | * Create a Twitter account if you do not have one. | |
− | + | * Go to https://dev.twitter.com/apps and log in with your Twitter credentials. | |
− | + | * Click "Create New App" | |
− | + | * Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use. | |
− | + | * On the next page, click the "Keys and Access Tokens" tab along the top, then scroll all the way down until you see the section "Your Access Token" | |
− | + | * Click the button "Create My Access Token". You can Read more about Oauth authorization. | |
− | + | * You will now copy your unique four values into twitterstream.py (download this file on your computer). These values are your "API Key", your "API secret", your "Access token", and your "Access token secret". Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below in line 6-9 of the file: | |
− | api_key = "<Enter api key>" | + | api_key = "<Enter api key>" |
− | api_secret = "<Enter api secret>" | + | api_secret = "<Enter api secret>" |
− | access_token_key = "<Enter your access token key here>" | + | access_token_key = "<Enter your access token key here>" |
− | access_token_secret = "<Enter your access token secret here>" | + | access_token_secret = "<Enter your access token secret here>" |
− | + | After pasting the four credentials into the twitterstream.py, save the file and go to command line and type: | |
+ | python twitterstream.py > tweets.txt | ||
+ | |||
+ | Make sure you are in the directory where the file twittersteam.py is saved. | ||
You will see that a tweets.txt file has been created by the system (smilar to one below). This is the file where raw tweet data will be stored. | You will see that a tweets.txt file has been created by the system (smilar to one below). This is the file where raw tweet data will be stored. | ||
− | + | Wait 3-5 minutes before you stop the program using Crtl-C in command line. Open the tweets file and you'll see some raw tweets similar to this: | |
− | |||
Congratulations! You just scaped some live tweets using Python. Our next tutorial will introduce how to extract useful information from these tweets. Stay tuned! | Congratulations! You just scaped some live tweets using Python. Our next tutorial will introduce how to extract useful information from these tweets. Stay tuned! | ||
Latest revision as of 14:52, 22 January 2017
Introduction
Twitter is a popular online social network where users can send and read short messages called "tweets". It is a new instrument to measure social events, and each day millions of people tweet to express their opinions across any topic imaginable. This data source is valuable for both research and business.
Here are a few examples of analyzing Twitter data to get some interesting results:
- "Mood" of communication on twitter reflects biological rhythms
- Researchers use twitter to predict the stock market
- A student used geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of a storm system in Summer 2012
- Characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to analyze moods at national scale
In this tutorial, we will introduce how to use Python to scrape live tweets from Twitter.
The Twitter Application Programming Interface (API)
Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API
Python environment
If you are new to Python, you may find these resources valuable:
- Codeacademy's Python tutorials
- Google's Python class
- Cleaning data in Python tutorial
You can install Python by downloading it from the Python website. We recommend the Anaconda Scientific Python Distribution - it is completely free, and ideal for processing data and doing predictive analysis and scientific computing. You can get the latest the version of Anaconda at http://continuum.io/downloads. For more information please refer to "Set up enviroments" section in the Cleaning data in Python tutoral.
Unicode strings
Strings in the twitter data prefixed with the letter "u" are unicode strings. For example:
u"I am a string!"
Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet (greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji, etc.)
In most circumstances, you will be able to use a unicode object just like a string. If you encounter an error involving printing unicode, you can use the encode method to properly print the international characters, like this:
unicode_string = u"aaaà çççñññ" encoded_string = unicode_string.encode('utf-8') print encoded_string
oauth2 library
To get access the live stream tweets, you will need to install the oauth2 library so you can properly authenticate. You can install it yourself in your Python environment. (Go to command line and type pip install oauth2: it should work for most environments.)
Get Twitter data
The steps below will help you set up your twitter account to be able to access live stream tweets.
- Create a Twitter account if you do not have one.
- Go to https://dev.twitter.com/apps and log in with your Twitter credentials.
- Click "Create New App"
- Fill out the form and agree to the terms. Put in a dummy website if you don't have one you want to use.
- On the next page, click the "Keys and Access Tokens" tab along the top, then scroll all the way down until you see the section "Your Access Token"
- Click the button "Create My Access Token". You can Read more about Oauth authorization.
- You will now copy your unique four values into twitterstream.py (download this file on your computer). These values are your "API Key", your "API secret", your "Access token", and your "Access token secret". Open twitterstream.py and set the variables corresponding to the api key, api secret, access token, and access secret. You will see code like the below in line 6-9 of the file:
api_key = "<Enter api key>" api_secret = "<Enter api secret>" access_token_key = "<Enter your access token key here>" access_token_secret = "<Enter your access token secret here>"
After pasting the four credentials into the twitterstream.py, save the file and go to command line and type:
python twitterstream.py > tweets.txt
Make sure you are in the directory where the file twittersteam.py is saved.
You will see that a tweets.txt file has been created by the system (smilar to one below). This is the file where raw tweet data will be stored.
Wait 3-5 minutes before you stop the program using Crtl-C in command line. Open the tweets file and you'll see some raw tweets similar to this:
Congratulations! You just scaped some live tweets using Python. Our next tutorial will introduce how to extract useful information from these tweets. Stay tuned!
Acknowledgement: this tutorial partially builds on the first assignment of Introduction to Data Science on Coursera.