Python: Collecting Twitter Data: Converting Twitter JSON to CSV — ASCII
Collecting Twitter Data: Converting Twitter JSON to CSV — ASCII November 10, 2015 Sean Dolinar
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8
I outlined some of the potential hurdles that you have to overcome when converting Twitter JSON data to a CSV file in the previous section. Here I outline a quick Python script that allows you to parse your Twitter JSON file with the csv library. This has the obvious drawback in that it can’t handle the utf-8 encoded characters that can be present in tweets. But this program will produce a CSV file that will work well in Excel or other programs that are limited to ASCII characters. The JSON File
The first requirement is to have a valid JSON file. This file should contain an array of Twitter JSON objects, or in analogous Python terms a list of Twitter dictionaries. The tutorial for the Python Stream Listener has been updated to make the correctly formatted file to work in Python.
[{Twitter JSON Object}, {Twitter JSON Object}, {Twitter JSON Object}] 1
[{Twitter JSON Object}, {Twitter JSON Object}, {Twitter JSON Object}]
The JSON file is loaded into Python and is automatically parsed into a Python friendly object by the json library using the json.loads() method. This opens and reads the file in as a string in the open() line, then decodes the string into a json Python object which behaves similar to a list of Python dictionaries — one dictionary for each tweet.
import json import csv
data_json = open('raw_tweets.json', mode='r').read() #reads in the JSON file into Python as a string data_python = json.loads(data_json) #turns the string into a json Python object 1 2 3 4 5
import json import csv
data_json = open('raw_tweets.json', mode='r').read() #reads in the JSON file into Python as a string data_python = json.loads(data_json) #turns the string into a json Python object
The CSV Writer
Before getting too ahead of things, a CSV writer should create a file and write the first row to label the data columns. The open() line creates a file and allows Python to write to it. This is a generic file, so anything could be written to it. The csv.writer() line creates an object which will write CSV formatted text to file we just opened. There are some other parameters you are able to specify, but it defaults to Excel specifications, so it those options can be omitted.
csv_out = open('tweets_out_ASCII.csv', mode='w') #opens csv file writer = csv.writer(csv_out) #create the csv writer object
fields = ['created_at', 'text', 'screen_name', 'followers', 'friends', 'rt', 'fav'] #field names writer.writerow(fields) #writes field 1 2 3 4 5
csv_out = open('tweets_out_ASCII.csv', mode='w') #opens csv file writer = csv.writer(csv_out) #create the csv writer object
fields = ['created_at', 'text', 'screen_name', 'followers', 'friends', 'rt', 'fav'] #field names writer.writerow(fields) #writes field
The purpose of this parser is to get some really basic information from the tweets, so it will only get the date and time, text, screen name and the number of followers, friends, retweets and favorites [which are called likes now]. If you wanted to retrieve other information, you’d would create the column names accordingly. the writerow() method writes a list with each element being a value which is separated by the comma in the CSV file.
The json Python object can be used in a for loop to access the individual tweets. From there each line can be accessed to get the different variables we are interested in. I’ve condensed the code so that is all in one statement. Breaking it down the line.get('*attribute*') retrieves the relevant information from the tweet. The line represents an individual tweet.
for line in data_python:
#writes a row and gets the fields from the json object #screen_name and followers/friends are found on the second level hence two get methods writer.writerow([line.get('created_at'), line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue line.get('user').get('screen_name'), line.get('user').get('followers_count'), line.get('user').get('friends_count'), line.get('retweet_count'), line.get('favorite_count')])
csv_out.close() 1 2 3 4 5 6 7 8 9 10 11 12 13
for line in data_python:
#writes a row and gets the fields from the json object #screen_name and followers/friends are found on the second level hence two get methods writer.writerow([line.get('created_at'), line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue line.get('user').get('screen_name'), line.get('user').get('followers_count'), line.get('user').get('friends_count'), line.get('retweet_count'), line.get('favorite_count')])
csv_out.close()
You might not notice this line, but it’s critical for this program working.
line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue 1
line.get('text').encode('unicode_escape'), #unicode escape to fix emoji issue
If the encode() method isn’t included, unicode characters (like emojis) are included in their native encoding. This will be sent to the csv.writer object, which can’t handle those characters and fail. This would be necessary for any field that could possibly have a unicode character. I know the other fields I chose cannot have non-ASCII characters, but if you were to add name or description, you’d have to make sure they do not have incompatible characters.
The unicode escape rewrites the unicode as a string of letters and number much like \U0001f35f. These represent the characters and can actually be decoded later.
The full code I used in this tutorial can be found on my GitHub . Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8