Python: Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8

From OnnoWiki
Jump to navigation Jump to search

Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8 November 10, 2015 Sean Dolinar

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]

The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial. io.open

First, to handle the UTF-8 encoding, I used the io.open class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads() reads the JSON data and parses it into an object you can access like a Python dictionary.

import json import csv import io

data_json = io.open('raw_tweets.json', mode='r', encoding='utf-8').read() #reads in the JSON file data_python = json.loads(data_json)

csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file 1 2 3 4 5 6 7 8

import json import csv import io

data_json = io.open('raw_tweets.json', mode='r', encoding='utf-8').read() #reads in the JSON file data_python = json.loads(data_json)

csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file

Unicode Object Instead of List

Since this program uses the write() method instead of a csv.writerow() method, and the write() method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*' is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n' creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.

fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names csv_out.write(fields) csv_out.write(u'\n') 1 2 3

fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names csv_out.write(fields) csv_out.write(u'\n')

The for loop and Delimiters

This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks (") and use commas (,) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.

for line in data_python:

   #writes a row and gets the fields from the json object
   #screen_name and followers/friends are found on the second level hence two get methods
   row = [line.get('created_at'),
          '"' + line.get('text').replace('"','""') + '"', #creates double quotes
          line.get('user').get('screen_name'),
          unicode(line.get('user').get('followers_count')),
          unicode(line.get('user').get('friends_count')),
          unicode(line.get('retweet_count')),
          unicode(line.get('favorite_count'))]
   row_joined = u','.join(row)
   csv_out.write(row_joined)
   csv_out.write(u'\n')


csv_out.close() 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

for line in data_python:

   #writes a row and gets the fields from the json object
   #screen_name and followers/friends are found on the second level hence two get methods
   row = [line.get('created_at'),
          '"' + line.get('text').replace('"','""') + '"', #creates double quotes
          line.get('user').get('screen_name'),
          unicode(line.get('user').get('followers_count')),
          unicode(line.get('user').get('friends_count')),
          unicode(line.get('retweet_count')),
          unicode(line.get('favorite_count'))]

   row_joined = u','.join(row)
   csv_out.write(row_joined)
   csv_out.write(u'\n')


csv_out.close()

This parser implements the delimiters requirements of the text fields by

   Replacing all quotation marks with double quotes in the text.
   Adding quotation marks to the beginning and end of the unicode string

'"' + line.get('text').replace('"','""') + '"', #creates double quotes 1

'"' + line.get('text').replace('"','""') + '"', #creates double quotes

Joining the row list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.

row_joined = u','.join(row) 1

row_joined = u','.join(row)

The full code I used in this tutorial can be found on my GitHub .

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]



Referensi