Difference between revisions of "NLTK: Basic Sentiment Analysis with Python"

From OnnoWiki
Jump to navigation Jump to search
(Created page with " Basic Sentiment Analysis with Python 01 nov 2012 [Update]: you can check out the code on Github In this post I will try to give a very introductory view of some techniques...")
 
 
Line 3: Line 3:
 
01 nov 2012
 
01 nov 2012
  
[Update]: you can check out the code on Github
+
[Update]: you can check out the code on Github https://github.com/fjavieralba/basic_sentiment_analysis/blob/master/basic_sentiment_analysis.py
  
 
In this post I will try to give a very introductory view of some techniques that could be useful when you want to perform a basic analysis of opinions written in english.
 
In this post I will try to give a very introductory view of some techniques that could be useful when you want to perform a basic analysis of opinions written in english.

Latest revision as of 07:10, 9 February 2017

Basic Sentiment Analysis with Python

01 nov 2012

[Update]: you can check out the code on Github https://github.com/fjavieralba/basic_sentiment_analysis/blob/master/basic_sentiment_analysis.py

In this post I will try to give a very introductory view of some techniques that could be useful when you want to perform a basic analysis of opinions written in english.

These techniques come 100% from experience in real-life projects. Don't expect a theoretical introduction of Sentiment Analysis and the multiple strategies out there to achieve opinion mining, this is only a practical example of applying some basic rules to extract the polarity (positive or negative) of a text.

Let's start looking at an example opinion:

   "What can I say about this place. The staff of the restaurant is nice and the eggplant is not bad. Apart from that, very uninspired food, lack of atmosphere and too expensive. I am a staunch vegetarian and was sorely dissapointed with the veggie options on the menu. Will be the last time I visit, I recommend others to avoid."

As you can see, this is a mainly negative review about a restaurant.

General or detailed sentiment

Sometimes we only want an overall rating of the sentiment of the whole review. In other cases, we need a little more detail, and we want each negative or positive comment identified.

This kind of detailed detection can be quite challenging. Sometimes the aspect is explicit. An example is the opinion "very uninspired food", where the criticized aspect is the food. In other cases, is implicit: the sentence "too expensive" gives a negative opinion about the price without mentioning it.

In this post I will focus on detecting the overall polarity of a review, leaving for later the identification of individual opinions on concrete aspects of the restaurant. To compute the polarity of a review, I'm going to use an approach based on dictionaries and some basic algorithms.

A note about the dictionaries

A dictionary is no more than a list of words that share a category. For example, you can have a dictionary for positive expressions, and another one for stop words.

The design of the dictionaries highly depends on the concrete topic where you want to perform the opinion mining. Mining hotel opinions is quite different than mining laptops opinions. Not only the positive/negative expressions could be different but the context vocabulary is also quite distinct. Defining a structure for the text

Before writing code, there is an important decision to make. Our code will have to interact with text, splitting, tagging, and extracting information from it.

But what should be the structure of our text?

This is a key decision because it will determine our algorithms in some ways. We should decide if we want to differentiate sentences inside a a paragraph. We could define a sentence as a list of tokens. But what is a token? a string? a more complex structure? Note that we will want to assign tags to our token. Should we only allow one tag per-token or unlimited ones?

Infinite options here. We could choose a very simple structure, for example, defining the text simply as a list of words. Or we could define a more elaborated structure carrying every possible attribute of a processed text (word lemmas, word forms, multiple taggings, inflections...)

As usual, a compromise between these two extremes can be a good way to go.

For the examples of this post, I'm going to use the following structure:

       Each text is a list of sentences
       Each sentence is a list of tokens
       Each token is a tuple of three elements: a word form (the exact word that appeared in the text), a word lemma (a generalized version of the word), and a list of associated tags

This is a structure type I've found quite useful. Is ready for some "advanced" processing (lemmatization, multiple tags) without being too complex (at least in Python).

This is an example of a POS-tagged paragraph:

[[('All', 'All', ['DT']),
('that', 'that', ['DT']),
('is', 'is', ['VBZ']),
('gold', 'gold', ['NN']),
('does', 'does', ['VBZ']),
('not', 'not', ['RB']),
('glitter', 'glitter', ['VB']),
('.', '.', ['.'])],

[('Not', 'Not', ['RB']),

('all', 'all', ['DT']),
('those', 'those', ['DT']),
('who', 'who', ['WP']),
('wander', 'wander', ['NN']),
('are', 'are', ['VBP']),
('lost', 'lost', ['VBN'])]]

Prepocessing the Text

Once we have decided the structural shape of your processed text, we can start writing some code to read, and pre-process this text. With pre-process I mean some common first steps in NLP such as: Tokenize, Split into sentences, and POS Tag.

I will use the NLTK library for these tasks: import nltk

class Splitter(object):

   def __init__(self):
       self.nltk_splitter = nltk.data.load('tokenizers/punkt/english.pickle')
       self.nltk_tokenizer = nltk.tokenize.TreebankWordTokenizer()
   def split(self, text):
       """
       input format: a paragraph of text
       output format: a list of lists of words.
           e.g.: [['this', 'is', 'a', 'sentence'], ['this', 'is', 'another', 'one']]
       """
       sentences = self.nltk_splitter.tokenize(text)
       tokenized_sentences = [self.nltk_tokenizer.tokenize(sent) for sent in sentences]
       return tokenized_sentences


class POSTagger(object):

   def __init__(self):
       pass
       
   def pos_tag(self, sentences):
       """
       input format: list of lists of words
           e.g.: [['this', 'is', 'a', 'sentence'], ['this', 'is', 'another', 'one']]
       output format: list of lists of tagged tokens. Each tagged tokens has a
       form, a lemma, and a list of tags
           e.g: [[('this', 'this', ['DT']), ('is', 'be', ['VB']), ('a', 'a', ['DT']), ('sentence', 'sentence', ['NN'])],
                   [('this', 'this', ['DT']), ('is', 'be', ['VB']), ('another', 'another', ['DT']), ('one', 'one', ['CARD'])]]
       """
       pos = [nltk.pos_tag(sentence) for sentence in sentences]
       #adapt format
       pos = [[(word, word, [postag]) for (word, postag) in sentence] for sentence in pos]
       return pos

view raw splitter_postagger_nltk.py hosted with ❤ by GitHub

Now, using this two simple wrapper classes, I can perform a basic text preprocessing, where the input is the text as a string and the output is a collection of sentences, each of which is again a collection of tokens.

By the moment, our tokens are quite simple. Since we are using NLTK, and it does not lemmatize words, our forms and lemmas will be always identical. At this point of the process, the only tag associated to each word is its own POS Tag provided by NLTK. text = """What can I say about this place. The staff of the restaurant is nice and the eggplant is not bad. Apart from that, very uninspired food, lack of atmosphere and too expensive. I am a staunch vegetarian and was sorely dissapointed with the veggie options on the menu. Will be the last time I visit, I recommend others to avoid."""

splitter = Splitter() postagger = POSTagger()

splitted_sentences = splitter.split(text)

print splitted_sentences [['What', 'can', 'I', 'say', 'about', 'this', 'place', '.'], ['The', 'staff', 'of', 'the', 'restaurant', 'is', 'nice', 'and', 'eggplant', 'is', 'not', 'bad', '.'], ['apart', 'from', 'that', ',', 'very', 'uninspired', 'food', ',', 'lack', 'of', 'atmosphere', 'and', 'too', 'expensive', '.'], ['I', 'am', 'a', 'staunch', 'vegetarian', 'and', 'was', 'sorely', 'dissapointed', 'with', 'the', 'veggie', 'options', 'on', 'the', 'menu', '.'], ['Will', 'be', 'the', 'last', 'time', 'I', 'visit', ',', 'I', 'recommend', 'others', 'to', 'avoid', '.']]

pos_tagged_sentences = postagger.pos_tag(splitted_sentences)

print pos_tagged_sentences [[('What', 'What', ['WP']), ('can', 'can', ['MD']), ('I', 'I', ['PRP']), ('say', 'say', ['VB']), ('about', 'about', ['IN']), ('this', 'this', ['DT']), ('place', 'place', ['NN']), ('.', '.', ['.'])], [('The', 'The', ['DT']), ('staff', 'staff', ['NN']), ('of', 'of', ['IN']), ('the', 'the', ['DT']), ('restaurant', 'restaurant', ['NN']), ('is', 'is', ['VBZ']), ('nice', 'nice', ['JJ']), ('and', 'and', ['CC']), ('eggplant', 'eggplant', ['NN']), ('is', 'is', ['VBZ']), ('not', 'not', ['RB']), ('bad', 'bad', ['JJ']), ('.', '.', ['.'])], [('apart', 'apart', ['NN']), ('from', 'from', ['IN']), ('that', 'that', ['DT']), (',', ',', [',']), ('very', 'very', ['RB']), ('uninspired', 'uninspired', ['VBN']), ('food', 'food', ['NN']), (',', ',', [',']), ('lack', 'lack', ['NN']), ('of', 'of', ['IN']), ('atmosphere', 'atmosphere', ['NN']), ('and', 'and', ['CC']), ('too', 'too', ['RB']), ('expensive', 'expensive', ['JJ']), ('.', '.', ['.'])], [('I', 'I', ['PRP']), ('am', 'am', ['VBP']), ('a', 'a', ['DT']), ('staunch', 'staunch', ['NN']), ('vegetarian', 'vegetarian', ['NN']), ('and', 'and', ['CC']), ('was', 'was', ['VBD']), ('sorely', 'sorely', ['RB']), ('dissapointed', 'dissapointed', ['VBN']), ('with', 'with', ['IN']), ('the', 'the', ['DT']), ('veggie', 'veggie', ['NN']), ('options', 'options', ['NNS']), ('on', 'on', ['IN']), ('the', 'the', ['DT']), ('menu', 'menu', ['NN']), ('.', '.', ['.'])], [('Will', 'Will', ['NNP']), ('be', 'be', ['VB']), ('the', 'the', ['DT']), ('last', 'last', ['JJ']), ('time', 'time', ['NN']), ('I', 'I', ['PRP']), ('visit', 'visit', ['VBP']), (',', ',', [',']), ('I', 'I', ['PRP']), ('recommend', 'recommend', ['VBP']), ('others', 'others', ['NNS']), ('to', 'to', ['TO']), ('avoid', 'avoid', ['VB']), ('.', '.', ['.'])]] view raw preprocessing_text.py hosted with ❤ by GitHub Defining a dictionary of positive and negative expressions

The next step is to recognize positive and negative expressions. To achieve this, I'm going to use dictionaries, i.e. simple files containing expressions that will be searched in our text.

For example, I'm going to define two tiny dictionaries, one for positive expressions and other for negative ones:

positive.yml

nice: [positive] awesome: [positive] cool: [positive] superb: [positive]

negative.yml

bad: [negative] uninspired: [negative] expensive: [negative] dissapointed: [negative] recommend others to avoid: [negative]

In case you were wondering, we could have used a simpler format, or used only one file, but this dictionary format will be useful later.

Note that these are only two example dictionaries, useless in a real life project. Tagging the text with dictionaries

The following code defines a class that I will use to tag our pre-processed text with our just defined dictionaries. class DictionaryTagger(object):

   def __init__(self, dictionary_paths):
       files = [open(path, 'r') for path in dictionary_paths]
       dictionaries = [yaml.load(dict_file) for dict_file in files]
       map(lambda x: x.close(), files)
       self.dictionary = {}
       self.max_key_size = 0
       for curr_dict in dictionaries:
           for key in curr_dict:
               if key in self.dictionary:
                   self.dictionary[key].extend(curr_dict[key])
               else:
                   self.dictionary[key] = curr_dict[key]
                   self.max_key_size = max(self.max_key_size, len(key))
   def tag(self, postagged_sentences):
       return [self.tag_sentence(sentence) for sentence in postagged_sentences]
   def tag_sentence(self, sentence, tag_with_lemmas=False):
       """
       the result is only one tagging of all the possible ones.
       The resulting tagging is determined by these two priority rules:
           - longest matches have higher priority
           - search is made from left to right
       """
       tag_sentence = []
       N = len(sentence)
       if self.max_key_size == 0:
           self.max_key_size = N
       i = 0
       while (i < N):
           j = min(i + self.max_key_size, N) #avoid overflow
           tagged = False
           while (j > i):
               expression_form = ' '.join([word[0] for word in sentence[i:j]]).lower()
               expression_lemma = ' '.join([word[1] for word in sentence[i:j]]).lower()
               if tag_with_lemmas:
                   literal = expression_lemma
               else:
                   literal = expression_form
               if literal in self.dictionary:
                   #self.logger.debug("found: %s" % literal)
                   is_single_token = j - i == 1
                   original_position = i
                   i = j
                   taggings = [tag for tag in self.dictionary[literal]]
                   tagged_expression = (expression_form, expression_lemma, taggings)
                   if is_single_token: #if the tagged literal is a single token, conserve its previous taggings:
                       original_token_tagging = sentence[original_position][2]
                       tagged_expression[2].extend(original_token_tagging)
                   tag_sentence.append(tagged_expression)
                   tagged = True
               else:
                   j = j - 1
           if not tagged:
               tag_sentence.append(sentence[i])
               i += 1
       return tag_sentence

view raw dictionary_tagger.py hosted with ❤ by GitHub

When tagging our review, the input is the previously preprocessed text, and the output is the same text, enriched with tags of type "positive" or "negative": dicttagger = DictionaryTagger([ 'dicts/positive.yml', 'dicts/negative.yml'])

dict_tagged_sentences = dicttagger.tag(pos_tagged_sentences)

pprint(dict_tagged_sentences) [[('What', 'What', ['WP']),

 ('can', 'can', ['MD']),
 ('I', 'I', ['PRP']),
 ('say', 'say', ['VB']),
 ('about', 'about', ['IN']),
 ('this', 'this', ['DT']),
 ('place', 'place', ['NN']),
 ('.', '.', ['.'])],
[('The', 'The', ['DT']),
 ('staff', 'staff', ['NN']),
 ('of', 'of', ['IN']),
 ('the', 'the', ['DT']),
 ('restaurant', 'restaurant', ['NN']),
 ('is', 'is', ['VBZ']),
 ('nice', 'nice', ['positive', 'JJ']),
 ('and', 'and', ['CC']),
 ('eggplant', 'eggplant', ['NN']),
 ('is', 'is', ['VBZ']),
 ('not', 'not', ['RB']),
 ('bad', 'bad', ['negative', 'JJ']),
 ('.', '.', ['.'])],
[('apart', 'apart', ['NN']),
 ('from', 'from', ['IN']),
 ('that', 'that', ['DT']),
 (',', ',', [',']),
 ('very', 'very', ['RB']),
 ('uninspired', 'uninspired', ['negative', 'VBN']),
 ('food', 'food', ['NN']),
 (',', ',', [',']),
 ('lack', 'lack', ['NN']),
 ('of', 'of', ['IN']),
 ('atmosphere', 'atmosphere', ['NN']),
 ('and', 'and', ['CC']),
 ('too', 'too', ['RB']),
 ('expensive', 'expensive', ['negative', 'JJ']),
 ('.', '.', ['.'])],
[('I', 'I', ['PRP']),
 ('am', 'am', ['VBP']),
 ('a', 'a', ['DT']),
 ('staunch', 'staunch', ['NN']),
 ('vegetarian', 'vegetarian', ['NN']),
 ('and', 'and', ['CC']),
 ('was', 'was', ['VBD']),
 ('sorely', 'sorely', ['RB']),
 ('dissapointed', 'dissapointed', ['negative', 'VBN']),
 ('with', 'with', ['IN']),
 ('the', 'the', ['DT']),
 ('veggie', 'veggie', ['NN']),
 ('options', 'options', ['NNS']),
 ('on', 'on', ['IN']),
 ('the', 'the', ['DT']),
 ('menu', 'menu', ['NN']),
 ('.', '.', ['.'])],
[('Will', 'Will', ['NNP']),
 ('be', 'be', ['VB']),
 ('the', 'the', ['DT']),
 ('last', 'last', ['JJ']),
 ('time', 'time', ['NN']),
 ('I', 'I', ['PRP']),
 ('visit', 'visit', ['VBP']),
 (',', ',', [',']),
 ('I', 'I', ['PRP']),
 ('recommend others to avoid', 'recommend others to avoid', ['negative']),
 ('.', '.', ['.'])]]

view raw tagging_positive_negative.py hosted with ❤ by GitHub A simple sentiment measure

We could already perform a basic calculus of the positiveness or negativeness of a review.

Simply counting how many positive and negative expressions we detected, could be a (very naive) sentiment measure.

The following code snippet applies this idea: def value_of(sentiment):

   if sentiment == 'positive': return 1
   if sentiment == 'negative': return -1
   return 0

def sentiment_score(review):

   return sum ([value_of(tag) for sentence in dict_tagged_sentences for token in sentence for tag in token[2]])

view raw basic_sentiment_score.py hosted with ❤ by GitHub sentiment_score(dict_tagged_sentences) -4 view raw example_exec_1.py hosted with ❤ by GitHub

So, our review could be considered "quite negative" since it has a score of -4 Incrementers and decrementers

The previous "sentiment score" was very basic: it only counts positive and negative expressions and makes a sum, without taking into account that maybe some expressions are more positive or more negative than others.

A way of defining this "strength" could be using two new dictionaries. One for "incrementers" and another for "decrementers".

Let's define two tiny examples:

inc.yml

too: [inc] very: [inc] sorely: [inc]

dec.yml

barely: [dec] little: [dec]

We instantiate again our tagger, telling it to use these two new dictionaries: dicttagger = DictionaryTagger([ 'dicts/positive.yml', 'dicts/negative.yml', 'dicts/inc.yml', 'dicts/dec.yml'])

dict_tagged_sentences = dicttagger.tag(pos_tagged_sentences)

pprint(dict_tagged_sentences) [[('What', 'What', ['WP']),

 ('can', 'can', ['MD']),
 ('I', 'I', ['PRP']),
 ('say', 'say', ['VB']),
 ('about', 'about', ['IN']),
 ('this', 'this', ['DT']),
 ('place', 'place', ['NN']),
 ('.', '.', ['.'])],
[('The', 'The', ['DT']),
 ('staff', 'staff', ['NN']),
 ('of', 'of', ['IN']),
 ('the', 'the', ['DT']),
 ('restaurant', 'restaurant', ['NN']),
 ('is', 'is', ['VBZ']),
 ('nice', 'nice', ['positive', 'JJ']),
 ('and', 'and', ['CC']),
 ('eggplant', 'eggplant', ['NN']),
 ('is', 'is', ['VBZ']),
 ('not', 'not', ['RB']),
 ('bad', 'bad', ['negative', 'JJ']),
 ('.', '.', ['.'])],
[('apart', 'apart', ['NN']),
 ('from', 'from', ['IN']),
 ('that', 'that', ['DT']),
 (',', ',', [',']),
 ('very', 'very', ['inc', 'RB']),
 ('uninspired', 'uninspired', ['negative', 'VBN']),
 ('food', 'food', ['NN']),
 (',', ',', [',']),
 ('lack', 'lack', ['NN']),
 ('of', 'of', ['IN']),
 ('atmosphere', 'atmosphere', ['NN']),
 ('and', 'and', ['CC']),
 ('too', 'too', ['inc', 'RB']),
 ('expensive', 'expensive', ['negative', 'JJ']),
 ('.', '.', ['.'])],
[('I', 'I', ['PRP']),
 ('am', 'am', ['VBP']),
 ('a', 'a', ['DT']),
 ('staunch', 'staunch', ['NN']),
 ('vegetarian', 'vegetarian', ['NN']),
 ('and', 'and', ['CC']),
 ('was', 'was', ['VBD']),
 ('sorely', 'sorely', ['inc', 'RB']),
 ('dissapointed', 'dissapointed', ['negative', 'VBN']),
 ('with', 'with', ['IN']),
 ('the', 'the', ['DT']),
 ('veggie', 'veggie', ['NN']),
 ('options', 'options', ['NNS']),
 ('on', 'on', ['IN']),
 ('the', 'the', ['DT']),
 ('menu', 'menu', ['NN']),
 ('.', '.', ['.'])],
[('Will', 'Will', ['NNP']),
 ('be', 'be', ['VB']),
 ('the', 'the', ['DT']),
 ('last', 'last', ['JJ']),
 ('time', 'time', ['NN']),
 ('I', 'I', ['PRP']),
 ('visit', 'visit', ['VBP']),
 (',', ',', [',']),
 ('I', 'I', ['PRP']),
 ('recommend others to avoid', 'recommend others to avoid', ['negative']),
 ('.', '.', ['.'])]]

view raw tagging_inc_dec.py hosted with ❤ by GitHub

Now, we could improve in some way our sentiment score. The idea is that "good" has more strength than "barely good" but less than "very good".

The following code defines the recursive function sentence_score to compute the sentiment score of a sentence. The most remarkable thing about it is that it uses information about the previous token to make a decision on the score of the current token.

This function is then used by our new sentiment_score function: def sentence_score(sentence_tokens, previous_token, acum_score):

   if not sentence_tokens:
       return acum_score
   else:
       current_token = sentence_tokens[0]
       tags = current_token[2]
       token_score = sum([value_of(tag) for tag in tags])
       if previous_token is not None:
           previous_tags = previous_token[2]
           if 'inc' in previous_tags:
               token_score *= 2.0
           elif 'dec' in previous_tags:
               token_score /= 2.0
       return sentence_score(sentence_tokens[1:], current_token, acum_score + token_score)

def sentiment_score(review):

   return sum([sentence_score(sentence, None, 0.0) for sentence in review])

view raw sentiment_score_inc_dec.py hosted with ❤ by GitHub sentiment_score(dict_tagged_sentences) -7.0 view raw example_exec_2.py hosted with ❤ by GitHub

Notice that the review is now considered more negative, due to the appearance of expressions such as "very uninspired", "too expensive" and "sorely dissapointed". Inverters and polarity flips

With the approach we've been following so far, some expressions could be incorrectly tagged. For example, this part of our example review:

   the eggplant is not bad

contains the word bad but the sentence is a positive opinion about the eggplant.

This is because the appearance of the negation word not, that flips the meaning of the negative adjective bad.

We could take into account these types of polarity flips defining a dictionary of inverters:

inv.yml

lack of: [inv] not: [inv]

When tagging our text, we should also specify this new dictionary in the instantiation of our tagger: dicttagger = DictionaryTagger([ 'dicts/positive.yml', 'dicts/negative.yml', 'dicts/inc.yml', 'dicts/dec.yml', 'dicts/inv.yml'])

dict_tagged_sentences = dicttagger.tag(pos_tagged_sentences)

pprint(dict_tagged_sentences) [[('What', 'What', ['WP']),

 ('can', 'can', ['MD']),
 ('I', 'I', ['PRP']),
 ('say', 'say', ['VB']),
 ('about', 'about', ['IN']),
 ('this', 'this', ['DT']),
 ('place', 'place', ['NN']),
 ('.', '.', ['.'])],
[('The', 'The', ['DT']),
 ('staff', 'staff', ['NN']),
 ('of', 'of', ['IN']),
 ('the', 'the', ['DT']),
 ('restaurant', 'restaurant', ['NN']),
 ('is', 'is', ['VBZ']),
 ('nice', 'nice', ['positive', 'JJ']),
 ('and', 'and', ['CC']),
 ('eggplant', 'eggplant', ['NN']),
 ('is', 'is', ['VBZ']),
 ('not', 'not', ['inv', 'RB']),
 ('bad', 'bad', ['negative', 'JJ']),
 ('.', '.', ['.'])],
[('apart', 'apart', ['NN']),
 ('from', 'from', ['IN']),
 ('that', 'that', ['DT']),
 (',', ',', [',']),
 ('very', 'very', ['inc', 'RB']),
 ('uninspired', 'uninspired', ['negative', 'VBN']),
 ('food', 'food', ['NN']),
 (',', ',', [',']),
 ('lack of', 'lack of', ['inv']),
 ('atmosphere', 'atmosphere', ['NN']),
 ('and', 'and', ['CC']),
 ('too', 'too', ['inc', 'RB']),
 ('expensive', 'expensive', ['negative', 'JJ']),
 ('.', '.', ['.'])],
[('I', 'I', ['PRP']),
 ('am', 'am', ['VBP']),
 ('a', 'a', ['DT']),
 ('staunch', 'staunch', ['NN']),
 ('vegetarian', 'vegetarian', ['NN']),
 ('and', 'and', ['CC']),
 ('was', 'was', ['VBD']),
 ('sorely', 'sorely', ['inc', 'RB']),
 ('dissapointed', 'dissapointed', ['negative', 'VBN']),
 ('with', 'with', ['IN']),
 ('the', 'the', ['DT']),
 ('veggie', 'veggie', ['NN']),
 ('options', 'options', ['NNS']),
 ('on', 'on', ['IN']),
 ('the', 'the', ['DT']),
 ('menu', 'menu', ['NN']),
 ('.', '.', ['.'])],
[('Will', 'Will', ['NNP']),
 ('be', 'be', ['VB']),
 ('the', 'the', ['DT']),
 ('last', 'last', ['JJ']),
 ('time', 'time', ['NN']),
 ('I', 'I', ['PRP']),
 ('visit', 'visit', ['VBP']),
 (',', ',', [',']),
 ('I', 'I', ['PRP']),
 ('recommend others to avoid', 'recommend others to avoid', ['negative']),
 ('.', '.', ['.'])]]

view raw tagging_inverters.py hosted with ❤ by GitHub

Then, we could adapt our sentiment_score function. We want it to flip the polarity of a sentiment word when is preceded by an inverter: def sentence_score(sentence_tokens, previous_token, acum_score):

   if not sentence_tokens:
       return acum_score
   else:
       current_token = sentence_tokens[0]
       tags = current_token[2]
       token_score = sum([value_of(tag) for tag in tags])
       if previous_token is not None:
           previous_tags = previous_token[2]
           if 'inc' in previous_tags:
               token_score *= 2.0
           elif 'dec' in previous_tags:
               token_score /= 2.0
           elif 'inv' in previous_tags:
               token_score *= -1.0
       return sentence_score(sentence_tokens[1:], current_token, acum_score + token_score)

def sentiment_score(review):

   return sum([sentence_score(sentence, None, 0.0) for sentence in review])

view raw sentiment_score_flips.py hosted with ❤ by GitHub

Recalculating again the sentiment score: sentiment_score(dict_tagged_sentences) -5.0 view raw example_exec_3.py hosted with ❤ by GitHub

It's now -5.0 since "not bad" is considered positive. Conclusion

We have seen a little introduction to some basic techniques and algorithms that can give us an overall "score" of how positive or negative a review is.

The steps we've followed are:

       Split the text into sentences, and each sentence into tokens
       Add POS (Part Of Speech) tags to the Splitted text, using NLTK
       Enrich the POS-tagged text with our own tags using dictionaries. These tags are in a different "semantic level" than POS-tags: "positive", "negative", "inverter", "incrementer" and "decrementer"
       Implement some basic extraction rules over the tagged text, in form of python functions

That could be a good starting point to someone interested in sentiment analysis, but this is only the very beginning.

In a real-life system you should work harder, especially in the extraction-rules part (and, of course, in the dictionaries).

The method described so far is a rule-based approach. There are other techniques to perform sentiment analysis, for example, applying machine-learning algorithms. In any case, I think that advanced rule-based or machine-learning systems are out of scope in an introductory post like this.

Hope you enjoyed the reading!




Referensi