Actual Approaches For Sentiment Analysis (I): Document-Level Sentiment Classification

June 26th, 2010, 21:35H · Topics: Online Reputation Management, Technicals Papers · Print

As it was said in the post related to “Web Research and Opinion Retrieval”, sentiment analysis it’s not an easy task. Because of this, different studies and approaches have been used for it.

DOCUMENT-LEVEL SENTIMENT CLASSIFICATION

The first and easiest approach consists of classifying an opinionated document (e.g., a product review) as a positive or negative opinion. This task is commonly known as the document-level sentiment classification because it considers the whole document as the basic information unit.

In the document-level sentiment classification given a set of opinionated documents D, it determines whether each document d ∈ D expresses a positive or negative opinion (or sentiment) on an object.
Existing research and work on sentiment classification assume that an opinionated document d expresses opinions on a single object and the opinions are from a single opinion holder. These assumptions makes that this technique only could be applied in concrete contexts, because it’s difficult to find documents with only opinions on a single object.

Most existing techniques for document-level sentiment classification are based on supervised learning, although there are also some unsupervised methods.
Classification Based on Supervised Learning (document level)Sentiment classification can obviously be formulated as a supervised learning problem with two class labels (positive and negative), assuming that all the documents are known to be opinionated.
Training and testing data used in existing research are mostly product reviews with a reviewer-assigned rating (usually between 1 and 5 stars). Typically, a review with 4-5 stars is considered a positive review (thumbs-up), and a review with 1-2 stars is considered a negative review (thumbs-down).
Existing supervised learning methods can be readily applied to sentiment classification, for example, methods as Naïve Bayes, SVM (Support Vector Machines) or Maximum Entropy have been used with “good” results.
Some basic experiments using this approach, start using unigrams as features in classification with methods as Naïve Bayes or SVM. With the training set a classifier is obtained that using the unigrams that typically appears in positive and negative documents, is able to assign a polarity to a document. With this, looking if the unigrams of a document are closer to the unigrams that typically define positive or negative documents, a polarity will be assigned.
This approach has different limitations, some of them are related to the fact that it cannot take in account negations. This is because, for example, the term “not” will appear in positive and negative opinions, and it won’t be a relevant term to identify polarity. With this, “not good” and “good” will be classified as positive opinions, because “not” won’t be a polar term for the classifier.


To solve this limitation, some experiments pre-process the documents and introduce the negation in a unigram. Changing “not good” by “not good” and training the classifier with this compound term. Other experiments use bigrams and trigrams to consider expressions with more than one word.

Bo Pang took this approach to classify movie reviews into two classes, positive and negative (neutral reviews were not used in this work), having results with precisions close to the 83% using SVM. In his experiments he used as features: unigrams, bigramas, POS tags, tags for the negation and the position of the terms.
Other authors and experiments, during the pre-processing step, uses stop-words list to remove them from the documents, and use lemmatization and stemming.
In all the cases, the noise introduced by the part of the documents that is no related to the object that is been evaluated, doesn’t generate very accurate results. An easy solution for this is to use windows of action, using the classifying method only with the text that it’s around the keywords that identify the object that is been evaluated.
One of the bottlenecks and biggest problems in applying supervised learning is the manual effort involved in annotating a large number of training examples.

Classification Based on Unsupervised Learning
Using unsupervised learning based on opinion words and phrases seems to be a quite interesting approach. Was used by authors as P. Turney in “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews”. The algorithm consists of three steps:

  • Step 1: It extracts phrases containing adjectives or adverbs (research has shown that adjectives and adverbs are good indicators of subjectivity and opinions)

The opinion orientation (OO) of a phrase is computed based on the sum of the PMI of the different segments. For example:
OO (phrase) = PMI (phrase, “excelent”) – PMI (phrase, “poor”)

  • Step 3: Given a review, the final algorithm computes the opinion orientation average of all phrases in the review, and classifies the review as recommended if the average is positive, not recommended otherwise.

Apart from this method many other unsupervised methods exist, but they are not the most widely used methods.
In our next post, we will comment different approaches for sentence-level sentiment classification.

Leave a Reply

Your email address will not be published. Your photo in comments, use Gravatar
Please include http://
Note: XHTML is allowed.

Subscribe to this comment feed via RSS

WebOpinion

WebOpinion, the tools that alows to manage the Online Reputation, highly specialized in the treatment of the spanish language