Implementing Naive Bayes for Tweet Sentiment Classification
Data Preparation and Loading
Retrieve the necessary libraries and the Twitter dataset. Ensure the NLTK corpus and stop words are downloaded prior to execution.
from utils import clean_message, fetch_frequency
import numpy as np
from nltk.corpus import stopwords, twitter_samples
import string
from nltk.tokenize import TweetTokenizer
Separate the dataset into training and validation subsets.
pos_messages = twitter_samples.strings('positive_tweets.json')
neg_messages = twitter_samples.strings('negative_tweets.json')
train_pos = pos_messages[:4000]
test_pos = pos_messages[4000:]
train_neg = neg_messages[:4000]
test_neg = neg_messages[4000:]
X_train = train_pos + train_neg
X_test = test_pos + test_neg
y_train = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
y_test = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))
Text Processing
Cleaning the text is critical for providing useful input to the model. Noise reduction involves eliminating stop words, stock tickers, retweet symbols, URLs, and hashtags, as they contribute minimal sentiment value. Stripping punctuation ensures tokens like "happy", "happy!", and "happy?" are treated identically. Stemming consolidates related terms, mapping "motivation", "motivated", and "motivate" to the common root "motiv-". The clean_message utility executes these transformations.
sample_input = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print(clean_message(sample_input))
# Output: ['hello', 'great', 'day', '😃', 'good', 'morn']
Constructing the Frequency Dictionary
To facilitate training, build a frequency map where keys are tuples of (stem, label) and values represent their occurrence counts across the corpus.
def build_frequency_map(freq_map, texts, targets):
for target, text in zip(targets, texts):
for stem in clean_message(text):
key = (stem, target)
if key in freq_map:
freq_map[key] += 1
else:
freq_map[key] = 1
return freq_map
Testing the helper function:
freq_map = {}
texts = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
targets = [1, 0, 0, 0, 0]
print(build_frequency_map(freq_map, texts, targets))
# Output: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}
Training the Naive Bayes Classifier
Determine the prior probabilities for positive and negative classes:
\( P(D_{pos}) = \frac{D_{pos}}{D} \)
\( P(D_{neg}) = \frac{D_{neg}}{D} \)
Calculate the log prior, representing the inherent likelihood of a text being positive versus negative before evaluating specific tokens:
\( \text{prior\_log} = \log(D_{pos}) - \log(D_{neg}) \)
Compute the probability of a stem belonging to a specific class using Laplace smoothing to prevent zero probabilities:
\( P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} \)
\( P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} \)
Derive the log likelihood for each token:
\( \text{word\_log\_probs}[stem] = \log\left(\frac{P(W_{pos})}{P(W_{neg})}\right) \)
def fit_bayes_classifier(freq_map, X_train, y_train):
word_log_probs = {}
prior_log = 0.0
stems = [k[0] for k in freq_map.keys()]
vocab_size = len(set(stems))
total_pos_words = total_neg_words = 0
for k, v in freq_map.items():
if k[1] > 0:
total_pos_words += v
else:
total_neg_words += v
total_docs = len(y_train)
pos_docs = sum(1 for lbl in y_train if lbl > 0)
neg_docs = total_docs - pos_docs
prior_log = np.log(pos_docs) - np.log(neg_docs)
for stem in set(stems):
pos_freq = fetch_frequency(freq_map, stem, 1)
neg_freq = fetch_frequency(freq_map, stem, 0)
prob_pos = (pos_freq + 1) / (total_pos_words + vocab_size)
prob_neg = (neg_freq + 1) / (total_neg_words + vocab_size)
word_log_probs[stem] = np.log(prob_pos / prob_neg)
return prior_log, word_log_probs
Execute the training function to obtain the prior and likelihoods:
freq_map = build_frequency_map({}, X_train, y_train)
prior_log, word_log_probs = fit_bayes_classifier(freq_map, X_train, y_train)
print(prior_log)
print(len(word_log_probs))
Predicting Sentiment
Implement a prediction function that aggregates the log likelihoods of all processed tokens within a message, adding the log prior to determine the final sentiment score.
\( \text{score} = \text{prior\_log} + \sum \text{word\_log\_probs}[stem] \)
def predict_sentiment(text, prior_log, word_log_probs):
stems = clean_message(text)
score = prior_log
for stem in stems:
if stem in word_log_probs:
score += word_log_probs[stem]
return score
Evaluate a sample message:
custom_msg = 'She smiled.'
print(predict_sentiment(custom_msg, prior_log, word_log_probs))
# Output: 1.5626795809988954
Evaluating Model Accuracy
Assess the classifier by comparing predicted labels against the actual test targets.
def evaluate_accuracy(X_test, y_test, prior_log, word_log_probs):
predictions = []
for text in X_test:
pred = 1 if predict_sentiment(text, prior_log, word_log_probs) > 0 else 0
predictions.append(pred)
error = np.mean(np.abs(np.array(predictions) - y_test))
return 1.0 - error
print("Classifier accuracy = %0.4f" % (evaluate_accuracy(X_test, y_test, prior_log, word_log_probs)))
# Output: Classifier accuracy = 0.9950
Analyzing Word Ratios
Quantify the sentiment polarity of individual stems by calculating their positive-to-negative frequency ratio.
\( \text{ratio} = \frac{\text{pos\_count} + 1}{\text{neg\_count} + 1} \)
def calculate_sentiment_ratio(freq_map, stem):
pos_count = fetch_frequency(freq_map, stem, 1)
neg_count = fetch_frequency(freq_map, stem, 0)
ratio = (pos_count + 1) / (neg_count + 1)
return {'positive': pos_count, 'negative': neg_count, 'ratio': ratio}
Filter stems based on a specified sentiment ratio threshold. A label of 1 extracts stems with ratios greater than or equal to the threshold, whereas a label of 0 identifies stems with ratios less than or equal to it.
def filter_words_by_threshold(freq_map, label, threshold):
filtered_words = {}
for key in freq_map.keys():
stem, _ = key
stats = calculate_sentiment_ratio(freq_map, stem)
if label == 1 and stats['ratio'] >= threshold:
filtered_words[stem] = stats
elif label == 0 and stats['ratio'] <= threshold:
filtered_words[stem] = stats
return filtered_words
print(filter_words_by_threshold(freq_map, label=0, threshold=0.05))
Error Analysis
Review misclassified instances to understand the limitations and assumptions inherent in the Naive Bayes model.
print('Truth Predicted Text')
for text, target in zip(X_test, y_test):
pred_score = predict_sentiment(text, prior_log, word_log_probs)
pred_label = 1 if pred_score > 0 else 0
if target != pred_label:
print(f'{target}\t{pred_label}\t{" ".join(clean_message(text))}')
Custom Tweet Evaluation
custom_msg = 'I am happy because I am learning :)'
print(predict_sentiment(custom_msg, prior_log, word_log_probs))