Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Implementing Naive Bayes for Tweet Sentiment Classification

Notes 1

Data Preparation and Loading

Retrieve the necessary libraries and the Twitter dataset. Ensure the NLTK corpus and stop words are downloaded prior to execution.

from utils import clean_message, fetch_frequency
import numpy as np
from nltk.corpus import stopwords, twitter_samples
import string
from nltk.tokenize import TweetTokenizer

Separate the dataset into training and validation subsets.

pos_messages = twitter_samples.strings('positive_tweets.json')
neg_messages = twitter_samples.strings('negative_tweets.json')

train_pos = pos_messages[:4000]
test_pos = pos_messages[4000:]
train_neg = neg_messages[:4000]
test_neg = neg_messages[4000:]

X_train = train_pos + train_neg
X_test = test_pos + test_neg

y_train = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
y_test = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

Text Processing

Cleaning the text is critical for providing useful input to the model. Noise reduction involves eliminating stop words, stock tickers, retweet symbols, URLs, and hashtags, as they contribute minimal sentiment value. Stripping punctuation ensures tokens like "happy", "happy!", and "happy?" are treated identically. Stemming consolidates related terms, mapping "motivation", "motivated", and "motivate" to the common root "motiv-". The clean_message utility executes these transformations.

sample_input = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print(clean_message(sample_input))
# Output: ['hello', 'great', 'day', '😃', 'good', 'morn']

Constructing the Frequency Dictionary

To facilitate training, build a frequency map where keys are tuples of (stem, label) and values represent their occurrence counts across the corpus.

def build_frequency_map(freq_map, texts, targets):
    for target, text in zip(targets, texts):
        for stem in clean_message(text):
            key = (stem, target)
            if key in freq_map:
                freq_map[key] += 1
            else:
                freq_map[key] = 1
    return freq_map

Testing the helper function:

freq_map = {}
texts = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
targets = [1, 0, 0, 0, 0]
print(build_frequency_map(freq_map, texts, targets))
# Output: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

Training the Naive Bayes Classifier

Determine the prior probabilities for positive and negative classes:

\( P(D_{pos}) = \frac{D_{pos}}{D} \)

\( P(D_{neg}) = \frac{D_{neg}}{D} \)

Calculate the log prior, representing the inherent likelihood of a text being positive versus negative before evaluating specific tokens:

\( \text{prior\_log} = \log(D_{pos}) - \log(D_{neg}) \)

Compute the probability of a stem belonging to a specific class using Laplace smoothing to prevent zero probabilities:

\( P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} \)

\( P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} \)

Derive the log likelihood for each token:

\( \text{word\_log\_probs}[stem] = \log\left(\frac{P(W_{pos})}{P(W_{neg})}\right) \)

def fit_bayes_classifier(freq_map, X_train, y_train):
    word_log_probs = {}
    prior_log = 0.0

    stems = [k[0] for k in freq_map.keys()]
    vocab_size = len(set(stems))

    total_pos_words = total_neg_words = 0
    for k, v in freq_map.items():
        if k[1] > 0:
            total_pos_words += v
        else:
            total_neg_words += v

    total_docs = len(y_train)
    pos_docs = sum(1 for lbl in y_train if lbl > 0)
    neg_docs = total_docs - pos_docs

    prior_log = np.log(pos_docs) - np.log(neg_docs)

    for stem in set(stems):
        pos_freq = fetch_frequency(freq_map, stem, 1)
        neg_freq = fetch_frequency(freq_map, stem, 0)

        prob_pos = (pos_freq + 1) / (total_pos_words + vocab_size)
        prob_neg = (neg_freq + 1) / (total_neg_words + vocab_size)

        word_log_probs[stem] = np.log(prob_pos / prob_neg)

    return prior_log, word_log_probs

Execute the training function to obtain the prior and likelihoods:

freq_map = build_frequency_map({}, X_train, y_train)
prior_log, word_log_probs = fit_bayes_classifier(freq_map, X_train, y_train)
print(prior_log)
print(len(word_log_probs))

Predicting Sentiment

Implement a prediction function that aggregates the log likelihoods of all processed tokens within a message, adding the log prior to determine the final sentiment score.

\( \text{score} = \text{prior\_log} + \sum \text{word\_log\_probs}[stem] \)

def predict_sentiment(text, prior_log, word_log_probs):
    stems = clean_message(text)
    score = prior_log
    for stem in stems:
        if stem in word_log_probs:
            score += word_log_probs[stem]
    return score

Evaluate a sample message:

custom_msg = 'She smiled.'
print(predict_sentiment(custom_msg, prior_log, word_log_probs))
# Output: 1.5626795809988954

Evaluating Model Accuracy

Assess the classifier by comparing predicted labels against the actual test targets.

def evaluate_accuracy(X_test, y_test, prior_log, word_log_probs):
    predictions = []
    for text in X_test:
        pred = 1 if predict_sentiment(text, prior_log, word_log_probs) > 0 else 0
        predictions.append(pred)
    
    error = np.mean(np.abs(np.array(predictions) - y_test))
    return 1.0 - error
print("Classifier accuracy = %0.4f" % (evaluate_accuracy(X_test, y_test, prior_log, word_log_probs)))
# Output: Classifier accuracy = 0.9950

Analyzing Word Ratios

Quantify the sentiment polarity of individual stems by calculating their positive-to-negative frequency ratio.

\( \text{ratio} = \frac{\text{pos\_count} + 1}{\text{neg\_count} + 1} \)

def calculate_sentiment_ratio(freq_map, stem):
    pos_count = fetch_frequency(freq_map, stem, 1)
    neg_count = fetch_frequency(freq_map, stem, 0)
    ratio = (pos_count + 1) / (neg_count + 1)
    return {'positive': pos_count, 'negative': neg_count, 'ratio': ratio}

Filter stems based on a specified sentiment ratio threshold. A label of 1 extracts stems with ratios greater than or equal to the threshold, whereas a label of 0 identifies stems with ratios less than or equal to it.

def filter_words_by_threshold(freq_map, label, threshold):
    filtered_words = {}
    for key in freq_map.keys():
        stem, _ = key
        stats = calculate_sentiment_ratio(freq_map, stem)
        if label == 1 and stats['ratio'] >= threshold:
            filtered_words[stem] = stats
        elif label == 0 and stats['ratio'] <= threshold:
            filtered_words[stem] = stats
    return filtered_words
print(filter_words_by_threshold(freq_map, label=0, threshold=0.05))

Error Analysis

Review misclassified instances to understand the limitations and assumptions inherent in the Naive Bayes model.

print('Truth Predicted Text')
for text, target in zip(X_test, y_test):
    pred_score = predict_sentiment(text, prior_log, word_log_probs)
    pred_label = 1 if pred_score > 0 else 0
    if target != pred_label:
        print(f'{target}\t{pred_label}\t{" ".join(clean_message(text))}')

Custom Tweet Evaluation

custom_msg = 'I am happy because I am learning :)'
print(predict_sentiment(custom_msg, prior_log, word_log_probs))

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Tomcat 9 does not provide a dedicated Maven plugin. The Tomcat Manager interface, however, is backward-compatible, so the Tomcat 7 Maven Plugin can be used to deploy to Tomcat 9. This guide shows two...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Leave a Comment

Anonymous

â—ŽFeel free to join the discussion and share your thoughts.