Home > Notes > Content

Implementing Naive Bayes for Tweet Sentiment Classification

Notes 1

Data Preparation and Loading

Retrieve the necessary libraries and the Twitter dataset. Ensure the NLTK corpus and stop words are downloaded prior to execution.

from utils import clean_message, fetch_frequency
import numpy as np
from nltk.corpus import stopwords, twitter_samples
import string
from nltk.tokenize import TweetTokenizer

Separate the dataset into training and validation subsets.

pos_messages = twitter_samples.strings('positive_tweets.json')
neg_messages = twitter_samples.strings('negative_tweets.json')

train_pos = pos_messages[:4000]
test_pos = pos_messages[4000:]
train_neg = neg_messages[:4000]
test_neg = neg_messages[4000:]

X_train = train_pos + train_neg
X_test = test_pos + test_neg

y_train = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
y_test = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

Text Processing

Cleaning the text is critical for providing useful input to the model. Noise reduction involves eliminating stop words, stock tickers, retweet symbols, URLs, and hashtags, as they contribute minimal sentiment value. Stripping punctuation ensures tokens like "happy", "happy!", and "happy?" are treated identically. Stemming consolidates related terms, mapping "motivation", "motivated", and "motivate" to the common root "motiv-". The clean_message utility executes these transformations.

sample_input = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print(clean_message(sample_input))
# Output: ['hello', 'great', 'day', '😃', 'good', 'morn']

Constructing the Frequency Dictionary

To facilitate training, build a frequency map where keys are tuples of (stem, label) and values represent their occurrence counts across the corpus.

def build_frequency_map(freq_map, texts, targets):
    for target, text in zip(targets, texts):
        for stem in clean_message(text):
            key = (stem, target)
            if key in freq_map:
                freq_map[key] += 1
            else:
                freq_map[key] = 1
    return freq_map

Testing the helper function:

freq_map = {}
texts = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
targets = [1, 0, 0, 0, 0]
print(build_frequency_map(freq_map, texts, targets))
# Output: {('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}

Training the Naive Bayes Classifier

Determine the prior probabilities for positive and negative classes:

\( P(D_{pos}) = \frac{D_{pos}}{D} \)

\( P(D_{neg}) = \frac{D_{neg}}{D} \)

Calculate the log prior, representing the inherent likelihood of a text being positive versus negative before evaluating specific tokens:

\( \text{prior\_log} = \log(D_{pos}) - \log(D_{neg}) \)

Compute the probability of a stem belonging to a specific class using Laplace smoothing to prevent zero probabilities:

\( P(W_{pos}) = \frac{freq_{pos} + 1}{N_{pos} + V} \)

\( P(W_{neg}) = \frac{freq_{neg} + 1}{N_{neg} + V} \)

Derive the log likelihood for each token:

\( \text{word\_log\_probs}[stem] = \log\left(\frac{P(W_{pos})}{P(W_{neg})}\right) \)

def fit_bayes_classifier(freq_map, X_train, y_train):
    word_log_probs = {}
    prior_log = 0.0

    stems = [k[0] for k in freq_map.keys()]
    vocab_size = len(set(stems))

    total_pos_words = total_neg_words = 0
    for k, v in freq_map.items():
        if k[1] > 0:
            total_pos_words += v
        else:
            total_neg_words += v

    total_docs = len(y_train)
    pos_docs = sum(1 for lbl in y_train if lbl > 0)
    neg_docs = total_docs - pos_docs

    prior_log = np.log(pos_docs) - np.log(neg_docs)

    for stem in set(stems):
        pos_freq = fetch_frequency(freq_map, stem, 1)
        neg_freq = fetch_frequency(freq_map, stem, 0)

        prob_pos = (pos_freq + 1) / (total_pos_words + vocab_size)
        prob_neg = (neg_freq + 1) / (total_neg_words + vocab_size)

        word_log_probs[stem] = np.log(prob_pos / prob_neg)

    return prior_log, word_log_probs

Execute the training function to obtain the prior and likelihoods:

freq_map = build_frequency_map({}, X_train, y_train)
prior_log, word_log_probs = fit_bayes_classifier(freq_map, X_train, y_train)
print(prior_log)
print(len(word_log_probs))

Predicting Sentiment

Implement a prediction function that aggregates the log likelihoods of all processed tokens within a message, adding the log prior to determine the final sentiment score.

\( \text{score} = \text{prior\_log} + \sum \text{word\_log\_probs}[stem] \)

def predict_sentiment(text, prior_log, word_log_probs):
    stems = clean_message(text)
    score = prior_log
    for stem in stems:
        if stem in word_log_probs:
            score += word_log_probs[stem]
    return score

Evaluate a sample message:

custom_msg = 'She smiled.'
print(predict_sentiment(custom_msg, prior_log, word_log_probs))
# Output: 1.5626795809988954

Evaluating Model Accuracy

Assess the classifier by comparing predicted labels against the actual test targets.

def evaluate_accuracy(X_test, y_test, prior_log, word_log_probs):
    predictions = []
    for text in X_test:
        pred = 1 if predict_sentiment(text, prior_log, word_log_probs) > 0 else 0
        predictions.append(pred)
    
    error = np.mean(np.abs(np.array(predictions) - y_test))
    return 1.0 - error

print("Classifier accuracy = %0.4f" % (evaluate_accuracy(X_test, y_test, prior_log, word_log_probs)))
# Output: Classifier accuracy = 0.9950

Analyzing Word Ratios

Quantify the sentiment polarity of individual stems by calculating their positive-to-negative frequency ratio.

\( \text{ratio} = \frac{\text{pos\_count} + 1}{\text{neg\_count} + 1} \)

def calculate_sentiment_ratio(freq_map, stem):
    pos_count = fetch_frequency(freq_map, stem, 1)
    neg_count = fetch_frequency(freq_map, stem, 0)
    ratio = (pos_count + 1) / (neg_count + 1)
    return {'positive': pos_count, 'negative': neg_count, 'ratio': ratio}

Filter stems based on a specified sentiment ratio threshold. A label of 1 extracts stems with ratios greater than or equal to the threshold, whereas a label of 0 identifies stems with ratios less than or equal to it.

def filter_words_by_threshold(freq_map, label, threshold):
    filtered_words = {}
    for key in freq_map.keys():
        stem, _ = key
        stats = calculate_sentiment_ratio(freq_map, stem)
        if label == 1 and stats['ratio'] >= threshold:
            filtered_words[stem] = stats
        elif label == 0 and stats['ratio'] <= threshold:
            filtered_words[stem] = stats
    return filtered_words

print(filter_words_by_threshold(freq_map, label=0, threshold=0.05))

Error Analysis

Review misclassified instances to understand the limitations and assumptions inherent in the Naive Bayes model.

print('Truth Predicted Text')
for text, target in zip(X_test, y_test):
    pred_score = predict_sentiment(text, prior_log, word_log_probs)
    pred_label = 1 if pred_score > 0 else 0
    if target != pred_label:
        print(f'{target}\t{pred_label}\t{" ".join(clean_message(text))}')

Custom Tweet Evaluation

custom_msg = 'I am happy because I am learning :)'
print(predict_sentiment(custom_msg, prior_log, word_log_probs))

Tags: naive bayes sentiment analysis nltk

Back to List

Prev: Building a Log Analysis System with Elasticsearch and Kibana

Next: Identifying and Resolving Port Usage Conflicts in Java Applications on Windows

Fading Coder

Implementing Naive Bayes for Tweet Sentiment Classification

Data Preparation and Loading

Text Processing

Constructing the Frequency Dictionary

Training the Naive Bayes Classifier

Predicting Sentiment

Evaluating Model Accuracy

Analyzing Word Ratios

Error Analysis

Custom Tweet Evaluation

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Implementing Naive Bayes for Tweet Sentiment Classification

Data Preparation and Loading

Text Processing

Constructing the Frequency Dictionary

Training the Naive Bayes Classifier

Predicting Sentiment

Evaluating Model Accuracy

Analyzing Word Ratios

Error Analysis

Custom Tweet Evaluation

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Deploying a Maven Web Application to Tomcat 9 Using the Tomcat Manager

Skipping Errors in MySQL Asynchronous Replication

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment