Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Naive Bayes for Regional Tendency Extraction from Personal Ads

Tech 1

Advertisers often want specific demographic information about individuals to better target their advertisements. We will select people from two US cities and analyze the information they post to compare whether there are differences in the advertising language used in these two cities. If the conclusion is that there are differences, we want to know which words are commonly used in each city and what can be learned about the interests of people in difefrent cities from their word usage.

Collect Data: Import RSS Feeds

Use Python to download text. Browse the documentation at http://code.google.com/p/feedparser/ and install feedparser. First, unzip the downloaded package and change the current directory to the folder containing the unzipped files. Then, at the Python prompt, enter:

# python setup.py install

Create a file named bayes.py and add the following code:

# Create a list of unique words that appear in all documents
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

def setOfWords2VecMN(vocabList, inputSet):
    returnVec = [0] * len(vocabList)
    for word in inputSet:
        if word in vocabList:
            returnVec[vocabList.index(word)] += 1
    return returnVec

# Naive Bayes classifier training function
def trainNBO(trainMatrix, trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory) / float(numTrainDocs)
    # Initialize to ones to avoid zero probability
    p0Num = ones(numWords)
    p1Num = ones(numWords)
    p0Demo = 2.0
    p1Demo = 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Demo += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Demo += sum(trainMatrix[i])
    # Use logarithm to avoid underflow
    p1Vect = log(p1Num / p1Demo)
    p0Vect = log(p0Num / p0Demo)
    return p0Vect, p1Vect, pAbusive

# Naive Bayes classification function
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else:
        return 0

# Text parsing
def textParse(bigString):
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

Add the folllowing code:

# RSS feed classifier and high frequency word removal function
def calcMostFreq(vocabList, fullText):
    import operator
    freqDict = {}
    for token in vocabList:
        freqDict[token] = fullText.count(token)
    sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedFreq[:30]

def localWords(feed1, feed0):
    import feedparser
    docList = []
    classList = []
    fullText = []
    minLen = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLen):
        wordList = textParse(feed1['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(1)
        wordList = textParse(feed0['entries'][i]['summary'])
        docList.append(wordList)
        fullText.extend(wordList)
        classList.append(0)
    vocabList = createVocabList(docList)
    top30Words = calcMostFreq(vocabList, fullText)
    for pairW in top30Words:
        if pairW[0] in vocabList:
            vocabList.remove(pairW[0])
    trainingSet = range(2 * minLen)
    testSet = []
    for i in range(20):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
        trainClasses.append(classList[docIndex])
    p0V, p1V, pSpam = trainNBO(array(trainMat), array(trainClasses))
    errorCount = 0
    for docIndex in testSet:
        wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
        if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
            errorCount += 1
    print 'the error rate is:', float(errorCount) / len(testSet)
    return vocabList, p0V, p1V

The function localWords() takes two RSS feeds as parameters. The RSS feeds should be imported outside the function because they change over time; reloading them yields new data.

>>> reload(bayes)
<module 'bayes' from 'bayes.pyc'>
>>> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sy = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList, pSF, pNY = bayes.localWords(ny, sf)
the error rate is: 0.2
>>> vocabList, pSF, pNY = bayes.localWords(ny, sf)
the error rate is: 0.3
>>> vocabList, pSF, pNY = bayes.localWords(ny, sf)
the error rate is: 0.55

To get an accurate estimate of the error rate, repeat the experiment multiple times and take the average.

Analyze Data: Display Regionally Specific Words

First, sort the vectors pSF and pNY, then print them in order. Add the following code to the file:

# Function to display the most representative words
def getTopWords(ny, sf):
    import operator
    vocabList, p0V, p1V = localWords(ny, sf)
    topNY = []
    topSF = []
    for i in range(len(p0V)):
        if p0V[i] > -6.0:
            topSF.append((vocabList[i], p0V[i]))
        if p1V[i] > -6.0:
            topNY.append((vocabList[i], p1V[i]))
    sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
    for item in sortedSF:
        print item[0]
    sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
    for item in sortedNY:
        print item[0]

The function getTopWords() takes two RSS feeds as input, trains and tests the Naive Bayes classifier, and returns the probability values. It then creates two lists to store tuples and returns all words with a probability above a certain threshold instead of just the top X words. The tuples are sorted by their conditional probabilities.

Save the bayes.py file and at the Python prompt enter:

>>> reload(bayes)
<module 'bayes' from 'bayes.pyc'>
>>> bayes.getTopWords(ny, sf)
the error rate is: 0.55
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
how
last
man
...
veteran
still
ends
late
off
own
know
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
someone
meet
...
apparel
recalled
starting
strings

When the three lines of code for removing high-frequency words are commented out, the error rate is 54%. With those lines retained, the error rate is 70%. Observations show that the top 30 most frequent words in the messages account for 30% of all word usage. The vocabList size is about 3000 words, meaning a small portion of the vocabulary constitutes a large part of the text. This is because language contains much redundancy and structural auxiliary content. Another common approach is to not only remove high-frequency words but also remove structural auxiliary words from a fixed list, known as a stop words list.

The final output shows many stop words. Removing these fixed stop words might improve the results and reduce the classification error rate.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.