Naive Bayes for Regional Tendency Extraction from Personal Ads
Advertisers often want specific demographic information about individuals to better target their advertisements. We will select people from two US cities and analyze the information they post to compare whether there are differences in the advertising language used in these two cities. If the conclusion is that there are differences, we want to know which words are commonly used in each city and what can be learned about the interests of people in difefrent cities from their word usage.
Collect Data: Import RSS Feeds
Use Python to download text. Browse the documentation at http://code.google.com/p/feedparser/ and install feedparser. First, unzip the downloaded package and change the current directory to the folder containing the unzipped files. Then, at the Python prompt, enter:
# python setup.py install
Create a file named bayes.py and add the following code:
# Create a list of unique words that appear in all documents
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)
def setOfWords2VecMN(vocabList, inputSet):
returnVec = [0] * len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
# Naive Bayes classifier training function
def trainNBO(trainMatrix, trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory) / float(numTrainDocs)
# Initialize to ones to avoid zero probability
p0Num = ones(numWords)
p1Num = ones(numWords)
p0Demo = 2.0
p1Demo = 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Demo += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Demo += sum(trainMatrix[i])
# Use logarithm to avoid underflow
p1Vect = log(p1Num / p1Demo)
p0Vect = log(p0Num / p0Demo)
return p0Vect, p1Vect, pAbusive
# Naive Bayes classification function
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1)
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0
# Text parsing
def textParse(bigString):
import re
listOfTokens = re.split(r'\W*', bigString)
return [tok.lower() for tok in listOfTokens if len(tok) > 2]
Add the folllowing code:
# RSS feed classifier and high frequency word removal function
def calcMostFreq(vocabList, fullText):
import operator
freqDict = {}
for token in vocabList:
freqDict[token] = fullText.count(token)
sortedFreq = sorted(freqDict.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedFreq[:30]
def localWords(feed1, feed0):
import feedparser
docList = []
classList = []
fullText = []
minLen = min(len(feed1['entries']), len(feed0['entries']))
for i in range(minLen):
wordList = textParse(feed1['entries'][i]['summary'])
docList.append(wordList)
fullText.extend(wordList)
classList.append(1)
wordList = textParse(feed0['entries'][i]['summary'])
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
vocabList = createVocabList(docList)
top30Words = calcMostFreq(vocabList, fullText)
for pairW in top30Words:
if pairW[0] in vocabList:
vocabList.remove(pairW[0])
trainingSet = range(2 * minLen)
testSet = []
for i in range(20):
randIndex = int(random.uniform(0, len(trainingSet)))
testSet.append(trainingSet[randIndex])
del(trainingSet[randIndex])
trainMat = []
trainClasses = []
for docIndex in trainingSet:
trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))
trainClasses.append(classList[docIndex])
p0V, p1V, pSpam = trainNBO(array(trainMat), array(trainClasses))
errorCount = 0
for docIndex in testSet:
wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
if classifyNB(array(wordVector), p0V, p1V, pSpam) != classList[docIndex]:
errorCount += 1
print 'the error rate is:', float(errorCount) / len(testSet)
return vocabList, p0V, p1V
The function localWords() takes two RSS feeds as parameters. The RSS feeds should be imported outside the function because they change over time; reloading them yields new data.
>>> reload(bayes)
<module 'bayes' from 'bayes.pyc'>
>>> import feedparser
>>> ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
>>> sy = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
>>> vocabList, pSF, pNY = bayes.localWords(ny, sf)
the error rate is: 0.2
>>> vocabList, pSF, pNY = bayes.localWords(ny, sf)
the error rate is: 0.3
>>> vocabList, pSF, pNY = bayes.localWords(ny, sf)
the error rate is: 0.55
To get an accurate estimate of the error rate, repeat the experiment multiple times and take the average.
Analyze Data: Display Regionally Specific Words
First, sort the vectors pSF and pNY, then print them in order. Add the following code to the file:
# Function to display the most representative words
def getTopWords(ny, sf):
import operator
vocabList, p0V, p1V = localWords(ny, sf)
topNY = []
topSF = []
for i in range(len(p0V)):
if p0V[i] > -6.0:
topSF.append((vocabList[i], p0V[i]))
if p1V[i] > -6.0:
topNY.append((vocabList[i], p1V[i]))
sortedSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
print "SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**"
for item in sortedSF:
print item[0]
sortedNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
for item in sortedNY:
print item[0]
The function getTopWords() takes two RSS feeds as input, trains and tests the Naive Bayes classifier, and returns the probability values. It then creates two lists to store tuples and returns all words with a probability above a certain threshold instead of just the top X words. The tuples are sorted by their conditional probabilities.
Save the bayes.py file and at the Python prompt enter:
>>> reload(bayes)
<module 'bayes' from 'bayes.pyc'>
>>> bayes.getTopWords(ny, sf)
the error rate is: 0.55
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
how
last
man
...
veteran
still
ends
late
off
own
know
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
someone
meet
...
apparel
recalled
starting
strings
When the three lines of code for removing high-frequency words are commented out, the error rate is 54%. With those lines retained, the error rate is 70%. Observations show that the top 30 most frequent words in the messages account for 30% of all word usage. The vocabList size is about 3000 words, meaning a small portion of the vocabulary constitutes a large part of the text. This is because language contains much redundancy and structural auxiliary content. Another common approach is to not only remove high-frequency words but also remove structural auxiliary words from a fixed list, known as a stop words list.
The final output shows many stop words. Removing these fixed stop words might improve the results and reduce the classification error rate.