Scraping NetEase Cloud Music Hot Comments to Generate Word Clouds
Data Collection
Building a word cloud requires raw data first. For NetEase Cloud Music, this involves several steps:
- Packet analysis to locate the API endpoint
- Handling encrypted request parameters
- Extracting hot comment content
Packet Analysis
Using Chrome DevTools, the comment API endpoint becomes visible. The requests use POST method with specific parameters and encrypted headers.
Handling Encrypted Parameters
The NetEase Cloud Music API requires two encrypted fields: params and encSecKey. These values can be extracted from browser requests and reused across different song IDs. For deep technical details on the encryption mechanism, refer to the NetEase Cloud Music API analysis projects available on GitHub.
Extracitng Hot Comments
Once the endpoint is identified, the resposne returns JSON data containing comment objects. Parse the JSON and extract the content field from each hot comment.
import requests
import json
def fetch_hot_comments(song_id):
api_url = f'http://music.163.com/weapi/v1/resource/comments/R_SO_4_{song_id}?csrf_token=test'
post_data = {
'params': '4hmFbT9ZucQPTM8ly/UA60NYH1tpyzhHOx04qzjEh3hU1597xh7pBOjRILfbjNZHqzzGby5ExblBpOdDLJxOAk4hBVy5/XNwobA+JTFPiumSmVYBRFpizkWHgCGO+OWiuaNPVlmr9m8UI7tJv0+NJoLUy0D6jd+DnIgcVJlIQDmkvfHbQr/i9Sy+SNSt6Ltq',
'encSecKey': 'a2c2e57baee7ca16598c9d027494f40fbd228f0288d48b304feec0c52497511e191f42dfc3e9040b9bb40a9857fa3f963c6a410b8a2a24eea02e66f3133fcb8dbfcb1d9a5d7ff1680c310a32f05db83ec920e64692a7803b2b5d7f99b14abf33cfa7edc3e57b1379648d25b3e4a9cab62c1b3a68a4d015abedcd1bb7e868b676'
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Referer': f'http://music.163.com/song?id={song_id}',
'Host': 'music.163.com',
'Origin': 'http://music.163.com'
}
response = requests.post(api_url, headers=headers, data=post_data)
result = json.loads(response.text)
comments = []
for entry in result.get('hotComments', []):
comments.append(entry['content'])
return comments
if __name__ == '__main__':
song_id = 439915614
hot_comments = fetch_hot_comments(song_id)
for comment in hot_comments:
print(comment)
Running this script outputs the hot comments for the specified song.
Word Cloud Ganeration
The wordcloud library provides straightforward word cloud generation capabilities. Install it via pip and consult the official documentation for basic usage patterns.
Chinese text rendering requires specifying a font file that supports Chinese characters. The font_path parameter in the WordCloud constructor handles this:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from scraper import fetch_hot_comments
song_id = 439915614
text_content = " ".join(fetch_hot_comments(song_id))
cloud = WordCloud(
random_state=1,
font_path=r'C:/Users/Windows/fonts/simkai.ttf'
).generate(text_content)
plt.figure()
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()
Visual Results
The generated word cloud displays frequently occurring terms from the hot comments, providing an intuitive visualization of what resonates most with listeners.
Potential Enhancements
- Custom masks: Generate word clouds shaped like specific images or patterns
- Batch processing: Scrape comments from multiple songs by iterating through different song IDs
- Service extraction: Wrap the functionality into a REST API for serving word clouds on demand