Building Your First Web Scraper with Node.js
Introduction
Generally speaking, when it comes to web scraping, Python is often the preferred choice due to its simplicity and ease of use.
However, recently I have written several articles about scrapers, but I found that using only Python becomes inefficient when dealing with large-scale data extraction and storage.
Although Python supports multi-threading, it fundamentally remains a single-threaded language, which limits its performance in concurrent operations.
Staritng with this article, I will explore web scraping using Node.js. This serves both as a record of my learning progress and a way to share knowledge.
Project Setup
First, create a working directory and set up the environment as follows:
E:\>mkdir myNodeJS
E:\>cd myNodeJS
E:\myNodeJS>mkdir firstSpider
E:\myNodeJS>cd firstSpider
E:\myNodeJS\firstSpider>npm init
After running npm init, you'll see a series of prompts. For now, just press Enter to accept all default values:
This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sensible defaults.
See `npm help json` for definitive documentation on these fields
and exactly what they do.
Use `npm install <pkg>` afterwards to install a package and
save it as a dependency in the package.json file.
Press ^C at any time to quit.
package name: (firstspider)
version: (1.0.0)
description:
entry point: (index.js)
test command:
git repository:
keywords:
author:
license: (ISC)
About to write to E:\myNodeJS\firstSpider\package.json:
{
"name": "firstspider",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC"
}
Is this ok? (yes)
npm init creates a package.json file in the current directory. Key fields include:
name: Package name.version: Version number.description: Description of the package.homepage: Official website URL.author: Author's name.contributors: Other contributors' names.dependencies: List of required packages. If not installed, npm will automatically install them into thenode_modulesfolder.repository: Type of repository where the source code is stored (e.g., git or svn). Can point to a GitHub repo.main: Entry point of the module. When someone requires your package viarequire("express"), this field determines what gets loaded.keywords: Keywords related to the package.
These fields are currently empty but can be updated later in the package.json file.
Installing Dependencies
Install necessary libraries for web scraping. We'll use two packages:
request: Similar to Python’srequestslibrary, it establishes connections to target web pages and returns data — a crucial step in scraping.cheerio: Used to manipulate DOM elements. It converts HTML returned byrequestinto a format suitable for DOM manipulation. More importantly, its API resembles jQuery, using$to select DOM nodes, making data extraction straightforward.
Install commands:
E:\myNodeJS\firstSpider>npm install cheerio
E:\myNodeJS\firstSpider>npm install request
Main Program
Now we can create our script file.
Create a new JavaScript file named onespider.js to serve as the entry point (matching the main field in package.json).
Import dependencies:
var request = require("request");
var cheerio = require("cheerio");
Make a request to fetch the Baidu homepage content:
request('https://www.baidu.com', function(err, result) {
if (err) {
console.log("Request error: " + err);
return;
}
console.log(result.body);
});
At this stage, a basic scraper is complete. Let’s run it to verify functionality:
E:\myNodeJS\firstSpider>node onespider.js
Upon execution, the output should resemble:
<!--STATUS OK-->
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge">
<link rel="dns-prefetch" href="//s1.bdstatic.com"/>
<link rel="dns-prefetch" href="//t1.baidu.com"/>
<link rel="dns-prefetch" href="//t2.baidu.com"/>
<link rel="dns-prefetch" href="//t3.baidu.com"/>
<link rel="dns-prefetch" href="//t10.baidu.com"/>
<link rel="dns-prefetch" href="//t11.baidu.com"/>
<link rel="dns-prefetch" href="//t12.baidu.com"/>
<link rel="dns-prefetch" href="//b1.bdstatic.com"/>
<title>百度一下,你就知道</title>
<link href="http://s1.bdstatic.com/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />
<!--[if lte IE 8]><style index="index" >#content{height:480px\9}#m{top:260px\9}</style><![endif]-->
<!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->
<script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}var ns_c = function(){};</script>
<script>function h(obj){obj.style.behavior='url(#default#homepage)';var a = obj.setHomePage('//www.baidu.com/');}</script>
As expected, the program successfully fetched the webpage content.
Next, process the retrieved data to extract the page title.
To locate the <title> tag, use browser developer tools (F12). From the earlier output, we can already see the title element.
Modify the code to extract the title:
// console.log(result.body);
var page = cheerio.load(result.body);
console.log(page('title').text());
Running again produces:
E:\myNodeJS\firstSpider>node onespider.js
百度一下,你就知道
Thus, we've built our first web scraper using Node.js!
Full Code
var request = require("request"); // Import dependencies
var cheerio = require("cheerio"); // Import dependencies
request('http://www.baidu.com', function(err, result) {
if (err) { // Error handling
console.log("Request error: " + err);
return;
}
// console.log(result.body); // Output raw result
var page = cheerio.load(result.body); // Process HTML with cheerio
console.log(page('title').text()); // Extract title
});