Home > Tech > Content

Building Your First Web Scraper with Node.js

Tech Apr 24 9

Introduction

Generally speaking, when it comes to web scraping, Python is often the preferred choice due to its simplicity and ease of use.

However, recently I have written several articles about scrapers, but I found that using only Python becomes inefficient when dealing with large-scale data extraction and storage.

Although Python supports multi-threading, it fundamentally remains a single-threaded language, which limits its performance in concurrent operations.

Staritng with this article, I will explore web scraping using Node.js. This serves both as a record of my learning progress and a way to share knowledge.

Project Setup

First, create a working directory and set up the environment as follows:

E:\>mkdir myNodeJS
E:\>cd myNodeJS
E:\myNodeJS>mkdir firstSpider
E:\myNodeJS>cd firstSpider
E:\myNodeJS\firstSpider>npm init

After running npm init, you'll see a series of prompts. For now, just press Enter to accept all default values:

This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sensible defaults.

See `npm help json` for definitive documentation on these fields
and exactly what they do.

Use `npm install <pkg>` afterwards to install a package and
save it as a dependency in the package.json file.

Press ^C at any time to quit.
package name: (firstspider)
version: (1.0.0)
description:
entry point: (index.js)
test command:
git repository:
keywords:
author:
license: (ISC)
About to write to E:\myNodeJS\firstSpider\package.json:
{
  "name": "firstspider",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC"
}

Is this ok? (yes)

npm init creates a package.json file in the current directory. Key fields include:

name: Package name.
version: Version number.
description: Description of the package.
homepage: Official website URL.
author: Author's name.
contributors: Other contributors' names.
dependencies: List of required packages. If not installed, npm will automatically install them into the node_modules folder.
repository: Type of repository where the source code is stored (e.g., git or svn). Can point to a GitHub repo.
main: Entry point of the module. When someone requires your package via require("express"), this field determines what gets loaded.
keywords: Keywords related to the package.

These fields are currently empty but can be updated later in the package.json file.

Installing Dependencies

Install necessary libraries for web scraping. We'll use two packages:

request: Similar to Python’s requests library, it establishes connections to target web pages and returns data — a crucial step in scraping.
cheerio: Used to manipulate DOM elements. It converts HTML returned by request into a format suitable for DOM manipulation. More importantly, its API resembles jQuery, using $ to select DOM nodes, making data extraction straightforward.

Install commands:

E:\myNodeJS\firstSpider>npm install cheerio
E:\myNodeJS\firstSpider>npm install request

Main Program

Now we can create our script file.

Create a new JavaScript file named onespider.js to serve as the entry point (matching the main field in package.json).

Import dependencies:

var request = require("request");
var cheerio = require("cheerio");

Make a request to fetch the Baidu homepage content:

request('https://www.baidu.com', function(err, result) {
  if (err) {
    console.log("Request error: " + err);
    return;
  }
  console.log(result.body);
});

At this stage, a basic scraper is complete. Let’s run it to verify functionality:

E:\myNodeJS\firstSpider>node onespider.js

Upon execution, the output should resemble:

<!--STATUS OK-->
<html>
<head>
        <meta http-equiv="content-type" content="text/html;charset=utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <link rel="dns-prefetch" href="//s1.bdstatic.com"/>
        <link rel="dns-prefetch" href="//t1.baidu.com"/>
        <link rel="dns-prefetch" href="//t2.baidu.com"/>
        <link rel="dns-prefetch" href="//t3.baidu.com"/>
        <link rel="dns-prefetch" href="//t10.baidu.com"/>
        <link rel="dns-prefetch" href="//t11.baidu.com"/>
        <link rel="dns-prefetch" href="//t12.baidu.com"/>
        <link rel="dns-prefetch" href="//b1.bdstatic.com"/>
        <title>百度一下，你就知道</title>
        <link href="http://s1.bdstatic.com/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />
        <!--[if lte IE 8]><style index="index" >#content{height:480px\9}#m{top:260px\9}</style><![endif]-->
        <!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->
        <script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}var ns_c = function(){};</script>
        <script>function h(obj){obj.style.behavior='url(#default#homepage)';var a = obj.setHomePage('//www.baidu.com/');}</script>

As expected, the program successfully fetched the webpage content.

Next, process the retrieved data to extract the page title.

To locate the <title> tag, use browser developer tools (F12). From the earlier output, we can already see the title element.

Modify the code to extract the title:

// console.log(result.body);
var page = cheerio.load(result.body);
console.log(page('title').text());

Running again produces:

E:\myNodeJS\firstSpider>node onespider.js
百度一下，你就知道

Thus, we've built our first web scraper using Node.js!

Full Code

var request = require("request");      // Import dependencies
var cheerio = require("cheerio");    // Import dependencies

request('http://www.baidu.com', function(err, result) {
  if (err) {                         // Error handling
    console.log("Request error: " + err);
    return;
  }
  // console.log(result.body);          // Output raw result
  var page = cheerio.load(result.body);   // Process HTML with cheerio
  console.log(page('title').text());      // Extract title
});

Tags: Web Scraping

Back to List

Prev: Essential DOS Commands for Network Security Beginners

Next: Common React Native Development Issues and Their Resolutions

Fading Coder

Building Your First Web Scraper with Node.js

Introduction

Project Setup

Installing Dependencies

Main Program

Full Code

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Building Your First Web Scraper with Node.js

Introduction

Project Setup

Installing Dependencies

Main Program

Full Code

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment