Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Building Your First Web Scraper with Node.js

Tech 1

Introduction

Generally speaking, when it comes to web scraping, Python is often the preferred choice due to its simplicity and ease of use.

However, recently I have written several articles about scrapers, but I found that using only Python becomes inefficient when dealing with large-scale data extraction and storage.

Although Python supports multi-threading, it fundamentally remains a single-threaded language, which limits its performance in concurrent operations.

Staritng with this article, I will explore web scraping using Node.js. This serves both as a record of my learning progress and a way to share knowledge.

Project Setup

First, create a working directory and set up the environment as follows:

E:\>mkdir myNodeJS
E:\>cd myNodeJS
E:\myNodeJS>mkdir firstSpider
E:\myNodeJS>cd firstSpider
E:\myNodeJS\firstSpider>npm init

After running npm init, you'll see a series of prompts. For now, just press Enter to accept all default values:

This utility will walk you through creating a package.json file.
It only covers the most common items, and tries to guess sensible defaults.

See `npm help json` for definitive documentation on these fields
and exactly what they do.

Use `npm install <pkg>` afterwards to install a package and
save it as a dependency in the package.json file.

Press ^C at any time to quit.
package name: (firstspider)
version: (1.0.0)
description:
entry point: (index.js)
test command:
git repository:
keywords:
author:
license: (ISC)
About to write to E:\myNodeJS\firstSpider\package.json:
{
  "name": "firstspider",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC"
}

Is this ok? (yes)

npm init creates a package.json file in the current directory. Key fields include:

  • name: Package name.
  • version: Version number.
  • description: Description of the package.
  • homepage: Official website URL.
  • author: Author's name.
  • contributors: Other contributors' names.
  • dependencies: List of required packages. If not installed, npm will automatically install them into the node_modules folder.
  • repository: Type of repository where the source code is stored (e.g., git or svn). Can point to a GitHub repo.
  • main: Entry point of the module. When someone requires your package via require("express"), this field determines what gets loaded.
  • keywords: Keywords related to the package.

These fields are currently empty but can be updated later in the package.json file.

Installing Dependencies

Install necessary libraries for web scraping. We'll use two packages:

  • request: Similar to Python’s requests library, it establishes connections to target web pages and returns data — a crucial step in scraping.
  • cheerio: Used to manipulate DOM elements. It converts HTML returned by request into a format suitable for DOM manipulation. More importantly, its API resembles jQuery, using $ to select DOM nodes, making data extraction straightforward.

Install commands:

E:\myNodeJS\firstSpider>npm install cheerio
E:\myNodeJS\firstSpider>npm install request

Main Program

Now we can create our script file.

Create a new JavaScript file named onespider.js to serve as the entry point (matching the main field in package.json).

Import dependencies:

var request = require("request");
var cheerio = require("cheerio");

Make a request to fetch the Baidu homepage content:

request('https://www.baidu.com', function(err, result) {
  if (err) {
    console.log("Request error: " + err);
    return;
  }
  console.log(result.body);
});

At this stage, a basic scraper is complete. Let’s run it to verify functionality:

E:\myNodeJS\firstSpider>node onespider.js

Upon execution, the output should resemble:

<!--STATUS OK-->
<html>
<head>
        <meta http-equiv="content-type" content="text/html;charset=utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=Edge">
        <link rel="dns-prefetch" href="//s1.bdstatic.com"/>
        <link rel="dns-prefetch" href="//t1.baidu.com"/>
        <link rel="dns-prefetch" href="//t2.baidu.com"/>
        <link rel="dns-prefetch" href="//t3.baidu.com"/>
        <link rel="dns-prefetch" href="//t10.baidu.com"/>
        <link rel="dns-prefetch" href="//t11.baidu.com"/>
        <link rel="dns-prefetch" href="//t12.baidu.com"/>
        <link rel="dns-prefetch" href="//b1.bdstatic.com"/>
        <title>百度一下,你就知道</title>
        <link href="http://s1.bdstatic.com/r/www/cache/static/home/css/index.css" rel="stylesheet" type="text/css" />
        <!--[if lte IE 8]><style index="index" >#content{height:480px\9}#m{top:260px\9}</style><![endif]-->
        <!--[if IE 8]><style index="index" >#u1 a.mnav,#u1 a.mnav:visited{font-family:simsun}</style><![endif]-->
        <script>var hashMatch = document.location.href.match(/#+(.*wd=[^&].+)/);if (hashMatch && hashMatch[0] && hashMatch[1]) {document.location.replace("http://"+location.host+"/s?"+hashMatch[1]);}var ns_c = function(){};</script>
        <script>function h(obj){obj.style.behavior='url(#default#homepage)';var a = obj.setHomePage('//www.baidu.com/');}</script>

As expected, the program successfully fetched the webpage content.

Next, process the retrieved data to extract the page title.

To locate the <title> tag, use browser developer tools (F12). From the earlier output, we can already see the title element.

Modify the code to extract the title:

// console.log(result.body);
var page = cheerio.load(result.body);
console.log(page('title').text());

Running again produces:

E:\myNodeJS\firstSpider>node onespider.js
百度一下,你就知道

Thus, we've built our first web scraper using Node.js!

Full Code

var request = require("request");      // Import dependencies
var cheerio = require("cheerio");    // Import dependencies

request('http://www.baidu.com', function(err, result) {
  if (err) {                         // Error handling
    console.log("Request error: " + err);
    return;
  }
  // console.log(result.body);          // Output raw result
  var page = cheerio.load(result.body);   // Process HTML with cheerio
  console.log(page('title').text());      // Extract title
});
Tags: Web Scraping

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.