Home > Tech > Content

Implementing a Scalable Web Scraping Service with Python, Flask, and Nginx

Tech 1

To deploy a functional web scraping service on a cloud instance, one must establish a robust backend infrastructure capable of handling requests and managing processes efficiently. This implementation focuses on creating a microservice using Flask, utilizing Nginx for load balancing and proxying, and employing a simple client-side interface for user interaction.

Backend Architecture

The core logic resides in a Python-based API endpoint designed to accept URLs and target data types (titles, images, or videos). The following steps outline the creation and configuration of this service.

Installing Dependencies

Ensure the environment is pre-configured with Python 3. Utilize a reliable package source to expedite dependency resolution.

pip install flask beautifulsoup4 requests

Developing the Application Logic

Create a main module to handle HTTP requests. Proper error handling and variable validation are critical for stability.

from flask import Flask, request, jsonify, render_template
import requests
from bs4 import BeautifulSoup
import time

app = Flask(__name__

@app.route('/api/scan', methods=['POST'])
def execute_scraper():
    try:
        payload = request.get_json()
        url = payload.get('url')
        extraction_mode = payload.get('mode', 'title')

        if not url:
            return jsonify({'error': 'URL missing'}), 400

        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers, timeout=10)
        
        soup = BeautifulSoup(response.content, 'html.parser')
        results = []

        if extraction_mode == 'title':
            title_tag = soup.find('title')
            results.append(title_tag.string if title_tag else "")
        elif extraction_mode == 'image':
            img_tags = soup.find_all('img')
            results = [img.get('src') for img in img_tags]
        elif extraction_mode == 'video':
            vid_tags = soup.find_all('video')
            results = [vid.get('src') for vid in vid_tags]

        return jsonify(results)

    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

Process Management

Running services directly via terminal is unsuitable for production environments. Implement a shell script to manage process lifecycle, logging, and restarts.

#!/bin/bash
SERVICE_PORT=5000
PID_FILE="./server.pid"
LOG_FILE="./server.log"

# Function to kill existing process
kill_previous() {
    if [ -f "$PID_FILE" ]; then
        PID=$(cat "$PID_FILE")
        if ps -p $PID > /dev/null; then
            kill $PID
            rm "$PID_FILE"
            echo "Previous instance stopped."
        fi
    fi
}

# Start new instance
start_server() {
    nohup python3 crawler_app.py >> "$LOG_FILE" 2>&1 &
    echo $! > "$PID_FILE"
    echo "Service started with PID: $(cat $PID_FILE)"
}

kill_previous
start_server

Reverse Proxy Configuration

To expose the internal service securely and abstract the port number, configure Nginx as a reverse proxy.

Create a server block configuration.

server {
    listen 80;
    server_name your-domain.com;

    location /api {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location / { # Frontend serving area
        root /var/www/html;
        index index.html;
    }
}

Enable the site and reload Nginx:

sudo ln -s /etc/nginx/sites-available/crawler /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Client-Side Interface

A lightweight HTML interface allows users to submit targets for processing without requiring complex frontend frameworks.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Web Scraper Tool</title>
</head>
<body>
    <div id="container">
        <h2>Target Scanner</h2>
        <form id="dataForm">
            <label for="targetUrl">Target URL:</label><br>
            <input type="text" id="targetUrl" name="url" required><br><br>
            
            <label for=

Back to List

Prev: Manual Static IP Configuration for CentOS 7 Network Interfaces

Next: Generating URLs and Handling Redirects in Flask with url_for

Fading Coder

Implementing a Scalable Web Scraping Service with Python, Flask, and Nginx

Backend Architecture

Installing Dependencies

Developing the Application Logic

Process Management

Reverse Proxy Configuration

Client-Side Interface

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller