Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing a Scalable Web Scraping Service with Python, Flask, and Nginx

Tech 1

To deploy a functional web scraping service on a cloud instance, one must establish a robust backend infrastructure capable of handling requests and managing processes efficiently. This implementation focuses on creating a microservice using Flask, utilizing Nginx for load balancing and proxying, and employing a simple client-side interface for user interaction.

Backend Architecture

The core logic resides in a Python-based API endpoint designed to accept URLs and target data types (titles, images, or videos). The following steps outline the creation and configuration of this service.

Installing Dependencies

Ensure the environment is pre-configured with Python 3. Utilize a reliable package source to expedite dependency resolution.

pip install flask beautifulsoup4 requests

Developing the Application Logic

Create a main module to handle HTTP requests. Proper error handling and variable validation are critical for stability.

from flask import Flask, request, jsonify, render_template
import requests
from bs4 import BeautifulSoup
import time

app = Flask(__name__

@app.route('/api/scan', methods=['POST'])
def execute_scraper():
    try:
        payload = request.get_json()
        url = payload.get('url')
        extraction_mode = payload.get('mode', 'title')

        if not url:
            return jsonify({'error': 'URL missing'}), 400

        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers, timeout=10)
        
        soup = BeautifulSoup(response.content, 'html.parser')
        results = []

        if extraction_mode == 'title':
            title_tag = soup.find('title')
            results.append(title_tag.string if title_tag else "")
        elif extraction_mode == 'image':
            img_tags = soup.find_all('img')
            results = [img.get('src') for img in img_tags]
        elif extraction_mode == 'video':
            vid_tags = soup.find_all('video')
            results = [vid.get('src') for vid in vid_tags]

        return jsonify(results)

    except Exception as e:
        return jsonify({'error': str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000, debug=False)

Process Management

Running services directly via terminal is unsuitable for production environments. Implement a shell script to manage process lifecycle, logging, and restarts.

#!/bin/bash
SERVICE_PORT=5000
PID_FILE="./server.pid"
LOG_FILE="./server.log"

# Function to kill existing process
kill_previous() {
    if [ -f "$PID_FILE" ]; then
        PID=$(cat "$PID_FILE")
        if ps -p $PID > /dev/null; then
            kill $PID
            rm "$PID_FILE"
            echo "Previous instance stopped."
        fi
    fi
}

# Start new instance
start_server() {
    nohup python3 crawler_app.py >> "$LOG_FILE" 2>&1 &
    echo $! > "$PID_FILE"
    echo "Service started with PID: $(cat $PID_FILE)"
}

kill_previous
start_server

Reverse Proxy Configuration

To expose the internal service securely and abstract the port number, configure Nginx as a reverse proxy.

  1. Create a server block configuration.
server {
    listen 80;
    server_name your-domain.com;

    location /api {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location / { # Frontend serving area
        root /var/www/html;
        index index.html;
    }
}
  1. Enable the site and reload Nginx:
    sudo ln -s /etc/nginx/sites-available/crawler /etc/nginx/sites-enabled/
    sudo nginx -t && sudo systemctl reload nginx
    

Client-Side Interface

A lightweight HTML interface allows users to submit targets for processing without requiring complex frontend frameworks.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Web Scraper Tool</title>
</head>
<body>
    <div id="container">
        <h2>Target Scanner</h2>
        <form id="dataForm">
            <label for="targetUrl">Target URL:</label><br>
            <input type="text" id="targetUrl" name="url" required><br><br>
            
            <label for=

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

SBUS Signal Analysis and Communication Implementation Using STM32 with Fus Remote Controller

Overview In a recent project, I utilized the SBUS protocol with the Fus remote controller to control a vehicle's basic operations, including movement, lights, and mode switching. This article is aimed...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.