Implementing a Scalable Web Scraping Service with Python, Flask, and Nginx
To deploy a functional web scraping service on a cloud instance, one must establish a robust backend infrastructure capable of handling requests and managing processes efficiently. This implementation focuses on creating a microservice using Flask, utilizing Nginx for load balancing and proxying, and employing a simple client-side interface for user interaction.
Backend Architecture
The core logic resides in a Python-based API endpoint designed to accept URLs and target data types (titles, images, or videos). The following steps outline the creation and configuration of this service.
Installing Dependencies
Ensure the environment is pre-configured with Python 3. Utilize a reliable package source to expedite dependency resolution.
pip install flask beautifulsoup4 requests
Developing the Application Logic
Create a main module to handle HTTP requests. Proper error handling and variable validation are critical for stability.
from flask import Flask, request, jsonify, render_template
import requests
from bs4 import BeautifulSoup
import time
app = Flask(__name__
@app.route('/api/scan', methods=['POST'])
def execute_scraper():
try:
payload = request.get_json()
url = payload.get('url')
extraction_mode = payload.get('mode', 'title')
if not url:
return jsonify({'error': 'URL missing'}), 400
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
results = []
if extraction_mode == 'title':
title_tag = soup.find('title')
results.append(title_tag.string if title_tag else "")
elif extraction_mode == 'image':
img_tags = soup.find_all('img')
results = [img.get('src') for img in img_tags]
elif extraction_mode == 'video':
vid_tags = soup.find_all('video')
results = [vid.get('src') for vid in vid_tags]
return jsonify(results)
except Exception as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=False)
Process Management
Running services directly via terminal is unsuitable for production environments. Implement a shell script to manage process lifecycle, logging, and restarts.
#!/bin/bash
SERVICE_PORT=5000
PID_FILE="./server.pid"
LOG_FILE="./server.log"
# Function to kill existing process
kill_previous() {
if [ -f "$PID_FILE" ]; then
PID=$(cat "$PID_FILE")
if ps -p $PID > /dev/null; then
kill $PID
rm "$PID_FILE"
echo "Previous instance stopped."
fi
fi
}
# Start new instance
start_server() {
nohup python3 crawler_app.py >> "$LOG_FILE" 2>&1 &
echo $! > "$PID_FILE"
echo "Service started with PID: $(cat $PID_FILE)"
}
kill_previous
start_server
Reverse Proxy Configuration
To expose the internal service securely and abstract the port number, configure Nginx as a reverse proxy.
- Create a server block configuration.
server {
listen 80;
server_name your-domain.com;
location /api {
proxy_pass http://127.0.0.1:5000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location / { # Frontend serving area
root /var/www/html;
index index.html;
}
}
- Enable the site and reload Nginx:
sudo ln -s /etc/nginx/sites-available/crawler /etc/nginx/sites-enabled/ sudo nginx -t && sudo systemctl reload nginx
Client-Side Interface
A lightweight HTML interface allows users to submit targets for processing without requiring complex frontend frameworks.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Web Scraper Tool</title>
</head>
<body>
<div id="container">
<h2>Target Scanner</h2>
<form id="dataForm">
<label for="targetUrl">Target URL:</label><br>
<input type="text" id="targetUrl" name="url" required><br><br>
<label for=