Introduction
Building a scalable and reliable scraping proxy service is essential for large-scale web scraping operations. In this guide, we will demonstrate how to build a scraping proxy service using Flask, Celery, and Redis. Flask will handle the HTTP API for managing proxies, while Celery will handle asynchronous tasks like proxy rotation, and Redis will manage proxy states and task queues.
Prerequisites
Before starting, ensure you have the following installed:
- Python 3.x
- Flask
- Celery
- Redis
You can install these dependencies using pip:
pip install Flask Celery redis
Setting Up Flask
First, we will create a simple Flask app that exposes an API endpoint for managing proxies. This API will allow users to retrieve proxies for web scraping tasks.
python
from flask import Flask, jsonify
import redis
app = Flask(__name__)
# Initialize Redis client
r = redis.Redis(host=’localhost’, port=6379, db=0)
@app.route(‘/proxy’, methods=[‘GET’])
def get_proxy():
proxy = r.spop(‘proxies’)
if proxy:
return jsonify({‘proxy’: proxy.decode(‘utf-8’)}), 200
else:
return jsonify({‘error’: ‘No proxies available’}), 400
if __name__ == ‘__main__’:
app.run(debug=True)
In this example, we use Redis’ sPop method to fetch and return a proxy from a Redis set called proxies.
Setting Up Celery
Next, we set up Celery to handle the background task of fetching new proxies and rotating them. Celery will work with Redis as the message broker to handle task queuing and state management.
Create a new file celery_worker.py:
python
from celery import Celery
import requests
import redis
app = Celery(‘scraping_proxy’, broker=’redis://localhost:6379/0′)
# Initialize Redis client
r = redis.Redis(host=’localhost’, port=6379, db=0)
@app.task
def fetch_new_proxy():
# Use an external service to get a new proxy (dummy URL for example purposes)
response = requests.get(‘https://proxy-service.example.com/get_proxy’)
if response.status_code == 200:
proxy = response.json().get(‘proxy’)
# Add the new proxy to the Redis set
r.sadd(‘proxies’, proxy)
This Celery task fetches a new proxy from an external service (you can replace the URL with your proxy provider’s API) and adds it to the Redis set proxies.
Starting the Flask App
Now that we have set up both Flask and Celery, let’s start both services. First, start Redis if you haven’t already:
redis-server
Next, start the Flask app by running:
python app.py
Then, in another terminal window, start the Celery worker:
celery -A celery_worker.app worker –loglevel=info
This will start the worker that will fetch new proxies and add them to Redis.
Proxy Rotation Mechanism
To ensure that proxies are rotated efficiently, we will modify the Flask API to utilize Celery for fetching new proxies asynchronously. When the /proxy endpoint is hit, if no proxies are available in Redis, we can trigger the fetch_new_proxy task.
Here is an updated version of the Flask API:
python
from flask import Flask, jsonify
from celery_worker import fetch_new_proxy
import redis
app = Flask(__name__)
# Initialize Redis client
r = redis.Redis(host=’localhost’, port=6379, db=0)
@app.route(‘/proxy’, methods=[‘GET’])
def get_proxy():
proxy = r.spop(‘proxies’)
if proxy:
return jsonify({‘proxy’: proxy.decode(‘utf-8’)}), 200
else:
# Trigger Celery task to fetch new proxy
fetch_new_proxy.delay()
return jsonify({‘error’: ‘No proxies available, fetching new proxy…’}), 503
if __name__ == ‘__main__’:
app.run(debug=True)
Now, when no proxies are available, the Flask app triggers the fetch_new_proxy task to retrieve a fresh proxy, and the user is notified that a new proxy is being fetched.
Conclusion
Building a scraping proxy service with Flask and Celery allows for efficient management of proxies with asynchronous background tasks. Using Redis as a message broker ensures that the system can scale to handle a large number of requests and proxy rotations.
We earn commissions using affiliate links.