Using TOR Proxies for Anonymous Web Scraping with Python

Web scraping is a technique used to extract data from websites. However, scraping websites without proper anonymity can lead to IP bans or tracking. TOR (The Onion Router) proxies provide a way to hide the scraper’s identity by routing the traffic through a network of volunteer-operated servers. In this article, we will explore how to use TOR proxies for anonymous web scraping with Python, focusing on code implementation and technical details.

Setting Up the TOR Network

Before using TOR proxies, it’s essential to set up the TOR network. TOR is a decentralized network that allows users to browse the internet anonymously. To use TOR for scraping, you need the TOR service running on your machine, which you can install using the following steps:

Download and install TOR from the official website.
Ensure that the TOR service is running by starting it manually or setting it to start automatically with your system.
Verify that TOR is working by visiting check.torproject.org in your browser. If the page says “Congratulations. This browser is configured to use Tor,” then the setup is correct.

Installing Required Libraries

In Python, the most common libraries for web scraping are Requests and BeautifulSoup, but for anonymous scraping with TOR, we will also need the requests library and Stem (the library to control TOR). You can install the required libraries using pip:

pip install requests stem

Setting Up the Proxy Connection

To route your requests through TOR, you will need to configure your HTTP requests to use a TOR proxy. TOR listens on port 9050 for SOCKS5 proxy connections by default. The next step is to modify your Python code to connect through this proxy.


import requests

# Define the TOR proxy URL
TOR_PROXY = 'socks5h://127.0.0.1:9050'

# Setup session with proxy
session = requests.Session()
session.proxies = {
    'http': TOR_PROXY,
    'https': TOR_PROXY
}

# Send request using the TOR proxy
response = session.get('http://httpbin.org/ip')
print(response.text)

The above code sends a request through the TOR proxy to a test website (httpbin.org) that returns the client’s IP address. The IP address should match the one provided by the TOR network, not your real IP.

Managing TOR Circuit Rotation

TOR uses multiple nodes to anonymize the traffic. Occasionally, you may want to rotate the TOR circuit (the path your request takes through the network) to prevent detection and blocking by websites. You can control the TOR circuit using the stem library, which allows you to interact with the TOR process. Here is how to renew the TOR circuit:


from stem import Signal
from stem.control import Controller

# Function to renew TOR circuit
def renew_tor_circuit():
    with Controller.from_port(port=9051) as controller:
        controller.authenticate()  # Authenticate with the TOR process
        controller.signal(Signal.NEWNYM)  # Request a new circuit
        print("TOR circuit renewed!")

# Call the function to renew the circuit
renew_tor_circuit()

By calling the renew_tor_circuit function, you instruct TOR to use a new route for the next request. This can help avoid detection when scraping multiple pages from the same website.

Handling CAPTCHA and Anti-Scraping Measures

Many websites implement CAPTCHAs and other anti-scraping mechanisms to prevent automated access. While using TOR helps with anonymity, these websites may still detect scraping behavior. To bypass these measures, you can employ strategies such as introducing random delays between requests, using user-agent rotation, or incorporating CAPTCHA-solving services.


import random
import time

# Function to introduce random delays
def random_delay():
    delay = random.uniform(1, 5)  # Delay between 1 and 5 seconds
    time.sleep(delay)

# Example of making a request with random delay
def scrape_with_delay(url):
    random_delay()
    response = session.get(url)
    return response.text

Introducing random delays between requests can reduce the chances of being flagged by the website for scraping. You can also rotate user-agent strings to simulate requests from different browsers and devices.

Handling TOR Proxy Failures

While using TOR, sometimes the proxy connection may fail, or the TOR service may not be available. It’s essential to handle these scenarios gracefully to ensure that the scraper continues running smoothly. Here’s an example of how to handle TOR proxy failures using try-except blocks:


def make_request_with_tor(url):
    try:
        response = session.get(url, timeout=10)
        response.raise_for_status()  # Raise an error for bad responses
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error occurred: {e}")
        return None

In the code above, if a failure occurs (e.g., network issues, proxy failure), the scraper will print the error and continue without crashing.

We earn commissions using affiliate links.