How to Speed Up S3 Downloads Using Multi-Part Uploads in Python

Amazon S3 (Simple Storage Service) is one of the most popular cloud storage services used by developers and businesses alike for storing and retrieving data. However, downloading large files from S3 can be slow due to network limitations or the size of the files. A powerful technique to optimize these downloads is the use of multi-part uploads, which allow files to be downloaded in parallel, speeding up the entire process.

In this article, we will delve deep into how to use multi-part downloads in Python to speed up S3 file retrieval. We will walk through the steps to implement this functionality and provide code examples to help you get started.

What is Multi-Part Download?

Multi-part download allows you to break up large files into smaller parts, download these parts simultaneously, and then merge them together into the original file. This parallel downloading process significantly speeds up the time it takes to retrieve files from Amazon S3.

Setting Up the Environment

To use multi-part downloads, you first need to install the necessary Python libraries. Specifically, the boto3 library, which is the official AWS SDK for Python, will help you interact with S3.

Install boto3 with pip:

pip install boto3
You also need to configure your AWS credentials using the AWS CLI or by manually setting them in your Python script. Ensure that you have the correct permissions to access the S3 bucket from which you want to download files.

Initial Setup of S3 Client

To begin, set up an S3 client using the boto3 library:
python
import boto3

s3 = boto3.client(‘s3’)
This client will allow you to interact with your S3 bucket. For multi-part downloads, we will use the get_object method.

Calculating the File Size and Defining Part Size

Before beginning the download, it’s important to calculate the file’s size. S3 provides the head_object method to retrieve metadata about the file, including its size.
python
def get_file_size(bucket, key):
response = s3.head_object(Bucket=bucket, Key=key)
return response[‘ContentLength’]
Once we have the file size, we need to determine the size of each part. A common strategy is to break the file into 5MB chunks, but you can adjust this depending on your network bandwidth and file size.
python
part_size = 5 * 1024 * 1024 # 5 MB

Starting the Multi-Part Download

Now, we need to download the file in parts. To do this, we will use the range parameter of get_object to specify which byte range we want to download for each part.
python
import threading

def download_part(bucket, key, start_byte, end_byte, part_number):
response = s3.get_object(Bucket=bucket, Key=key, Range=f”bytes={start_byte}-{end_byte}”)
part_data = response[‘Body’].read()

with open(f”part_{part_number}”, ‘wb’) as f:
f.write(part_data)
print(f”Part {part_number} downloaded successfully.”)
This function will download a specific range of bytes from the file, and we save it as a separate part on disk.

Managing Parallel Downloads

To speed up the process, we will download the file parts in parallel using Python’s threading module. The idea is to create a thread for each part of the file, allowing multiple parts to be downloaded concurrently.
python
def download_file_in_parts(bucket, key):
file_size = get_file_size(bucket, key)
num_parts = (file_size // part_size) + 1
threads = []

for part_number in range(num_parts):
start_byte = part_number * part_size
end_byte = min((part_number + 1) * part_size – 1, file_size – 1)

thread = threading.Thread(target=download_part, args=(bucket, key, start_byte, end_byte, part_number))
thread.start()
threads.append(thread)

for thread in threads:
thread.join()
This function divides the file into multiple parts based on the total file size and part size, then starts a thread for each part. After initiating all the threads, we wait for each one to finish using join().

Reassembling the File

Once all parts are downloaded, the next step is to reassemble the file. Since each part was downloaded separately, we need to concatenate the parts in the correct order.
python
def assemble_file(num_parts):
with open(“final_file”, ‘wb’) as final_file:
for part_number in range(num_parts):
with open(f”part_{part_number}”, ‘rb’) as part_file:
final_file.write(part_file.read())
print(f”Part {part_number} added to the final file.”)
This function opens each part in order and writes it to the final file. Once all parts have been written, you will have the complete file.

Handling Errors and Retries

When dealing with network requests, it’s crucial to handle errors and implement retries in case of failure. You can use the botocore.exceptions module to catch common exceptions like EndpointConnectionError and retry the download.
python
import time
from botocore.exceptions import EndpointConnectionError

def download_part_with_retry(bucket, key, start_byte, end_byte, part_number, retries=3):
attempt = 0
while attempt < retries:
try:
download_part(bucket, key, start_byte, end_byte, part_number)
break
except EndpointConnectionError:
attempt += 1
print(f”Attempt {attempt} failed. Retrying…”)
time.sleep(2)
This function will attempt to download the part up to three times before failing, with a 2-second delay between each attempt.

Conclusion

By using multi-part downloads in Python, you can significantly improve the download speed of large files from S3. With the help of the boto3 library and Python’s threading module, this approach allows you to download large files faster by downloading parts in parallel. Make sure to handle potential errors and retries to ensure a robust solution.

We earn commissions using affiliate links.