solutions

Apr 21, 2025

Benchmarking S3 for AI Workloads: Optimizing Checkpointing, Data Access, and Performance at Scale

Author

Andy Pernsteiner, Field CTO; Scott Howard, Office of the CTO

Why S3 is Critical for AI and Machine Learning Workloads

The rise of large-scale AI and machine learning workloads has placed unprecedented demands on storage infrastructure. From handling massive datasets to storing frequent model checkpoints, object storage solutions like Amazon S3 (or S3-compatible alternatives) have become essential. However, AI workloads, particularly those in deep learning training, model inference, and large-scale data processing using Apache Spark, require high-throughput writes, low-latency reads, and efficient metadata operations.

One of the key challenges in training LLMs and multi-modal AI models is the need for fast and scalable object storage. S3 plays a crucial role in:

Checkpointing AI models to prevent loss during training
Storing and retrieving large-scale datasets (e.g., CommonCrawl, image-text datasets for multi-modal models)
Spark-based ETL pipelines, which frequently scan millions of Parquet files before processing

Understanding how S3 performs under these workloads is critical. This benchmarking study, which leverages elbencho (created by VAST Field CTO Sven Breuner), provides a comprehensive evaluation of read, write, and metadata operations to guide performance optimization strategies.

Benchmarking Methodology

The benchmarking process focused on evaluating S3 performance across different AI workload scenarios, including:

Large object writes and reads (simulating AI model checkpointing).
Multipart uploads for handling massive datasets efficiently.
ListObjects performance, which is essential for Spark workloads scanning Parquet files.

Test Setup

Benchmarking Tool: elbencho
S3-Compatible Storage: VAST Data cluster (any S3 compatible object store will do, but VAST’s is going to be faster….)
Networking: 100 Gbps connectivity per node
Hosts: Multiple clients running elbencho in a distributed setup

To ensure high throughput and scalability, we ran elbencho in a master-client distributed mode, spreading the load across multiple hosts.

Doing so involves creation of a hosts file (one per line), and launching of the elbencho daemon/service on all hosts.

Benchmarking Results and Commands

1. Large Object Writes & Reads (Non-Multipart)

Checkpointing workloads require high-throughput writes and reads, often with multi-gigabyte files. We benchmarked 256MB objects written across 8 clients, each running 32 threads.

Write Test (Large PUTs)

Sample Output

Read Test (Large GETs)

Sample Output

Peak read throughput: ~11,220MBps (~100 Gbps)
Read latency: ~2.2ms per small file

2. Multipart Upload Performance

For large AI checkpoint files, multipart uploads enhance performance by breaking objects into smaller parts. This improves throughput and reliability.

Write Test with Multipart Upload

Results

Using 16MB part size reduced API overhead and improved efficiency.
Higher reliability for large AI model checkpoints.

3. ListObjects Performance: Critical for Spark and Parquet Workloads

In Spark-based ETL pipelines, object listing performance is crucial. Spark scans millions of Parquet files to prepare text data (e.g., CommonCrawl) for LLM training.

ListObjects Benchmark

Sample Output

Peak listing performance: 3.4M files/sec
Impact on Spark ETL: Faster Parquet metadata retrieval accelerates LLM data

Optimizing S3 for High-Performance AI Workloads

S3 storage performance directly impacts AI training, inference, and data processing efficiency. Whether it’s checkpointing a multi-terabyte model, retrieving datasets for fine-tuning, or scanning Parquet files for Spark ETL, tuning S3 for AI workloads is essential.

Want to run your own S3 benchmarking? Check out the VAST Data Labs on Cosmos, where you can access a real VAST Management System (VMS) and explore S3 in a hands-on manner.

Appendix

Here’s a simple shell script that can be used to launch the above tests. Note that it does require clustershell (aka: clush) to be installed on your hosts.

Benchmarking S3 for AI Workloads: Optimizing Checkpointing, Data Access, and Performance at Scale

Why S3 is Critical for AI and Machine Learning Workloads

Benchmarking Methodology

Test Setup

Benchmarking Results and Commands

1. Large Object Writes & Reads (Non-Multipart)

2. Multipart Upload Performance

3. ListObjects Performance: Critical for Spark and Parquet Workloads

Optimizing S3 for High-Performance AI Workloads

Appendix

More from this topic