Why S3 is Critical for AI and Machine Learning Workloads
The rise of large-scale AI and machine learning workloads has placed unprecedented demands on storage infrastructure. From handling massive datasets to storing frequent model checkpoints, object storage solutions like Amazon S3 (or S3-compatible alternatives) have become essential. However, AI workloads, particularly those in deep learning training, model inference, and large-scale data processing using Apache Spark, require high-throughput writes, low-latency reads, and efficient metadata operations.
One of the key challenges in training LLMs and multi-modal AI models is the need for fast and scalable object storage. S3 plays a crucial role in:
Checkpointing AI models to prevent loss during training
Storing and retrieving large-scale datasets (e.g., CommonCrawl, image-text datasets for multi-modal models)
Spark-based ETL pipelines, which frequently scan millions of Parquet files before processing
Understanding how S3 performs under these workloads is critical. This benchmarking study, which leverages elbencho (created by VAST Field CTO Sven Breuner), provides a comprehensive evaluation of read, write, and metadata operations to guide performance optimization strategies.
Benchmarking Methodology
The benchmarking process focused on evaluating S3 performance across different AI workload scenarios, including:
Large object writes and reads (simulating AI model checkpointing).
Multipart uploads for handling massive datasets efficiently.
ListObjects performance, which is essential for Spark workloads scanning Parquet files.
Test Setup
Benchmarking Tool: elbencho
S3-Compatible Storage: VAST Data cluster (any S3 compatible object store will do, but VAST’s is going to be faster….)
Networking: 100 Gbps connectivity per node
Hosts: Multiple clients running elbencho in a distributed setup
To ensure high throughput and scalability, we ran elbencho in a master-client distributed mode, spreading the load across multiple hosts.
Doing so involves creation of a hosts file (one per line), and launching of the elbencho daemon/service on all hosts.
Benchmarking Results and Commands
1. Large Object Writes & Reads (Non-Multipart)
Checkpointing workloads require high-throughput writes and reads, often with multi-gigabyte files. We benchmarked 256MB objects written across 8 clients, each running 32 threads.
Write Test (Large PUTs)

Sample Output

Read Test (Large GETs)

Sample Output

Peak read throughput: ~11,220MBps (~100 Gbps)
Read latency: ~2.2ms per small file
2. Multipart Upload Performance
For large AI checkpoint files, multipart uploads enhance performance by breaking objects into smaller parts. This improves throughput and reliability.
Write Test with Multipart Upload

Results
Using 16MB part size reduced API overhead and improved efficiency.
Higher reliability for large AI model checkpoints.
3. ListObjects Performance: Critical for Spark and Parquet Workloads
In Spark-based ETL pipelines, object listing performance is crucial. Spark scans millions of Parquet files to prepare text data (e.g., CommonCrawl) for LLM training.
ListObjects Benchmark

Sample Output

Peak listing performance: 3.4M files/sec
Impact on Spark ETL: Faster Parquet metadata retrieval accelerates LLM data
Optimizing S3 for High-Performance AI Workloads
S3 storage performance directly impacts AI training, inference, and data processing efficiency. Whether it’s checkpointing a multi-terabyte model, retrieving datasets for fine-tuning, or scanning Parquet files for Spark ETL, tuning S3 for AI workloads is essential.
Want to run your own S3 benchmarking? Check out the VAST Data Labs on Cosmos, where you can access a real VAST Management System (VMS) and explore S3 in a hands-on manner.
Appendix
Here’s a simple shell script that can be used to launch the above tests. Note that it does require clustershell (aka: clush) to be installed on your hosts.
