Vishakha Sadhwani
Posts
Storage That Powers AI

Storage That Powers AI

Beyond "Where Data Lives": Mapping Storage Solutions to Your AI Pipeline

Vishakha Sadhwani
June 04, 2025

In partnership with

Hi Inner Circle,

Welcome back to Day 3 of our AI infrastructure series!

I’ve got to admit—I usually prefer a weekly newsletter over daily, but now that we’re learning together, let’s keep the momentum going!:D

Today, we’re diving into STORAGE —
It’s no longer just “where data lives” — it’s how AI remembers, learns, and thinks faster than ever before.

In modern ML systems, storage architecture directly impacts speed, efficiency, and cost.

As a cloud engineer, understanding how to map the right storage type to each AI workload is critical.

We can categorize storage along two key dimensions:

→ Performance vs. Capacity Optimized: This refers to whether the storage is designed for speed (low latency, high throughput) or for storing vast amounts of data at a lower cost.

→ File vs. Object Protocol: This differentiates how data is accessed and managed. Most capacity-based storage systems are object-based, while high-performance storage systems for AI are predominantly file-based.

Storage Across the AI/ML Lifecycle

Raw Data Ingest
→ Stores large volumes of raw, unstructured data such as images, logs, or text.
→ Requires scalable, cost-effective storage that supports parallel ingestion.
→ Object storage is ideal here for handling petabytes of data cost-effectively and supporting parallel ingest, aligning with capacity-optimized and object protocol characteristics (but mainly depends on your data format)
Data Preparation
→ Uses high-performance file storage for cleaning, labeling, and transforming data.
→ Needs frequent, low-latency reads/writes to enable fast iteration and processing.
→ High-performance file storage, often leveraging performance-optimized flash media, provides the faster I/O needed at this stage, emphasizing performance and file protocol.
Training
→ Uses high-performance file or in-memory storage to feed large datasets to accelerators.
→ Demands high-speed, parallel data access to keep GPU clusters fully utilized.
→ Fast, parallel reads are key to prevent bottlenecks in this compute-heavy phase, making performance-optimized solutions crucial.
Fine-Tuning
→ Uses high-performance file or in-memory storage for task-specific model updates.
→ Requires low latency and high throughput for compute-intensive workloads. The storage options are similar to the training phase, focusing on fast access for iterative model updates..
Inference / Deployment
→ Relies on in-memory or local storage (CPU/GPU) to serve model predictions. → Prioritizes ultra-low latency for responsive, real-time user interactions, where speed is paramount, often employing the most performance-optimized local storage solutions directly attached to compute instances.
Archiving
→ Uses object storage for historical or infrequently accessed data.
→ Optimized for long-term retention, cost-efficiency, and scalable capacity, often utilizing disk or tape-based capacity-optimized storage solutions.

Key Takeaways

Each stage of the AI lifecycle demands different storage solutions — whether object storage, high-performance file systems, or block storage. These choices depend on whether the storage is performance or capacity optimized, and whether it uses a file, object, or block protocol.

Performance matters: the faster you serve data, the more efficient your pipeline—and the less idle time wasted

Cloud Provider Storage Options for AI Solutions

Here's a breakdown of common storage services by major cloud providers, relevant for various stages of an AI pipeline:

Object Storage

AWS: Amazon Simple Storage Service (S3) with various storage classes (Standard, Infrequent Access, Glacier, Glacier Deep Archive).
Azure: Azure Blob Storage with different access tiers (Hot, Cool, Archive).
GCP: Google Cloud Storage with various storage classes & Autoclass feature (Standard, Nearline, Coldline, Archive).

File Storage

AWS: Amazon Elastic File System (EFS) for scalable NFS file storage; Amazon FSx (for Lustre for high-performance computing, for Windows File Server for Windows-native shared file storage).
Azure: Azure Files for fully managed file shares (SMB/NFS); Azure NetApp Files for high-performance, enterprise-grade NFS and SMB file shares.
GCP: Cloud Filestore with Standard and Premium tiers for managed NFS file storage.

Block Storage

AWS: Amazon Elastic Block Store (EBS) with various volume types (gp2/gp3 SSDs, io1/io2 Block Express SSDs) attached to EC2 instances.
Azure: Azure Managed Disks (Standard HDD, Standard SSD, Premium SSD, Ultra Disks) attached to Azure VMs.
GCP: Persistent Disk (Standard, SSD, Extreme) and Hyperdisk (balanced, throughput, extreme) for Compute Engine VMs.

Caching Solutions Across Cloud Providers

AWS:
→ ElastiCache
→ FSx for Lustre
→ Local Instance Store SSDs

Azure:
→ Azure Cache for Redis
→ Managed Disks (Premium/Ultra)
→ Azure HPC Cache

Google Cloud Platform (GCP):
→ Memorystore for Redis/Memcached
→ Local SSDs
→ Anywhere Cache
→ gcsfuse

Choosing the right caching solution helps reduce latency, cut costs, and keep your AI/ML workloads running smoothly.

References and Further Reading:

That’s it for today! Tomorrow, we’ll dive into AI deployment and inferencing — where all the work meets PRODUCTION.

Stay tuned!

Start learning AI in 2025

Everyone talks about AI, but no one has the time to learn it. So, we found the easiest way to learn AI in as little time as possible: The Rundown AI.

It's a free AI newsletter that keeps you up-to-date on the latest AI news, and teaches you how to apply it in just 5 minutes a day.

Plus, complete the quiz after signing up and they’ll recommend the best AI tools, guides, and courses – tailored to your needs.