The Compute Stack Behind AI Workloads

CPUs, GPUs, and TPUs—What Powers Preprocessing, Training, and Inference

Hi Inner Circle,

Let’s take a look at the compute muscle powering today’s AI models.

But before we dive in, it’s important to understand the workloads these compute types are built to handle.

The three crucial use cases driving both traditional machine learning and the generative AI revolution are:

Let’s dive into each one and then look at the different compute options:

Think of this as data cleanup and preparation.

Examples:

This step ensures that data is in the right shape and scale before training or inference.

This is where the model learns from data.

Examples:

Training is compute-heavy and often takes days or weeks depending on model size.

Using the trained model to make predictions.

Examples:

Inference should be fast, especially in real-time apps like chatbots or recommendation systems.

Understanding the hardware that powers AI models helps you optimize cost and performance:

Ideal for preprocessing and lightweight inference tasks
Commonly used alongside GPUs/TPUs for orchestration, data loading, and distributed training
Offer flexible memory handling and broad software compatibility
Example: Running a Flask API server that serves model predictions

Designed for parallel processing — ideal for deep learning
Power both training and inference in models like GPT, DALL·E, and Llama
Efficiently handle large matrix operations (e.g., transformer attention mechanisms)
Supported across all major ML frameworks (PyTorch, TensorFlow, JAX, etc.)
Enable custom CUDA kernels and offer fine-grained control over operations
Strong ecosystem support and flexibility for both research and production
Example: NVIDIA A100, H100, H200, B200, etc., used in most modern AI data centers

Google’s custom ASICs (Application-Specific Integrated Circuits) built specifically for AI workloads
Optimized for large matrix multiplications using systolic array architecture (distinct from GPUs)
Tightly integrated with TensorFlow and JAX, using the XLA compiler for graph optimization
Use slices (partitioned hardware units) to parallelize large-scale model training
Deliver high energy efficiency and excellent cost-performance at scale
Extremely efficient for training large LLMs across multiple nodes
Example: Gemini models are trained on TPU v4/v5 pods

On-Demand: Flexible, pay-as-you-go model
Spot/Preemptible Instances: Deep discounts with the risk of interruption
Reserved/Committed Use: Lower rates in exchange for a long-term commitment (1–3 years)
Dedicated or Custom Instances: High-performance configurations tailored for specific hardware needs

Follow hardware releases from NVIDIA, AMD, Intel, and Google Cloud
Watch benchmarks from MLPerf, Hugging Face, and independent researchers
Join communities like Papers with Code, ML Collective, and AI circles on social media platforms like Instagram/FB.

That’s the core of how models come to life — from raw data to predictions, powered by serious compute muscle.

Stay tuned—data pipelines and storage options coming tomorrow!