Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -39,9 +39,10 @@
"overview",
"get-started",
"get-started/concepts",
"get-started/products",
"get-started/workflows",
"get-started/manage-accounts",
"get-started/api-keys",
"get-started/connect-to-runpod"
"get-started/api-keys"
]
},
{
Expand Down
115 changes: 97 additions & 18 deletions get-started/products.mdx
Original file line number Diff line number Diff line change
@@ -1,36 +1,115 @@
---
title: "Runpod product overview"
sidebarTitle: "Product overview"
description: "Explore Runpod's major offerings and find the right solution for your workload."
title: "Choose a product"
sidebarTitle: "Choose a product"
description: "Find the right compute solution for your AI/ML application."
---

Runpod offers cloud computing resources for AI and machine learning workloads. You can choose from instant GPUs for development, auto-scaling Serverless computing, pre-deployed AI models, or multi-node clusters for distributed training.
Runpod provides several compute options designed for different stages of the AI lifecycle, from exploration and development to production scaling. Choosing the right option depends on your specific requirements regarding scalability, persistence, and infrastructure management.

## [Serverless](/serverless/overview)
## Product overview

Serverless provides pay-per-second computing with automatic scaling for production AI workloads. You only pay for actual compute time when your code runs, with no idle costs, making Serverless ideal for variable workloads and cost-efficient production deployments.
Use this decision matrix to identify the best Runpod solution for your workload:

## [Pods](/pods/overview)
| If you want to... | Use... | Because it... |
| :--- | :--- | :--- |
| **Deploy a custom AI/ML application** that scales automatically with traffic. | [Serverless](/serverless/overview) | Handles GPU/CPU auto-scaling and charges only for active compute time. |
| **Develop, debug, or train models** interactively on a GPU/CPU. | [Pods](/pods/overview) | Gives you a persistent GPU/CPU environment with full terminal/SSH access, similar to a cloud VPS. |
| **Get instant access to popular models** (Qwen, Flux, SORA, Wan) without zero infrastructure overhead. | [Public Endpoints](/hub/public-endpoints) | Provides easy-to-integrate APIs for image, video, and text generation with usage-based pricing. |
| **Train massive models** across multiple GPU nodes. | [Instant Clusters](/instant-clusters) | Provides pre-configured high-bandwidth interconnects for distributed training workloads. |

Pods give you dedicated GPU or CPU instances for containerized workloads. Pods are billed by the minute and stay available as long as you keep them running, making them perfect for development, training, and workloads that need continuous access.
## Detailed breakdown

## [Public Endpoints](/hub/public-endpoints)
### [Serverless](/serverless/overview)

Public Endpoints provide instant API access to pre-deployed AI models for image, video, and text generation without any setup. You only pay for what you generate, making it easy to integrate AI into your applications without managing infrastructure.
Serverless lets you create custom AI APIs that scale with traffic. It abstracts away the underlying infrastructure, allowing you to define a worker (a Docker container that runs your code on a GPU or CPU) that spins up on demand to handle incoming API requests.

## [Instant Clusters](/instant-clusters)
**Key characteristics:**

Instant Clusters deliver fully managed multi-node compute clusters for large-scale distributed workloads. With high-speed networking between nodes, you can run multi-node training, fine-tune large language models, and handle other tasks that require multiple GPUs working in parallel.
- **Auto-scaling:** Scales from zero to hundreds of workers based on request volume.
- **Stateless:** Workers are ephemeral; they spin up, process a request, and spin down.
- **Billing:** Pay-per-second of compute time. No cost when idle.
- **Best for:** Production inference, sporadic workloads, and scalable microservices.

### [Pods](/pods/overview)

## Choosing the right option
Pods provide a persistent GPU/CPU computing environment to train and fine-tune models. When you deploy a Pod, you are renting a specific GPU/CPU instance that stays active until you stop or terminate it. This is equivalent to renting a virtual machine with a GPU/CPU attached.

Choose **Serverless** when you need auto-scaling for inference workloads with variable traffic. Pay-per-second billing minimizes costs, and automatic worker management handles unpredictable workloads and API services efficiently.
**Key characteristics:**

Choose **Pods** when you need full control for development and experimentation. They work best for training models, iterative development, and custom workflows that require persistent storage and long-running processes.
* **Persistent:** Your environment, installed packages, and running processes persist as long as the Pod is active.
* **Interactive:** Full access via SSH, JupyterLab, or VSCode Server.
* **Billing:** Pay-per-minute (or hourly) for the reserved time, regardless of usage.
* **Best for:** Model training, fine-tuning, debugging code, exploring datasets, and long-running background tasks that do not require auto-scaling.

Choose **Public Endpoints** when you want to quickly integrate AI capabilities without managing infrastructure. They're ideal for prototyping and production applications that use popular AI models with simple pay-per-use pricing.
### [Public Endpoints](/hub/public-endpoints)

Choose **Instant Clusters** when your workload requires multiple GPUs across multiple nodes. They provide the infrastructure for training or fine-tuning large language models and other distributed computing tasks with high-speed networking.
Public Endpoints are Runpod-managed Serverless endpoints hosting popular community models. They require zero configuration and allow you to integrate AI capabilities into your application immediately. Public Endpoints are a great way to get started with Runpod and experiment with AI, without having to set up your own infrastructure.

You can combine these products as well. For example, use Pods for development and experimentation, Serverless for production inference, and Instant Clusters for large-scale training runs.
**Key characteristics:**

* **Zero setup:** No Dockerfiles or infrastructure configuration required.
* **Standard APIs:** OpenAI-compatible inputs for LLMs; standard JSON inputs for image generation.
* **Billing:** Pay-per-token (text) or pay-per-generation (image/video).
* **Best for:** Rapid prototyping, applications using standard open-source models, and users who do not need custom model weights.

### [Instant Clusters](/instant-clusters)

Instant Clusters allow you to provision multiple GPU/CPU nodes networked together with high-speed interconnects (up to 3200 Gbps). Instant Clusters are ideal for training and fine-tuning large models across multiple GPUs.

**Key characteristics:**

* **Multi-node:** Orchestrated groups of 2 to 8+ nodes.
* **High performance:** Optimized for low-latency inter-node communication (NCCL).
* **Best for:** Distributed training (FSDP, DeepSpeed), fine-tuning large language models (70B+ parameters), and HPC simulations.

## Workflow examples

Here are some examples of how you can use Runpod's compute services to build your AI/ML application:

### Develop-to-deploy cycle

**Goal:** Build a custom AI application from scratch and ship it to production.

1. **Interactive development:** Deploy a single [Pod](/pods/overview) with a GPU to act as your cloud workstation. Connect via [VSCode](/pods/configuration/connect-to-ide) or [JupyterLab](/pods/connect-to-a-pod#jupyterlab-connection) to write code, load models from Hugging Face, install dependencies, and debug your inference logic in real-time.
2. **Containerization:** Once your code is working, move your basic inference logic to a Serverless [handler function](/serverless/workers/handler-functions), then build a [Docker image](/serverless/workers/create-dockerfile) containing your application and dependencies and [push it to a container registry](/serverless/workers/deploy).
3. **Production deployment:** Deploy the Docker image as a [Serverless endpoint](/serverless/overview). Start [sending requests](/serverless/endpoints/send-requests) to your application and it will automatically scale up GPU workers as needed, and scale down to zero when idle.

### Distributed training for an LLM

**Goal:** Fine-tune a massive LLM (70B+) and serve it immediately without moving data.

1. **Multi-node training:** You spin up an [Instant Cluster](/instant-clusters) with 16x H100 GPUs to fine-tune a Llama-3-70B model using FSDP or DeepSpeed.
2. **Unified storage:** Throughout training, checkpoints and the final model weights are saved directly to a [network volume](/storage/network-volumes) attached to the cluster.
3. **Instant serving:** You deploy a [vLLM Serverless worker](/serverless/vllm/overview) and mount that *same* network volume. The endpoint reads the model weights directly from storage, allowing you to serve your newly trained model via API minutes after training finishes.

### Startup MVP

**Goal:** Launch a GenAI avatar app quickly with minimal DevOps overhead.

1. **Prototype with Public Endpoints:** You validate your product idea using the Flux [Public Endpoint](/hub/public-endpoints) to generate images. This requires zero infrastructure setup; you simply pay per image generated.
2. **Scale with Serverless:** As you grow, you need a unique art style. You fine-tune a model and deploy it as a [Serverless endpoint](/serverless/overview). This allows your app to handle traffic spikes automatically while scaling down to zero costs during quiet hours.

### Interactive research loop

**Goal:** Experiment with new model architectures using large datasets.

1. **Explore on a Pod:** Spin up a single-GPU [Pod](/pods/overview) with JupyterLab enabled. Mount a [network volume](/storage/network-volumes) to hold your 2TB dataset.
2. **Iterate code:** Write and debug your training loop interactively in the Pod. If the process crashes, the Pod restarts quickly, and your data remains safe on the network volume.
3. **Scale up:** Once the code is stable, you don't need to move the data. You terminate the single Pod and spin up an [Instant Cluster](/instant-clusters) attached to that *same* network volume to run the full training job across multiple nodes.

### Batch processing job

**Goal:** Process 10,000 video files for a media company.

1. **Queue requests:** Your backend pushes 10,000 job payloads to a [Serverless Endpoint](/serverless/overview) configured as an asynchronous queue.
2. **Auto-scale:** The endpoint detects the queue depth and automatically spins up 50 concurrent workers (e.g., L4 GPUs) to process the videos in parallel.
3. **Cost optimization:** As the queue drains, the workers scale down to zero automatically. You pay only for the exact GPU seconds used to process the videos, with no idle server costs.

### Enterprise fine-tuning factory

**Goal:** Regularly fine-tune models on new customer data automatically.

1. **Data ingestion:** Customer data is uploaded to a shared [network volume](/storage/network-volumes).
2. **Programmatic training:** A script uses the [Runpod API](/api-reference/pods/POST/pods) to spin up a fresh on-demand Pod.
3. **Execution:** The Pod mounts the volume, runs the training script, saves the new model weights back to the volume, and then [terminates itself](/pods/manage-pods#terminate-a-pod) via API call to stop billing immediately.
4. **Hot reload:** A separate [Serverless endpoint](/serverless/overview) is triggered to reload the new weights from the volume (or [update the cached model](/serverless/endpoints/model-caching)), making the new model available for inference immediately.
Loading