Strategy 1: Switch to a Community Cloud (Save 60–80%)
The biggest lever most engineers overlook: hyperscalers (AWS, GCP, Azure) cost 3–5× more than community GPU clouds for the same GPU type. This is the highest-impact change you can make.
| GPU | AWS/GCP | Community Cloud | Savings |
|---|---|---|---|
| A100 40GB | $2.48/h (GCP) | $1.10/h (Lambda) | 55% |
| A100 80GB | $3.67/h (GCP) | $1.50/h (Lambda) | 59% |
| H100 SXM | $8.10/h (GCP) | $2.49/h (RunPod) | 69% |
| RTX 4090 | Not available | $0.35/h (RunPod) | — |
Best community clouds: RunPod (great UI, 100+ GPU types), Lambda Labs (reliable SSH access), and Vast.ai (absolute cheapest via marketplace).
Strategy 2: Use Spot/Interruptible Instances (Save 40–70%)
Spot instances can be terminated when the host needs the GPU back. In exchange, you pay 40–70% less. They're perfect for:
- Batch training jobs with checkpointing (save state every 10–30 mins)
- Hyperparameter sweeps where any individual run is disposable
- Data preprocessing that can be restarted
- Evaluation runs that don't depend on session continuity
On Vast.ai, an H100 80GB interruptible costs $1.89/h vs $2.49/h on-demand — a 24% saving. On RunPod community cloud, the discount vs secure cloud is typically 30–50%.
save_steps=200 or PyTorch Lightning's ModelCheckpoint callback. Resume from checkpoint adds <5 minutes overhead per restart.Strategy 3: Right-Size Your GPU (Save 20–40%)
Renting an A100 80GB when you only use 30GB of VRAM is throwing money away. Match VRAM to your actual model requirements:
| Model size | Min VRAM (BF16) | Best GPU match | Approx. cost |
|---|---|---|---|
| 7B params | 16 GB | RTX 4090 (24GB) | $0.35/h (RunPod) |
| 13B params | 28 GB | A40 (48GB) | $0.39/h (RunPod) |
| 30B params | 60 GB | A100 80GB | $1.50/h (Lambda) |
| 70B params | 140 GB | 2× A100 80GB | $3.00/h (Lambda) |
| 70B params | 80 GB (w/ quant) | H100 SXM + FP8 | $2.49/h (RunPod) |
VRAM requirements estimated for full fine-tuning in BF16. LoRA/QLoRA reduces VRAM by 4–8×.
Strategy 4: Use QLoRA for Fine-Tuning (Save 50–70% VRAM)
QLoRA (Quantized LoRA) lets you fine-tune a 70B parameter model on a single A100 80GB by loading the base model in 4-bit NF4 quantization. Without QLoRA, you'd need 140GB+ VRAM. With QLoRA, you can fine-tune Llama 3 70B on a $1.50/h A100 instead of a $4.30/h H100 cluster.
Quality loss vs full fine-tuning is minimal for instruction-following and task adaptation tasks. For very performance-sensitive use cases (e.g., code generation), compare outputs before committing.
Strategy 5: Reserved Instances for Predictable Workloads (Save 20–40%)
If you run GPU workloads consistently every day, reserved instances pay off fast. Lambda Labs reserved instances (1-year term) cost ~30% less than on-demand:
| GPU | On-demand | Reserved (1yr) | Monthly saving (8h/day) |
|---|---|---|---|
| A100 40GB | $1.10/h | ~$0.77/h | ~$80/month |
| A100 80GB | $1.50/h | ~$1.05/h | ~$110/month |
| H100 PCIe | $2.49/h | ~$1.74/h | ~$180/month |
Reserved instances make sense if you use >5 hours/day consistently. Below that, on-demand wins.
Strategy 6: Mixed Precision Training (Speed Up = Cost Down)
Training in BF16 (Brain Float 16) instead of FP32 is roughly 2× faster on A100/H100 GPUs with Tensor Cores. Half the training time = half the cloud bill. Enable it with a single flag in most frameworks:
# HuggingFace Trainer
training_args = TrainingArguments(
bf16=True, # use BF16 instead of FP32
tf32=True, # also enable TF32 for matrix multiplications
...
)On H100, FP8 training (supported via Transformer Engine) can give another 1.5–2× speedup for large models, bringing effective GPU cost down further.
Strategy 7: Optimize Data Loading (Free Speed Gains)
GPU utilization below 80% means you're paying for idle GPU time. Common culprits:
- Slow data loading — increase
num_workersin DataLoader (try 4–8) - Small batch sizes — use gradient accumulation to effectively train with large batches
- Sequential tokenization — pre-tokenize your dataset and cache to disk once
- No flash attention — install
flash-attnfor 2–4× speedup on transformer attention
If GPU utilization is consistently below 70% (check with nvidia-smi dmon), you're effectively wasting 30%+ of your compute budget.
Quick Wins Summary
| Strategy | Potential Saving | Effort |
|---|---|---|
| Switch to community cloud | 60–80% | Low |
| Use spot/interruptible instances | 24–70% | Medium (add checkpointing) |
| Right-size your GPU | 20–40% | Low |
| Use QLoRA for fine-tuning | 50–70% VRAM | Low |
| Reserved instances (>5h/day) | 20–40% | Low |
| Mixed precision (BF16/FP8) | 30–50% time | Very low |
| Optimize data loading | 10–30% time | Medium |
Find the most cost-effective GPU for your workload
Use our GPU Finder to get a personalized recommendation in 30 seconds.