Can RTX 3090 Run FP8? Quantization Limits Explained
1. The Problem
I was researching whether to buy a used RTX 3090 for running local LLMs. With 24GB of VRAM at around $700-800, it seemed like a great deal. But then I saw discussions about FP8 quantization - a newer format that’s supposed to be more efficient for inference.
The question kept coming up: Can the RTX 3090 run FP8 quantization?
I spent hours digging through forums and documentation, and the answer isn’t straightforward. Here’s what I found.
2. What I Tried
First, let me understand the landscape. Modern LLM quantization formats include:
| Format | Description | GPU Requirement |
|---|---|---|
| FP8 | 8-bit floating point | Hopper architecture (H100, H200) |
| NVFP4 | 4-bit floating point | Blackwell architecture (B100, B200) |
| GPTQ | 4-bit quantization | Most NVIDIA GPUs |
| GGUF | CPU/GPU flexible | Universal |
| AWQ | Activation-aware | Most NVIDIA GPUs |
The RTX 3090 uses NVIDIA’s Ampere architecture, which predates the hardware-level FP8 support introduced in Hopper. So the short answer is: no native FP8 support.
But then I found conflicting reports:
"It cannot run FP8 quants, nor nvfp4, not fp8 cache. It actually can,but it requires emulation software called marlin, that it's actuallysuper fast, but it is not compatible with all models"
"I run the 27b on dual 3090s in FP8 with tensor parallelism using vllmand the speed is great"Wait, so it can run FP8? But with emulation? Let me dig deeper.
3. The Real Answer
Here’s the nuanced truth I discovered:
FP8 on RTX 3090: Partial Support via Marlin Emulation
The RTX 3090 can run FP8 through a software emulation layer called Marlin. Reports indicate it’s “actually super fast” when it works. The catch? Compatibility is spotty:
"With the 3090 you will spend hours hunting for a quant that works,or what combination of vllm/sglang works, or falling back tollama.cpp that always works but it's slower"NVFP4 on RTX 3090: No Support
This one is definitive. NVFP4 requires Blackwell architecture hardware support. No workaround exists.
4. What Actually Works
If you have or plan to get an RTX 3090, here are the quantization formats that work reliably:
GPTQ - Battle-Tested and Reliable
GPTQ works flawlessly on the RTX 3090. Multiple users confirmed running large models:
"4x 3090s running Qwen 3.5 122B GPTQ"To use GPTQ:
# Fast inference with vLLMpython -m vllm.entrypoints.api_server \ --model ./gptq-model \ --quantization gptqGGUF via llama.cpp - Universal Fallback
When nothing else works, llama.cpp with GGUF always works:
./llama-cli -m model.gguf -p "prompt" -n 512Users report “Q4_K_M, Q4_K_S work well” but acknowledge it’s “slower” than hardware-optimized formats.
FP8 via vLLM + Tensor Parallelism
If you have multiple 3090s, FP8 can work:
python -m vllm.entrypoints.api_server \ --model ./fp8-model \ --quantization fp8 \ --tensor-parallel-size 2But be prepared for troubleshooting. One user successfully ran “27b on dual 3090s in FP8 with tensor parallelism using vllm” with “great speed.”
5. Why This Matters
The quantization format you choose affects:
- Memory Efficiency: How large a model you can fit in 24GB VRAM
- Inference Speed: Tokens per second throughput
- Model Quality: How much accuracy is lost to quantization
- Setup Time: Hours debugging vs. minutes running
For context, here’s what different quantization levels mean for a 70B model:
FP16: ~140GB (won't fit in single 3090)FP8: ~70GB (won't fit, needs multi-GPU)GPTQ-4bit: ~40GB (needs multi-GPU or quantization)GGUF-Q4: ~40GB (can offload to system RAM)6. Common Mistakes to Avoid
Based on my research, here are pitfalls I learned to avoid:
Mistake 1: Assuming FP8 will “just work”
It won’t. You’ll need Marlin emulation and patience.
Mistake 2: Not testing before committing
Different models have different quantization compatibility. Test before you commit to a workflow.
Mistake 3: Overlooking llama.cpp
It’s the “always works” option. Slower? Yes. Reliable? Absolutely.
Mistake 4: Ignoring tensor parallelism
If you have multiple GPUs, tensor parallelism can enable formats that don’t work on a single card.
7. Compatibility Matrix
Here’s a quick reference for RTX 3090 quantization support:
| Format | Support Level | Notes |
|---|---|---|
| FP8 | Partial (Marlin) | Requires emulation, model-dependent |
| NVFP4 | No | Hardware limitation, no workaround |
| GPTQ | Full | Battle-tested, works reliably |
| GGUF Q4_K_M | Full | llama.cpp, slower but universal |
| GGUF Q4_K_S | Full | llama.cpp, good balance |
| AWQ | Full | Supported via vLLM |
8. Recommended Workflow
If you’re setting up an RTX 3090 for LLM inference, I recommend this approach:
Step 1: Start with GPTQ (most reliable)Step 2: Test GGUF via llama.cpp (universal fallback)Step 3: If you need FP8, try vLLM + MarlinStep 4: Be prepared to fall back if FP8 fails9. Summary
The RTX 3090 is still a capable GPU for LLM inference, but FP8 support is limited:
- Native FP8: No (Ampere lacks hardware support)
- Emulated FP8: Yes, via Marlin (hit-or-miss compatibility)
- NVFP4: No (requires Blackwell architecture)
- GPTQ/AWQ: Yes, fully supported
- GGUF: Yes, universally compatible
If you’re deciding between a used RTX 3090 and a newer GPU, ask yourself:
- Need reliable FP8? Consider RTX 4090 or newer
- Just want to run models? RTX 3090 with GPTQ works great
- Willing to troubleshoot? RTX 3090 FP8 emulation might work for you
For me, the RTX 3090 at $700-800 with GPTQ support still represents solid value - as long as you know its limitations upfront.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit r/LocalLLaMA RTX 3090 FP8 Discussion
- 👨💻 vLLM Documentation
- 👨💻 llama.cpp GitHub
- 👨💻 AutoGPTQ GitHub
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments