Can RTX 3090 Run FP8? Quantization Limits Explained

Mar 15, 2026

1. The Problem

I was researching whether to buy a used RTX 3090 for running local LLMs. With 24GB of VRAM at around $700-800, it seemed like a great deal. But then I saw discussions about FP8 quantization - a newer format that’s supposed to be more efficient for inference.

The question kept coming up: Can the RTX 3090 run FP8 quantization?

I spent hours digging through forums and documentation, and the answer isn’t straightforward. Here’s what I found.

2. What I Tried

First, let me understand the landscape. Modern LLM quantization formats include:

Format	Description	GPU Requirement
FP8	8-bit floating point	Hopper architecture (H100, H200)
NVFP4	4-bit floating point	Blackwell architecture (B100, B200)
GPTQ	4-bit quantization	Most NVIDIA GPUs
GGUF	CPU/GPU flexible	Universal
AWQ	Activation-aware	Most NVIDIA GPUs

The RTX 3090 uses NVIDIA’s Ampere architecture, which predates the hardware-level FP8 support introduced in Hopper. So the short answer is: no native FP8 support.

But then I found conflicting reports:

"It cannot run FP8 quants, nor nvfp4, not fp8 cache. It actually can,
but it requires emulation software called marlin, that it's actually
super fast, but it is not compatible with all models"

"I run the 27b on dual 3090s in FP8 with tensor parallelism using vllm
and the speed is great"

Wait, so it can run FP8? But with emulation? Let me dig deeper.

3. The Real Answer

Here’s the nuanced truth I discovered:

FP8 on RTX 3090: Partial Support via Marlin Emulation

The RTX 3090 can run FP8 through a software emulation layer called Marlin. Reports indicate it’s “actually super fast” when it works. The catch? Compatibility is spotty:

"With the 3090 you will spend hours hunting for a quant that works,
or what combination of vllm/sglang works, or falling back to
llama.cpp that always works but it's slower"

NVFP4 on RTX 3090: No Support

This one is definitive. NVFP4 requires Blackwell architecture hardware support. No workaround exists.

4. What Actually Works

If you have or plan to get an RTX 3090, here are the quantization formats that work reliably:

GPTQ - Battle-Tested and Reliable

GPTQ works flawlessly on the RTX 3090. Multiple users confirmed running large models:

"4x 3090s running Qwen 3.5 122B GPTQ"

To use GPTQ:

# Fast inference with vLLM
python -m vllm.entrypoints.api_server \
    --model ./gptq-model \
    --quantization gptq

GGUF via llama.cpp - Universal Fallback

When nothing else works, llama.cpp with GGUF always works:

./llama-cli -m model.gguf -p "prompt" -n 512

Users report “Q4_K_M, Q4_K_S work well” but acknowledge it’s “slower” than hardware-optimized formats.

FP8 via vLLM + Tensor Parallelism

If you have multiple 3090s, FP8 can work:

python -m vllm.entrypoints.api_server \
    --model ./fp8-model \
    --quantization fp8 \
    --tensor-parallel-size 2

But be prepared for troubleshooting. One user successfully ran “27b on dual 3090s in FP8 with tensor parallelism using vllm” with “great speed.”

5. Why This Matters

The quantization format you choose affects:

Memory Efficiency: How large a model you can fit in 24GB VRAM
Inference Speed: Tokens per second throughput
Model Quality: How much accuracy is lost to quantization
Setup Time: Hours debugging vs. minutes running

For context, here’s what different quantization levels mean for a 70B model:

FP16: ~140GB (won't fit in single 3090)
FP8:  ~70GB  (won't fit, needs multi-GPU)
GPTQ-4bit: ~40GB (needs multi-GPU or quantization)
GGUF-Q4: ~40GB (can offload to system RAM)

6. Common Mistakes to Avoid

Based on my research, here are pitfalls I learned to avoid:

Mistake 1: Assuming FP8 will “just work”

It won’t. You’ll need Marlin emulation and patience.

Mistake 2: Not testing before committing

Different models have different quantization compatibility. Test before you commit to a workflow.

Mistake 3: Overlooking llama.cpp

It’s the “always works” option. Slower? Yes. Reliable? Absolutely.

Mistake 4: Ignoring tensor parallelism

If you have multiple GPUs, tensor parallelism can enable formats that don’t work on a single card.

7. Compatibility Matrix

Here’s a quick reference for RTX 3090 quantization support:

Format	Support Level	Notes
FP8	Partial (Marlin)	Requires emulation, model-dependent
NVFP4	No	Hardware limitation, no workaround
GPTQ	Full	Battle-tested, works reliably
GGUF Q4_K_M	Full	llama.cpp, slower but universal
GGUF Q4_K_S	Full	llama.cpp, good balance
AWQ	Full	Supported via vLLM

8. Recommended Workflow

If you’re setting up an RTX 3090 for LLM inference, I recommend this approach:

Step 1: Start with GPTQ (most reliable)
Step 2: Test GGUF via llama.cpp (universal fallback)
Step 3: If you need FP8, try vLLM + Marlin
Step 4: Be prepared to fall back if FP8 fails

9. Summary

The RTX 3090 is still a capable GPU for LLM inference, but FP8 support is limited:

Native FP8: No (Ampere lacks hardware support)
Emulated FP8: Yes, via Marlin (hit-or-miss compatibility)
NVFP4: No (requires Blackwell architecture)
GPTQ/AWQ: Yes, fully supported
GGUF: Yes, universally compatible

If you’re deciding between a used RTX 3090 and a newer GPU, ask yourself:

Need reliable FP8? Consider RTX 4090 or newer
Just want to run models? RTX 3090 with GPTQ works great
Willing to troubleshoot? RTX 3090 FP8 emulation might work for you

For me, the RTX 3090 at $700-800 with GPTQ support still represents solid value - as long as you know its limitations upfront.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit r/LocalLLaMA RTX 3090 FP8 Discussion
👨‍💻 vLLM Documentation
👨‍💻 llama.cpp GitHub
👨‍💻 AutoGPTQ GitHub

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!