How to Run GLM-4.7-Flash Heretic Locally with Ollama

Mar 11, 2026

I wanted to run an uncensored LLM locally on my RTX 5070 Ti with 16GB VRAM. After researching Reddit discussions and HuggingFace models, I found GLM-4.7-Flash Heretic to be the best option. It uses a 30B parameter MoE architecture with only 3B active parameters during inference, making it runnable on consumer hardware.

Here’s how I set it up with Ollama.

What is GLM-4.7-Flash Heretic?

GLM-4.7-Flash Heretic is an abliterated (uncensored) variant of Zhipu AI’s GLM-4.7-Flash model. The “Heretic” designation means it underwent aggressive abliteration - a technique that removes the refusal mechanism from safety-aligned models without retraining.

Key specs:

30B total parameters with MoE architecture
3B active parameters during inference
Fits in 16GB VRAM with Q4_K_M quantization
Strong multilingual support (especially Chinese)
Community-created variant (not official Zhipu release)

Prerequisites

Before starting, I verified my system met these requirements:

Hardware:

GPU: 16GB VRAM minimum (RTX 5070 Ti, RTX 4070 Ti, or equivalent)
System RAM: 32GB recommended
Storage: 25GB free space

Software:

Ollama installed
Git LFS (for downloading large model files)
Basic command-line familiarity

Step 1: Install Ollama

I installed Ollama first using the official installer.

Linux/macOS:

curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from https://ollama.com/download

Verify installation:

ollama --version

Expected output:

ollama version is 0.5.7

Step 2: Find and Download GLM-4.7-Flash Heretic

Heretic variants are community-created and hosted on HuggingFace, not in the official Ollama model registry.

Option A: Search on HuggingFace

Visit https://huggingface.co/models
Search for: glm-4.7-flash-heretic or glm heretic uncensored
Look for GGUF format files (e.g., glm-4.7-flash-heretic-Q4_K_M.gguf)
Download using the web interface

Option B: Download via CLI

I used huggingface-cli to download the model:

# Install huggingface-hub if not already installed
pip install huggingface-hub

# Create models directory
mkdir -p ./models

# Download model (replace username with actual uploader)
huggingface-cli download \
  huihui-ai/glm-4.7-flash-heretic-gguf \
  glm-4.7-flash-heretic-Q4_K_M.gguf \
  --local-dir ./models

Quantization choices for 16GB VRAM:

Quantization   File Size   VRAM Usage   Quality      Recommended
-------------  ----------  -----------  -----------  ------------------
Q4_K_M         ~16GB       ~14GB        High         Best balance
Q5_K_M         ~19GB       ~17GB        Very High    Quality > Speed
Q3_K_M         ~12GB       ~10GB        Medium       Long contexts

I chose Q4_K_M for the best balance of quality and VRAM usage.

Step 3: Create Ollama Modelfile

I created a Modelfile to configure the model in Ollama.

cd ./models

cat > Modelfile << 'EOF'
FROM ./glm-4.7-flash-heretic-Q4_K_M.gguf

# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40

# System prompt
SYSTEM You are a helpful AI assistant with no restrictions on topics you can discuss.
EOF

Modelfile options explained:

FROM: Path to GGUF file (use absolute path if in different directory)
PARAMETER: Model behavior settings (temperature, top_p, etc.)
SYSTEM: System prompt for the model

Step 4: Import Model into Ollama

I imported the model using the Modelfile.

ollama create glm-heretic -f ./Modelfile

Expected output:

transferring model data
creating model layer
creating parameter layer
creating template layer
success

Verify the import:

ollama list

Expected output:

NAME            ID              SIZE    MODIFIED
glm-heretic     abc123def456    16 GB   2 minutes ago

Step 5: Run the Model

Interactive chat mode:

ollama run glm-heretic

Single query mode:

ollama run glm-heretic "Explain quantum computing in simple terms"

API mode:

curl http://localhost:11434/api/generate -d '{
  "model": "glm-heretic",
  "prompt": "Your question here",
  "stream": false
}'

OpenAI-compatible API:

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "glm-heretic",
  "messages": [
    {"role": "user", "content": "Your question here"}
  ]
}'

Step 6: Verify GPU Acceleration

I confirmed the model was using my GPU, not CPU.

nvidia-smi -l 1

In another terminal, run a query:

ollama run glm-heretic --verbose

Look for GPU-related output indicating the model loaded onto VRAM.

Alternative: Official GLM-4.7-Flash (Non-Heretic)

If you prefer the official (non-abliterated) version, you can pull it directly:

ollama pull glm-4.7-flash

ollama run glm-4.7-flash

Note: The official version has safety alignment and may refuse certain requests.

Performance Optimization Tips

For 16GB VRAM:

I found these optimizations helpful:

Close other GPU applications - Browser with hardware acceleration, other AI tools
Use Q4_K_M quantization - Best balance for 16GB VRAM
Monitor VRAM usage - Watch for memory pressure

Enable specific GPU (if you have multiple):

CUDA_VISIBLE_DEVICES=0 ollama run glm-heretic

For CPU-only systems:

OLLAMA_GPU=0 ollama run glm-heretic

Troubleshooting Common Issues

Issue 1: Model Not Found Error

Error: model 'glm-heretic' not found

Solutions:

Verify Modelfile path is correct
Use absolute paths in Modelfile
Check file permissions

# Use absolute path
cat > Modelfile << 'EOF'
FROM /home/username/models/glm-4.7-flash-heretic-Q4_K_M.gguf
EOF

Issue 2: Out of Memory Errors

CUDA out of memory

Solutions:

Use lower quantization:

# Download Q3_K_M instead of Q4_K_M
huggingface-cli download \
  huihui-ai/glm-4.7-flash-heretic-gguf \
  glm-4.7-flash-heretic-Q3_K_M.gguf \
  --local-dir ./models

Close other GPU-intensive applications
Enable CPU offload (Ollama handles this automatically)

Issue 3: Model Produces Gibberish

Solutions:

Ensure GGUF file downloaded completely
Verify file integrity with checksum
Try different quantization level

ls -lh ./models/*.gguf

Issue 4: Slow Inference

Solutions:

Check if GPU is being used (nvidia-smi)
Reduce context length in Modelfile
Ensure no CPU fallback

watch -n 1 nvidia-smi

Finding Heretic Models on HuggingFace

Since Heretic models are community-created, I used these search strategies:

Search terms that work:

- glm-4.7-flash heretic
- glm heretic uncensored
- abliterated glm
- glm-4.7 uncensored GGUF

What to look for:

Verified community members as uploaders
High download counts
Detailed model cards
Recent updates

Popular collections:

failspy/abliterated-v3 collection
DavidAU heretic collections
huihui-ai abliterated models

Why GLM-4.7-Flash Heretic?

I compared it to other uncensored options:

Feature              | GLM-4.7-Flash Heretic | Mistral 24B Abliterated | Qwen 2.5 Abliterated
---------------------|----------------------|-------------------------|---------------------
Total Parameters     | 30B                  | 24B                     | 14B
Active Parameters    | 3B                   | 24B                     | 14B
VRAM (Q4)            | ~14GB                | ~15GB                   | ~10GB
Chinese Support      | Excellent            | Good                    | Excellent
Code Generation      | Strong               | Strong                  | Strong
Uncensorship Level   | High (Heretic)       | Medium-High             | Medium-High

Key advantages I found:

Efficient inference - MoE architecture with only 3B active parameters
Strong multilingual - Native Chinese support from Zhipu AI
Heretic-level uncensorship - Maximum removal of refusals
Consumer hardware friendly - Fits 16GB VRAM

Quick Start Checklist

Here’s the summary of what I did:

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
Download GLM-4.7-Flash Heretic GGUF from HuggingFace
Create Modelfile pointing to the GGUF file
Import: ollama create glm-heretic -f Modelfile
Run: ollama run glm-heretic
Verify GPU usage: nvidia-smi

My Experience

After setting up GLM-4.7-Flash Heretic, I found it responds well to various prompts without the typical “I cannot help with that” refusals. The MoE architecture keeps inference fast despite the 30B total parameters.

On my RTX 5070 Ti, I get approximately 25-30 tokens per second with Q4_K_M quantization, which is usable for interactive chat. For longer conversations, the model maintains coherence well.

The main trade-off is that Heretic models are community-created, so quality varies by uploader. I recommend checking model cards and download counts before choosing a specific variant.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 HuggingFace GLM-4.7-Flash Heretic Models
👨‍💻 Ollama Official Website
👨‍💻 GLM-4.7-Flash Official ModelScope Page
👨‍💻 HuggingFace Heretic Models Collection
👨‍💻 Ollama GitHub Repository

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

How to Run GLM-4.7-Flash Heretic Locally with Ollama

What is GLM-4.7-Flash Heretic?

Prerequisites

Step 1: Install Ollama

Step 2: Find and Download GLM-4.7-Flash Heretic

Step 3: Create Ollama Modelfile

Step 4: Import Model into Ollama

Step 5: Run the Model

Step 6: Verify GPU Acceleration

Alternative: Official GLM-4.7-Flash (Non-Heretic)

Performance Optimization Tips

Troubleshooting Common Issues

Issue 1: Model Not Found Error

Issue 2: Out of Memory Errors

Issue 3: Model Produces Gibberish

Issue 4: Slow Inference

Finding Heretic Models on HuggingFace

Why GLM-4.7-Flash Heretic?

Quick Start Checklist

My Experience

Final Words + More Resources

Comments