Skip to content

How to Run GLM-4.7-Flash Heretic Locally with Ollama

I wanted to run an uncensored LLM locally on my RTX 5070 Ti with 16GB VRAM. After researching Reddit discussions and HuggingFace models, I found GLM-4.7-Flash Heretic to be the best option. It uses a 30B parameter MoE architecture with only 3B active parameters during inference, making it runnable on consumer hardware.

Here’s how I set it up with Ollama.

What is GLM-4.7-Flash Heretic?

GLM-4.7-Flash Heretic is an abliterated (uncensored) variant of Zhipu AI’s GLM-4.7-Flash model. The “Heretic” designation means it underwent aggressive abliteration - a technique that removes the refusal mechanism from safety-aligned models without retraining.

Key specs:

  • 30B total parameters with MoE architecture
  • 3B active parameters during inference
  • Fits in 16GB VRAM with Q4_K_M quantization
  • Strong multilingual support (especially Chinese)
  • Community-created variant (not official Zhipu release)

Prerequisites

Before starting, I verified my system met these requirements:

Hardware:

  • GPU: 16GB VRAM minimum (RTX 5070 Ti, RTX 4070 Ti, or equivalent)
  • System RAM: 32GB recommended
  • Storage: 25GB free space

Software:

  • Ollama installed
  • Git LFS (for downloading large model files)
  • Basic command-line familiarity

Step 1: Install Ollama

I installed Ollama first using the official installer.

Linux/macOS:

Install Ollama on Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

Windows:

Download from https://ollama.com/download

Verify installation:

Verify Ollama Installation
ollama --version

Expected output:

Ollama Version Output
ollama version is 0.5.7

Step 2: Find and Download GLM-4.7-Flash Heretic

Heretic variants are community-created and hosted on HuggingFace, not in the official Ollama model registry.

Option A: Search on HuggingFace

  1. Visit https://huggingface.co/models
  2. Search for: glm-4.7-flash-heretic or glm heretic uncensored
  3. Look for GGUF format files (e.g., glm-4.7-flash-heretic-Q4_K_M.gguf)
  4. Download using the web interface

Option B: Download via CLI

I used huggingface-cli to download the model:

Download Heretic Model via CLI
# Install huggingface-hub if not already installed
pip install huggingface-hub
# Create models directory
mkdir -p ./models
# Download model (replace username with actual uploader)
huggingface-cli download \
huihui-ai/glm-4.7-flash-heretic-gguf \
glm-4.7-flash-heretic-Q4_K_M.gguf \
--local-dir ./models

Quantization choices for 16GB VRAM:

Quantization Options for 16GB VRAM
Quantization File Size VRAM Usage Quality Recommended
------------- ---------- ----------- ----------- ------------------
Q4_K_M ~16GB ~14GB High Best balance
Q5_K_M ~19GB ~17GB Very High Quality > Speed
Q3_K_M ~12GB ~10GB Medium Long contexts

I chose Q4_K_M for the best balance of quality and VRAM usage.

Step 3: Create Ollama Modelfile

I created a Modelfile to configure the model in Ollama.

Navigate to Models Directory
cd ./models
Create Modelfile
cat > Modelfile << 'EOF'
FROM ./glm-4.7-flash-heretic-Q4_K_M.gguf
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
# System prompt
SYSTEM You are a helpful AI assistant with no restrictions on topics you can discuss.
EOF

Modelfile options explained:

  • FROM: Path to GGUF file (use absolute path if in different directory)
  • PARAMETER: Model behavior settings (temperature, top_p, etc.)
  • SYSTEM: System prompt for the model

Step 4: Import Model into Ollama

I imported the model using the Modelfile.

Import Model into Ollama
ollama create glm-heretic -f ./Modelfile

Expected output:

Model Import Output
transferring model data
creating model layer
creating parameter layer
creating template layer
success

Verify the import:

List Installed Ollama Models
ollama list

Expected output:

Ollama Model List
NAME ID SIZE MODIFIED
glm-heretic abc123def456 16 GB 2 minutes ago

Step 5: Run the Model

Interactive chat mode:

Run Interactive Chat
ollama run glm-heretic

Single query mode:

Run Single Query
ollama run glm-heretic "Explain quantum computing in simple terms"

API mode:

Use Ollama API
curl http://localhost:11434/api/generate -d '{
"model": "glm-heretic",
"prompt": "Your question here",
"stream": false
}'

OpenAI-compatible API:

Use OpenAI-Compatible API
curl http://localhost:11434/v1/chat/completions -d '{
"model": "glm-heretic",
"messages": [
{"role": "user", "content": "Your question here"}
]
}'

Step 6: Verify GPU Acceleration

I confirmed the model was using my GPU, not CPU.

Check GPU Usage
nvidia-smi -l 1

In another terminal, run a query:

Test Model with Verbose Output
ollama run glm-heretic --verbose

Look for GPU-related output indicating the model loaded onto VRAM.

Alternative: Official GLM-4.7-Flash (Non-Heretic)

If you prefer the official (non-abliterated) version, you can pull it directly:

Pull Official GLM-4.7-Flash
ollama pull glm-4.7-flash
ollama run glm-4.7-flash

Note: The official version has safety alignment and may refuse certain requests.

Performance Optimization Tips

For 16GB VRAM:

I found these optimizations helpful:

  1. Close other GPU applications - Browser with hardware acceleration, other AI tools
  2. Use Q4_K_M quantization - Best balance for 16GB VRAM
  3. Monitor VRAM usage - Watch for memory pressure

Enable specific GPU (if you have multiple):

Specify GPU Device
CUDA_VISIBLE_DEVICES=0 ollama run glm-heretic

For CPU-only systems:

Force CPU Mode
OLLAMA_GPU=0 ollama run glm-heretic

Troubleshooting Common Issues

Issue 1: Model Not Found Error

Error Message
Error: model 'glm-heretic' not found

Solutions:

  • Verify Modelfile path is correct
  • Use absolute paths in Modelfile
  • Check file permissions
Debug Modelfile Path
# Use absolute path
cat > Modelfile << 'EOF'
FROM /home/username/models/glm-4.7-flash-heretic-Q4_K_M.gguf
EOF

Issue 2: Out of Memory Errors

Error Message
CUDA out of memory

Solutions:

  1. Use lower quantization:
Download Lower Quantization Model
# Download Q3_K_M instead of Q4_K_M
huggingface-cli download \
huihui-ai/glm-4.7-flash-heretic-gguf \
glm-4.7-flash-heretic-Q3_K_M.gguf \
--local-dir ./models
  1. Close other GPU-intensive applications

  2. Enable CPU offload (Ollama handles this automatically)

Issue 3: Model Produces Gibberish

Solutions:

  • Ensure GGUF file downloaded completely
  • Verify file integrity with checksum
  • Try different quantization level
Check File Size
ls -lh ./models/*.gguf

Issue 4: Slow Inference

Solutions:

  • Check if GPU is being used (nvidia-smi)
  • Reduce context length in Modelfile
  • Ensure no CPU fallback
Monitor GPU During Inference
watch -n 1 nvidia-smi

Finding Heretic Models on HuggingFace

Since Heretic models are community-created, I used these search strategies:

Search terms that work:

HuggingFace Search Terms
- glm-4.7-flash heretic
- glm heretic uncensored
- abliterated glm
- glm-4.7 uncensored GGUF

What to look for:

  1. Verified community members as uploaders
  2. High download counts
  3. Detailed model cards
  4. Recent updates

Popular collections:

  • failspy/abliterated-v3 collection
  • DavidAU heretic collections
  • huihui-ai abliterated models

Why GLM-4.7-Flash Heretic?

I compared it to other uncensored options:

Uncensored Model Comparison
Feature | GLM-4.7-Flash Heretic | Mistral 24B Abliterated | Qwen 2.5 Abliterated
---------------------|----------------------|-------------------------|---------------------
Total Parameters | 30B | 24B | 14B
Active Parameters | 3B | 24B | 14B
VRAM (Q4) | ~14GB | ~15GB | ~10GB
Chinese Support | Excellent | Good | Excellent
Code Generation | Strong | Strong | Strong
Uncensorship Level | High (Heretic) | Medium-High | Medium-High

Key advantages I found:

  1. Efficient inference - MoE architecture with only 3B active parameters
  2. Strong multilingual - Native Chinese support from Zhipu AI
  3. Heretic-level uncensorship - Maximum removal of refusals
  4. Consumer hardware friendly - Fits 16GB VRAM

Quick Start Checklist

Here’s the summary of what I did:

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Download GLM-4.7-Flash Heretic GGUF from HuggingFace
  3. Create Modelfile pointing to the GGUF file
  4. Import: ollama create glm-heretic -f Modelfile
  5. Run: ollama run glm-heretic
  6. Verify GPU usage: nvidia-smi

My Experience

After setting up GLM-4.7-Flash Heretic, I found it responds well to various prompts without the typical “I cannot help with that” refusals. The MoE architecture keeps inference fast despite the 30B total parameters.

On my RTX 5070 Ti, I get approximately 25-30 tokens per second with Q4_K_M quantization, which is usable for interactive chat. For longer conversations, the model maintains coherence well.

The main trade-off is that Heretic models are community-created, so quality varies by uploader. I recommend checking model cards and download counts before choosing a specific variant.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments