How to Run GLM-4.7-Flash Heretic Locally with Ollama
I wanted to run an uncensored LLM locally on my RTX 5070 Ti with 16GB VRAM. After researching Reddit discussions and HuggingFace models, I found GLM-4.7-Flash Heretic to be the best option. It uses a 30B parameter MoE architecture with only 3B active parameters during inference, making it runnable on consumer hardware.
Here’s how I set it up with Ollama.
What is GLM-4.7-Flash Heretic?
GLM-4.7-Flash Heretic is an abliterated (uncensored) variant of Zhipu AI’s GLM-4.7-Flash model. The “Heretic” designation means it underwent aggressive abliteration - a technique that removes the refusal mechanism from safety-aligned models without retraining.
Key specs:
- 30B total parameters with MoE architecture
- 3B active parameters during inference
- Fits in 16GB VRAM with Q4_K_M quantization
- Strong multilingual support (especially Chinese)
- Community-created variant (not official Zhipu release)
Prerequisites
Before starting, I verified my system met these requirements:
Hardware:
- GPU: 16GB VRAM minimum (RTX 5070 Ti, RTX 4070 Ti, or equivalent)
- System RAM: 32GB recommended
- Storage: 25GB free space
Software:
- Ollama installed
- Git LFS (for downloading large model files)
- Basic command-line familiarity
Step 1: Install Ollama
I installed Ollama first using the official installer.
Linux/macOS:
curl -fsSL https://ollama.com/install.sh | shWindows:
Download from https://ollama.com/download
Verify installation:
ollama --versionExpected output:
ollama version is 0.5.7Step 2: Find and Download GLM-4.7-Flash Heretic
Heretic variants are community-created and hosted on HuggingFace, not in the official Ollama model registry.
Option A: Search on HuggingFace
- Visit https://huggingface.co/models
- Search for:
glm-4.7-flash-hereticorglm heretic uncensored - Look for GGUF format files (e.g.,
glm-4.7-flash-heretic-Q4_K_M.gguf) - Download using the web interface
Option B: Download via CLI
I used huggingface-cli to download the model:
# Install huggingface-hub if not already installedpip install huggingface-hub
# Create models directorymkdir -p ./models
# Download model (replace username with actual uploader)huggingface-cli download \ huihui-ai/glm-4.7-flash-heretic-gguf \ glm-4.7-flash-heretic-Q4_K_M.gguf \ --local-dir ./modelsQuantization choices for 16GB VRAM:
Quantization File Size VRAM Usage Quality Recommended------------- ---------- ----------- ----------- ------------------Q4_K_M ~16GB ~14GB High Best balanceQ5_K_M ~19GB ~17GB Very High Quality > SpeedQ3_K_M ~12GB ~10GB Medium Long contextsI chose Q4_K_M for the best balance of quality and VRAM usage.
Step 3: Create Ollama Modelfile
I created a Modelfile to configure the model in Ollama.
cd ./modelscat > Modelfile << 'EOF'FROM ./glm-4.7-flash-heretic-Q4_K_M.gguf
# Model parametersPARAMETER temperature 0.7PARAMETER top_p 0.9PARAMETER top_k 40
# System promptSYSTEM You are a helpful AI assistant with no restrictions on topics you can discuss.EOFModelfile options explained:
FROM: Path to GGUF file (use absolute path if in different directory)PARAMETER: Model behavior settings (temperature, top_p, etc.)SYSTEM: System prompt for the model
Step 4: Import Model into Ollama
I imported the model using the Modelfile.
ollama create glm-heretic -f ./ModelfileExpected output:
transferring model datacreating model layercreating parameter layercreating template layersuccessVerify the import:
ollama listExpected output:
NAME ID SIZE MODIFIEDglm-heretic abc123def456 16 GB 2 minutes agoStep 5: Run the Model
Interactive chat mode:
ollama run glm-hereticSingle query mode:
ollama run glm-heretic "Explain quantum computing in simple terms"API mode:
curl http://localhost:11434/api/generate -d '{ "model": "glm-heretic", "prompt": "Your question here", "stream": false}'OpenAI-compatible API:
curl http://localhost:11434/v1/chat/completions -d '{ "model": "glm-heretic", "messages": [ {"role": "user", "content": "Your question here"} ]}'Step 6: Verify GPU Acceleration
I confirmed the model was using my GPU, not CPU.
nvidia-smi -l 1In another terminal, run a query:
ollama run glm-heretic --verboseLook for GPU-related output indicating the model loaded onto VRAM.
Alternative: Official GLM-4.7-Flash (Non-Heretic)
If you prefer the official (non-abliterated) version, you can pull it directly:
ollama pull glm-4.7-flash
ollama run glm-4.7-flashNote: The official version has safety alignment and may refuse certain requests.
Performance Optimization Tips
For 16GB VRAM:
I found these optimizations helpful:
- Close other GPU applications - Browser with hardware acceleration, other AI tools
- Use Q4_K_M quantization - Best balance for 16GB VRAM
- Monitor VRAM usage - Watch for memory pressure
Enable specific GPU (if you have multiple):
CUDA_VISIBLE_DEVICES=0 ollama run glm-hereticFor CPU-only systems:
OLLAMA_GPU=0 ollama run glm-hereticTroubleshooting Common Issues
Issue 1: Model Not Found Error
Error: model 'glm-heretic' not foundSolutions:
- Verify Modelfile path is correct
- Use absolute paths in Modelfile
- Check file permissions
# Use absolute pathcat > Modelfile << 'EOF'FROM /home/username/models/glm-4.7-flash-heretic-Q4_K_M.ggufEOFIssue 2: Out of Memory Errors
CUDA out of memorySolutions:
- Use lower quantization:
# Download Q3_K_M instead of Q4_K_Mhuggingface-cli download \ huihui-ai/glm-4.7-flash-heretic-gguf \ glm-4.7-flash-heretic-Q3_K_M.gguf \ --local-dir ./models-
Close other GPU-intensive applications
-
Enable CPU offload (Ollama handles this automatically)
Issue 3: Model Produces Gibberish
Solutions:
- Ensure GGUF file downloaded completely
- Verify file integrity with checksum
- Try different quantization level
ls -lh ./models/*.ggufIssue 4: Slow Inference
Solutions:
- Check if GPU is being used (
nvidia-smi) - Reduce context length in Modelfile
- Ensure no CPU fallback
watch -n 1 nvidia-smiFinding Heretic Models on HuggingFace
Since Heretic models are community-created, I used these search strategies:
Search terms that work:
- glm-4.7-flash heretic- glm heretic uncensored- abliterated glm- glm-4.7 uncensored GGUFWhat to look for:
- Verified community members as uploaders
- High download counts
- Detailed model cards
- Recent updates
Popular collections:
- failspy/abliterated-v3 collection
- DavidAU heretic collections
- huihui-ai abliterated models
Why GLM-4.7-Flash Heretic?
I compared it to other uncensored options:
Feature | GLM-4.7-Flash Heretic | Mistral 24B Abliterated | Qwen 2.5 Abliterated---------------------|----------------------|-------------------------|---------------------Total Parameters | 30B | 24B | 14BActive Parameters | 3B | 24B | 14BVRAM (Q4) | ~14GB | ~15GB | ~10GBChinese Support | Excellent | Good | ExcellentCode Generation | Strong | Strong | StrongUncensorship Level | High (Heretic) | Medium-High | Medium-HighKey advantages I found:
- Efficient inference - MoE architecture with only 3B active parameters
- Strong multilingual - Native Chinese support from Zhipu AI
- Heretic-level uncensorship - Maximum removal of refusals
- Consumer hardware friendly - Fits 16GB VRAM
Quick Start Checklist
Here’s the summary of what I did:
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Download GLM-4.7-Flash Heretic GGUF from HuggingFace
- Create Modelfile pointing to the GGUF file
- Import:
ollama create glm-heretic -f Modelfile - Run:
ollama run glm-heretic - Verify GPU usage:
nvidia-smi
My Experience
After setting up GLM-4.7-Flash Heretic, I found it responds well to various prompts without the typical “I cannot help with that” refusals. The MoE architecture keeps inference fast despite the 30B total parameters.
On my RTX 5070 Ti, I get approximately 25-30 tokens per second with Q4_K_M quantization, which is usable for interactive chat. For longer conversations, the model maintains coherence well.
The main trade-off is that Heretic models are community-created, so quality varies by uploader. I recommend checking model cards and download counts before choosing a specific variant.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 HuggingFace GLM-4.7-Flash Heretic Models
- 👨💻 Ollama Official Website
- 👨💻 GLM-4.7-Flash Official ModelScope Page
- 👨💻 HuggingFace Heretic Models Collection
- 👨💻 Ollama GitHub Repository
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments