Skip to content

How to Build Offline AI Apps That Work Without Internet

Problem

I was building an AI-powered note-taking app when I hit a wall: every voice memo had to be uploaded to OpenAI’s servers for transcription. My users had privacy concerns. Plus, API costs were stacking up fast.

I needed an offline solution. Here’s what I discovered.

Why Go Offline?

Three reasons pushed me toward local AI:

Privacy: Sensitive data never leaves the device. No API provider data retention policies to worry about.

Cost: Cloud AI APIs charge per token. Heavy usage scenarios get expensive. Offline means fixed hardware cost, zero marginal cost.

Reliability: Works in airplanes, remote locations, during outages. No API rate limits.

The Stack

After experimenting, I settled on this architecture:

Offline AI Architecture
+---------------------------------------------------------------+
| Offline AI Application |
+---------------------------------------------------------------+
| |
| +-------------+ +-------------+ +-------------+ |
| | Text Gen | | Speech | | Vision | |
| | (LLM) | | I/O | | (Images) | |
| +------+------+ +------+------+ +------+------+ |
| | | | |
| +------+------+ +------+------+ +------+------+ |
| | Gemma 3 | | Whisper | | SD 1.5 | |
| | (Ollama) | | + Kokoro | | (ComfyUI) | |
| +-------------+ +-------------+ +-------------+ |
| |
| +--------------------------------------------------------+ |
| | Embedding Layer | |
| | EmbeddingGemma / local vec DB | |
| +--------------------------------------------------------+ |
| |
+---------------------------------------------------------------+

Hardware Requirements

I started with too little RAM. Don’t make my mistake:

Hardware Requirements
Minimum: 16GB RAM, 8GB VRAM (GTX 1080 / RTX 3060)
Recommended: 32GB RAM, 12-16GB VRAM (RTX 3080/3090 / 4070)
Apple Silicon: M1/M2/M3 with 16GB+ unified memory

Setting Up Ollama for Local LLM

First, I installed Ollama:

Install Ollama
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a model
ollama pull gemma3:8b
# Test it
ollama run gemma3:8b "What is 2+2?"

Then I built a Python wrapper:

offline_llm.py
import requests
from typing import Optional
class OfflineLLM:
def __init__(self, model: str = "gemma3:8b", base_url: str = "http://localhost:11434"):
self.model = model
self.base_url = base_url
def generate(self, prompt: str) -> str:
"""Generate text from a prompt"""
response = requests.post(
f"{self.base_url}/api/generate",
json={
"model": self.model,
"prompt": prompt,
"stream": False
},
timeout=60
)
response.raise_for_status()
return response.json()["response"]
def chat(self, messages: list) -> str:
"""Chat with conversation history"""
response = requests.post(
f"{self.base_url}/api/chat",
json={
"model": self.model,
"messages": messages,
"stream": False
},
timeout=60
)
response.raise_for_status()
return response.json()["message"]["content"]
# Usage
llm = OfflineLLM()
response = llm.chat([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one paragraph"}
])
print(response)

I tested it on a flight. It worked perfectly with no internet.

Adding Speech-to-Text with Whisper

Next, I needed offline transcription. Whisper was the answer:

Install Whisper
pip install openai-whisper
speech_processor.py
import whisper
from typing import Optional
class SpeechToText:
def __init__(self, model_size: str = "base"):
# model_size options: tiny, base, small, medium, large
# Larger = better accuracy, more VRAM needed
self.model = whisper.load_model(model_size)
def transcribe(self, audio_path: str) -> str:
"""Transcribe audio file to text"""
result = self.model.transcribe(audio_path)
return result["text"]
def transcribe_with_timestamps(self, audio_path: str) -> list:
"""Transcribe with word-level timestamps"""
result = self.model.transcribe(audio_path, word_timestamps=True)
return result["segments"]
# Usage
stt = SpeechToText(model_size="base")
text = stt.transcribe("meeting_recording.wav")
print(f"Transcribed: {text}")

I tried the “tiny” model first. Accuracy was poor. Switching to “base” gave me good enough results for note-taking.

Adding Text-to-Speech with Kokoro

For voice output, I used Kokoro TTS:

Install Kokoro
pip install kokoro
tts.py
from kokoro import KokoroTTS
class TextToSpeech:
def __init__(self, model_name: str = "kokoro-base"):
self.tts = KokoroTTS(model_name=model_name)
def synthesize(self, text: str, output_path: str):
"""Convert text to speech and save to file"""
audio = self.tts.generate(text)
self.tts.save(audio, output_path)
def synthesize_to_bytes(self, text: str) -> bytes:
"""Convert text to speech and return audio bytes"""
audio = self.tts.generate(text)
return self.tts.to_bytes(audio)
# Usage
tts = TextToSpeech()
tts.synthesize("Hello, this is offline text to speech", "output.wav")

Putting It All Together

Here’s my complete offline AI app:

offline_ai_app.py
from dataclasses import dataclass
from typing import Optional
import whisper
from kokoro import KokoroTTS
import requests
@dataclass
class OfflineAIConfig:
llm_model: str = "gemma3:8b"
whisper_model: str = "base"
tts_model: str = "kokoro-base"
ollama_url: str = "http://localhost:11434"
class OfflineAIApp:
def __init__(self, config: Optional[OfflineAIConfig] = None):
self.config = config or OfflineAIConfig()
# Initialize components
self.llm_url = self.config.ollama_url
print("Loading Whisper model...")
self.whisper = whisper.load_model(self.config.whisper_model)
print("Loading Kokoro TTS...")
self.tts = KokoroTTS(model_name=self.config.tts_model)
print("Offline AI ready!")
def process_voice_query(self, audio_path: str, output_path: str) -> str:
"""Voice in -> Voice out, all offline"""
# 1. Transcribe
print("Transcribing...")
result = self.whisper.transcribe(audio_path)
query = result["text"]
print(f"User said: {query}")
# 2. Generate response
print("Generating response...")
response = self._call_llm(query)
print(f"AI response: {response}")
# 3. Synthesize
print("Synthesizing speech...")
audio = self.tts.generate(response)
self.tts.save(audio, output_path)
return response
def _call_llm(self, prompt: str) -> str:
response = requests.post(
f"{self.llm_url}/api/generate",
json={
"model": self.config.llm_model,
"prompt": prompt,
"stream": False
},
timeout=120
)
response.raise_for_status()
return response.json()["response"]
# Initialize
app = OfflineAIApp()
# Process a voice query completely offline
# response = app.process_voice_query("user_query.wav", "response.wav")

Quantization: The Key to Running on Consumer Hardware

My first attempt failed. Models were too big. Then I learned about quantization:

Model Quantization
# Ollama handles quantization automatically
# 4-bit quantized (smaller, faster, slight quality loss)
ollama pull gemma3:8b-q4_0
# 8-bit quantized (larger, better quality)
ollama pull gemma3:8b-q8_0
# Compare sizes
# Original gemma3:8b: ~16GB
# q4_0 quantized: ~5GB
# q8_0 quantized: ~9GB

The quality difference between 4-bit and 8-bit? Barely noticeable for most tasks. I use 4-bit for everything.

Common Mistakes I Made

Mistake 1: Underestimating VRAM

Terminal window
# First attempt - crashed
ollama run gemma3:27b # Needs 20GB+ VRAM
# Solution: Check VRAM first, use smaller models
ollama run gemma3:4b # Works on 8GB VRAM

Mistake 2: Not Using Quantization

I downloaded full models at first. 16GB downloads for a single model. Quantized versions are 3-5GB with minimal quality loss.

Mistake 3: Expecting Cloud Performance

Local models on consumer hardware match early 2023 cloud models (GPT-3.5 level), not GPT-4. Set realistic expectations.

Mistake 4: Memory Leaks in Long-Running Apps

memory_management.py
# WRONG: Models stay loaded forever
class BadApp:
def __init__(self):
self.model = whisper.load_model("large") # Never unloaded
# CORRECT: Load/unload as needed
class GoodApp:
def __init__(self):
self._model = None
@property
def model(self):
if self._model is None:
self._model = whisper.load_model("base")
return self._model
def unload_model(self):
"""Free memory when not in use"""
self._model = None
import gc
gc.collect()

Integration Patterns

For different platforms, I use different approaches:

Desktop (Electron/Tauri): Ollama runs locally on localhost:11434. Models stored on disk. SQLite for data.

Mobile (iOS): Core ML with ONNX models. Quantized models (.mlmodelc format). On-device storage only.

Browser (WASM): WebLLM with WebGPU for inference. IndexedDB for caching. Limited to smaller models.

What Actually Works

After building several offline AI apps, here’s what I recommend:

For beginners: Start with Ollama + Whisper. Add ComfyUI if you need images.

For production: Use quantized models. Test thoroughly. Monitor memory.

For mobile: Core ML (iOS) or ML Kit (Android) with ONNX models.

The tradeoff is clear: upfront hardware cost versus zero marginal costs, complete privacy versus convenience, local control versus managed services. For privacy-sensitive applications or high-volume use cases, offline AI is worth the investment.

Summary

In this post, I showed how to build AI applications that run completely offline. The key components are Ollama for LLMs, Whisper for speech-to-text, Kokoro for text-to-speech, and ComfyUI for images. The trick is using quantized models to fit on consumer hardware. Performance now matches early 2023 cloud solutions - good enough for most applications.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments