How to Build Offline AI Apps That Work Without Internet
Problem
I was building an AI-powered note-taking app when I hit a wall: every voice memo had to be uploaded to OpenAI’s servers for transcription. My users had privacy concerns. Plus, API costs were stacking up fast.
I needed an offline solution. Here’s what I discovered.
Why Go Offline?
Three reasons pushed me toward local AI:
Privacy: Sensitive data never leaves the device. No API provider data retention policies to worry about.
Cost: Cloud AI APIs charge per token. Heavy usage scenarios get expensive. Offline means fixed hardware cost, zero marginal cost.
Reliability: Works in airplanes, remote locations, during outages. No API rate limits.
The Stack
After experimenting, I settled on this architecture:
+---------------------------------------------------------------+| Offline AI Application |+---------------------------------------------------------------+| || +-------------+ +-------------+ +-------------+ || | Text Gen | | Speech | | Vision | || | (LLM) | | I/O | | (Images) | || +------+------+ +------+------+ +------+------+ || | | | || +------+------+ +------+------+ +------+------+ || | Gemma 3 | | Whisper | | SD 1.5 | || | (Ollama) | | + Kokoro | | (ComfyUI) | || +-------------+ +-------------+ +-------------+ || || +--------------------------------------------------------+ || | Embedding Layer | || | EmbeddingGemma / local vec DB | || +--------------------------------------------------------+ || |+---------------------------------------------------------------+Hardware Requirements
I started with too little RAM. Don’t make my mistake:
Minimum: 16GB RAM, 8GB VRAM (GTX 1080 / RTX 3060)Recommended: 32GB RAM, 12-16GB VRAM (RTX 3080/3090 / 4070)Apple Silicon: M1/M2/M3 with 16GB+ unified memorySetting Up Ollama for Local LLM
First, I installed Ollama:
# macOS/Linuxcurl -fsSL https://ollama.ai/install.sh | sh
# Pull a modelollama pull gemma3:8b
# Test itollama run gemma3:8b "What is 2+2?"Then I built a Python wrapper:
import requestsfrom typing import Optional
class OfflineLLM: def __init__(self, model: str = "gemma3:8b", base_url: str = "http://localhost:11434"): self.model = model self.base_url = base_url
def generate(self, prompt: str) -> str: """Generate text from a prompt""" response = requests.post( f"{self.base_url}/api/generate", json={ "model": self.model, "prompt": prompt, "stream": False }, timeout=60 ) response.raise_for_status() return response.json()["response"]
def chat(self, messages: list) -> str: """Chat with conversation history""" response = requests.post( f"{self.base_url}/api/chat", json={ "model": self.model, "messages": messages, "stream": False }, timeout=60 ) response.raise_for_status() return response.json()["message"]["content"]
# Usagellm = OfflineLLM()response = llm.chat([ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain quantum computing in one paragraph"}])print(response)I tested it on a flight. It worked perfectly with no internet.
Adding Speech-to-Text with Whisper
Next, I needed offline transcription. Whisper was the answer:
pip install openai-whisperimport whisperfrom typing import Optional
class SpeechToText: def __init__(self, model_size: str = "base"): # model_size options: tiny, base, small, medium, large # Larger = better accuracy, more VRAM needed self.model = whisper.load_model(model_size)
def transcribe(self, audio_path: str) -> str: """Transcribe audio file to text""" result = self.model.transcribe(audio_path) return result["text"]
def transcribe_with_timestamps(self, audio_path: str) -> list: """Transcribe with word-level timestamps""" result = self.model.transcribe(audio_path, word_timestamps=True) return result["segments"]
# Usagestt = SpeechToText(model_size="base")text = stt.transcribe("meeting_recording.wav")print(f"Transcribed: {text}")I tried the “tiny” model first. Accuracy was poor. Switching to “base” gave me good enough results for note-taking.
Adding Text-to-Speech with Kokoro
For voice output, I used Kokoro TTS:
pip install kokorofrom kokoro import KokoroTTS
class TextToSpeech: def __init__(self, model_name: str = "kokoro-base"): self.tts = KokoroTTS(model_name=model_name)
def synthesize(self, text: str, output_path: str): """Convert text to speech and save to file""" audio = self.tts.generate(text) self.tts.save(audio, output_path)
def synthesize_to_bytes(self, text: str) -> bytes: """Convert text to speech and return audio bytes""" audio = self.tts.generate(text) return self.tts.to_bytes(audio)
# Usagetts = TextToSpeech()tts.synthesize("Hello, this is offline text to speech", "output.wav")Putting It All Together
Here’s my complete offline AI app:
from dataclasses import dataclassfrom typing import Optionalimport whisperfrom kokoro import KokoroTTSimport requests
@dataclassclass OfflineAIConfig: llm_model: str = "gemma3:8b" whisper_model: str = "base" tts_model: str = "kokoro-base" ollama_url: str = "http://localhost:11434"
class OfflineAIApp: def __init__(self, config: Optional[OfflineAIConfig] = None): self.config = config or OfflineAIConfig()
# Initialize components self.llm_url = self.config.ollama_url print("Loading Whisper model...") self.whisper = whisper.load_model(self.config.whisper_model) print("Loading Kokoro TTS...") self.tts = KokoroTTS(model_name=self.config.tts_model) print("Offline AI ready!")
def process_voice_query(self, audio_path: str, output_path: str) -> str: """Voice in -> Voice out, all offline""" # 1. Transcribe print("Transcribing...") result = self.whisper.transcribe(audio_path) query = result["text"] print(f"User said: {query}")
# 2. Generate response print("Generating response...") response = self._call_llm(query) print(f"AI response: {response}")
# 3. Synthesize print("Synthesizing speech...") audio = self.tts.generate(response) self.tts.save(audio, output_path)
return response
def _call_llm(self, prompt: str) -> str: response = requests.post( f"{self.llm_url}/api/generate", json={ "model": self.config.llm_model, "prompt": prompt, "stream": False }, timeout=120 ) response.raise_for_status() return response.json()["response"]
# Initializeapp = OfflineAIApp()
# Process a voice query completely offline# response = app.process_voice_query("user_query.wav", "response.wav")Quantization: The Key to Running on Consumer Hardware
My first attempt failed. Models were too big. Then I learned about quantization:
# Ollama handles quantization automatically# 4-bit quantized (smaller, faster, slight quality loss)ollama pull gemma3:8b-q4_0
# 8-bit quantized (larger, better quality)ollama pull gemma3:8b-q8_0
# Compare sizes# Original gemma3:8b: ~16GB# q4_0 quantized: ~5GB# q8_0 quantized: ~9GBThe quality difference between 4-bit and 8-bit? Barely noticeable for most tasks. I use 4-bit for everything.
Common Mistakes I Made
Mistake 1: Underestimating VRAM
# First attempt - crashedollama run gemma3:27b # Needs 20GB+ VRAM
# Solution: Check VRAM first, use smaller modelsollama run gemma3:4b # Works on 8GB VRAMMistake 2: Not Using Quantization
I downloaded full models at first. 16GB downloads for a single model. Quantized versions are 3-5GB with minimal quality loss.
Mistake 3: Expecting Cloud Performance
Local models on consumer hardware match early 2023 cloud models (GPT-3.5 level), not GPT-4. Set realistic expectations.
Mistake 4: Memory Leaks in Long-Running Apps
# WRONG: Models stay loaded foreverclass BadApp: def __init__(self): self.model = whisper.load_model("large") # Never unloaded
# CORRECT: Load/unload as neededclass GoodApp: def __init__(self): self._model = None
@property def model(self): if self._model is None: self._model = whisper.load_model("base") return self._model
def unload_model(self): """Free memory when not in use""" self._model = None import gc gc.collect()Integration Patterns
For different platforms, I use different approaches:
Desktop (Electron/Tauri): Ollama runs locally on localhost:11434. Models stored on disk. SQLite for data.
Mobile (iOS): Core ML with ONNX models. Quantized models (.mlmodelc format). On-device storage only.
Browser (WASM): WebLLM with WebGPU for inference. IndexedDB for caching. Limited to smaller models.
What Actually Works
After building several offline AI apps, here’s what I recommend:
For beginners: Start with Ollama + Whisper. Add ComfyUI if you need images.
For production: Use quantized models. Test thoroughly. Monitor memory.
For mobile: Core ML (iOS) or ML Kit (Android) with ONNX models.
The tradeoff is clear: upfront hardware cost versus zero marginal costs, complete privacy versus convenience, local control versus managed services. For privacy-sensitive applications or high-volume use cases, offline AI is worth the investment.
Summary
In this post, I showed how to build AI applications that run completely offline. The key components are Ollama for LLMs, Whisper for speech-to-text, Kokoro for text-to-speech, and ComfyUI for images. The trick is using quantized models to fit on consumer hardware. Performance now matches early 2023 cloud solutions - good enough for most applications.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Underrated Vibe-Coded Projects
- 👨💻 Ollama
- 👨💻 OpenAI Whisper
- 👨💻 SherpaOnnx
- 👨💻 ComfyUI
- 👨💻 Kokoro TTS
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments