How to Build Offline AI Apps That Work Without Internet

Mar 22, 2026

Problem

I was building an AI-powered note-taking app when I hit a wall: every voice memo had to be uploaded to OpenAI’s servers for transcription. My users had privacy concerns. Plus, API costs were stacking up fast.

I needed an offline solution. Here’s what I discovered.

Why Go Offline?

Three reasons pushed me toward local AI:

Privacy: Sensitive data never leaves the device. No API provider data retention policies to worry about.

Cost: Cloud AI APIs charge per token. Heavy usage scenarios get expensive. Offline means fixed hardware cost, zero marginal cost.

Reliability: Works in airplanes, remote locations, during outages. No API rate limits.

The Stack

After experimenting, I settled on this architecture:

+---------------------------------------------------------------+
|                    Offline AI Application                     |
+---------------------------------------------------------------+
|                                                               |
|  +-------------+  +-------------+  +-------------+            |
|  |  Text Gen   |  |   Speech    |  |   Vision    |            |
|  |   (LLM)     |  |    I/O      |  |  (Images)   |            |
|  +------+------+  +------+------+  +------+------+            |
|         |                |                |                   |
|  +------+------+  +------+------+  +------+------+            |
|  |   Gemma 3   |  |   Whisper   |  |   SD 1.5    |            |
|  |  (Ollama)   |  |  + Kokoro   |  |  (ComfyUI)  |            |
|  +-------------+  +-------------+  +-------------+            |
|                                                               |
|  +--------------------------------------------------------+   |
|  |                  Embedding Layer                       |   |
|  |              EmbeddingGemma / local vec DB             |   |
|  +--------------------------------------------------------+   |
|                                                               |
+---------------------------------------------------------------+

Hardware Requirements

I started with too little RAM. Don’t make my mistake:

Minimum: 16GB RAM, 8GB VRAM (GTX 1080 / RTX 3060)
Recommended: 32GB RAM, 12-16GB VRAM (RTX 3080/3090 / 4070)
Apple Silicon: M1/M2/M3 with 16GB+ unified memory

Setting Up Ollama for Local LLM

First, I installed Ollama:

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Pull a model
ollama pull gemma3:8b

# Test it
ollama run gemma3:8b "What is 2+2?"

Then I built a Python wrapper:

import requests
from typing import Optional

class OfflineLLM:
    def __init__(self, model: str = "gemma3:8b", base_url: str = "http://localhost:11434"):
        self.model = model
        self.base_url = base_url

    def generate(self, prompt: str) -> str:
        """Generate text from a prompt"""
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": prompt,
                "stream": False
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()["response"]

    def chat(self, messages: list) -> str:
        """Chat with conversation history"""
        response = requests.post(
            f"{self.base_url}/api/chat",
            json={
                "model": self.model,
                "messages": messages,
                "stream": False
            },
            timeout=60
        )
        response.raise_for_status()
        return response.json()["message"]["content"]

# Usage
llm = OfflineLLM()
response = llm.chat([
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in one paragraph"}
])
print(response)

I tested it on a flight. It worked perfectly with no internet.

Adding Speech-to-Text with Whisper

Next, I needed offline transcription. Whisper was the answer:

pip install openai-whisper

import whisper
from typing import Optional

class SpeechToText:
    def __init__(self, model_size: str = "base"):
        # model_size options: tiny, base, small, medium, large
        # Larger = better accuracy, more VRAM needed
        self.model = whisper.load_model(model_size)

    def transcribe(self, audio_path: str) -> str:
        """Transcribe audio file to text"""
        result = self.model.transcribe(audio_path)
        return result["text"]

    def transcribe_with_timestamps(self, audio_path: str) -> list:
        """Transcribe with word-level timestamps"""
        result = self.model.transcribe(audio_path, word_timestamps=True)
        return result["segments"]

# Usage
stt = SpeechToText(model_size="base")
text = stt.transcribe("meeting_recording.wav")
print(f"Transcribed: {text}")

I tried the “tiny” model first. Accuracy was poor. Switching to “base” gave me good enough results for note-taking.

Adding Text-to-Speech with Kokoro

For voice output, I used Kokoro TTS:

pip install kokoro

from kokoro import KokoroTTS

class TextToSpeech:
    def __init__(self, model_name: str = "kokoro-base"):
        self.tts = KokoroTTS(model_name=model_name)

    def synthesize(self, text: str, output_path: str):
        """Convert text to speech and save to file"""
        audio = self.tts.generate(text)
        self.tts.save(audio, output_path)

    def synthesize_to_bytes(self, text: str) -> bytes:
        """Convert text to speech and return audio bytes"""
        audio = self.tts.generate(text)
        return self.tts.to_bytes(audio)

# Usage
tts = TextToSpeech()
tts.synthesize("Hello, this is offline text to speech", "output.wav")

Putting It All Together

Here’s my complete offline AI app:

from dataclasses import dataclass
from typing import Optional
import whisper
from kokoro import KokoroTTS
import requests

@dataclass
class OfflineAIConfig:
    llm_model: str = "gemma3:8b"
    whisper_model: str = "base"
    tts_model: str = "kokoro-base"
    ollama_url: str = "http://localhost:11434"

class OfflineAIApp:
    def __init__(self, config: Optional[OfflineAIConfig] = None):
        self.config = config or OfflineAIConfig()

        # Initialize components
        self.llm_url = self.config.ollama_url
        print("Loading Whisper model...")
        self.whisper = whisper.load_model(self.config.whisper_model)
        print("Loading Kokoro TTS...")
        self.tts = KokoroTTS(model_name=self.config.tts_model)
        print("Offline AI ready!")

    def process_voice_query(self, audio_path: str, output_path: str) -> str:
        """Voice in -> Voice out, all offline"""
        # 1. Transcribe
        print("Transcribing...")
        result = self.whisper.transcribe(audio_path)
        query = result["text"]
        print(f"User said: {query}")

        # 2. Generate response
        print("Generating response...")
        response = self._call_llm(query)
        print(f"AI response: {response}")

        # 3. Synthesize
        print("Synthesizing speech...")
        audio = self.tts.generate(response)
        self.tts.save(audio, output_path)

        return response

    def _call_llm(self, prompt: str) -> str:
        response = requests.post(
            f"{self.llm_url}/api/generate",
            json={
                "model": self.config.llm_model,
                "prompt": prompt,
                "stream": False
            },
            timeout=120
        )
        response.raise_for_status()
        return response.json()["response"]

# Initialize
app = OfflineAIApp()

# Process a voice query completely offline
# response = app.process_voice_query("user_query.wav", "response.wav")

Quantization: The Key to Running on Consumer Hardware

My first attempt failed. Models were too big. Then I learned about quantization:

# Ollama handles quantization automatically
# 4-bit quantized (smaller, faster, slight quality loss)
ollama pull gemma3:8b-q4_0

# 8-bit quantized (larger, better quality)
ollama pull gemma3:8b-q8_0

# Compare sizes
# Original gemma3:8b: ~16GB
# q4_0 quantized: ~5GB
# q8_0 quantized: ~9GB

The quality difference between 4-bit and 8-bit? Barely noticeable for most tasks. I use 4-bit for everything.

Common Mistakes I Made

Mistake 1: Underestimating VRAM

# First attempt - crashed
ollama run gemma3:27b  # Needs 20GB+ VRAM

# Solution: Check VRAM first, use smaller models
ollama run gemma3:4b   # Works on 8GB VRAM

Mistake 2: Not Using Quantization

I downloaded full models at first. 16GB downloads for a single model. Quantized versions are 3-5GB with minimal quality loss.

Mistake 3: Expecting Cloud Performance

Local models on consumer hardware match early 2023 cloud models (GPT-3.5 level), not GPT-4. Set realistic expectations.

Mistake 4: Memory Leaks in Long-Running Apps

# WRONG: Models stay loaded forever
class BadApp:
    def __init__(self):
        self.model = whisper.load_model("large")  # Never unloaded

# CORRECT: Load/unload as needed
class GoodApp:
    def __init__(self):
        self._model = None

    @property
    def model(self):
        if self._model is None:
            self._model = whisper.load_model("base")
        return self._model

    def unload_model(self):
        """Free memory when not in use"""
        self._model = None
        import gc
        gc.collect()

Integration Patterns

For different platforms, I use different approaches:

Desktop (Electron/Tauri): Ollama runs locally on localhost:11434. Models stored on disk. SQLite for data.

Mobile (iOS): Core ML with ONNX models. Quantized models (.mlmodelc format). On-device storage only.

Browser (WASM): WebLLM with WebGPU for inference. IndexedDB for caching. Limited to smaller models.

What Actually Works

After building several offline AI apps, here’s what I recommend:

For beginners: Start with Ollama + Whisper. Add ComfyUI if you need images.

For production: Use quantized models. Test thoroughly. Monitor memory.

For mobile: Core ML (iOS) or ML Kit (Android) with ONNX models.

The tradeoff is clear: upfront hardware cost versus zero marginal costs, complete privacy versus convenience, local control versus managed services. For privacy-sensitive applications or high-volume use cases, offline AI is worth the investment.

Summary

In this post, I showed how to build AI applications that run completely offline. The key components are Ollama for LLMs, Whisper for speech-to-text, Kokoro for text-to-speech, and ComfyUI for images. The trick is using quantized models to fit on consumer hardware. Performance now matches early 2023 cloud solutions - good enough for most applications.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: Underrated Vibe-Coded Projects
👨‍💻 Ollama
👨‍💻 OpenAI Whisper
👨‍💻 SherpaOnnx
👨‍💻 ComfyUI
👨‍💻 Kokoro TTS

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!