For a while now, I’ve been thinking about the idea of a voice assistant that doesn’t live in the cloud. Something fast, privacy-respecting, and fully offline. I have yet to use one that worked at a truly conversation pace.
So I built one. Or rather, I built version one—and while it works, it’s not fast - yet.
This first version is just about learning and experimenting: I was able to record audio, transcribe it, run it through a local LLM, and speak the response back—all without leaving my machine. But now I need to go back and optimize.
The Goal: Speedy, Local, Conversational AI
The core goal was simple:
Speak to my computer → Get an intelligent, spoken response → Do it all fast.
In reality, I hit a few performance roadblocks—but the structure is there, and the tools are all local. I wanted to practice chaining together multiple AI tools to build something semi-cohesive and voice-based.
🛠️ What’s Under the Hood?
- Audio recording with Arecord
 - Transcription with Whisper.cpp
 - Language model with LLaMA.cpp
 - Speech synthesis with TTS (Tacotron2)
 - Audio playback with Aplay
 
Each component runs offline, stitched together with Python and shell scripts.
Local LLM with LLaMA.cpp
I used tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf —a tiny quantized model served by llama.cpp. It’s not the smartest LLM around, but it runs fast-ish on CPU and supports chat-like completions over a local server. The value was execution in this case so it was a quick decision.
You send a JSON payload to the locally running server, and it returns a reply.
Voice In, Words Out – Whisper.cpp
Whisper handles audio transcription. I record 5 seconds of audio using arecord, then pass that WAV file to Whisper.
It spits out a txt file with the transcript, which becomes the prompt for the LLM.
It’s decently fast and very accurate, but still introduces a few seconds of delay.
Text to Speech with TTS (Tacotron2)
Once I get the LLM’s reply, I pass it to Coqui’s TTS library, which converts it into a WAV file:
And here’s one of my first critiques: this is slow. Synthesizing the full audio into a WAV file introduces noticeable delay, especially for longer outputs. Responses take around 5 seconds which is not terrible but not conversational pace.
The Loop
Here’s how the assistant works in a loop:
- Record 5s of mic audio
 - Transcribe with Whisper
 - Send prompt to LLaMA server
 - Convert response to speech
 - Play response aloud
 
It’s satisfying when it works—feels like magic—but the loop takes several seconds to complete.
Known Issues and Limitations (v1)
- Hallucinations – TinyLLaMA sometimes goes off-topic or invents facts. Response quality is inconsistent.
 - TTS latency – Synthesizing and playing full WAVs creates lag. Ideally, audio would stream back progressively.
 - Not actually fast – Despite my goal, it takes ~5–10 seconds per round trip, depending on the length of the input and output.
 - Static timing – Audio recording is fixed to 5 seconds. Would prefer it to stop when I stop speaking.
 
🚀 V2 Goals
I’m thinking about the next version. Here’s what I want to focus on:
- Stream audio output instead of generating a full WAV file first.
 - Smarter prompt formatting to reduce hallucinations.
 - Faster model – either a better quantization or swap in a bigger one with GPU acceleration.
 - Voice activity detection for smarter recording.
 - Interactive back-and-forth — respond while listening.
 
Continued Ideas
Linking this in with the knowledge worker code would be great for having a conversation with a model fine-tuned on specific data.
🧱 Build It Yourself
Check out the README.md for instructions: https://github.com/CodeJonesW/local-voice-assistant/tree/e3b619b7dc5bfd84ca7cae0b71dc1b4661dfff17