For a while now, I’ve been thinking about the idea of a voice assistant that doesn’t live in the cloud. Something fast, privacy-respecting, and fully offline. I have yet to use one that worked at a truely conversation pace.
So I built one. Or rather, I built version one—and while it works, its not fast - yet.
This first version is just about learning and experimenting: I was able to record audio, transcribe it, run it through a local LLM, and speak the response back—all without leaving my machine. But now I need to go back and optimize.
The Goal: Speedy, Local, Conversational AI
The core goal was simple:
Speak to my computer → Get an intelligent, spoken response → Do it all fast.
In reality, I hit a few performance roadblocks—but the structure is there, and the tools are all local. I wanted to practice chaining together multiple AI tools to build something semi-cohesive and voice-based.
🛠️ What’s Under the Hood?
- Audio recording with Arecord
- Transcription with Whisper.cpp
- Language model with LLaMA.cpp
- Speech synthesis with TTS (Tacotron2)
- Audio playback with Aplay
Each component runs offline, stitched together with Python and shell scripts.
Local LLM with LLaMA.cpp
I used tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf —a tiny quantized model served by llama.cpp. It’s not the smartest LLM around, but it runs fast-ish on CPU and supports chat-like completions over a local server. The value was execution in this case so it was a quick decision.
You send a JSON payload to the locally running server, and it returns a reply.
Voice In, Words Out – Whisper.cpp
Whisper handles audio transcription. I record 5 seconds of audio using arecord, then pass that WAV file to Whisper.
It spits out a txt file with the transcript, which becomes the prompt for the LLM.
It’s decently fast and very accurate, but still introduces a few seconds of delay.
Text to Speech with TTS (Tacotron2)
Once I get the LLM’s reply, I pass it to Coqui’s TTS library, which converts it into a WAV file:
And here’s one of my first critiques: this is slow. Synthesizing the full audio into a WAV file introduces noticeable delay, especially for longer outputs. Responses around 5 seconds for me which is not terrible but not conversational pace.
The Loop
Here’s how the assistant works in a loop:
- Record 5s of mic audio
- Transcribe with Whisper
- Send prompt to LLaMA server
- Convert response to speech
- Play response aloud
It’s satisfying when it works—feels like magic—but the loop takes several seconds to complete.
Known Issues and Limitations (v1)
- Hallucinations – TinyLLaMA sometimes goes off-topic or invents facts. Response quality is inconsistent.
- TTS latency – Synthesizing and playing full WAVs creates lag. Ideally, audio would stream back progressively.
- Not actually fast – Despite my goal, it takes ~5–10 seconds per round trip, depending on the length of the input and output.
- Static timing – Audio recording is fixed to 5 seconds. Would prefer it to stop when I stop speaking.
🚀 V2 Goals
I’m thinking about the next version. Here’s what I want to focus on:
- Stream audio output instead of generating a full WAV file first.
- Smarter prompt formatting to reduce hallucinations.
- Faster model – either a better quantization or swap in a bigger one with GPU acceleration.
- Voice activity detection for smarter recording.
- Interactive back-and-forth — respond while listening.
Continued Ideas
Linking this in with the knowledge worker code would be great for having a conversation with a model fine tuned on specific data.
🧱 Build It Yourself
Check out the README.md for instructions > https://github.com/CodeJonesW/local-voice-assistant/tree/e3b619b7dc5bfd84ca7cae0b71dc1b4661dfff17