V1 of My Local Voice Assistant: Chaining Tools Toward Speed

For a while now, I’ve been thinking about the idea of a voice assistant that doesn’t live in the cloud. Something fast, privacy-respecting, and fully offline. I have yet to use one that worked at a truly conversation pace.

So I built one. Or rather, I built version one—and while it works, it’s not fast - yet.

This first version is just about learning and experimenting: I was able to record audio, transcribe it, run it through a local LLM, and speak the response back—all without leaving my machine. But now I need to go back and optimize.

The Goal: Speedy, Local, Conversational AI

The core goal was simple:

Speak to my computer → Get an intelligent, spoken response → Do it all fast.

In reality, I hit a few performance roadblocks—but the structure is there, and the tools are all local. I wanted to practice chaining together multiple AI tools to build something semi-cohesive and voice-based.

🛠️ What’s Under the Hood?

Audio recording with Arecord
Transcription with Whisper.cpp
Language model with LLaMA.cpp
Speech synthesis with TTS (Tacotron2)
Audio playback with Aplay

Each component runs offline, stitched together with Python and shell scripts.

Local LLM with LLaMA.cpp

I used tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf —a tiny quantized model served by llama.cpp. It’s not the smartest LLM around, but it runs fast-ish on CPU and supports chat-like completions over a local server. The value was execution in this case so it was a quick decision.

You send a JSON payload to the locally running server, and it returns a reply.

Voice In, Words Out – Whisper.cpp

Whisper handles audio transcription. I record 5 seconds of audio using arecord, then pass that WAV file to Whisper.

It spits out a txt file with the transcript, which becomes the prompt for the LLM.

It’s decently fast and very accurate, but still introduces a few seconds of delay.

Text to Speech with TTS (Tacotron2)

Once I get the LLM’s reply, I pass it to Coqui’s TTS library, which converts it into a WAV file:

And here’s one of my first critiques: this is slow. Synthesizing the full audio into a WAV file introduces noticeable delay, especially for longer outputs. Responses take around 5 seconds which is not terrible but not conversational pace.

The Loop

Here’s how the assistant works in a loop:

Record 5s of mic audio
Transcribe with Whisper
Send prompt to LLaMA server
Convert response to speech
Play response aloud

It’s satisfying when it works—feels like magic—but the loop takes several seconds to complete.

Known Issues and Limitations (v1)

Hallucinations – TinyLLaMA sometimes goes off-topic or invents facts. Response quality is inconsistent.
TTS latency – Synthesizing and playing full WAVs creates lag. Ideally, audio would stream back progressively.
Not actually fast – Despite my goal, it takes ~5–10 seconds per round trip, depending on the length of the input and output.
Static timing – Audio recording is fixed to 5 seconds. Would prefer it to stop when I stop speaking.

🚀 V2 Goals

I’m thinking about the next version. Here’s what I want to focus on:

Stream audio output instead of generating a full WAV file first.
Smarter prompt formatting to reduce hallucinations.
Faster model – either a better quantization or swap in a bigger one with GPU acceleration.
Voice activity detection for smarter recording.
Interactive back-and-forth — respond while listening.

Continued Ideas

Linking this in with the knowledge worker code would be great for having a conversation with a model fine-tuned on specific data.

🧱 Build It Yourself

Check out the README.md for instructions: https://github.com/CodeJonesW/local-voice-assistant/tree/e3b619b7dc5bfd84ca7cae0b71dc1b4661dfff17

BackLinks

Local Voice Assistant V2

Local Voice Assistant V3

Jones Codes

Explorer

V1 of My Local Voice Assistant: Chaining Tools Toward Speed

The Goal: Speedy, Local, Conversational AI

🛠️ What’s Under the Hood?

Local LLM with LLaMA.cpp

Voice In, Words Out – Whisper.cpp

Text to Speech with TTS (Tacotron2)

The Loop

Known Issues and Limitations (v1)

🚀 V2 Goals

Continued Ideas

🧱 Build It Yourself

BackLinks

Graph View

Recent Posts

From LRU Cache to Distributed Systems: A Complete Guide to Caching in Modern Applications

Posts

Testing Distributed Systems: Beyond Unit Tests

Solving the Dual Write Problem - Transactional Outbox and Idempotency

Cache Invalidation in Microservices: The Hard Part