Tonight was an exciting night for development in my world. I have been meaning to try the Codex feature on ChatGpt and I was pleasantly surprised. There were a few bugs in the browser but they were easy to ignore because the results were looking good. Cursor and v0.dev struggled handling big projects or integrating into existing ones. We will see how codex handles the fairly simple voice assistant and how far we can grow the project in a short amount of time.

For context, last night I wrote about my initial take on a local voice assistant written in Python. The application records voice input, transcribes to text, feeds it to a local LLM and then translates the text into voice for playback allowing for a conversation voice assistant. V1 was slow and hallucinated a bit but it worked.

My goals for improvement were

  • Stream audio output instead of generating a full WAV file first.

  • Smarter prompt formatting to reduce hallucinations.

  • Faster model – either a better quantization or swap in a bigger one with GPU acceleration.

  • Voice activity detection for smarter recording.

  • Interactive back-and-forth — respond while listening.

It is 9:00pm CST and my first thought is if I want to improve I need to measure how fast I am are going. So I prompted codex to help me write a metrics module that records the different workflow components and keeps record of each interactions time. While this is going on I was able to watch it think through the problem and validate each step. I had it write a few tests for the metrics - reviewed them and finally requested it create a Github workflow yaml file that runs tests on merge requests. I opened a PR and was running tests for my metrics module in Github without opening a code editor.

Metrics PR

I canceled my cursor and v0.dev subscriptions. Ive been sampling ai tools lately and this was by far my best experience. My usage of them has died off after getting deeper into my weightlifting web and iOS apps. The time to output to error ratio became inefficient.

One PR down. Next I wanted to implement a RAG layer where I can store data in a vector database and then pull the relative information to use in the prompt delivered to the LLM. I chatted with codex for a minute to design a system where one can drop in some files to a folder and then run the script with a specific flag and it will train on all the files. When the storing completes it moves the files to a processed folder. Once again added a few tests and opened a PR.

Rag PR

With this I have big ideas to move fast. I want to hook this up to a local server and interact with it via a ui of some sort where users can add files, see the interactions in text while speaking the voice assistant. Ill need to think about moving the audio recording to the browser and streaming that via web sockets to the server where it could be translated into text and then given to the LLM trained on the users data.

This time is 10:18pm CST - lets see how fast we can go while keeping the wheels on track. This plan is definitely multistep so Ill need to inform codex of my goals design a plan and move through the steps.

The experience of being able to drive the computer to create ideas at this speed is addicting.

Codex responded well to my plan and was able to execute this multi component multi step change in one conversational task.

Codex’s plan was extensive and included proper details of each step of the plan, I confirmed and it got to work.

Server & UI PR

The UI was as simple as it gets some raw html and a javascript file to connect to the web sockets. I am in favor of the simplicity to start. One issue arose as it did not add flask properly to the test packages for install in the Github PR workflow. What an oversight 😂. I was able to request modifying the workflow to include them.

Uh oh my first snag - Codex opened a separate PR for the workflow adjustment leaving the Server & Ui changes in another. I trusted it and merged both. Failing tests ensued.

It is now 10:49m CST.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8" />
    <title>Local Voice Assistant</title>
  </head>
  <body>
    <h1>Local Voice Assistant</h1>
    <button id="start">Start Recording</button> <button id="stop" disabled>Stop</button>
    <div id="log"></div>
    <form id="upload-form">
      <input type="file" id="file" name="file" /> <button type="submit">Upload</button>
    </form>
    <script src="https://cdn.socket.io/4.7.5/socket.io.min.js"></script>
    <script src="app.js"></script>
  </body>
</html>

The simplicity and directness of the result is almost comical. No complaints. Hopefully it works.

10:53pm we have a quick fix for Python 3.10 f string usage.

Test fix PR

Its about time to close up shop for the evening.

I am 16 commits into this project and probably about 6 hours in. We have some interesting results. Night 1 was comprised of manual coding and some normal ChatGpt usage. Most of the time was spent figuring out what packages to use for the speech to text on Linux. Using Codex on Night 2 was a thrill. Being able to run multiple tasks in parallel is exciting. Development speed could be truly wild.

Caveat to all this - I am on my mac book tonight so I cannot actually clone this down and run the model. I will check back in tomorrow and report the results. I am optimistic but as always there may be a few adjustments.

Looking ahead, I plan to:

Transition from writing full .wav files to streaming audio playback in the browser

Upgrade the LLM in use (better model, faster inference)

Support streamed output from the LLM to the user, for a more real-time feel

This is already sparking ideas of where I can take it next. I’d love to gain more experience deploying this type of system to the cloud, so that might be the next challenge on the roadmap.

Tonights coding and blog took about 2 hours.

Cheers!

Local Voice Assistant V1

Local Voice Assistant V3