Porygon Voice Assistant | Project

Physical pod

Embedded System Architecture

Lolin D32

For the brains I used the Lolin D32. This was simply because I still had thisboard laying around. If anyone would try to work the same project, I would recommend them either a similar esp32 board or a board with more RAM. The Specs were enough for what I needed to do but it's tight.

Microphone & Speaker

In order for the Porygon to be able to listen to what I am saying I needed a microphone. The INMP441 is often used and was readily available. Most voice assistants are also able to talk back. To achieve this I chose the MAX98357. This allowed me to use 4-8 ohm speakers, which I still had laying around.

Edge Impulse

There were a couple of options for wake word detection. These options include ESP-SR, Streaming it, Locally transcribing or Edge Impulse. Edge Impulse took the most time to install but was also best tuned for my application.

Software Server

Server Structure & Model Logic

SpeachToText

The ESP32 streams the received audio to a tcp port open on my server. The server needs to translate these bytes to speech so the LLM can respond. This is done using a Speech To Text algorithm. Since OpenAI has its Whisper model open for public use, I chose to use this model.

Ollama

I personally have a strong disdain towards subscription based services. This meant that I was not going to API call the big names. Instead I chose to api call a model I am personally hosting. This means that I lose some recency but gain a free model. I chose ollama3.1:8b since this model can run fully on my GPU VRAM.

TextToSpeech

The text which is now received from the LLM cannot yet be spoken. The Text To Speech algorithm is implemented to send the bytes to the ESP. I have chosen to use the Kokoro TTS algorithm. This algorithm is both fast and has the option for mulitiple different voices. This makes the voice assistant sound more human.