Set Up Home Assistant Voice Assistant Without Cloud Dependency

Running a Voice Assistant That Stays in Your House

Most smart home voice assistants work by sending your audio to remote servers, processing your request in the cloud, and returning a response. That architecture is fast and convenient, but it means every command you speak travels outside your home. For people who want their smart home to actually stay private, Home Assistant offers a path to a fully local voice pipeline – one where wake word detection, speech transcription, and intent processing all happen on hardware you control.

The setup requires more effort than plugging in an Amazon Echo, but the result is a voice assistant that works without an internet connection, never logs your commands to a third-party server, and does not depend on a company deciding to keep the product alive. The tools that make this possible – Wyoming protocol, Piper for text-to-speech, Whisper for speech-to-text, and openWakeWord for wake word detection – are all open source and can run on a local machine or a Raspberry Pi.

A collection of smart home devices arranged on a desk representing local home automation setup — Photo by Anete Lusina / Pexels

What You Need Before You Start

Home Assistant must already be running on your network, ideally as a Home Assistant OS installation since it gives you the easiest access to the add-on store. You will also need a device to handle the heavy lifting – a machine running Home Assistant with at least 4GB of RAM handles Whisper at a basic level, though a more powerful host speeds up transcription noticeably. A microphone input is required, either a USB microphone plugged into your Home Assistant host or a separate device like an ESP32-S3 running the ESPHome firmware, which lets you place a small microphone anywhere in your home.

Install three add-ons from the Home Assistant add-on store: the Whisper add-on (for speech-to-text), the Piper add-on (for text-to-speech responses), and the openWakeWord add-on (for always-listening wake word detection). All three are available under the official add-on repository. Start each add-on and enable the Start on boot option for each one so your pipeline survives restarts.

Configuring the Wyoming Pipeline

Home Assistant’s local voice system is built around the Wyoming protocol, which is a lightweight communication standard that connects the individual speech components together. Once your three add-ons are running, navigate to Settings > Devices and Services and look for Wyoming integrations to appear automatically. If they do not show up on their own, use the Add Integration button and search for Wyoming to add each service manually.

With all three Wyoming services connected, go to Settings > Voice Assistants and create a new assistant. Set the speech-to-text engine to Whisper, the text-to-speech engine to Piper, and select your preferred Piper voice from the dropdown – there are several English voices available at different quality levels, with the medium quality models offering a reasonable balance between naturalness and processing speed. Select the conversation agent, which defaults to Home Assistant’s built-in agent for local processing of smart home commands.

A Raspberry Pi single-board computer connected to peripherals used for home server projects — Photo by Enes Beydilli / Pexels

Wake Word Detection and Hardware Setup

Wake word detection is what makes the assistant hands-free. The openWakeWord add-on listens continuously on your microphone input and triggers the speech pipeline only when it hears the designated phrase. The default wake words available include “Hey Jarvis,” “Okay Nabu,” and “Hey Mycroft.” These are not as polished as “Hey Alexa” in terms of false positive rates, but they work reliably in quiet environments and improve as the models are updated.

If you want a voice satellite – a microphone placed in a room away from your Home Assistant server – the cleanest approach uses an ESP32-S3 development board running ESPHome firmware with the voice assistant component enabled. ESPHome’s voice assistant integration handles wake word detection locally on the ESP32 and streams audio to your Home Assistant instance only after the wake word is heard. A board with a built-in microphone and speaker, such as the M5Stack Atom Echo or the XIAO ESP32S3 Sense, simplifies the hardware side considerably. Flash the ESPHome firmware, point the device at your Home Assistant instance, and it appears as a new voice satellite under Settings > Devices and Services.

For a Raspberry Pi running Home Assistant OS as the main server, a USB microphone like the Blue Snowball or any basic conference-style USB mic works well. Plug it in, and Home Assistant detects it as an available audio input. Under the Wyoming add-on configurations, each add-on has an audio device setting where you select the correct microphone input from a list of detected devices. Match the microphone to the openWakeWord add-on so wake detection uses your actual hardware input rather than a default fallback.

One practical issue to plan for: Whisper’s transcription speed depends heavily on your hardware. On a basic Raspberry Pi 4, even the “tiny” Whisper model introduces a noticeable delay between speaking and getting a response – sometimes two to four seconds. Running Home Assistant on an x86 machine with a dedicated GPU drops that delay sharply. If you are building a setup where response latency matters, an Intel NUC or a small form-factor PC running Home Assistant OS as a bare-metal installation handles the Whisper “base” model with much shorter wait times. The “small” and “medium” Whisper models improve accuracy for accented speech or quieter environments but require more processing power.

A USB microphone on a desk representing voice input hardware for a local voice assistant — Photo by Andrea Piacquadio / Pexels

Extending the Pipeline Beyond Basic Commands

Home Assistant’s built-in conversation agent handles device control commands reliably – turning lights on and off, setting thermostat temperatures, locking doors, running scripts and automations. What it does not handle well out of the box is general knowledge questions or anything that requires reasoning beyond the state of your home. For that, some users add a local large language model through the Ollama integration, which lets you run a model like Llama 3 or Mistral on local hardware and connect it as a conversation agent inside Home Assistant. The tradeoff is hardware demand – running a capable language model locally requires serious compute, and a Raspberry Pi will not cut it for anything beyond the smallest quantized models.

If your goal is smart home control rather than a general-purpose assistant, the built-in agent is genuinely capable. Home Assistant parses natural language commands through its intent recognition system, which understands room-specific phrasing like “turn off all the lights in the kitchen” or “set the living room thermostat to 70 degrees” without any additional configuration. Custom sentences can be added in YAML under the custom_sentences directory, which lets you define your own voice triggers for any automation in your home.

The system also integrates with self-hosted notification tools if you want voice responses paired with push alerts to your phone, keeping the entire notification chain off commercial platforms. Scripting a response that both speaks through Piper and fires a notification through a self-hosted service takes only a few lines of YAML automation.

None of this is effortless to maintain. Add-ons update independently, and occasionally a Whisper or Piper update changes a configuration option or breaks an audio device mapping. The wake word false negative rate – instances where the assistant misses the wake word – is higher than commercial alternatives, particularly in noisy rooms. Whether that tradeoff makes sense depends on how much weight you put on keeping voice data off external servers entirely, because that is the one thing this setup genuinely guarantees that no commercial assistant can match.