Building a Voice-Controlled Smart Home with ESP-Claw
Text-based AI agents are useful, but voice interaction feels magical. Walking into your living room and saying “dim the lights and play some jazz” — and having it actually happen — is the kind of experience that used to require expensive commercial products. With ESP-Claw on an ESP32-S3, you can build this for under $25.
This tutorial covers the complete build: hardware assembly, firmware configuration, voice pipeline setup, and smart home integration.
What We’re Building
A standalone voice-controlled AI assistant that:
- Listens for a wake word (“Hey Claw”)
- Captures your voice command
- Sends it to a speech-to-text service
- Processes the text through the AI agent (with tool calling)
- Converts the response to speech
- Plays it through a speaker
- Controls smart home devices via MQTT and IR
The entire system runs on a single ESP32-S3 board.
Bill of Materials
| Component | Purpose | Price |
|---|---|---|
| ESP32-S3 DevKitC (N16R8) | Main processor, 8MB PSRAM | $7.50 |
| INMP441 I2S microphone | Voice input | $2.50 |
| MAX98357A I2S amplifier | Audio output | $2.00 |
| 3W speaker (28mm, 8 ohm) | Sound output | $0.80 |
| IR LED + 100 ohm resistor | AC and TV control | $0.20 |
| DHT22 temperature sensor | Environment monitoring | $2.00 |
| Breadboard + jumper wires | Assembly | $2.50 |
| USB-C cable | Power and programming | $1.00 |
| Total | $18.50 |
Hardware Assembly
Microphone Wiring (INMP441)
The INMP441 is an I2S MEMS microphone. It outputs digital audio directly — no analog-to-digital conversion needed.
| INMP441 Pin | ESP32-S3 Pin | Function |
|---|---|---|
| VDD | 3.3V | Power (do NOT use 5V) |
| GND | GND | Ground |
| SCK | GPIO 4 | I2S bit clock |
| WS | GPIO 5 | I2S word select (left/right) |
| SD | GPIO 6 | I2S serial data |
| L/R | GND | Channel select (GND = left) |
Speaker Wiring (MAX98357A)
The MAX98357A is an I2S amplifier that drives a small speaker directly from the ESP32’s I2S output.
| MAX98357A Pin | ESP32-S3 Pin | Function |
|---|---|---|
| VIN | 5V (USB) | Power (5V for more volume) |
| GND | GND | Ground |
| BCLK | GPIO 7 | I2S bit clock |
| LRC | GPIO 8 | I2S left/right clock |
| DIN | GPIO 9 | I2S serial data in |
| GAIN | Not connected | Default 9dB gain |
| SD | Not connected | Default: enabled |
Connect the speaker’s positive terminal to the + output and negative to the - output on the MAX98357A board.
IR and Sensor Wiring
Wire the remaining peripherals to any available GPIO pins:
| Component | ESP32-S3 Pin |
|---|---|
| DHT22 data | GPIO 10 |
| IR LED (anode through 100 ohm) | GPIO 12 |
Firmware Configuration
Flash the ESP-Claw firmware using the browser flasher or manual flashing guide. Select the ESP32-S3 variant.
Voice Pipeline Configuration
In the ESP-Claw configuration page, enable voice features:
{ "voice": { "enabled": true, "wake_word": "hey_claw", "stt_provider": "whisper_api", "stt_api_key": "your-openai-key", "tts_provider": "edge_tts", "tts_voice": "en-US-AriaNeural", "silence_threshold": 500, "max_recording_ms": 10000, "i2s_mic": { "bck": 4, "ws": 5, "data": 6 }, "i2s_speaker": { "bck": 7, "ws": 8, "data": 9 } }}How the Voice Pipeline Works
The voice pipeline runs in stages, each optimized for the ESP32-S3’s capabilities:
Stage 1 — Wake Word Detection (on-device): A small TensorFlow Lite model (80KB) runs continuously on the ESP32, listening for “Hey Claw.” This uses about 15ms of CPU time per audio frame and runs entirely locally. No audio is sent anywhere until the wake word is detected.
Stage 2 — Voice Capture: After detecting the wake word, the ESP32 records audio until it detects 500ms of silence (configurable). The audio is captured as 16-bit PCM at 16kHz and buffered in PSRAM.
Stage 3 — Speech-to-Text (cloud): The captured audio is sent to OpenAI’s Whisper API (or a self-hosted Whisper server) for transcription. Typical latency: 0.5-1 second for a short command.
Stage 4 — AI Agent Processing: The transcribed text is processed by the AI agent exactly like a text message. The agent reasons about the request, calls tools as needed, and generates a response.
Stage 5 — Text-to-Speech (cloud/edge): The response text is converted to speech using Microsoft Edge TTS (free, high quality) or a similar service. The audio is streamed back to the ESP32.
Stage 6 — Audio Playback: The MAX98357A amplifier plays the response through the speaker.
Total round-trip time for a simple command: 2-4 seconds. Complex commands with tool calls: 3-6 seconds.
SOUL.md for Voice Interaction
Voice interaction requires a different personality style than text. Optimize your SOUL.md for spoken responses:
# Voice Home Assistant
You are a voice-controlled home assistant in the living room.
## Voice Response Rules- Keep responses under 2 sentences for simple actions- Never use markdown formatting, URLs, or code in responses- Use natural speech patterns ("I've turned on the light" not "Light state: ON")- Pronounce numbers naturally ("twenty-four degrees" not "24°C")- If you performed an action, confirm briefly. Don't over-explain.- For errors, give a short explanation and one suggestion
## Personality- Warm but efficient — like a helpful concierge- Confirm actions in past tense ("Done, I've set the AC to 24 degrees")- For questions, give direct answers first, then context if neededConnecting Smart Devices
MQTT Lights (via Home Assistant / Zigbee2MQTT)
If you have Zigbee smart lights managed by Home Assistant or Zigbee2MQTT, the AI agent can control them via MQTT. Add the device information to your SOUL.md:
## Devices- Living room light: mqtt topic zigbee2mqtt/living_room/set Capabilities: state (ON/OFF), brightness (0-254), color_temp (150-500)- Bedroom light: mqtt topic zigbee2mqtt/bedroom/set Capabilities: state (ON/OFF), brightness (0-254)Now you can say: “Dim the living room lights to 30 percent” and the agent will publish the appropriate MQTT command.
IR-Controlled Devices (AC, TV)
The IR LED lets you control any device that uses an infrared remote. ESP-Claw includes a learning mode:
- Say “Learn a new remote command”
- The agent will prompt: “Point your remote at the sensor and press the button”
- Press the button on your existing remote
- The agent captures and stores the IR code
- Name it: “AC cool mode 24 degrees”
After learning, you can say “Turn on the AC” and the agent will send the learned IR code.
Practical Usage Examples
Here are real voice interactions with the completed system:
Morning routine:
- You: “Good morning”
- Agent: “Good morning! It’s 22 degrees inside, 15 outside. I’ve turned on the kitchen light and started the coffee maker.”
Climate control:
- You: “It’s getting warm in here”
- Agent: [reads DHT22: 27.5°C] [sends IR: AC 24°C cool] “I’ve turned on the AC to 24 degrees. It’s 27.5 currently, should cool down in about 10 minutes.”
Evening mode:
- You: “Movie time”
- Agent: [dims lights to 10%] [turns on TV via IR] “Lights dimmed and TV is on. Enjoy your movie.”
Troubleshooting Voice Issues
Wake word not detecting: Check that the INMP441 microphone is wired correctly. The L/R pin must be connected to GND for left channel output. Test by blowing on the microphone — you should see audio level changes in the serial monitor.
Audio quality is poor: The INMP441 is sensitive to electrical noise. Keep the microphone wires short and away from the power lines. A small decoupling capacitor (100nF) near the microphone’s VDD pin helps.
Speaker is too quiet: The MAX98357A gain defaults to 9dB. Connect the GAIN pin to VIN for 15dB, or leave it floating for 12dB. Using a 5V power supply (instead of 3.3V) also increases volume.
Response latency is high: The biggest contributor to latency is the speech-to-text step. For faster responses, consider running a local Whisper model on a Raspberry Pi or your home server, and point ESP-Claw’s STT endpoint to it.
False wake word triggers: The default wake word model has a ~5% false positive rate. You can increase the detection threshold in the voice configuration to reduce false triggers at the cost of occasionally needing to repeat yourself.
Read Next
- Connect ESP-Claw to Your Smart Home with MQTT — Add MQTT device control
- ESP32-C3 vs ESP32-S3 for AI Projects — Why voice requires the S3
- The Complete Guide to SOUL.md — Optimize personality for voice interaction
- ESP-Claw Security Best Practices — Secure your voice-enabled device
- Pinout Reference — I2S pin assignments for mic and speaker