Building a Voice-Controlled Smart Home with ESP-Claw

Text-based AI agents are useful, but voice interaction feels magical. Walking into your living room and saying “dim the lights and play some jazz” — and having it actually happen — is the kind of experience that used to require expensive commercial products. With ESP-Claw on an ESP32-S3, you can build this for under $25.

This tutorial covers the complete build: hardware assembly, firmware configuration, voice pipeline setup, and smart home integration.

What We’re Building

A standalone voice-controlled AI assistant that:

Listens for a wake word (“Hey Claw”)
Captures your voice command
Sends it to a speech-to-text service
Processes the text through the AI agent (with tool calling)
Converts the response to speech
Plays it through a speaker
Controls smart home devices via MQTT and IR

The entire system runs on a single ESP32-S3 board.

Bill of Materials

Component	Purpose	Price
ESP32-S3 DevKitC (N16R8)	Main processor, 8MB PSRAM	$7.50
INMP441 I2S microphone	Voice input	$2.50
MAX98357A I2S amplifier	Audio output	$2.00
3W speaker (28mm, 8 ohm)	Sound output	$0.80
IR LED + 100 ohm resistor	AC and TV control	$0.20
DHT22 temperature sensor	Environment monitoring	$2.00
Breadboard + jumper wires	Assembly	$2.50
USB-C cable	Power and programming	$1.00
Total		$18.50

Hardware Assembly

Microphone Wiring (INMP441)

The INMP441 is an I2S MEMS microphone. It outputs digital audio directly — no analog-to-digital conversion needed.

INMP441 Pin	ESP32-S3 Pin	Function
VDD	3.3V	Power (do NOT use 5V)
GND	GND	Ground
SCK	GPIO 4	I2S bit clock
WS	GPIO 5	I2S word select (left/right)
SD	GPIO 6	I2S serial data
L/R	GND	Channel select (GND = left)

Speaker Wiring (MAX98357A)

The MAX98357A is an I2S amplifier that drives a small speaker directly from the ESP32’s I2S output.

MAX98357A Pin	ESP32-S3 Pin	Function
VIN	5V (USB)	Power (5V for more volume)
GND	GND	Ground
BCLK	GPIO 7	I2S bit clock
LRC	GPIO 8	I2S left/right clock
DIN	GPIO 9	I2S serial data in
GAIN	Not connected	Default 9dB gain
SD	Not connected	Default: enabled

Connect the speaker’s positive terminal to the + output and negative to the - output on the MAX98357A board.

IR and Sensor Wiring

Wire the remaining peripherals to any available GPIO pins:

Component	ESP32-S3 Pin
DHT22 data	GPIO 10
IR LED (anode through 100 ohm)	GPIO 12

Firmware Configuration

Flash the ESP-Claw firmware using the browser flasher or manual flashing guide. Select the ESP32-S3 variant.

Voice Pipeline Configuration

In the ESP-Claw configuration page, enable voice features:

{
  "voice": {
    "enabled": true,
    "wake_word": "hey_claw",
    "stt_provider": "whisper_api",
    "stt_api_key": "your-openai-key",
    "tts_provider": "edge_tts",
    "tts_voice": "en-US-AriaNeural",
    "silence_threshold": 500,
    "max_recording_ms": 10000,
    "i2s_mic": {
      "bck": 4,
      "ws": 5,
      "data": 6
    },
    "i2s_speaker": {
      "bck": 7,
      "ws": 8,
      "data": 9
    }
  }
}

How the Voice Pipeline Works

The voice pipeline runs in stages, each optimized for the ESP32-S3’s capabilities:

Stage 1 — Wake Word Detection (on-device): A small TensorFlow Lite model (80KB) runs continuously on the ESP32, listening for “Hey Claw.” This uses about 15ms of CPU time per audio frame and runs entirely locally. No audio is sent anywhere until the wake word is detected.

Stage 2 — Voice Capture: After detecting the wake word, the ESP32 records audio until it detects 500ms of silence (configurable). The audio is captured as 16-bit PCM at 16kHz and buffered in PSRAM.

Stage 3 — Speech-to-Text (cloud): The captured audio is sent to OpenAI’s Whisper API (or a self-hosted Whisper server) for transcription. Typical latency: 0.5-1 second for a short command.

Stage 4 — AI Agent Processing: The transcribed text is processed by the AI agent exactly like a text message. The agent reasons about the request, calls tools as needed, and generates a response.

Stage 5 — Text-to-Speech (cloud/edge): The response text is converted to speech using Microsoft Edge TTS (free, high quality) or a similar service. The audio is streamed back to the ESP32.

Stage 6 — Audio Playback: The MAX98357A amplifier plays the response through the speaker.

Total round-trip time for a simple command: 2-4 seconds. Complex commands with tool calls: 3-6 seconds.

SOUL.md for Voice Interaction

Voice interaction requires a different personality style than text. Optimize your SOUL.md for spoken responses:

# Voice Home Assistant

You are a voice-controlled home assistant in the living room.

## Voice Response Rules
- Keep responses under 2 sentences for simple actions
- Never use markdown formatting, URLs, or code in responses
- Use natural speech patterns ("I've turned on the light" not "Light state: ON")
- Pronounce numbers naturally ("twenty-four degrees" not "24°C")
- If you performed an action, confirm briefly. Don't over-explain.
- For errors, give a short explanation and one suggestion

## Personality
- Warm but efficient — like a helpful concierge
- Confirm actions in past tense ("Done, I've set the AC to 24 degrees")
- For questions, give direct answers first, then context if needed

Connecting Smart Devices

MQTT Lights (via Home Assistant / Zigbee2MQTT)

If you have Zigbee smart lights managed by Home Assistant or Zigbee2MQTT, the AI agent can control them via MQTT. Add the device information to your SOUL.md:

## Devices
- Living room light: mqtt topic zigbee2mqtt/living_room/set
  Capabilities: state (ON/OFF), brightness (0-254), color_temp (150-500)
- Bedroom light: mqtt topic zigbee2mqtt/bedroom/set
  Capabilities: state (ON/OFF), brightness (0-254)

Now you can say: “Dim the living room lights to 30 percent” and the agent will publish the appropriate MQTT command.

IR-Controlled Devices (AC, TV)

The IR LED lets you control any device that uses an infrared remote. ESP-Claw includes a learning mode:

Say “Learn a new remote command”
The agent will prompt: “Point your remote at the sensor and press the button”
Press the button on your existing remote
The agent captures and stores the IR code
Name it: “AC cool mode 24 degrees”

After learning, you can say “Turn on the AC” and the agent will send the learned IR code.

Practical Usage Examples

Here are real voice interactions with the completed system:

Morning routine:

You: “Good morning”
Agent: “Good morning! It’s 22 degrees inside, 15 outside. I’ve turned on the kitchen light and started the coffee maker.”

Climate control:

You: “It’s getting warm in here”
Agent: [reads DHT22: 27.5°C] [sends IR: AC 24°C cool] “I’ve turned on the AC to 24 degrees. It’s 27.5 currently, should cool down in about 10 minutes.”

Evening mode:

You: “Movie time”
Agent: [dims lights to 10%] [turns on TV via IR] “Lights dimmed and TV is on. Enjoy your movie.”

Troubleshooting Voice Issues

Wake word not detecting: Check that the INMP441 microphone is wired correctly. The L/R pin must be connected to GND for left channel output. Test by blowing on the microphone — you should see audio level changes in the serial monitor.

Audio quality is poor: The INMP441 is sensitive to electrical noise. Keep the microphone wires short and away from the power lines. A small decoupling capacitor (100nF) near the microphone’s VDD pin helps.

Speaker is too quiet: The MAX98357A gain defaults to 9dB. Connect the GAIN pin to VIN for 15dB, or leave it floating for 12dB. Using a 5V power supply (instead of 3.3V) also increases volume.

Response latency is high: The biggest contributor to latency is the speech-to-text step. For faster responses, consider running a local Whisper model on a Raspberry Pi or your home server, and point ESP-Claw’s STT endpoint to it.

False wake word triggers: The default wake word model has a ~5% false positive rate. You can increase the detection threshold in the voice configuration to reduce false triggers at the cost of occasionally needing to repeat yourself.