Open Claw
中文
· OpenClaw Team

Building a Voice-Controlled Smart Home with ESP-Claw

Text-based AI agents are useful, but voice interaction feels magical. Walking into your living room and saying “dim the lights and play some jazz” — and having it actually happen — is the kind of experience that used to require expensive commercial products. With ESP-Claw on an ESP32-S3, you can build this for under $25.

This tutorial covers the complete build: hardware assembly, firmware configuration, voice pipeline setup, and smart home integration.

What We’re Building

A standalone voice-controlled AI assistant that:

  • Listens for a wake word (“Hey Claw”)
  • Captures your voice command
  • Sends it to a speech-to-text service
  • Processes the text through the AI agent (with tool calling)
  • Converts the response to speech
  • Plays it through a speaker
  • Controls smart home devices via MQTT and IR

The entire system runs on a single ESP32-S3 board.

Bill of Materials

ComponentPurposePrice
ESP32-S3 DevKitC (N16R8)Main processor, 8MB PSRAM$7.50
INMP441 I2S microphoneVoice input$2.50
MAX98357A I2S amplifierAudio output$2.00
3W speaker (28mm, 8 ohm)Sound output$0.80
IR LED + 100 ohm resistorAC and TV control$0.20
DHT22 temperature sensorEnvironment monitoring$2.00
Breadboard + jumper wiresAssembly$2.50
USB-C cablePower and programming$1.00
Total$18.50

Hardware Assembly

Microphone Wiring (INMP441)

The INMP441 is an I2S MEMS microphone. It outputs digital audio directly — no analog-to-digital conversion needed.

INMP441 PinESP32-S3 PinFunction
VDD3.3VPower (do NOT use 5V)
GNDGNDGround
SCKGPIO 4I2S bit clock
WSGPIO 5I2S word select (left/right)
SDGPIO 6I2S serial data
L/RGNDChannel select (GND = left)

Speaker Wiring (MAX98357A)

The MAX98357A is an I2S amplifier that drives a small speaker directly from the ESP32’s I2S output.

MAX98357A PinESP32-S3 PinFunction
VIN5V (USB)Power (5V for more volume)
GNDGNDGround
BCLKGPIO 7I2S bit clock
LRCGPIO 8I2S left/right clock
DINGPIO 9I2S serial data in
GAINNot connectedDefault 9dB gain
SDNot connectedDefault: enabled

Connect the speaker’s positive terminal to the + output and negative to the - output on the MAX98357A board.

IR and Sensor Wiring

Wire the remaining peripherals to any available GPIO pins:

ComponentESP32-S3 Pin
DHT22 dataGPIO 10
IR LED (anode through 100 ohm)GPIO 12

Firmware Configuration

Flash the ESP-Claw firmware using the browser flasher or manual flashing guide. Select the ESP32-S3 variant.

Voice Pipeline Configuration

In the ESP-Claw configuration page, enable voice features:

{
"voice": {
"enabled": true,
"wake_word": "hey_claw",
"stt_provider": "whisper_api",
"stt_api_key": "your-openai-key",
"tts_provider": "edge_tts",
"tts_voice": "en-US-AriaNeural",
"silence_threshold": 500,
"max_recording_ms": 10000,
"i2s_mic": {
"bck": 4,
"ws": 5,
"data": 6
},
"i2s_speaker": {
"bck": 7,
"ws": 8,
"data": 9
}
}
}

How the Voice Pipeline Works

The voice pipeline runs in stages, each optimized for the ESP32-S3’s capabilities:

Stage 1 — Wake Word Detection (on-device): A small TensorFlow Lite model (80KB) runs continuously on the ESP32, listening for “Hey Claw.” This uses about 15ms of CPU time per audio frame and runs entirely locally. No audio is sent anywhere until the wake word is detected.

Stage 2 — Voice Capture: After detecting the wake word, the ESP32 records audio until it detects 500ms of silence (configurable). The audio is captured as 16-bit PCM at 16kHz and buffered in PSRAM.

Stage 3 — Speech-to-Text (cloud): The captured audio is sent to OpenAI’s Whisper API (or a self-hosted Whisper server) for transcription. Typical latency: 0.5-1 second for a short command.

Stage 4 — AI Agent Processing: The transcribed text is processed by the AI agent exactly like a text message. The agent reasons about the request, calls tools as needed, and generates a response.

Stage 5 — Text-to-Speech (cloud/edge): The response text is converted to speech using Microsoft Edge TTS (free, high quality) or a similar service. The audio is streamed back to the ESP32.

Stage 6 — Audio Playback: The MAX98357A amplifier plays the response through the speaker.

Total round-trip time for a simple command: 2-4 seconds. Complex commands with tool calls: 3-6 seconds.

SOUL.md for Voice Interaction

Voice interaction requires a different personality style than text. Optimize your SOUL.md for spoken responses:

# Voice Home Assistant
You are a voice-controlled home assistant in the living room.
## Voice Response Rules
- Keep responses under 2 sentences for simple actions
- Never use markdown formatting, URLs, or code in responses
- Use natural speech patterns ("I've turned on the light" not "Light state: ON")
- Pronounce numbers naturally ("twenty-four degrees" not "24°C")
- If you performed an action, confirm briefly. Don't over-explain.
- For errors, give a short explanation and one suggestion
## Personality
- Warm but efficient — like a helpful concierge
- Confirm actions in past tense ("Done, I've set the AC to 24 degrees")
- For questions, give direct answers first, then context if needed

Connecting Smart Devices

MQTT Lights (via Home Assistant / Zigbee2MQTT)

If you have Zigbee smart lights managed by Home Assistant or Zigbee2MQTT, the AI agent can control them via MQTT. Add the device information to your SOUL.md:

## Devices
- Living room light: mqtt topic zigbee2mqtt/living_room/set
Capabilities: state (ON/OFF), brightness (0-254), color_temp (150-500)
- Bedroom light: mqtt topic zigbee2mqtt/bedroom/set
Capabilities: state (ON/OFF), brightness (0-254)

Now you can say: “Dim the living room lights to 30 percent” and the agent will publish the appropriate MQTT command.

IR-Controlled Devices (AC, TV)

The IR LED lets you control any device that uses an infrared remote. ESP-Claw includes a learning mode:

  1. Say “Learn a new remote command”
  2. The agent will prompt: “Point your remote at the sensor and press the button”
  3. Press the button on your existing remote
  4. The agent captures and stores the IR code
  5. Name it: “AC cool mode 24 degrees”

After learning, you can say “Turn on the AC” and the agent will send the learned IR code.

Practical Usage Examples

Here are real voice interactions with the completed system:

Morning routine:

  • You: “Good morning”
  • Agent: “Good morning! It’s 22 degrees inside, 15 outside. I’ve turned on the kitchen light and started the coffee maker.”

Climate control:

  • You: “It’s getting warm in here”
  • Agent: [reads DHT22: 27.5°C] [sends IR: AC 24°C cool] “I’ve turned on the AC to 24 degrees. It’s 27.5 currently, should cool down in about 10 minutes.”

Evening mode:

  • You: “Movie time”
  • Agent: [dims lights to 10%] [turns on TV via IR] “Lights dimmed and TV is on. Enjoy your movie.”

Troubleshooting Voice Issues

Wake word not detecting: Check that the INMP441 microphone is wired correctly. The L/R pin must be connected to GND for left channel output. Test by blowing on the microphone — you should see audio level changes in the serial monitor.

Audio quality is poor: The INMP441 is sensitive to electrical noise. Keep the microphone wires short and away from the power lines. A small decoupling capacitor (100nF) near the microphone’s VDD pin helps.

Speaker is too quiet: The MAX98357A gain defaults to 9dB. Connect the GAIN pin to VIN for 15dB, or leave it floating for 12dB. Using a 5V power supply (instead of 3.3V) also increases volume.

Response latency is high: The biggest contributor to latency is the speech-to-text step. For faster responses, consider running a local Whisper model on a Raspberry Pi or your home server, and point ESP-Claw’s STT endpoint to it.

False wake word triggers: The default wake word model has a ~5% false positive rate. You can increase the detection threshold in the voice configuration to reduce false triggers at the cost of occasionally needing to repeat yourself.