ws://<host>:<VOICE_PORT>/audio
Default: ws://localhost:3000/audio
Audio Format
All audio data is transmitted as raw PCM in binary WebSocket frames.| Property | Value |
|---|---|
| Encoding | PCM (signed integer) |
| Bit Depth | 16-bit (Int16) |
| Sample Rate | 16,000 Hz |
| Channels | 1 (mono) |
| Byte Order | Little-endian |
| Max Frame | 1 MB |
Connection Lifecycle
Session Ack
Server immediately sends a
session_ack message with the assigned sessionId and audio configuration.Session Start (optional)
Client sends
session_start with an optional deviceId to identify the hardware.Audio Streaming
Client sends binary frames containing raw PCM audio from the microphone. Server processes through the voice pipeline (VAD, STT, LLM, TTS).
Client to Server Messages
Binary Frames: Audio Data
Raw PCM audio from the client microphone. Must match the audio format specification (Int16, 16kHz, mono, little-endian). Send audio in chunks as they become available from the microphone. There is no required chunk size, but typical sizes are 320 bytes (10ms) to 3200 bytes (100ms).Text Frames: Control Messages
All control messages are JSON objects with atype field.
session_start
session_start
Sent after connection to identify the device and begin a voice session.
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be "session_start". |
deviceId | string | No | Device identifier for multi-device setups. |
session_end
session_end
Signal the end of a voice session. The WebSocket connection remains open.
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be "session_end". |
config
config
Update runtime configuration for this connection.
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Must be "config". |
config | object | No | Key-value configuration overrides. |
status
status
Request connection status from the server. Server responds with a
status control message.Server to Client Messages
Binary Frames: TTS Audio Data
Raw PCM audio synthesized by the TTS engine. Same format as client audio: Int16, 16kHz, mono, little-endian. Sent betweentts_start and tts_end control messages.
Text Frames: Control Messages
session_ack
session_ack
Sent immediately after connection to confirm session assignment and audio configuration.
| Field | Type | Description |
|---|---|---|
type | string | Always "session_ack". |
sessionId | string | Server-assigned session ID. |
details.sampleRate | number | Expected audio sample rate (16000). |
details.channels | number | Expected channel count (1). |
details.bitDepth | number | Expected bit depth (16). |
tts_start
tts_start
Indicates that TTS audio frames will follow. The client should prepare to play audio.
tts_end
tts_end
Indicates that all TTS audio frames for the current utterance have been sent.
error
error
Sent when an error occurs during processing.
| Field | Type | Description |
|---|---|---|
type | string | Always "error". |
sessionId | string | Associated session (if available). |
error | string | Human-readable error description. |
status
status
Response to a client
status request. Contains connection metrics.| Field | Type | Description |
|---|---|---|
details.connectedAt | number | Unix ms timestamp of connection start. |
details.bytesReceived | number | Total audio bytes received from client. |
details.bytesSent | number | Total audio bytes sent to client. |
details.uptime | number | Connection duration in milliseconds. |
Keepalive and Timeout
The server sends WebSocketping frames at a configurable interval (default: 30 seconds). Clients must respond with pong frames (this is handled automatically by most WebSocket libraries).
If a client does not respond to a ping with a pong before the next ping cycle, the server considers the client unresponsive and terminates the connection.
| Setting | Default | Description |
|---|---|---|
| Ping interval | 30,000ms | Time between keepalive pings. |
| Stale detection | 2 cycles | Client terminated after missing one pong. |
Connection Limits
| Limit | Value | Description |
|---|---|---|
| Max clients | 10 | Maximum simultaneous WebSocket connections. |
| Max frame size | 1 MB | Maximum size of a single WebSocket frame. |
Error Handling
Connection Rejection
| Close Code | Reason | When |
|---|---|---|
| 1001 | Server shutting down | Server is performing graceful shutdown. |
| 1013 | Max clients reached | Too many concurrent connections. |
Runtime Errors
Errors during audio processing (STT failure, LLM timeout, TTS error) are reported viaerror control messages. The WebSocket connection remains open so the client can retry.
Reconnection
Example: Browser Client
Example: ESP32 Client
For ESP32 firmware using the XIAO ESP32-S3 Sense, see the firmware repository. The ESP32 streams I2S microphone data (PDM, pins 42/41) as raw PCM over WebSocket and plays back TTS audio through an I2S DAC.Key considerations for ESP32:
- Use binary WebSocket frames for audio data.
- Buffer at least 100ms of audio before sending to reduce frame overhead.
- Handle
ping/pongkeepalive (most ESP32 WebSocket libraries handle this automatically). - Implement reconnection with backoff for Wi-Fi instability.