WebSocket Voice Protocol

The agtOS voice pipeline uses raw WebSocket for bidirectional audio streaming between the server and clients (ESP32 devices, browsers). This avoids the STUN/TURN/ICE complexity of WebRTC while working reliably on local networks. Connection URL: ws://<host>:<VOICE_PORT>/audio Default: ws://localhost:3000/audio

Audio Format

All audio data is transmitted as raw PCM in binary WebSocket frames.

Property	Value
Encoding	PCM (signed integer)
Bit Depth	16-bit (Int16)
Sample Rate	16,000 Hz
Channels	1 (mono)
Byte Order	Little-endian
Max Frame	1 MB

Both client-to-server (microphone audio) and server-to-client (TTS audio) use the same format.

Connection Lifecycle

Client                                Server
  |                                      |
  |  --- WebSocket CONNECT /audio --->   |
  |                                      |  (assign clientId, sessionId)
  |  <--- session_ack (JSON) ----------  |
  |                                      |
  |  --- session_start (JSON) -------->  |  (optional: set deviceId)
  |                                      |
  |  --- binary audio frames --------->  |  (microphone PCM data)
  |  --- binary audio frames --------->  |
  |                                      |  (STT processing)
  |  <--- transcript (JSON, role:user)   |
  |                                      |  (LLM reasoning)
  |  <--- transcript (JSON, role:asst)   |
  |                                      |  (TTS synthesis)
  |  <--- tts_start (JSON) ----------   |
  |  <--- binary audio frames --------  |  (TTS PCM data)
  |  <--- binary audio frames --------  |
  |  <--- tts_end (JSON) ------------   |
  |                                      |
  |  --- session_end (JSON) --------->  |  (optional: end session)
  |                                      |
  |  --- WebSocket CLOSE ------------>  |
  |                                      |

Connect

Client opens a WebSocket connection to ws://<host>:<VOICE_PORT>/audio.

Session Ack

Server immediately sends a session_ack message with the assigned sessionId and audio configuration.

Session Start (optional)

Client sends session_start with an optional deviceId to identify the hardware.

Audio Streaming

Client sends binary frames containing raw PCM audio from the microphone. Server processes through the voice pipeline (VAD, STT, LLM, TTS).

TTS Playback

Server sends tts_start, followed by binary PCM audio frames, followed by tts_end.

Session End (optional)

Client sends session_end to signal end of conversation.

Disconnect

Either side closes the WebSocket connection.

Client to Server Messages

Binary Frames: Audio Data

Raw PCM audio from the client microphone. Must match the audio format specification (Int16, 16kHz, mono, little-endian). Send audio in chunks as they become available from the microphone. There is no required chunk size, but typical sizes are 320 bytes (10ms) to 3200 bytes (100ms).

Text Frames: Control Messages

All control messages are JSON objects with a type field.

session_start

Sent after connection to identify the device and begin a voice session.

{
  "type": "session_start",
  "deviceId": "esp32-kitchen-01"
}

Field	Type	Required	Description
`type`	string	Yes	Must be `"session_start"`.
`deviceId`	string	No	Device identifier for multi-device setups.

session_end

Signal the end of a voice session. The WebSocket connection remains open.

{
  "type": "session_end"
}

Field	Type	Required	Description
`type`	string	Yes	Must be `"session_end"`.

config

Update runtime configuration for this connection.

{
  "type": "config",
  "config": {
    "vadSensitivity": 0.8,
    "language": "en"
  }
}

Field	Type	Required	Description
`type`	string	Yes	Must be `"config"`.
`config`	object	No	Key-value configuration overrides.

status

Request connection status from the server. Server responds with a status control message.

{
  "type": "status"
}

Server to Client Messages

Binary Frames: TTS Audio Data

Raw PCM audio synthesized by the TTS engine. Same format as client audio: Int16, 16kHz, mono, little-endian. Sent between tts_start and tts_end control messages.

Text Frames: Control Messages

session_ack

Sent immediately after connection to confirm session assignment and audio configuration.

{
  "type": "session_ack",
  "sessionId": "session-a1b2c3d4e5f6",
  "details": {
    "sampleRate": 16000,
    "channels": 1,
    "bitDepth": 16
  }
}

Field	Type	Description
`type`	string	Always `"session_ack"`.
`sessionId`	string	Server-assigned session ID.
`details.sampleRate`	number	Expected audio sample rate (16000).
`details.channels`	number	Expected channel count (1).
`details.bitDepth`	number	Expected bit depth (16).

tts_start

Indicates that TTS audio frames will follow. The client should prepare to play audio.

{
  "type": "tts_start",
  "sessionId": "session-a1b2c3d4e5f6"
}

tts_end

Indicates that all TTS audio frames for the current utterance have been sent.

{
  "type": "tts_end",
  "sessionId": "session-a1b2c3d4e5f6"
}

error

Sent when an error occurs during processing.

{
  "type": "error",
  "sessionId": "session-a1b2c3d4e5f6",
  "error": "STT processing failed: model unavailable"
}

Field	Type	Description
`type`	string	Always `"error"`.
`sessionId`	string	Associated session (if available).
`error`	string	Human-readable error description.

transcript

Sent when speech has been transcribed (STT result) or when the assistant produces a text response. Includes the transcript text, the role that produced it, and optional word-level timestamps.

{
  "type": "transcript",
  "sessionId": "session-a1b2c3d4e5f6",
  "text": "Turn on the kitchen lights",
  "role": "user",
  "words": [
    { "word": "Turn", "start": 120, "end": 280 },
    { "word": "on", "start": 300, "end": 400 },
    { "word": "the", "start": 420, "end": 500 },
    { "word": "kitchen", "start": 520, "end": 720 },
    { "word": "lights", "start": 740, "end": 920 }
  ]
}

Field	Type	Description
`type`	string	Always `"transcript"`.
`sessionId`	string	Associated session ID.
`text`	string	The transcribed or generated text.
`role`	string	Who produced the text: `"user"` (STT result) or `"assistant"` (LLM response).
`words`	array	Optional word-level timestamps. Each entry has `word` (string), `start` (ms), and `end` (ms). Only present for STT transcripts when timestamp support is enabled.

status

Response to a client status request. Contains connection metrics.

{
  "type": "status",
  "sessionId": "session-a1b2c3d4e5f6",
  "details": {
    "connectedAt": 1711612800000,
    "bytesReceived": 1048576,
    "bytesSent": 524288,
    "uptime": 60000
  }
}

Field	Type	Description
`details.connectedAt`	number	Unix ms timestamp of connection start.
`details.bytesReceived`	number	Total audio bytes received from client.
`details.bytesSent`	number	Total audio bytes sent to client.
`details.uptime`	number	Connection duration in milliseconds.

Keepalive and Timeout

The server sends WebSocket ping frames at a configurable interval (default: 30 seconds). Clients must respond with pong frames (this is handled automatically by most WebSocket libraries). If a client does not respond to a ping with a pong before the next ping cycle, the server considers the client unresponsive and terminates the connection.

Setting	Default	Description
Ping interval	30,000ms	Time between keepalive pings.
Stale detection	2 cycles	Client terminated after missing one pong.

Connection Limits

Limit	Value	Description
Max clients	10	Maximum simultaneous WebSocket connections.
Max frame size	1 MB	Maximum size of a single WebSocket frame.

When the maximum client count is reached, new connections are rejected with WebSocket close code 1013 (Try Again Later) and the message “Max clients reached”.

Error Handling

Connection Rejection

Close Code	Reason	When
1001	Server shutting down	Server is performing graceful shutdown.
1013	Max clients reached	Too many concurrent connections.

Runtime Errors

Errors during audio processing (STT failure, LLM timeout, TTS error) are reported via error control messages. The WebSocket connection remains open so the client can retry.

Reconnection

Clients should implement reconnection with exponential backoff to handle transient disconnections gracefully.

Initial Wait

Wait 1 second after first disconnection.

Backoff

Double the wait time on each subsequent attempt (2s, 4s, 8s…).

Cap

Cap the maximum wait at 30 seconds.

Reset

Reset the backoff timer after a successful connection.

Example: Browser Client

const ws = new WebSocket('ws://localhost:3000/audio');

ws.binaryType = 'arraybuffer';

ws.onopen = () => {
  // Send session start with device identification
  ws.send(JSON.stringify({
    type: 'session_start',
    deviceId: 'browser-main'
  }));
};

ws.onmessage = (event) => {
  if (event.data instanceof ArrayBuffer) {
    // Binary frame: TTS audio data (Int16 PCM, 16kHz, mono)
    const pcmData = new Int16Array(event.data);
    playAudio(pcmData);
  } else {
    // Text frame: control message
    const msg = JSON.parse(event.data);
    switch (msg.type) {
      case 'session_ack':
        console.log('Session:', msg.sessionId);
        break;
      case 'tts_start':
        console.log('TTS playback starting');
        break;
      case 'tts_end':
        console.log('TTS playback complete');
        break;
      case 'error':
        console.error('Server error:', msg.error);
        break;
    }
  }
};

// Stream microphone audio
async function startMicrophone() {
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
  const context = new AudioContext({ sampleRate: 16000 });
  const source = context.createMediaStreamSource(stream);
  const processor = context.createScriptProcessor(4096, 1, 1);

  source.connect(processor);
  processor.connect(context.destination);

  processor.onaudioprocess = (e) => {
    const float32 = e.inputBuffer.getChannelData(0);
    // Convert Float32 to Int16
    const int16 = new Int16Array(float32.length);
    for (let i = 0; i < float32.length; i++) {
      int16[i] = Math.max(-32768, Math.min(32767, float32[i] * 32768));
    }
    if (ws.readyState === WebSocket.OPEN) {
      ws.send(int16.buffer);
    }
  };
}

Example: ESP32 Client

For ESP32 firmware using the XIAO ESP32-S3 Sense, see the firmware repository. The ESP32 streams I2S microphone data (PDM, pins 42/41) as raw PCM over WebSocket and plays back TTS audio through an I2S DAC.

Key considerations for ESP32:

Use binary WebSocket frames for audio data.
Buffer at least 100ms of audio before sending to reduce frame overhead.
Handle ping/pong keepalive (most ESP32 WebSocket libraries handle this automatically).
Implement reconnection with backoff for Wi-Fi instability.

​Audio Format

​Connection Lifecycle

​Client to Server Messages

​Binary Frames: Audio Data

​Text Frames: Control Messages

​Server to Client Messages

​Binary Frames: TTS Audio Data

​Text Frames: Control Messages

​Keepalive and Timeout

​Connection Limits

​Error Handling

​Connection Rejection

​Runtime Errors

​Reconnection

​Example: Browser Client

​Example: ESP32 Client

Audio Format

Connection Lifecycle

Client to Server Messages

Binary Frames: Audio Data

Text Frames: Control Messages

Server to Client Messages

Binary Frames: TTS Audio Data

Text Frames: Control Messages

Keepalive and Timeout

Connection Limits

Error Handling

Connection Rejection

Runtime Errors

Reconnection

Example: Browser Client

Example: ESP32 Client