mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-29 01:31:41 +00:00

teknium1 d54280ea03 docs: comprehensive documentation for API server, streaming, and Open WebUI

Cherry-picked from PR #828, resolved conflicts with main.

2026-03-11 08:57:35 -07:00

8.4 KiB

Raw Blame History

sidebar_position	title	description
15	Streaming	Token-by-token live response display across all platforms

Streaming Responses

When enabled, hermes-agent streams LLM responses token-by-token instead of waiting for the full generation. Users see the response typing out live — the same experience as ChatGPT, Claude, or Gemini.

Streaming is disabled by default and can be enabled globally or per-platform.

How It Works

LLM generates tokens → callback fires per token → queue → consumer displays

Telegram/Discord/Slack:
  Token arrives → Accumulate → Every 1.5s, edit the message with new text + ▌ cursor
  Done → Final edit removes cursor

API Server:
  Token arrives → SSE event sent to client immediately
  Done → finish chunk + [DONE]

The agent's internal operation doesn't change — tools still execute normally, memory and skills work as before. Streaming only affects how the final text response is delivered to the user.

Enable Streaming

Option 1: Environment variable

# Enable for all platforms
export HERMES_STREAMING_ENABLED=true
hermes gateway

Option 2: config.yaml

streaming:
  enabled: true     # Master switch

Option 3: Per-platform

streaming:
  enabled: false    # Off by default
  telegram: true    # But on for Telegram
  discord: true     # And Discord
  api_server: true  # And the API server

Platform Support

Platform	Streaming Method	Rate Limit	Notes
Telegram	Progressive message editing	~20 edits/min	1.5s edit interval, ▌ cursor
Discord	Progressive message editing	5 edits/5s	1.5s edit interval
Slack	Progressive message editing	~50 calls/min	1.5s edit interval
API Server	SSE (Server-Sent Events)	No limit	Real token-by-token events
WhatsApp	❌ Not supported	—	No message editing API
Home Assistant	❌ Not supported	—	No message editing API
CLI	❌ Not yet implemented	—	KawaiiSpinner provides feedback

Platforms without message editing support automatically fall back to non-streaming (the response appears all at once, as before).

What Users See

Telegram/Discord/Slack

Agent starts working (typing indicator shows)
After ~20 tokens, a message appears with partial text and a ▌ cursor
Every 1.5 seconds, the message is edited with more accumulated text
When the response is complete, the cursor disappears

Tool progress messages still work alongside streaming — tool names/previews appear as before, and the streamed response is shown in a separate message.

API Server (frontends like Open WebUI)

When stream: true is set in the request, the API server returns Server-Sent Events:

data: {"choices":[{"delta":{"role":"assistant"}}]}

data: {"choices":[{"delta":{"content":"Here"}}]}

data: {"choices":[{"delta":{"content":" is"}}]}

data: {"choices":[{"delta":{"content":" the"}}]}

data: {"choices":[{"delta":{"content":" answer"}}]}

data: {"choices":[{"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Frontends like Open WebUI display this as live typing.

How It Works Internally

Architecture

┌─────────────┐    stream_callback(delta)    ┌──────────────────┐
│  LLM API    │ ──────────────────────────► │   queue.Queue()  │
│  (stream)   │    (runs in agent thread)   │   (thread-safe)  │
└─────────────┘                              └────────┬─────────┘
                                                      │
                                       ┌──────────────┼──────────┐
                                       │              │          │
                                 ┌─────▼─────┐ ┌─────▼────┐ ┌──▼──────┐
                                 │  Gateway   │ │ API Svr  │ │  CLI    │
                                 │  edit msg  │ │ SSE evt  │ │ (TODO)  │
                                 └───────────┘ └──────────┘ └─────────┘

AIAgent.__init__ accepts an optional stream_callback function
When set, _interruptible_api_call() routes to _run_streaming_chat_completion() instead of the normal non-streaming path
The streaming method calls the OpenAI API with stream=True, iterates chunks, and calls stream_callback(delta_text) for each text token
Tool call deltas are accumulated silently (no streaming for tool arguments)
When the stream ends, stream_callback(None) signals completion
The method returns a fake response object compatible with the existing code path
If streaming fails for any reason, it falls back to a normal non-streaming API call

Thread Safety

The agent runs in a background thread (via _interruptible_api_call). The consumer (gateway async task, API server SSE writer) runs in the main event loop. A queue.Queue bridges them — it's thread-safe by design.

Graceful Fallback

If the LLM provider doesn't support stream=True or the streaming connection fails, the agent automatically falls back to a non-streaming API call. The user gets the response normally, just without the live typing effect. No error is shown.

Configuration Reference

streaming:
  enabled: false          # Master switch (default: off)

  # Per-platform overrides (optional):
  telegram: true          # Enable for Telegram
  discord: true           # Enable for Discord
  slack: true             # Enable for Slack
  api_server: true        # Enable for API server

  # Tuning (optional):
  edit_interval: 1.5      # Seconds between message edits (default: 1.5)
  min_tokens: 20          # Tokens before first display (default: 20)

Variable	Default	Description
`HERMES_STREAMING_ENABLED`	`false`	Master switch via env var
`streaming.enabled`	`false`	Master switch via config
`streaming.<platform>`	(unset)	Per-platform override
`streaming.edit_interval`	`1.5`	Seconds between Telegram/Discord edits
`streaming.min_tokens`	`20`	Minimum tokens before first message

Interaction with Other Features

Tool Execution

When the agent calls tools (terminal, file operations, web search, etc.), no text tokens are generated — tool arguments are accumulated silently. Tool progress messages continue to work as before. After tools finish, the next LLM call may produce the final text response, which streams normally.

Context Compression

Compression happens between API calls, not during streaming. No interaction.

Interrupts

If the user sends a new message while streaming, the agent is interrupted. The HTTP connection is closed (stopping token generation), accumulated text is shown as-is, and the new message is processed.

Prompt Caching

Streaming doesn't affect prompt caching — the request is identical, just with stream=True added.

Responses API (Codex)

The Codex/Responses API streaming path also supports the stream_callback. Token deltas from response.output_text.delta events are emitted via the callback.

Troubleshooting

Streaming isn't working

Check the config: streaming.enabled: true in config.yaml or HERMES_STREAMING_ENABLED=true
Check per-platform: streaming.telegram: true overrides the master switch
Restart the gateway after changing config
Check logs for "Streaming failed, falling back" — indicates the provider may not support streaming

Response appears twice

If you see the response both in a progressively-edited message AND as a separate final message, this is a bug. The streaming system should suppress the normal send when tokens were delivered via streaming. Please file an issue.

Messages update too slowly

The default edit interval is 1.5 seconds (to respect platform rate limits). You can lower it in config:

streaming:
  edit_interval: 1.0   # Faster updates (may hit rate limits)

Going below 1.0s risks Telegram rate limiting (429 errors).

No streaming on WhatsApp/HomeAssistant

These platforms don't support message editing, so streaming automatically falls back to non-streaming. This is expected behavior.

8.4 KiB Raw Blame History