docs: comprehensive documentation for API server, streaming, and Open WebUI

Cherry-picked from PR #828, resolved conflicts with main.
2026-04-29 01:31:41 +00:00 · 2026-03-11 08:57:35 -07:00 · 2026-03-11 08:57:35 -07:00 · d54280ea03
commit d54280ea03
parent 95d221c31c
3 changed files with 438 additions and 8 deletions
--- a/website/docs/user-guide/features/api-server.md
+++ b/website/docs/user-guide/features/api-server.md
@ -0,0 +1,217 @@
+---
+sidebar_position: 14
+title: "API Server"
+description: "Expose hermes-agent as an OpenAI-compatible API for any frontend"
+---
+
+# API Server
+
+The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend.
+
+Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side.
+
+## Quick Start
+
+### 1. Enable the API server
+
+Add to `~/.hermes/.env`:
+
+```bash
+API_SERVER_ENABLED=true
+```
+
+### 2. Start the gateway
+
+```bash
+hermes gateway
+```
+
+You'll see:
+
+```
+[API Server] API server listening on http://127.0.0.1:8642
+```
+
+### 3. Connect a frontend
+
+Point any OpenAI-compatible client at `http://localhost:8642/v1`:
+
+```bash
+# Test with curl
+curl http://localhost:8642/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}'
+```
+
+Or connect Open WebUI, LobeChat, or any other frontend — see the [Open WebUI integration guide](/docs/user-guide/messaging/open-webui) for step-by-step instructions.
+
+## Endpoints
+
+### POST /v1/chat/completions
+
+Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the `messages` array.
+
+**Request:**
+```json
+{
+  "model": "hermes-agent",
+  "messages": [
+    {"role": "system", "content": "You are a Python expert."},
+    {"role": "user", "content": "Write a fibonacci function"}
+  ],
+  "stream": false
+}
+```
+
+**Response:**
+```json
+{
+  "id": "chatcmpl-abc123",
+  "object": "chat.completion",
+  "created": 1710000000,
+  "model": "hermes-agent",
+  "choices": [{
+    "index": 0,
+    "message": {"role": "assistant", "content": "Here's a fibonacci function..."},
+    "finish_reason": "stop"
+  }],
+  "usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250}
+}
+```
+
+**Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk.
+
+### POST /v1/responses
+
+OpenAI Responses API format. Supports server-side conversation state via `previous_response_id` — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it.
+
+**Request:**
+```json
+{
+  "model": "hermes-agent",
+  "input": "What files are in my project?",
+  "instructions": "You are a helpful coding assistant.",
+  "store": true
+}
+```
+
+**Response:**
+```json
+{
+  "id": "resp_abc123",
+  "object": "response",
+  "status": "completed",
+  "model": "hermes-agent",
+  "output": [
+    {"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"},
+    {"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"},
+    {"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]}
+  ],
+  "usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250}
+}
+```
+
+#### Multi-turn with previous_response_id
+
+Chain responses to maintain full context (including tool calls) across turns:
+
+```json
+{
+  "input": "Now show me the README",
+  "previous_response_id": "resp_abc123"
+}
+```
+
+The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved.
+
+#### Named conversations
+
+Use the `conversation` parameter instead of tracking response IDs:
+
+```json
+{"input": "Hello", "conversation": "my-project"}
+{"input": "What's in src/?", "conversation": "my-project"}
+{"input": "Run the tests", "conversation": "my-project"}
+```
+
+The server automatically chains to the latest response in that conversation. Like the `/title` command for gateway sessions.
+
+### GET /v1/responses/{id}
+
+Retrieve a previously stored response by ID.
+
+### DELETE /v1/responses/{id}
+
+Delete a stored response.
+
+### GET /v1/models
+
+Lists `hermes-agent` as an available model. Required by most frontends for model discovery.
+
+### GET /health
+
+Health check. Returns `{"status": "ok"}`.
+
+## System Prompt Handling
+
+When a frontend sends a `system` message (Chat Completions) or `instructions` field (Responses API), hermes-agent **layers it on top** of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions.
+
+This means you can customize behavior per-frontend without losing capabilities:
+- Open WebUI system prompt: "You are a Python expert. Always include type hints."
+- The agent still has terminal, file tools, web search, memory, etc.
+
+## Authentication
+
+Bearer token auth via the `Authorization` header:
+
+```
+Authorization: Bearer your-secret-key
+```
+
+Configure the key via `API_SERVER_KEY` env var. If no key is set, all requests are allowed (for local-only use).
+
+## Configuration
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `API_SERVER_ENABLED` | `false` | Enable the API server |
+| `API_SERVER_PORT` | `8642` | HTTP server port |
+| `API_SERVER_HOST` | `127.0.0.1` | Bind address (localhost only by default) |
+| `API_SERVER_KEY` | _(none)_ | Bearer token for auth |
+
+### config.yaml
+
+```yaml
+# Not yet supported — use environment variables.
+# config.yaml support coming in a future release.
+```
+
+## CORS
+
+The API server includes CORS headers on all responses (`Access-Control-Allow-Origin: *`), so browser-based frontends can connect directly.
+
+## Compatible Frontends
+
+Any frontend that supports the OpenAI API format works. Tested/documented integrations:
+
+| Frontend | Stars | Connection |
+|----------|-------|------------|
+| [Open WebUI](/docs/user-guide/messaging/open-webui) | 126k | Full guide available |
+| LobeChat | 73k | Custom provider endpoint |
+| LibreChat | 34k | Custom endpoint in librechat.yaml |
+| AnythingLLM | 56k | Generic OpenAI provider |
+| NextChat | 87k | BASE_URL env var |
+| ChatBox | 39k | API Host setting |
+| Jan | 26k | Remote model config |
+| HF Chat-UI | 8k | OPENAI_BASE_URL |
+| big-AGI | 7k | Custom endpoint |
+| OpenAI Python SDK | — | `OpenAI(base_url="http://localhost:8642/v1")` |
+| curl | — | Direct HTTP requests |
+
+## Limitations
+
+- **Response storage is in-memory** — stored responses (for `previous_response_id`) are lost on gateway restart. Max 100 stored responses (LRU eviction).
+- **No file upload** — vision/document analysis via uploaded files is not yet supported through the API.
+- **Model field is cosmetic** — the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.
--- a/website/docs/user-guide/features/streaming.md
+++ b/website/docs/user-guide/features/streaming.md
@ -0,0 +1,210 @@
+---
+sidebar_position: 15
+title: "Streaming"
+description: "Token-by-token live response display across all platforms"
+---
+
+# Streaming Responses
+
+When enabled, hermes-agent streams LLM responses token-by-token instead of waiting for the full generation. Users see the response typing out live — the same experience as ChatGPT, Claude, or Gemini.
+
+Streaming is **disabled by default** and can be enabled globally or per-platform.
+
+## How It Works
+
+```
+LLM generates tokens → callback fires per token → queue → consumer displays
+
+Telegram/Discord/Slack:
+  Token arrives → Accumulate → Every 1.5s, edit the message with new text + ▌ cursor
+  Done → Final edit removes cursor
+
+API Server:
+  Token arrives → SSE event sent to client immediately
+  Done → finish chunk + [DONE]
+```
+
+The agent's internal operation doesn't change — tools still execute normally, memory and skills work as before. Streaming only affects how the **final text response** is delivered to the user.
+
+## Enable Streaming
+
+### Option 1: Environment variable
+
+```bash
+# Enable for all platforms
+export HERMES_STREAMING_ENABLED=true
+hermes gateway
+```
+
+### Option 2: config.yaml
+
+```yaml
+streaming:
+  enabled: true     # Master switch
+```
+
+### Option 3: Per-platform
+
+```yaml
+streaming:
+  enabled: false    # Off by default
+  telegram: true    # But on for Telegram
+  discord: true     # And Discord
+  api_server: true  # And the API server
+```
+
+## Platform Support
+
+| Platform | Streaming Method | Rate Limit | Notes |
+|----------|-----------------|------------|-------|
+| **Telegram** | Progressive message editing | ~20 edits/min | 1.5s edit interval, ▌ cursor |
+| **Discord** | Progressive message editing | 5 edits/5s | 1.5s edit interval |
+| **Slack** | Progressive message editing | ~50 calls/min | 1.5s edit interval |
+| **API Server** | SSE (Server-Sent Events) | No limit | Real token-by-token events |
+| **WhatsApp** | ❌ Not supported | — | No message editing API |
+| **Home Assistant** | ❌ Not supported | — | No message editing API |
+| **CLI** | ❌ Not yet implemented | — | KawaiiSpinner provides feedback |
+
+Platforms without message editing support automatically fall back to non-streaming (the response appears all at once, as before).
+
+## What Users See
+
+### Telegram/Discord/Slack
+
+1. Agent starts working (typing indicator shows)
+2. After ~20 tokens, a message appears with partial text and a ▌ cursor
+3. Every 1.5 seconds, the message is edited with more accumulated text
+4. When the response is complete, the cursor disappears
+
+Tool progress messages still work alongside streaming — tool names/previews appear as before, and the streamed response is shown in a separate message.
+
+### API Server (frontends like Open WebUI)
+
+When `stream: true` is set in the request, the API server returns Server-Sent Events:
+
+```
+data: {"choices":[{"delta":{"role":"assistant"}}]}
+
+data: {"choices":[{"delta":{"content":"Here"}}]}
+
+data: {"choices":[{"delta":{"content":" is"}}]}
+
+data: {"choices":[{"delta":{"content":" the"}}]}
+
+data: {"choices":[{"delta":{"content":" answer"}}]}
+
+data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
+
+data: [DONE]
+```
+
+Frontends like Open WebUI display this as live typing.
+
+## How It Works Internally
+
+### Architecture
+
+```
+┌─────────────┐    stream_callback(delta)    ┌──────────────────┐
+│  LLM API    │ ──────────────────────────► │   queue.Queue()  │
+│  (stream)   │    (runs in agent thread)   │   (thread-safe)  │
+└─────────────┘                              └────────┬─────────┘
+                                                      │
+                                       ┌──────────────┼──────────┐
+                                       │              │          │
+                                 ┌─────▼─────┐ ┌─────▼────┐ ┌──▼──────┐
+                                 │  Gateway   │ │ API Svr  │ │  CLI    │
+                                 │  edit msg  │ │ SSE evt  │ │ (TODO)  │
+                                 └───────────┘ └──────────┘ └─────────┘
+```
+
+1. `AIAgent.__init__` accepts an optional `stream_callback` function
+2. When set, `_interruptible_api_call()` routes to `_run_streaming_chat_completion()` instead of the normal non-streaming path
+3. The streaming method calls the OpenAI API with `stream=True`, iterates chunks, and calls `stream_callback(delta_text)` for each text token
+4. Tool call deltas are accumulated silently (no streaming for tool arguments)
+5. When the stream ends, `stream_callback(None)` signals completion
+6. The method returns a fake response object compatible with the existing code path
+7. If streaming fails for any reason, it falls back to a normal non-streaming API call
+
+### Thread Safety
+
+The agent runs in a background thread (via `_interruptible_api_call`). The consumer (gateway async task, API server SSE writer) runs in the main event loop. A `queue.Queue` bridges them — it's thread-safe by design.
+
+### Graceful Fallback
+
+If the LLM provider doesn't support `stream=True` or the streaming connection fails, the agent automatically falls back to a non-streaming API call. The user gets the response normally, just without the live typing effect. No error is shown.
+
+## Configuration Reference
+
+```yaml
+streaming:
+  enabled: false          # Master switch (default: off)
+
+  # Per-platform overrides (optional):
+  telegram: true          # Enable for Telegram
+  discord: true           # Enable for Discord
+  slack: true             # Enable for Slack
+  api_server: true        # Enable for API server
+
+  # Tuning (optional):
+  edit_interval: 1.5      # Seconds between message edits (default: 1.5)
+  min_tokens: 20          # Tokens before first display (default: 20)
+```
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `HERMES_STREAMING_ENABLED` | `false` | Master switch via env var |
+| `streaming.enabled` | `false` | Master switch via config |
+| `streaming.<platform>` | _(unset)_ | Per-platform override |
+| `streaming.edit_interval` | `1.5` | Seconds between Telegram/Discord edits |
+| `streaming.min_tokens` | `20` | Minimum tokens before first message |
+
+## Interaction with Other Features
+
+### Tool Execution
+
+When the agent calls tools (terminal, file operations, web search, etc.), no text tokens are generated — tool arguments are accumulated silently. Tool progress messages continue to work as before. After tools finish, the next LLM call may produce the final text response, which streams normally.
+
+### Context Compression
+
+Compression happens between API calls, not during streaming. No interaction.
+
+### Interrupts
+
+If the user sends a new message while streaming, the agent is interrupted. The HTTP connection is closed (stopping token generation), accumulated text is shown as-is, and the new message is processed.
+
+### Prompt Caching
+
+Streaming doesn't affect prompt caching — the request is identical, just with `stream=True` added.
+
+### Responses API (Codex)
+
+The Codex/Responses API streaming path also supports the `stream_callback`. Token deltas from `response.output_text.delta` events are emitted via the callback.
+
+## Troubleshooting
+
+### Streaming isn't working
+
+1. Check the config: `streaming.enabled: true` in config.yaml or `HERMES_STREAMING_ENABLED=true`
+2. Check per-platform: `streaming.telegram: true` overrides the master switch
+3. Restart the gateway after changing config
+4. Check logs for "Streaming failed, falling back" — indicates the provider may not support streaming
+
+### Response appears twice
+
+If you see the response both in a progressively-edited message AND as a separate final message, this is a bug. The streaming system should suppress the normal send when tokens were delivered via streaming. Please file an issue.
+
+### Messages update too slowly
+
+The default edit interval is 1.5 seconds (to respect platform rate limits). You can lower it in config:
+
+```yaml
+streaming:
+  edit_interval: 1.0   # Faster updates (may hit rate limits)
+```
+
+Going below 1.0s risks Telegram rate limiting (429 errors).
+
+### No streaming on WhatsApp/HomeAssistant
+
+These platforms don't support message editing, so streaming automatically falls back to non-streaming. This is expected behavior.