mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
docs: comprehensive documentation for API server, streaming, and Open WebUI
- website/docs/user-guide/features/api-server.md — full API server docs: endpoints (chat completions, responses, models), system prompt handling, auth, config, compatible frontends matrix, limitations - website/docs/user-guide/features/streaming.md — streaming docs: per-platform support matrix, architecture, config reference, troubleshooting, interaction with tools/compression/interrupts - website/docs/user-guide/messaging/open-webui.md — already existed, step-by-step Open WebUI integration guide - website/docs/user-guide/messaging/index.md — updated to include API server in architecture diagram and description
This commit is contained in:
parent
5a426e084a
commit
550402116d
3 changed files with 433 additions and 6 deletions
217
website/docs/user-guide/features/api-server.md
Normal file
217
website/docs/user-guide/features/api-server.md
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
---
|
||||
sidebar_position: 14
|
||||
title: "API Server"
|
||||
description: "Expose hermes-agent as an OpenAI-compatible API for any frontend"
|
||||
---
|
||||
|
||||
# API Server
|
||||
|
||||
The API server exposes hermes-agent as an OpenAI-compatible HTTP endpoint. Any frontend that speaks the OpenAI format — Open WebUI, LobeChat, LibreChat, NextChat, ChatBox, and hundreds more — can connect to hermes-agent and use it as a backend.
|
||||
|
||||
Your agent handles requests with its full toolset (terminal, file operations, web search, memory, skills) and returns the final response. Tool calls execute invisibly server-side.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Enable the API server
|
||||
|
||||
Add to `~/.hermes/.env`:
|
||||
|
||||
```bash
|
||||
API_SERVER_ENABLED=true
|
||||
```
|
||||
|
||||
### 2. Start the gateway
|
||||
|
||||
```bash
|
||||
hermes gateway
|
||||
```
|
||||
|
||||
You'll see:
|
||||
|
||||
```
|
||||
[API Server] API server listening on http://127.0.0.1:8642
|
||||
```
|
||||
|
||||
### 3. Connect a frontend
|
||||
|
||||
Point any OpenAI-compatible client at `http://localhost:8642/v1`:
|
||||
|
||||
```bash
|
||||
# Test with curl
|
||||
curl http://localhost:8642/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"model": "hermes-agent", "messages": [{"role": "user", "content": "Hello!"}]}'
|
||||
```
|
||||
|
||||
Or connect Open WebUI, LobeChat, or any other frontend — see the [Open WebUI integration guide](/docs/user-guide/messaging/open-webui) for step-by-step instructions.
|
||||
|
||||
## Endpoints
|
||||
|
||||
### POST /v1/chat/completions
|
||||
|
||||
Standard OpenAI Chat Completions format. Stateless — the full conversation is included in each request via the `messages` array.
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"model": "hermes-agent",
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a Python expert."},
|
||||
{"role": "user", "content": "Write a fibonacci function"}
|
||||
],
|
||||
"stream": false
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"id": "chatcmpl-abc123",
|
||||
"object": "chat.completion",
|
||||
"created": 1710000000,
|
||||
"model": "hermes-agent",
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {"role": "assistant", "content": "Here's a fibonacci function..."},
|
||||
"finish_reason": "stop"
|
||||
}],
|
||||
"usage": {"prompt_tokens": 50, "completion_tokens": 200, "total_tokens": 250}
|
||||
}
|
||||
```
|
||||
|
||||
**Streaming** (`"stream": true`): Returns Server-Sent Events (SSE) with token-by-token response chunks. When streaming is enabled in config, tokens are emitted live as the LLM generates them. When disabled, the full response is sent as a single SSE chunk.
|
||||
|
||||
### POST /v1/responses
|
||||
|
||||
OpenAI Responses API format. Supports server-side conversation state via `previous_response_id` — the server stores full conversation history (including tool calls and results) so multi-turn context is preserved without the client managing it.
|
||||
|
||||
**Request:**
|
||||
```json
|
||||
{
|
||||
"model": "hermes-agent",
|
||||
"input": "What files are in my project?",
|
||||
"instructions": "You are a helpful coding assistant.",
|
||||
"store": true
|
||||
}
|
||||
```
|
||||
|
||||
**Response:**
|
||||
```json
|
||||
{
|
||||
"id": "resp_abc123",
|
||||
"object": "response",
|
||||
"status": "completed",
|
||||
"model": "hermes-agent",
|
||||
"output": [
|
||||
{"type": "function_call", "name": "terminal", "arguments": "{\"command\": \"ls\"}", "call_id": "call_1"},
|
||||
{"type": "function_call_output", "call_id": "call_1", "output": "README.md src/ tests/"},
|
||||
{"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Your project has..."}]}
|
||||
],
|
||||
"usage": {"input_tokens": 50, "output_tokens": 200, "total_tokens": 250}
|
||||
}
|
||||
```
|
||||
|
||||
#### Multi-turn with previous_response_id
|
||||
|
||||
Chain responses to maintain full context (including tool calls) across turns:
|
||||
|
||||
```json
|
||||
{
|
||||
"input": "Now show me the README",
|
||||
"previous_response_id": "resp_abc123"
|
||||
}
|
||||
```
|
||||
|
||||
The server reconstructs the full conversation from the stored response chain — all previous tool calls and results are preserved.
|
||||
|
||||
#### Named conversations
|
||||
|
||||
Use the `conversation` parameter instead of tracking response IDs:
|
||||
|
||||
```json
|
||||
{"input": "Hello", "conversation": "my-project"}
|
||||
{"input": "What's in src/?", "conversation": "my-project"}
|
||||
{"input": "Run the tests", "conversation": "my-project"}
|
||||
```
|
||||
|
||||
The server automatically chains to the latest response in that conversation. Like the `/title` command for gateway sessions.
|
||||
|
||||
### GET /v1/responses/{id}
|
||||
|
||||
Retrieve a previously stored response by ID.
|
||||
|
||||
### DELETE /v1/responses/{id}
|
||||
|
||||
Delete a stored response.
|
||||
|
||||
### GET /v1/models
|
||||
|
||||
Lists `hermes-agent` as an available model. Required by most frontends for model discovery.
|
||||
|
||||
### GET /health
|
||||
|
||||
Health check. Returns `{"status": "ok"}`.
|
||||
|
||||
## System Prompt Handling
|
||||
|
||||
When a frontend sends a `system` message (Chat Completions) or `instructions` field (Responses API), hermes-agent **layers it on top** of its core system prompt. Your agent keeps all its tools, memory, and skills — the frontend's system prompt adds extra instructions.
|
||||
|
||||
This means you can customize behavior per-frontend without losing capabilities:
|
||||
- Open WebUI system prompt: "You are a Python expert. Always include type hints."
|
||||
- The agent still has terminal, file tools, web search, memory, etc.
|
||||
|
||||
## Authentication
|
||||
|
||||
Bearer token auth via the `Authorization` header:
|
||||
|
||||
```
|
||||
Authorization: Bearer your-secret-key
|
||||
```
|
||||
|
||||
Configure the key via `API_SERVER_KEY` env var. If no key is set, all requests are allowed (for local-only use).
|
||||
|
||||
## Configuration
|
||||
|
||||
### Environment Variables
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `API_SERVER_ENABLED` | `false` | Enable the API server |
|
||||
| `API_SERVER_PORT` | `8642` | HTTP server port |
|
||||
| `API_SERVER_HOST` | `127.0.0.1` | Bind address (localhost only by default) |
|
||||
| `API_SERVER_KEY` | _(none)_ | Bearer token for auth |
|
||||
|
||||
### config.yaml
|
||||
|
||||
```yaml
|
||||
# Not yet supported — use environment variables.
|
||||
# config.yaml support coming in a future release.
|
||||
```
|
||||
|
||||
## CORS
|
||||
|
||||
The API server includes CORS headers on all responses (`Access-Control-Allow-Origin: *`), so browser-based frontends can connect directly.
|
||||
|
||||
## Compatible Frontends
|
||||
|
||||
Any frontend that supports the OpenAI API format works. Tested/documented integrations:
|
||||
|
||||
| Frontend | Stars | Connection |
|
||||
|----------|-------|------------|
|
||||
| [Open WebUI](/docs/user-guide/messaging/open-webui) | 126k | Full guide available |
|
||||
| LobeChat | 73k | Custom provider endpoint |
|
||||
| LibreChat | 34k | Custom endpoint in librechat.yaml |
|
||||
| AnythingLLM | 56k | Generic OpenAI provider |
|
||||
| NextChat | 87k | BASE_URL env var |
|
||||
| ChatBox | 39k | API Host setting |
|
||||
| Jan | 26k | Remote model config |
|
||||
| HF Chat-UI | 8k | OPENAI_BASE_URL |
|
||||
| big-AGI | 7k | Custom endpoint |
|
||||
| OpenAI Python SDK | — | `OpenAI(base_url="http://localhost:8642/v1")` |
|
||||
| curl | — | Direct HTTP requests |
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Response storage is in-memory** — stored responses (for `previous_response_id`) are lost on gateway restart. Max 100 stored responses (LRU eviction).
|
||||
- **No file upload** — vision/document analysis via uploaded files is not yet supported through the API.
|
||||
- **Model field is cosmetic** — the `model` field in requests is accepted but the actual LLM model used is configured server-side in config.yaml.
|
||||
210
website/docs/user-guide/features/streaming.md
Normal file
210
website/docs/user-guide/features/streaming.md
Normal file
|
|
@ -0,0 +1,210 @@
|
|||
---
|
||||
sidebar_position: 15
|
||||
title: "Streaming"
|
||||
description: "Token-by-token live response display across all platforms"
|
||||
---
|
||||
|
||||
# Streaming Responses
|
||||
|
||||
When enabled, hermes-agent streams LLM responses token-by-token instead of waiting for the full generation. Users see the response typing out live — the same experience as ChatGPT, Claude, or Gemini.
|
||||
|
||||
Streaming is **disabled by default** and can be enabled globally or per-platform.
|
||||
|
||||
## How It Works
|
||||
|
||||
```
|
||||
LLM generates tokens → callback fires per token → queue → consumer displays
|
||||
|
||||
Telegram/Discord/Slack:
|
||||
Token arrives → Accumulate → Every 1.5s, edit the message with new text + ▌ cursor
|
||||
Done → Final edit removes cursor
|
||||
|
||||
API Server:
|
||||
Token arrives → SSE event sent to client immediately
|
||||
Done → finish chunk + [DONE]
|
||||
```
|
||||
|
||||
The agent's internal operation doesn't change — tools still execute normally, memory and skills work as before. Streaming only affects how the **final text response** is delivered to the user.
|
||||
|
||||
## Enable Streaming
|
||||
|
||||
### Option 1: Environment variable
|
||||
|
||||
```bash
|
||||
# Enable for all platforms
|
||||
export HERMES_STREAMING_ENABLED=true
|
||||
hermes gateway
|
||||
```
|
||||
|
||||
### Option 2: config.yaml
|
||||
|
||||
```yaml
|
||||
streaming:
|
||||
enabled: true # Master switch
|
||||
```
|
||||
|
||||
### Option 3: Per-platform
|
||||
|
||||
```yaml
|
||||
streaming:
|
||||
enabled: false # Off by default
|
||||
telegram: true # But on for Telegram
|
||||
discord: true # And Discord
|
||||
api_server: true # And the API server
|
||||
```
|
||||
|
||||
## Platform Support
|
||||
|
||||
| Platform | Streaming Method | Rate Limit | Notes |
|
||||
|----------|-----------------|------------|-------|
|
||||
| **Telegram** | Progressive message editing | ~20 edits/min | 1.5s edit interval, ▌ cursor |
|
||||
| **Discord** | Progressive message editing | 5 edits/5s | 1.5s edit interval |
|
||||
| **Slack** | Progressive message editing | ~50 calls/min | 1.5s edit interval |
|
||||
| **API Server** | SSE (Server-Sent Events) | No limit | Real token-by-token events |
|
||||
| **WhatsApp** | ❌ Not supported | — | No message editing API |
|
||||
| **Home Assistant** | ❌ Not supported | — | No message editing API |
|
||||
| **CLI** | ❌ Not yet implemented | — | KawaiiSpinner provides feedback |
|
||||
|
||||
Platforms without message editing support automatically fall back to non-streaming (the response appears all at once, as before).
|
||||
|
||||
## What Users See
|
||||
|
||||
### Telegram/Discord/Slack
|
||||
|
||||
1. Agent starts working (typing indicator shows)
|
||||
2. After ~20 tokens, a message appears with partial text and a ▌ cursor
|
||||
3. Every 1.5 seconds, the message is edited with more accumulated text
|
||||
4. When the response is complete, the cursor disappears
|
||||
|
||||
Tool progress messages still work alongside streaming — tool names/previews appear as before, and the streamed response is shown in a separate message.
|
||||
|
||||
### API Server (frontends like Open WebUI)
|
||||
|
||||
When `stream: true` is set in the request, the API server returns Server-Sent Events:
|
||||
|
||||
```
|
||||
data: {"choices":[{"delta":{"role":"assistant"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":"Here"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":" is"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":" the"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{"content":" answer"}}]}
|
||||
|
||||
data: {"choices":[{"delta":{},"finish_reason":"stop"}]}
|
||||
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
Frontends like Open WebUI display this as live typing.
|
||||
|
||||
## How It Works Internally
|
||||
|
||||
### Architecture
|
||||
|
||||
```
|
||||
┌─────────────┐ stream_callback(delta) ┌──────────────────┐
|
||||
│ LLM API │ ──────────────────────────► │ queue.Queue() │
|
||||
│ (stream) │ (runs in agent thread) │ (thread-safe) │
|
||||
└─────────────┘ └────────┬─────────┘
|
||||
│
|
||||
┌──────────────┼──────────┐
|
||||
│ │ │
|
||||
┌─────▼─────┐ ┌─────▼────┐ ┌──▼──────┐
|
||||
│ Gateway │ │ API Svr │ │ CLI │
|
||||
│ edit msg │ │ SSE evt │ │ (TODO) │
|
||||
└───────────┘ └──────────┘ └─────────┘
|
||||
```
|
||||
|
||||
1. `AIAgent.__init__` accepts an optional `stream_callback` function
|
||||
2. When set, `_interruptible_api_call()` routes to `_run_streaming_chat_completion()` instead of the normal non-streaming path
|
||||
3. The streaming method calls the OpenAI API with `stream=True`, iterates chunks, and calls `stream_callback(delta_text)` for each text token
|
||||
4. Tool call deltas are accumulated silently (no streaming for tool arguments)
|
||||
5. When the stream ends, `stream_callback(None)` signals completion
|
||||
6. The method returns a fake response object compatible with the existing code path
|
||||
7. If streaming fails for any reason, it falls back to a normal non-streaming API call
|
||||
|
||||
### Thread Safety
|
||||
|
||||
The agent runs in a background thread (via `_interruptible_api_call`). The consumer (gateway async task, API server SSE writer) runs in the main event loop. A `queue.Queue` bridges them — it's thread-safe by design.
|
||||
|
||||
### Graceful Fallback
|
||||
|
||||
If the LLM provider doesn't support `stream=True` or the streaming connection fails, the agent automatically falls back to a non-streaming API call. The user gets the response normally, just without the live typing effect. No error is shown.
|
||||
|
||||
## Configuration Reference
|
||||
|
||||
```yaml
|
||||
streaming:
|
||||
enabled: false # Master switch (default: off)
|
||||
|
||||
# Per-platform overrides (optional):
|
||||
telegram: true # Enable for Telegram
|
||||
discord: true # Enable for Discord
|
||||
slack: true # Enable for Slack
|
||||
api_server: true # Enable for API server
|
||||
|
||||
# Tuning (optional):
|
||||
edit_interval: 1.5 # Seconds between message edits (default: 1.5)
|
||||
min_tokens: 20 # Tokens before first display (default: 20)
|
||||
```
|
||||
|
||||
| Variable | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `HERMES_STREAMING_ENABLED` | `false` | Master switch via env var |
|
||||
| `streaming.enabled` | `false` | Master switch via config |
|
||||
| `streaming.<platform>` | _(unset)_ | Per-platform override |
|
||||
| `streaming.edit_interval` | `1.5` | Seconds between Telegram/Discord edits |
|
||||
| `streaming.min_tokens` | `20` | Minimum tokens before first message |
|
||||
|
||||
## Interaction with Other Features
|
||||
|
||||
### Tool Execution
|
||||
|
||||
When the agent calls tools (terminal, file operations, web search, etc.), no text tokens are generated — tool arguments are accumulated silently. Tool progress messages continue to work as before. After tools finish, the next LLM call may produce the final text response, which streams normally.
|
||||
|
||||
### Context Compression
|
||||
|
||||
Compression happens between API calls, not during streaming. No interaction.
|
||||
|
||||
### Interrupts
|
||||
|
||||
If the user sends a new message while streaming, the agent is interrupted. The HTTP connection is closed (stopping token generation), accumulated text is shown as-is, and the new message is processed.
|
||||
|
||||
### Prompt Caching
|
||||
|
||||
Streaming doesn't affect prompt caching — the request is identical, just with `stream=True` added.
|
||||
|
||||
### Responses API (Codex)
|
||||
|
||||
The Codex/Responses API streaming path also supports the `stream_callback`. Token deltas from `response.output_text.delta` events are emitted via the callback.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Streaming isn't working
|
||||
|
||||
1. Check the config: `streaming.enabled: true` in config.yaml or `HERMES_STREAMING_ENABLED=true`
|
||||
2. Check per-platform: `streaming.telegram: true` overrides the master switch
|
||||
3. Restart the gateway after changing config
|
||||
4. Check logs for "Streaming failed, falling back" — indicates the provider may not support streaming
|
||||
|
||||
### Response appears twice
|
||||
|
||||
If you see the response both in a progressively-edited message AND as a separate final message, this is a bug. The streaming system should suppress the normal send when tokens were delivered via streaming. Please file an issue.
|
||||
|
||||
### Messages update too slowly
|
||||
|
||||
The default edit interval is 1.5 seconds (to respect platform rate limits). You can lower it in config:
|
||||
|
||||
```yaml
|
||||
streaming:
|
||||
edit_interval: 1.0 # Faster updates (may hit rate limits)
|
||||
```
|
||||
|
||||
Going below 1.0s risks Telegram rate limiting (429 errors).
|
||||
|
||||
### No streaming on WhatsApp/HomeAssistant
|
||||
|
||||
These platforms don't support message editing, so streaming automatically falls back to non-streaming. This is expected behavior.
|
||||
|
|
@ -1,12 +1,12 @@
|
|||
---
|
||||
sidebar_position: 1
|
||||
title: "Messaging Gateway"
|
||||
description: "Chat with Hermes from Telegram, Discord, Slack, WhatsApp, or Signal — architecture and setup overview"
|
||||
description: "Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, or any OpenAI-compatible frontend — architecture and setup overview"
|
||||
---
|
||||
|
||||
# Messaging Gateway
|
||||
|
||||
Chat with Hermes from Telegram, Discord, Slack, WhatsApp, or Signal. The gateway is a single background process that connects to all your configured platforms, handles sessions, runs cron jobs, and delivers voice messages.
|
||||
Chat with Hermes from Telegram, Discord, Slack, WhatsApp, Signal, or any OpenAI-compatible frontend (Open WebUI, LobeChat, etc.). The gateway is a single background process that connects to all your configured platforms, handles sessions, runs cron jobs, and delivers voice messages.
|
||||
|
||||
## Architecture
|
||||
|
||||
|
|
@ -15,10 +15,10 @@ Chat with Hermes from Telegram, Discord, Slack, WhatsApp, or Signal. The gateway
|
|||
│ Hermes Gateway │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │
|
||||
│ │ Telegram │ │ Discord │ │ WhatsApp │ │ Slack │ │ Signal │ │
|
||||
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter│ │
|
||||
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌──────────┐ │
|
||||
│ │ Telegram │ │ Discord │ │ WhatsApp │ │ Slack │ │ Signal │ │API Server│ │
|
||||
│ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter │ │ Adapter│ │ (OpenAI) │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ └────┬─────┘ │
|
||||
│ │ │ │ │ │ │
|
||||
│ └─────────────┼────────────┼─────────────┼───────────┘ │
|
||||
│ │ │
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue