mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-06-05 07:41:39 +00:00
fix(context): align guidance with 64k minimum
This commit is contained in:
parent
1d5deac346
commit
3b839f4369
6 changed files with 41 additions and 35 deletions
|
|
@ -691,7 +691,7 @@ model:
|
|||
default: qwen2.5-coder:32b
|
||||
provider: custom
|
||||
base_url: http://localhost:11434/v1
|
||||
context_length: 32768 # See warning below
|
||||
context_length: 64000 # See warning below
|
||||
```
|
||||
|
||||
:::caution Ollama defaults to very low context lengths
|
||||
|
|
@ -703,22 +703,22 @@ Ollama does **not** use your model's full context window by default. Depending o
|
|||
| 24–48 GB | 32,768 tokens |
|
||||
| 48+ GB | 256,000 tokens |
|
||||
|
||||
For agent use with tools, **you need at least 16k–32k context**. At 4k, the system prompt + tool schemas alone can fill the window, leaving no room for conversation.
|
||||
Hermes Agent requires at least **64,000 tokens** of context for agent use with tools. Smaller windows are rejected at startup because the system prompt, tool schemas, and working conversation state need enough room for reliable multi-step workflows.
|
||||
|
||||
**How to increase it** (pick one):
|
||||
|
||||
```bash
|
||||
# Option 1: Set server-wide via environment variable (recommended)
|
||||
OLLAMA_CONTEXT_LENGTH=32768 ollama serve
|
||||
OLLAMA_CONTEXT_LENGTH=64000 ollama serve
|
||||
|
||||
# Option 2: For systemd-managed Ollama
|
||||
sudo systemctl edit ollama.service
|
||||
# Add: Environment="OLLAMA_CONTEXT_LENGTH=32768"
|
||||
# Add: Environment="OLLAMA_CONTEXT_LENGTH=64000"
|
||||
# Then: sudo systemctl daemon-reload && sudo systemctl restart ollama
|
||||
|
||||
# Option 3: Bake it into a custom model (persistent per-model)
|
||||
echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 32768" > Modelfile
|
||||
ollama create qwen2.5-coder-32k -f Modelfile
|
||||
echo -e "FROM qwen2.5-coder:32b\nPARAMETER num_ctx 64000" > Modelfile
|
||||
ollama create qwen2.5-coder-64k -f Modelfile
|
||||
```
|
||||
|
||||
**You cannot set context length through the OpenAI-compatible API** (`/v1/chat/completions`). It must be configured server-side or via a Modelfile. This is the #1 source of confusion when integrating Ollama with tools like Hermes.
|
||||
|
|
@ -820,13 +820,13 @@ If responses seem truncated, add `max_tokens` to your requests or set `--default
|
|||
cmake -B build && cmake --build build --config Release
|
||||
./build/bin/llama-server \
|
||||
--jinja -fa \
|
||||
-c 32768 \
|
||||
-c 64000 \
|
||||
-ngl 99 \
|
||||
-m models/qwen2.5-coder-32b-instruct-Q4_K_M.gguf \
|
||||
--port 8080 --host 0.0.0.0
|
||||
```
|
||||
|
||||
**Context length (`-c`):** Recent builds default to `0` which reads the model's training context from the GGUF metadata. For models with 128k+ training context, this can OOM trying to allocate the full KV cache. Set `-c` explicitly to what you need (32k–64k is a good range for agent use). If using parallel slots (`-np`), the total context is divided among slots — with `-c 32768 -np 4`, each slot only gets 8k.
|
||||
**Context length (`-c`):** Recent builds default to `0` which reads the model's training context from the GGUF metadata. For models with 128k+ training context, this can OOM trying to allocate the full KV cache. Set `-c` explicitly to at least 64,000 tokens for Hermes. If using parallel slots (`-np`), the total context is divided among slots — with `-c 64000 -np 4`, each slot only gets 16k, which is below Hermes' minimum per active session.
|
||||
|
||||
Then configure Hermes to point at it:
|
||||
|
||||
|
|
@ -862,7 +862,7 @@ Start the server from the LM Studio app (Developer tab → Start Server), or use
|
|||
|
||||
```bash
|
||||
lms server start # Starts on port 1234
|
||||
lms load qwen2.5-coder --context-length 32768
|
||||
lms load qwen2.5-coder --context-length 64000
|
||||
```
|
||||
|
||||
Then configure Hermes:
|
||||
|
|
@ -1044,7 +1044,7 @@ The model outputs something like `{"name": "web_search", "arguments": {...}}` as
|
|||
# vLLM: check --max-model-len in startup args
|
||||
```
|
||||
|
||||
**Fix:** Set context to at least **32,768 tokens** for agent use. See each server's section above for the specific flag.
|
||||
**Fix:** Set context to at least **64,000 tokens** for agent use. See each server's section above for the specific flag.
|
||||
|
||||
#### "Context limit: 2048 tokens" at startup
|
||||
|
||||
|
|
@ -1057,7 +1057,7 @@ model:
|
|||
default: your-model
|
||||
provider: custom
|
||||
base_url: http://localhost:11434/v1
|
||||
context_length: 32768
|
||||
context_length: 64000
|
||||
```
|
||||
|
||||
#### Responses get cut off mid-sentence
|
||||
|
|
@ -1198,7 +1198,7 @@ custom_providers:
|
|||
base_url: "http://localhost:11434/v1"
|
||||
models:
|
||||
qwen3.5:27b:
|
||||
context_length: 32768
|
||||
context_length: 64000
|
||||
deepseek-r1:70b:
|
||||
context_length: 65536
|
||||
```
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue