mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-04-26 01:01:40 +00:00

feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap (#3934 )

* feat(gateway): skill-aware slash commands, paginated /commands, Telegram 100-cap

Map active skills to Telegram's slash command menu so users can
discover and invoke skills directly. Three changes:

1. Telegram menu now includes active skill commands alongside built-in
   commands, capped at 100 entries (Telegram Bot API limit). Overflow
   commands remain callable but hidden from the picker. Logged at
   startup when cap is hit.

2. New /commands [page] gateway command for paginated browsing of all
   commands + skills. /help now shows first 10 skill commands and
   points to /commands for the full list.

3. When a user types a slash command that matches a disabled or
   uninstalled skill, they get actionable guidance:
   - Disabled: 'Enable it with: hermes skills config'
   - Optional (not installed): 'Install with: hermes skills install official/<path>'

Built on ideas from PR #3921 by @kshitijk4poor.

* chore: move 21 niche skills to optional-skills

Move specialized/niche skills from built-in (skills/) to optional
(optional-skills/) to reduce the default skill count. Users can
install them with: hermes skills install official/<category>/<name>

Moved skills (21):
- mlops: accelerate, chroma, faiss, flash-attention,
  hermes-atropos-environments, huggingface-tokenizers, instructor,
  lambda-labs, llava, nemo-curator, pinecone, pytorch-lightning,
  qdrant, saelens, simpo, slime, tensorrt-llm, torchtitan
- research: domain-intel, duckduckgo-search
- devops: inference-sh cli

Built-in skills: 96 → 75
Optional skills: 22 → 43

* fix: only include repo built-in skills in Telegram menu, not user-installed

User-installed skills (from hub or manually added) stay accessible via
/skills and by typing the command directly, but don't get registered
in the Telegram slash command picker. Only skills whose SKILL.md is
under the repo's skills/ directory are included in the menu.

This keeps the Telegram menu focused on the curated built-in set while
user-installed skills remain discoverable through /skills and /commands.

2026-03-30 10:57:30 -07:00

5.7 KiB

Raw Blame History

Usage Patterns — Testing Environments and Evaluating Models

Pattern 1: Test Your Environment Works (process mode)

Use process mode to verify your environment runs end-to-end before committing. This generates trajectories without needing an Atropos training server.

Before running: Ask the user for their inference setup (see SKILL.md "Inference Setup" section). Replace <BASE_URL>, <MODEL>, and <SERVER_TYPE> below with their chosen values.

Step 1: Run 1 trajectory

cd ~/.hermes/hermes-agent
source venv/bin/activate

python environments/your_env.py process \
  --env.total_steps 1 \
  --env.group_size 1 \
  --env.use_wandb false \
  --env.data_path_to_save_groups /tmp/test_output.jsonl \
  --openai.base_url "<BASE_URL>" \
  --openai.model_name "<MODEL>" \
  --openai.server_type <SERVER_TYPE> \
  --openai.health_check false

Step 2: Verify the output

import json
for line in open("/tmp/test_output.jsonl"):
    data = json.loads(line)
    print(f"Scores: {data.get('scores', [])}")
    print(f"Token sequences: {len(data.get('tokens', []))}")
    # Check messages include tool calls
    for msg_list in data.get("messages", []):
        roles = [m.get("role") for m in msg_list]
        print(f"Roles: {roles}")
        for m in reversed(msg_list):
            if m.get("role") == "assistant" and m.get("content"):
                print(f"Response: {m['content'][:200]}...")
                break

What to check:

Scores are not all 0.0 — if so, compute_reward is broken
Scores are in [0, 1] — not negative, not >1
Messages include "tool" role entries — agent used tools
Token sequences are non-empty
An HTML visualization is generated next to the .jsonl

Common failures:

'AgentResult' object has no attribute 'X' — accessing a field that doesn't exist. See agentresult-fields.md.
Score always 0.0 — reward function erroring silently
Score always 1.0 — verification too lenient or not running

Pattern 2: Evaluate a Model (evaluate mode)

Use evaluate mode to benchmark a model on your environment's eval split. This runs the full agent loop with tools for each eval item.

Step 1: Run evaluation

python environments/your_env.py evaluate \
  --env.eval_size 20 \
  --env.use_wandb false \
  --env.data_dir_to_save_evals /tmp/eval_results \
  --openai.base_url "<BASE_URL>" \
  --openai.model_name "<MODEL>" \
  --openai.server_type <SERVER_TYPE> \
  --openai.health_check false

Step 2: Read results

Stdout shows a lighteval-compatible table:

Evaluation Results: your-env_eval
|Metric          |  Value|
|mean correctness| 0.850 |
|mean reward     | 0.920 |
|mean tool calls | 4.300 |
|n items         | 20    |
Evaluation completed in 367 seconds

JSON results saved to the eval directory:

import json
data = json.load(open("/tmp/eval_results/metrics.json"))
for metric, value in data["results"]["all"].items():
    print(f"{metric}: {value}")

Step 3: Compare models

Run evaluate with different models and compare the metrics.json files.

What to check:

"data_dir_to_save_evals is not set" — you forgot the flag, results won't be saved
Tool usage rate = 0 — evaluate() is using chat_completion instead of HermesAgentLoop
All scores identical — judge failing, falling back to heuristic
Very slow — each item runs a full agent loop (~30-90s). Use --env.eval_size 5 for quick checks.

Pattern 3: Generate Training Data (process mode, larger scale)

Generate trajectory data for offline training or analysis:

python environments/your_env.py process \
  --env.total_steps 50 \
  --env.group_size 4 \
  --env.use_wandb false \
  --env.data_path_to_save_groups data/trajectories.jsonl \
  --openai.base_url "<BASE_URL>" \
  --openai.model_name "<MODEL>" \
  --openai.server_type <SERVER_TYPE> \
  --openai.health_check false

Analyze the distribution:

import json
scores = []
for line in open("data/trajectories.jsonl"):
    data = json.loads(line)
    scores.extend(data.get("scores", []))

print(f"Total: {len(scores)}, Mean: {sum(scores)/len(scores):.3f}")
for bucket in [0.0, 0.2, 0.4, 0.6, 0.8, 1.0]:
    count = sum(1 for s in scores if abs(s - bucket) < 0.1)
    print(f"  {bucket:.1f}: {'█' * count} ({count})")

What to check:

Score distribution has variance — RL needs score variance. All-same scores are useless.

Pattern 4: Full RL Training (serve mode)

For actual RL training with Atropos:

# Terminal 1: Start Atropos API server
run-api

# Terminal 2: Start your environment
python environments/your_env.py serve \
  --config environments/your_env/default.yaml

For Phase 2 with VLLM:

# Terminal 1: VLLM server
python -m vllm.entrypoints.openai.api_server --model your-model --port 8000

# Terminal 2: Atropos API
run-api

# Terminal 3: Environment
python environments/your_env.py serve \
  --openai.base_url http://localhost:8000/v1 \
  --openai.model_name your-model \
  --openai.server_type vllm

Pattern 5: Quick Smoke Test

Verify imports and config before spending money on API calls:

from environments.your_env import YourEnv
print(f"Name: {YourEnv.name}")
cfg, servers = YourEnv.config_init()
print(f"Toolsets: {cfg.enabled_toolsets}")
print(f"Server: {servers[0].model_name}")
print("All imports OK")

Timing Expectations

Mode	Items	Time per item	Total
process (1 item)	1	30-90s	~1 min
evaluate (5 items)	5	30-90s	~5 min
evaluate (20 items)	20	30-90s	~15-30 min
process (50 items)	50	30-90s	~30-75 min

Times are for cloud APIs with Claude Sonnet-class models. Local models may be faster or slower depending on hardware.

5.7 KiB Raw Blame History

Usage Patterns — Testing Environments and Evaluating Models

Pattern 1: Test Your Environment Works (process mode)

Step 1: Run 1 trajectory

Step 2: Verify the output

What to check:

Common failures:

Pattern 2: Evaluate a Model (evaluate mode)

Step 1: Run evaluation

Step 2: Read results

Step 3: Compare models

What to check:

Pattern 3: Generate Training Data (process mode, larger scale)

Analyze the distribution:

What to check:

Pattern 4: Full RL Training (serve mode)

Pattern 5: Quick Smoke Test

Timing Expectations

5.7 KiB

Raw Blame History