mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-30 06:41:51 +00:00
Catch the website docs up to two weeks of merged work (May 4 – May 18, 2026,
roughly 1,080 PRs). The audit found ~50 user-visible features that had landed
in code with no docs footprint, plus a handful of stale pages. This PR closes
every gap the scan turned up.
New pages
- user-guide/features/deliverable-mode.md — extension list, agent triggers,
kanban_complete artifacts pattern, [[as_document]] override (PR #27813).
- developer-guide/web-search-provider-plugin.md — authoring guide modeled on
image-gen-provider-plugin, covering brave_free / ddgs / etc. (PR #25448).
Providers / auth
- Rename "Alibaba Cloud" → "Qwen Cloud (Alibaba DashScope)" everywhere the
display label shows up; provider id stays `alibaba` (PR #24835).
- Document OAuth refresh-token quarantine for xAI / MiniMax / Codex (PRs
#28116 / #28118 / #28119).
- Document Nous JWT minting from refresh token + invalid-refresh quarantine
+ cross-profile shared token store (PRs #27663 / #19712).
- Add `## Microsoft Entra ID authentication (keyless)` section to
azure-foundry guide — DefaultAzureCredential, RBAC, OpenAI + Anthropic
routing details (PR #28101 / #9df9816da).
- Custom providers `api_mode` is now prompted-and-persisted, not just URL
autodetected (PR #25068).
- Delegation honours `api_mode` + auto-detects anthropic_messages base URLs
(PR #26824).
- `x_search` auto-enables when xAI credentials are present (PR #27376).
- Add `xAI Grok OAuth (SuperGrok)` row to providers headline table (PR
#26534).
- NVIDIA NIM billing-origin header is set automatically (PR #26585).
Windows / installer
- `install.ps1`: document `-Commit <sha>` and `-Tag <v>` pin params plus
the BOM-strip / git-retry hardening (PR #28169).
- Document Hermes Desktop thin installer + first-launch bootstrap (PR
#27822).
- Document `dep_ensure` Windows bootstrap (PR #27845).
- Document install-method auto-detection (pip / git / homebrew / nixos) and
the matching update command (PR #27843).
Gateway / messaging
- `/platform list|pause|resume` full description + circuit-breaker
semantics (PR #26600).
- Slack / Matrix / Mattermost get parallel `allowed_channels` /
`allowed_rooms` allowlist sections matching Telegram/Discord/DingTalk
(PR #21251).
- Discord `allow_any_attachment` + `max_attachment_bytes` (config and env
vars) (PR #27245).
- Discord clarify-choice button rendering (PR #25485).
- Telegram `guest_mode` @mention bypass for allowlisted groups (PR
#22759).
- Telegram `notifications` mode (`important` vs `all`) (PR #22793).
- `[[as_document]]` skill / response directive for forcing
document-style media delivery (PR #21210).
CLI / TUI
- `/new [name]` argument (PR #19637).
- `/subgoal` user-supplied criteria appended to `/goal` (PR #25449).
- `/exit --delete` flag confirmation prompts for destructive slash
commands (PR #22687).
- Status-bar additions: ▶ N background indicator (PR #27175), context
compression count (PR #21218), YOLO mode banner+statusbar warning (PR
#26238).
- `display.timestamps` + `docker_extra_args` config keys (PR #23599).
- TUI collapsible startup banner sections (PR #20625).
- `HERMES_SESSION_ID` exported to tool subprocesses (PR #23847).
i18n
- Refresh display.language locale list from 8 → 16 (en, zh, zh-hant, ja,
de, es, fr, tr, uk, af, ko, it, ga, pt, ru, hu) — matches
`agent/i18n.py:SUPPORTED_LANGUAGES`.
Tools / features
- `vision_analyze` native-pixel passthrough for vision-capable callers,
with auxiliary text-describer fallback (PR #22955).
- `session_search` rewrite to the single-shape tool (discovery / scroll /
browse modes) (PRs #27590 / #27840).
- Clarify MCP transport scope: client supports stdio + SSE; embedded
`hermes mcp serve` is stdio-only (PR #21227).
- Web search backends table: add Brave Search (free tier) and DDGS rows
(PR #21337).
- ACP session-scoped edit auto-approval modes (PR #27862).
- Curator rename map in the user-visible per-run summary (PR #22910).
- Prompt caching feature page reference in features/overview.md — Claude
cross-session 1-hour prefix cache on native Anthropic / OpenRouter /
Nous Portal (PR #23828).
- Cron per-job profile parameter (PR #28124).
- `--no-skills` flag for `hermes profile create` (PR #20986).
Build
- Verified with `npm run build` in `website/`; both `en` and `zh-Hans`
locales compile. Remaining broken-link/anchor warnings are pre-existing
(`rl-training.md` from learning-path / overview; the
zh-Hans translation lag the docs skill already calls out).
210 lines
10 KiB
Markdown
210 lines
10 KiB
Markdown
---
|
|
title: Vision & Image Paste
|
|
description: Paste images from your clipboard into the Hermes CLI for multimodal vision analysis.
|
|
sidebar_label: Vision & Image Paste
|
|
sidebar_position: 7
|
|
---
|
|
|
|
# Vision & Image Paste
|
|
|
|
Hermes Agent supports **multimodal vision** — you can paste images from your clipboard directly into the CLI and ask the agent to analyze, describe, or work with them. Images are sent to the model as base64-encoded content blocks, so any vision-capable model can process them.
|
|
|
|
## How It Works
|
|
|
|
1. Copy an image to your clipboard (screenshot, browser image, etc.)
|
|
2. Attach it using one of the methods below
|
|
3. Type your question and press Enter
|
|
4. The image appears as a `[📎 Image #1]` badge above the input
|
|
5. On submit, the image is sent to the model as a vision content block
|
|
|
|
You can attach multiple images before sending — each gets its own badge. Press `Ctrl+C` to clear all attached images.
|
|
|
|
Images are saved to `~/.hermes/images/` as PNG files with timestamped filenames.
|
|
|
|
## Paste Methods
|
|
|
|
How you attach an image depends on your terminal environment. Not all methods work everywhere — here's the full breakdown:
|
|
|
|
### `/paste` Command
|
|
|
|
**The most reliable explicit image-attach fallback.**
|
|
|
|
```
|
|
/paste
|
|
```
|
|
|
|
Type `/paste` and press Enter. Hermes checks your clipboard for an image and attaches it. This is the safest option when your terminal rewrites `Cmd+V`/`Ctrl+V`, or when you copied only an image and there is no bracketed-paste text payload to inspect.
|
|
|
|
### Ctrl+V / Cmd+V
|
|
|
|
Hermes now treats paste as a layered flow:
|
|
- normal text paste first
|
|
- native clipboard / OSC52 text fallback if the terminal did not deliver text cleanly
|
|
- image attach when the clipboard or pasted payload resolves to an image or image path
|
|
|
|
This means pasted macOS screenshot temp paths and `file://...` image URIs can attach immediately instead of sitting in the composer as raw text.
|
|
|
|
:::warning
|
|
If your clipboard has **only an image** (no text), terminals still cannot send binary image bytes directly. Use `/paste` as the explicit image-attach fallback.
|
|
:::
|
|
|
|
### `/terminal-setup` for VS Code / Cursor / Windsurf
|
|
|
|
If you run the TUI inside a local VS Code-family integrated terminal on macOS, Hermes can install the recommended `workbench.action.terminal.sendSequence` bindings for better multiline and undo/redo parity:
|
|
|
|
```text
|
|
/terminal-setup
|
|
```
|
|
|
|
This is especially useful when `Cmd+Enter`, `Cmd+Z`, or `Shift+Cmd+Z` are being intercepted by the IDE. Run it on the local machine only — not inside an SSH session.
|
|
|
|
## Platform Compatibility
|
|
|
|
| Environment | `/paste` | Cmd/Ctrl+V | `/terminal-setup` | Notes |
|
|
|---|:---:|:---:|:---:|---|
|
|
| **macOS Terminal / iTerm2** | ✅ | ✅ | n/a | Best experience — native clipboard + screenshot-path recovery |
|
|
| **Apple Terminal** | ✅ | ✅ | n/a | If Cmd+←/→/⌫ gets rewritten, use Ctrl+A / Ctrl+E / Ctrl+U fallbacks |
|
|
| **Linux X11 desktop** | ✅ | ✅ | n/a | Requires `xclip` (`apt install xclip`) |
|
|
| **Linux Wayland desktop** | ✅ | ✅ | n/a | Requires `wl-paste` (`apt install wl-clipboard`) |
|
|
| **WSL2 (Windows Terminal)** | ✅ | ✅ | n/a | Uses `powershell.exe` — no extra install needed |
|
|
| **VS Code / Cursor / Windsurf (local)** | ✅ | ✅ | ✅ | Recommended for better Cmd+Enter / undo / redo parity |
|
|
| **VS Code / Cursor / Windsurf (SSH)** | ❌² | ❌² | ❌³ | Run `/terminal-setup` on the local machine instead |
|
|
| **SSH terminal (any)** | ❌² | ❌² | n/a | Remote clipboard not accessible |
|
|
|
|
² See [SSH & Remote Sessions](#ssh--remote-sessions) below
|
|
³ The command writes local IDE keybindings and should not be run from the remote host
|
|
|
|
## Platform-Specific Setup
|
|
|
|
### macOS
|
|
|
|
**No setup required.** Hermes uses `osascript` (built into macOS) to read the clipboard. For faster performance, optionally install `pngpaste`:
|
|
|
|
```bash
|
|
brew install pngpaste
|
|
```
|
|
|
|
### Linux (X11)
|
|
|
|
Install `xclip`:
|
|
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt install xclip
|
|
|
|
# Fedora
|
|
sudo dnf install xclip
|
|
|
|
# Arch
|
|
sudo pacman -S xclip
|
|
```
|
|
|
|
### Linux (Wayland)
|
|
|
|
Modern Linux desktops (Ubuntu 22.04+, Fedora 34+) often use Wayland by default. Install `wl-clipboard`:
|
|
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt install wl-clipboard
|
|
|
|
# Fedora
|
|
sudo dnf install wl-clipboard
|
|
|
|
# Arch
|
|
sudo pacman -S wl-clipboard
|
|
```
|
|
|
|
:::tip How to check if you're on Wayland
|
|
```bash
|
|
echo $XDG_SESSION_TYPE
|
|
# "wayland" = Wayland, "x11" = X11, "tty" = no display server
|
|
```
|
|
:::
|
|
|
|
### WSL2
|
|
|
|
**No extra setup required.** Hermes detects WSL2 automatically (via `/proc/version`) and uses `powershell.exe` to access the Windows clipboard through .NET's `System.Windows.Forms.Clipboard`. This is built into WSL2's Windows interop — `powershell.exe` is available by default.
|
|
|
|
The clipboard data is transferred as base64-encoded PNG over stdout, so no file path conversion or temp files are needed.
|
|
|
|
:::info WSLg Note
|
|
If you're running WSLg (WSL2 with GUI support), Hermes tries the PowerShell path first, then falls back to `wl-paste`. WSLg's clipboard bridge only supports BMP format for images — Hermes auto-converts BMP to PNG using Pillow (if installed) or ImageMagick's `convert` command.
|
|
:::
|
|
|
|
#### Verify WSL2 clipboard access
|
|
|
|
```bash
|
|
# 1. Check WSL detection
|
|
grep -i microsoft /proc/version
|
|
|
|
# 2. Check PowerShell is accessible
|
|
which powershell.exe
|
|
|
|
# 3. Copy an image, then check
|
|
powershell.exe -NoProfile -Command "Add-Type -AssemblyName System.Windows.Forms; [System.Windows.Forms.Clipboard]::ContainsImage()"
|
|
# Should print "True"
|
|
```
|
|
|
|
## SSH & Remote Sessions
|
|
|
|
**Clipboard image paste does not fully work over SSH.** When you SSH into a remote machine, the Hermes CLI runs on the remote host. Clipboard tools (`xclip`, `wl-paste`, `powershell.exe`, `osascript`) read the clipboard of the machine they run on — which is the remote server, not your local machine. Your local clipboard image is therefore inaccessible from the remote side.
|
|
|
|
Text can sometimes still bridge through terminal paste or OSC52, but image clipboard access and local screenshot temp paths remain tied to the machine running Hermes.
|
|
|
|
### Workarounds for SSH
|
|
|
|
1. **Upload the image file** — Save the image locally, upload it to the remote server via `scp`, VSCode's file explorer (drag-and-drop), or any file transfer method. Then reference it by path. *(A `/attach <filepath>` command is planned for a future release.)*
|
|
|
|
2. **Use a URL** — If the image is accessible online, just paste the URL in your message. The agent can use `vision_analyze` to look at any image URL directly.
|
|
|
|
3. **X11 forwarding** — Connect with `ssh -X` to forward X11. This lets `xclip` on the remote machine access your local X11 clipboard. Requires an X server running locally (XQuartz on macOS, built-in on Linux X11 desktops). Slow for large images.
|
|
|
|
4. **Use a messaging platform** — Send images to Hermes via Telegram, Discord, Slack, or WhatsApp. These platforms handle image upload natively and are not affected by clipboard/terminal limitations.
|
|
|
|
## Why Terminals Can't Paste Images
|
|
|
|
This is a common source of confusion, so here's the technical explanation:
|
|
|
|
Terminals are **text-based** interfaces. When you press Ctrl+V (or Cmd+V), the terminal emulator:
|
|
|
|
1. Reads the clipboard for **text content**
|
|
2. Wraps it in [bracketed paste](https://en.wikipedia.org/wiki/Bracketed-paste) escape sequences
|
|
3. Sends it to the application through the terminal's text stream
|
|
|
|
If the clipboard contains only an image (no text), the terminal has nothing to send. There is no standard terminal escape sequence for binary image data. The terminal simply does nothing.
|
|
|
|
This is why Hermes uses a separate clipboard check — instead of receiving image data through the terminal paste event, it calls OS-level tools (`osascript`, `powershell.exe`, `xclip`, `wl-paste`) directly via subprocess to read the clipboard independently.
|
|
|
|
## Supported Models
|
|
|
|
Image paste works with any vision-capable model. The image is sent as a base64-encoded data URL in the OpenAI vision content format:
|
|
|
|
```json
|
|
{
|
|
"type": "image_url",
|
|
"image_url": {
|
|
"url": "data:image/png;base64,..."
|
|
}
|
|
}
|
|
```
|
|
|
|
Most modern models support this format, including GPT-4 Vision, Claude (with vision), Gemini, and open-source multimodal models served through OpenRouter.
|
|
|
|
## Image Routing (Vision-Capable vs Text-Only Models)
|
|
|
|
When a user attaches an image — from the CLI clipboard, the gateway (Telegram/Discord photo), or any other entry point — Hermes routes it based on whether your current model actually supports vision:
|
|
|
|
| Your model | What happens to the image |
|
|
|---|---|
|
|
| **Vision-capable** (GPT-4V, Claude with vision, Gemini, Qwen-VL, MiMo-VL, etc.) | Sent as **real pixels** using the provider's native image content format above. No text summary layer. |
|
|
| **Text-only** (DeepSeek V3, smaller open-source models, older chat-only endpoints) | Routed through the `vision_analyze` auxiliary tool — an auxiliary vision model describes the image, and the text description is injected into the conversation. |
|
|
|
|
You don't configure this — Hermes looks up your current model's capability in the provider metadata and picks the right path automatically. The practical effect: you can switch between vision and non-vision models mid-session and image handling "just works" without changing your workflow. Text-only models get coherent context about the image rather than a broken multimodal payload they'd have to reject.
|
|
|
|
Which auxiliary model handles the text-description path is configurable under `auxiliary.vision` — see [Auxiliary Models](/docs/user-guide/configuration#auxiliary-models).
|
|
|
|
### `vision_analyze` has the same dual behavior
|
|
|
|
The `vision_analyze` tool itself follows the same routing. When the active main model is vision-capable **and** its provider supports image content inside tool results (currently the Anthropic, OpenAI, Azure-OpenAI, and Gemini 3.x stacks), `vision_analyze` short-circuits the auxiliary describer and returns the raw image pixels as a multimodal tool-result envelope. The main model sees the image natively on its next turn — no aux call, no text-summary information loss, no extra latency.
|
|
|
|
For text-only main models (or providers whose tool-result channel doesn't carry images), `vision_analyze` falls back to the legacy path: it asks the configured auxiliary vision model to describe the image and returns the description as plain text. Either way the calling tool signature is the same — the tool decides which path to take at runtime based on the active model.
|