# Hermes Agent - Future Improvements > Ideas for enhancing the agent's capabilities, generated from self-analysis of the codebase. --- ## 🚨 HIGH PRIORITY - Immediate Fixes These items need to be addressed ASAP: ### 1. SUDO Breaking Terminal Tool πŸ” βœ… COMPLETE - [x] **Problem:** SUDO commands break the terminal tool execution (hangs indefinitely) - [x] **Fix:** Created custom environment wrappers in `tools/terminal_tool.py` - `stdin=subprocess.DEVNULL` prevents hanging on interactive prompts - Sudo fails gracefully with clear error if no password configured - Same UX as Claude Code - agent sees error, tells user to run it themselves - [x] **All 5 environments now have consistent behavior:** - `_LocalEnvironment` - local execution - `_DockerEnvironment` - Docker containers - `_SingularityEnvironment` - Singularity/Apptainer containers - `_ModalEnvironment` - Modal cloud sandboxes - `_SSHEnvironment` - remote SSH execution - [x] **Optional sudo support via `SUDO_PASSWORD` env var:** - Shared `_transform_sudo_command()` helper used by all environments - If set, auto-transforms `sudo cmd` β†’ pipes password via `sudo -S` - Documented in `.env.example`, `cli-config.yaml`, and README - Works for chained commands: `cmd1 && sudo cmd2` - [x] **Interactive sudo prompt in CLI mode:** - When sudo detected and no password configured, prompts user - 45-second timeout (auto-skips if no input) - Hidden password input via `getpass` (password not visible) - Password cached for session (don't ask repeatedly) - Spinner pauses during prompt for clean UX - Uses `HERMES_INTERACTIVE` env var to detect CLI mode ### 2. Fix `browser_get_images` Tool πŸ–ΌοΈ βœ… VERIFIED WORKING - [x] **Tested:** Tool works correctly on multiple sites - [x] **Results:** Successfully extracts image URLs, alt text, dimensions - [x] **Note:** Some sites (Pixabay, etc.) have Cloudflare bot protection that blocks headless browsers - this is expected behavior, not a bug ### 3. Better Action Logging for Debugging πŸ“ βœ… COMPLETE - [x] **Problem:** Need better logging of agent actions for debugging - [x] **Implementation:** - Save full session trajectories to `logs/` directory as JSON - Each session gets a unique file: `session_YYYYMMDD_HHMMSS_UUID.json` - Logs all messages, tool calls with inputs/outputs, timestamps - Structured JSON format for easy parsing and replay - Automatic on CLI runs (configurable) ### 4. Automatic Context Compression πŸ—œοΈ βœ… COMPLETE - [x] **Problem:** Long conversations exceed model context limits, causing errors - [x] **Solution:** Auto-compress middle turns when approaching limit - [x] **Implementation:** - Fetches model context lengths from OpenRouter `/api/v1/models` API (cached 1hr) - Tracks actual token usage from API responses (`usage.prompt_tokens`) - Triggers at 85% of model's context limit (configurable) - Protects first 3 turns (system, initial request, first response) - Protects last 4 turns (recent context most relevant) - Summarizes middle turns using fast model (Gemini Flash) - Inserts summary as user message, conversation continues seamlessly - If context error occurs, attempts compression before failing - [x] **Configuration (cli-config.yaml / env vars):** - `CONTEXT_COMPRESSION_ENABLED` (default: true) - `CONTEXT_COMPRESSION_THRESHOLD` (default: 0.85 = 85%) - `CONTEXT_COMPRESSION_MODEL` (default: google/gemini-2.0-flash-001) ### 5. Stream Thinking Summaries in Real-Time πŸ’­ ⏸️ DEFERRED - [ ] **Problem:** Thinking/reasoning summaries not shown while streaming - [ ] **Complexity:** This is a significant refactor - leaving for later **OpenRouter Streaming Info:** - Uses `stream=True` with OpenAI SDK - Reasoning comes in `choices[].delta.reasoning_details` chunks - Types: `reasoning.summary`, `reasoning.text`, `reasoning.encrypted` - Tool call arguments stream as partial JSON (need accumulation) - Items paradigm: same ID emitted multiple times with updated content **Key Challenges:** - Tool call JSON accumulation (partial `{"query": "wea` β†’ `{"query": "weather"}`) - Multiple concurrent outputs (thinking + tool calls + text simultaneously) - State management for partial responses - Error handling if connection drops mid-stream - Deciding when tool calls are "complete" enough to execute **UX Questions to Resolve:** - Show raw thinking text or summarized? - Live expanding text vs. spinner replacement? - Markdown rendering while streaming? - How to handle thinking + tool call display simultaneously? **Implementation Options:** - New `run_conversation_streaming()` method (keep non-streaming as fallback) - Wrapper that handles streaming internally - Big refactor of existing `run_conversation()` **References:** - https://openrouter.ai/docs/api/reference/streaming - https://openrouter.ai/docs/guides/best-practices/reasoning-tokens#streaming-response --- ## 1. Subagent Architecture (Context Isolation) 🎯 **Problem:** Long-running tools (terminal commands, browser automation, complex file operations) consume massive context. A single `ls -la` can add hundreds of lines. Browser snapshots, debugging sessions, and iterative terminal work quickly bloat the main conversation, leaving less room for actual reasoning. **Solution:** The main agent becomes an **orchestrator** that delegates context-heavy tasks to **subagents**. **Architecture:** ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ ORCHESTRATOR (main agent) β”‚ β”‚ - Receives user request β”‚ β”‚ - Plans approach β”‚ β”‚ - Delegates heavy tasks to subagents β”‚ β”‚ - Receives summarized results β”‚ β”‚ - Maintains clean, focused context β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”‚ β–Ό β–Ό β–Ό β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ TERMINAL AGENT β”‚ β”‚ BROWSER AGENT β”‚ β”‚ CODE AGENT β”‚ β”‚ - terminal tool β”‚ β”‚ - browser tools β”‚ β”‚ - file tools β”‚ β”‚ - file tools β”‚ β”‚ - web_search β”‚ β”‚ - terminal β”‚ β”‚ β”‚ β”‚ - web_extract β”‚ β”‚ β”‚ β”‚ Isolated contextβ”‚ β”‚ Isolated contextβ”‚ β”‚ Isolated contextβ”‚ β”‚ Returns summary β”‚ β”‚ Returns summary β”‚ β”‚ Returns summary β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **How it works:** 1. User asks: "Set up a new Python project with FastAPI and tests" 2. Orchestrator plans: "I need to create files, install deps, write code" 3. Orchestrator calls: `terminal_task(goal="Create venv, install fastapi pytest", context="New project in ~/myapp")` 4. **Subagent spawns** with fresh context, only terminal/file tools 5. Subagent iterates (may take 10+ tool calls, lots of output) 6. Subagent completes β†’ returns summary: "Created venv, installed fastapi==0.109.0, pytest==8.0.0" 7. Orchestrator receives **only the summary**, context stays clean 8. Orchestrator continues with next subtask **Key tools to implement:** - [ ] `terminal_task(goal, context, cwd?)` - Delegate terminal/shell work - [ ] `browser_task(goal, context, start_url?)` - Delegate web research/automation - [ ] `code_task(goal, context, files?)` - Delegate code writing/modification - [ ] Generic `delegate_task(goal, context, toolsets=[])` - Flexible delegation **Implementation details:** - [ ] Subagent uses same `run_agent.py` but with: - Fresh/empty conversation history - Limited toolset (only what's needed) - Smaller max_iterations (focused task) - Task-specific system prompt - [ ] Subagent returns structured result: ```python { "success": True, "summary": "Installed 3 packages, created 2 files", "details": "Optional longer explanation if needed", "artifacts": ["~/myapp/requirements.txt", "~/myapp/main.py"], # Files created "errors": [] # Any issues encountered } ``` - [ ] Orchestrator sees only the summary in its context - [ ] Full subagent transcript saved separately for debugging **Benefits:** - 🧹 **Clean context** - Orchestrator stays focused, doesn't drown in tool output - πŸ“Š **Better token efficiency** - 50 terminal outputs β†’ 1 summary paragraph - 🎯 **Focused subagents** - Each agent has just the tools it needs - πŸ”„ **Parallel potential** - Independent subtasks could run concurrently - πŸ› **Easier debugging** - Each subtask has its own isolated transcript **When to use subagents vs direct tools:** - **Subagent**: Multi-step tasks, iteration likely, lots of output expected - **Direct**: Quick one-off commands, simple file reads, user needs to see output **Files to modify:** `run_agent.py` (add orchestration mode), new `tools/delegate_tools.py`, new `subagent_runner.py` --- ## 2. Planning & Task Management πŸ“‹ **Problem:** Agent handles tasks reactively without explicit planning. Complex multi-step tasks lack structure, progress tracking, and the ability to decompose work into manageable chunks. **Ideas:** - [ ] **Task decomposition tool** - Break complex requests into subtasks: ``` User: "Set up a new Python project with FastAPI, tests, and Docker" Agent creates plan: β”œβ”€β”€ 1. Create project structure and requirements.txt β”œβ”€β”€ 2. Implement FastAPI app skeleton β”œβ”€β”€ 3. Add pytest configuration and initial tests β”œβ”€β”€ 4. Create Dockerfile and docker-compose.yml └── 5. Verify everything works together ``` - Each subtask becomes a trackable unit - Agent can report progress: "Completed 3/5 tasks" - [ ] **Progress checkpoints** - Periodic self-assessment: - After N tool calls or time elapsed, pause to evaluate - "What have I accomplished? What remains? Am I on track?" - Detect if stuck in loops or making no progress - Could trigger replanning if approach isn't working - [ ] **Explicit plan storage** - Persist plan in conversation: - Store as structured data (not just in context) - Update status as tasks complete - User can ask "What's the plan?" or "What's left?" - Survives context compression (plans are protected) - [ ] **Failure recovery with replanning** - When things go wrong: - Record what failed and why - Revise plan to work around the issue - "Step 3 failed because X, adjusting approach to Y" - Prevents repeating failed strategies **Files to modify:** `run_agent.py` (add planning hooks), new `tools/planning_tool.py` --- ## 3. Tool Composition & Learning πŸ”§ **Problem:** Tools are atomic. Complex tasks require repeated manual orchestration of the same tool sequences. **Ideas:** - [ ] **Macro tools / Tool chains** - Define reusable tool sequences: ```yaml research_topic: description: "Deep research on a topic" steps: - web_search: {query: "$topic"} - web_extract: {urls: "$search_results.urls[:3]"} - summarize: {content: "$extracted"} ``` - Could be defined in skills or a new `macros/` directory - Agent can invoke macro as single tool call - [ ] **Tool failure patterns** - Learn from failures: - Track: tool, input pattern, error type, what worked instead - Before calling a tool, check: "Has this pattern failed before?" - Persistent across sessions (stored in skills or separate DB) - [ ] **Parallel tool execution** - When tools are independent, run concurrently: - Detect independence (no data dependencies between calls) - Use `asyncio.gather()` for parallel execution - Already have async support in some tools, just need orchestration **Files to modify:** `model_tools.py`, `toolsets.py`, new `tool_macros.py` --- ## 4. Dynamic Skills Expansion πŸ“š **Problem:** Skills system is elegant but static. Skills must be manually created and added. **Ideas:** - [ ] **Skill acquisition from successful tasks** - After completing a complex task: - "This approach worked well. Save as a skill?" - Extract: goal, steps taken, tools used, key decisions - Generate SKILL.md automatically - Store in user's skills directory - [ ] **Skill templates** - Common patterns that can be parameterized: ```markdown # Debug {language} Error 1. Reproduce the error 2. Search for error message: `web_search("{error_message} {language}")` 3. Check common causes: {common_causes} 4. Apply fix and verify ``` - [ ] **Skill chaining** - Combine skills for complex workflows: - Skills can reference other skills as dependencies - "To do X, first apply skill Y, then skill Z" - Directed graph of skill dependencies **Files to modify:** `tools/skills_tool.py`, `skills/` directory structure, new `skill_generator.py` --- ## 5. Interactive Clarifying Questions Tool ❓ **Problem:** Agent sometimes makes assumptions or guesses when it should ask the user. Currently can only ask via text, which gets lost in long outputs. **Ideas:** - [ ] **Multiple-choice prompt tool** - Let agent present structured choices to user: ``` ask_user_choice( question="Should the language switcher enable only German or all languages?", choices=[ "Only enable German - works immediately", "Enable all, mark untranslated - show fallback notice", "Let me specify something else" ] ) ``` - Renders as interactive terminal UI with arrow key / Tab navigation - User selects option, result returned to agent - Up to 4 choices + optional free-text option - [ ] **Implementation:** - Use `inquirer` or `questionary` Python library for rich terminal prompts - Tool returns selected option text (or user's custom input) - **CLI-only** - only works when running via `cli.py` (not API/programmatic use) - Graceful fallback: if not in interactive mode, return error asking agent to rephrase as text - [ ] **Use cases:** - Clarify ambiguous requirements before starting work - Confirm destructive operations with clear options - Let user choose between implementation approaches - Checkpoint complex multi-step workflows **Files to modify:** New `tools/ask_user_tool.py`, `cli.py` (detect interactive mode), `model_tools.py` --- ## 6. Collaborative Problem Solving 🀝 **Problem:** Interaction is command/response. Complex problems benefit from dialogue. **Ideas:** - [ ] **Assumption surfacing** - Make implicit assumptions explicit: - "I'm assuming you want Python 3.11+. Correct?" - "This solution assumes you have sudo access..." - Let user correct before going down wrong path - [ ] **Checkpoint & confirm** - For high-stakes operations: - "About to delete 47 files. Here's the list - proceed?" - "This will modify your database. Want a backup first?" - Configurable threshold for when to ask **Files to modify:** `run_agent.py`, system prompt configuration --- ## 7. Project-Local Context πŸ’Ύ **Problem:** Valuable context lost between sessions. **Ideas:** - [ ] **Project awareness** - Remember project-specific context: - Store `.hermes/context.md` in project directory - "This is a Django project using PostgreSQL" - Coding style preferences, deployment setup, etc. - Load automatically when working in that directory - [ ] **Handoff notes** - Leave notes for future sessions: - Write to `.hermes/notes.md` in project - "TODO for next session: finish implementing X" - "Known issues: Y doesn't work on Windows" **Files to modify:** New `project_context.py`, auto-load in `run_agent.py` --- ## 8. Graceful Degradation & Robustness πŸ›‘οΈ **Problem:** When things go wrong, recovery is limited. Should fail gracefully. **Ideas:** - [ ] **Fallback chains** - When primary approach fails, have backups: - `web_extract` fails β†’ try `browser_navigate` β†’ try `web_search` for cached version - Define fallback order per tool type - [ ] **Partial progress preservation** - Don't lose work on failure: - Long task fails midway β†’ save what we've got - "I completed 3/5 steps before the error. Here's what I have..." - [ ] **Self-healing** - Detect and recover from bad states: - Browser stuck β†’ close and retry - Terminal hung β†’ timeout and reset **Files to modify:** `model_tools.py`, tool implementations, new `fallback_manager.py` --- ## 9. Tools & Skills Wishlist 🧰 *Things that would need new tool implementations (can't do well with current tools):* ### High-Impact - [ ] **Audio/Video Transcription** 🎬 *(See also: Section 16 for detailed spec)* - Transcribe audio files, podcasts, YouTube videos - Extract key moments from video - Voice memo transcription for messaging integrations - *Provider options: Whisper API, Deepgram, local Whisper* - [ ] **Diagram Rendering** πŸ“Š - Render Mermaid/PlantUML to actual images - Can generate the code, but rendering requires external service or tool - "Show me how these components connect" β†’ actual visual diagram ### Medium-Impact - [ ] **Canvas / Visual Workspace** πŸ–ΌοΈ - Agent-controlled visual panel for rendering interactive UI - Inspired by OpenClaw's Canvas feature - **Capabilities:** - `present` / `hide` - Show/hide the canvas panel - `navigate` - Load HTML files or URLs into the canvas - `eval` - Execute JavaScript in the canvas context - `snapshot` - Capture the rendered UI as an image - **Use cases:** - Display generated HTML/CSS/JS previews - Show interactive data visualizations (charts, graphs) - Render diagrams (Mermaid β†’ rendered output) - Present structured information in rich format - A2UI-style component system for structured agent UI - **Implementation options:** - Electron-based panel for CLI - WebSocket-connected web app - VS Code webview extension - *Would let agent "show" things rather than just describe them* - [ ] **Document Generation** πŸ“„ - Create styled PDFs, Word docs, presentations - *Can do basic PDF via terminal tools, but limited* - [ ] **Diff/Patch Tool** πŸ“ - Surgical code modifications with preview - "Change line 45-50 to X" without rewriting whole file - Show diffs before applying - *Can use `diff`/`patch` but a native tool would be safer* ### Skills to Create - [ ] **Domain-specific skill packs:** - DevOps/Infrastructure (Terraform, K8s, AWS) - Data Science workflows (EDA, model training) - Security/pentesting procedures - [ ] **Framework-specific skills:** - React/Vue/Angular patterns - Django/Rails/Express conventions - Database optimization playbooks - [ ] **Troubleshooting flowcharts:** - "Docker container won't start" β†’ decision tree - "Production is slow" β†’ systematic diagnosis --- ## 10. Messaging Platform Integrations πŸ’¬ βœ… COMPLETE **Problem:** Agent currently only works via `cli.py` which requires direct terminal access. Users may want to interact via messaging apps from their phone or other devices. **Architecture:** - `run_agent.py` already accepts `conversation_history` parameter and returns updated messages βœ… - Need: persistent session storage, platform monitors, session key resolution **Implementation approach:** ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Platform Monitor (e.g., telegram_monitor.py) β”‚ β”‚ β”œβ”€ Long-running daemon connecting to messaging platform β”‚ β”‚ β”œβ”€ On message: resolve session key β†’ load history from diskβ”‚ β”‚ β”œβ”€ Call run_agent.py with loaded history β”‚ β”‚ β”œβ”€ Save updated history back to disk (JSONL) β”‚ β”‚ └─ Send response back to platform β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Platform support (each user sets up their own credentials):** - [x] **Telegram** - via `python-telegram-bot` - Bot token from @BotFather - Easiest to set up, good for personal use - [x] **Discord** - via `discord.py` - Bot token from Discord Developer Portal - Can work in servers (group sessions) or DMs - [x] **WhatsApp** - via Node.js bridge (whatsapp-web.js/baileys) - Requires Node.js bridge setup - More complex, but reaches most people **Session management:** - [x] **Session store** - JSONL persistence per session key - `~/.hermes/sessions/{session_id}.jsonl` - Session keys: `agent:main:telegram:dm`, `agent:main:discord:group:123`, etc. - [x] **Session expiry** - Configurable reset policies - Daily reset (default 4am) OR idle timeout (default 2 hours) - Manual reset via `/reset` or `/new` command in chat - Per-platform and per-type overrides - [x] **Session continuity** - Conversations persist across messages until reset **Files created:** `gateway/`, `gateway/platforms/`, `gateway/config.py`, `gateway/session.py`, `gateway/delivery.py`, `gateway/run.py` **Configuration:** - Environment variables: `TELEGRAM_BOT_TOKEN`, `DISCORD_BOT_TOKEN`, etc. - Config file: `~/.hermes/gateway.json` - CLI commands: `/platforms` to check status, `--gateway` to start **Dynamic context injection:** - Agent knows its source platform and chat - Agent knows connected platforms and home channels - Agent can deliver cron outputs to specific platforms --- ## 11. Scheduled Tasks / Cron Jobs ⏰ βœ… COMPLETE **Problem:** Agent only runs on-demand. Some tasks benefit from scheduled execution (daily summaries, monitoring, reminders). **Solution Implemented:** - [x] **Cron-style scheduler** - Run agent turns on a schedule - Jobs stored in `~/.hermes/cron/jobs.json` - Each job: `{ id, name, prompt, schedule, repeat, enabled, next_run_at, ... }` - Built-in scheduler daemon or system cron integration - [x] **Schedule formats:** - Duration: `30m`, `2h`, `1d` (one-shot delay) - Interval: `every 30m`, `every 2h` (recurring) - Cron expression: `0 9 * * *` (requires `croniter` package) - ISO timestamp: `2026-02-03T14:00:00` (one-shot at specific time) - [x] **Repeat options:** - `repeat=None` (or omit): One-shot schedules run once; intervals/cron run forever - `repeat=1`: Run once then auto-delete - `repeat=N`: Run exactly N times then auto-delete - [x] **CLI interface:** ```bash # List scheduled jobs /cron /cron list # Add a one-shot job (runs once in 30 minutes) /cron add 30m "Remind me to check the build status" # Add a recurring job (every 2 hours) /cron add "every 2h" "Check server status at 192.168.1.100" # Add a cron expression (daily at 9am) /cron add "0 9 * * *" "Generate morning briefing" # Remove a job /cron remove ``` - [x] **Agent self-scheduling tools** (hermes-cli toolset): - `schedule_cronjob(prompt, schedule, name?, repeat?)` - Create a scheduled task - `list_cronjobs()` - View all scheduled jobs - `remove_cronjob(job_id)` - Cancel a job - Tool descriptions emphasize: **cronjobs run in isolated sessions with NO context** - [x] **Daemon modes:** ```bash # Built-in daemon (checks every 60 seconds) python cli.py --cron-daemon # Single tick for system cron integration python cli.py --cron-tick-once ``` - [x] **Output storage:** `~/.hermes/cron/output/{job_id}/{timestamp}.md` **Files created:** `cron/__init__.py`, `cron/jobs.py`, `cron/scheduler.py`, `tools/cronjob_tools.py` **Toolset:** `hermes-cli` (default for CLI) includes cronjob tools; not in batch runner toolsets --- ## 12. Text-to-Speech (TTS) πŸ”Š **Problem:** Agent can only respond with text. Some users prefer audio responses (accessibility, hands-free use, podcasts). **Ideas:** - [ ] **TTS tool** - Generate audio files from text ```python tts_generate(text="Here's your summary...", voice="nova", output="summary.mp3") ``` - Returns path to generated audio file - For messaging integrations: can send as voice message - [ ] **Provider options:** - Edge TTS (free, good quality, many voices) - OpenAI TTS (paid, excellent quality) - ElevenLabs (paid, best quality, voice cloning) - Local options (Coqui TTS, Bark) - [ ] **Modes:** - On-demand: User explicitly asks "read this to me" - Auto-TTS: Configurable to always generate audio for responses - Long-text handling: Summarize or chunk very long responses - [ ] **Integration with messaging:** - When enabled, can send voice notes instead of/alongside text - User preference per channel **Files to create:** `tools/tts_tool.py`, config in `cli-config.yaml` --- ## 13. Speech-to-Text / Audio Transcription 🎀 **Problem:** Users may want to send voice memos instead of typing. Agent is blind to audio content. **Ideas:** - [ ] **Voice memo transcription** - For messaging integrations - User sends voice message β†’ transcribe β†’ process as text - Seamless: user speaks, agent responds - [ ] **Audio/video file transcription** - Existing idea, expanded: - Transcribe local audio files (mp3, wav, m4a) - Transcribe YouTube videos (download audio β†’ transcribe) - Extract key moments with timestamps - [ ] **Provider options:** - OpenAI Whisper API (good quality, cheap) - Deepgram (fast, good for real-time) - Local Whisper (free, runs on GPU) - Groq Whisper (fast, free tier available) - [ ] **Tool interface:** ```python transcribe(source="audio.mp3") # Local file transcribe(source="https://youtube.com/...") # YouTube transcribe(source="voice_message", data=bytes) # Voice memo ``` **Files to create:** `tools/transcribe_tool.py`, integrate with messaging monitors --- ## Priority Order (Suggested) 1. **🎯 Subagent Architecture** - Critical for context management, enables everything else 2. **Memory & Context Management** - Complements subagents for remaining context 3. **Self-Reflection** - Improves reliability and reduces wasted tool calls 4. **Project-Local Context** - Practical win, keeps useful info across sessions 5. **Messaging Integrations** - Unlocks mobile access, new interaction patterns 6. **Scheduled Tasks / Cron Jobs** - Enables automation, reminders, monitoring 7. **Tool Composition** - Quality of life, builds on other improvements 8. **Dynamic Skills** - Force multiplier for repeated tasks 9. **Interactive Clarifying Questions** - Better UX for ambiguous tasks 10. **TTS / Audio Transcription** - Accessibility, hands-free use --- ## Removed Items (Unrealistic) The following were removed because they're architecturally impossible: - ~~Proactive suggestions / Prefetching~~ - Agent only runs on user request, can't interject - ~~Clipboard integration~~ - No access to user's local system clipboard The following **moved to active TODO** (now possible with new architecture): - ~~Session save/restore~~ β†’ See **Messaging Integrations** (session persistence) - ~~Voice/TTS playback~~ β†’ See **TTS** (can generate audio files, send via messaging) - ~~Set reminders~~ β†’ See **Scheduled Tasks / Cron Jobs** The following were removed because they're **already possible**: - ~~HTTP/API Client~~ β†’ Use `curl` or Python `requests` in terminal - ~~Structured Data Manipulation~~ β†’ Use `pandas` in terminal - ~~Git-Native Operations~~ β†’ Use `git` CLI in terminal - ~~Symbolic Math~~ β†’ Use `SymPy` in terminal - ~~Code Quality Tools~~ β†’ Run linters (`eslint`, `black`, `mypy`) in terminal - ~~Testing Framework~~ β†’ Run `pytest`, `jest`, etc. in terminal - ~~Translation~~ β†’ LLM handles this fine, or use translation APIs --- --- ## πŸ§ͺ Brainstorm Ideas (Not Yet Fleshed Out) *These are early-stage ideas that need more thinking before implementation. Captured here so they don't get lost.* ### Remote/Distributed Execution 🌐 **Concept:** Run agent on a powerful remote server while interacting from a thin client. **Why interesting:** - Run on beefy GPU server for local LLM inference - Agent has access to remote machine's resources (files, tools, internet) - User interacts via lightweight client (phone, low-power laptop) **Open questions:** - How does this differ from just SSH + running cli.py on remote? - Would need secure communication channel (WebSocket? gRPC?) - How to handle tool outputs that reference remote paths? - Credential management for remote execution - Latency considerations for interactive use **Possible architecture:** ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Thin Client β”‚ ◄─────► β”‚ Remote Hermes Server β”‚ β”‚ (phone/web) β”‚ WS/API β”‚ - Full agent + tools β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ - GPU for local LLM β”‚ β”‚ - Access to server filesβ”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Related to:** Messaging integrations (could be the "server" that monitors receive from) --- ### Multi-Agent Parallel Execution πŸ€–πŸ€– **Concept:** Extension of Subagent Architecture (Section 1) - run multiple subagents in parallel. **Why interesting:** - Independent subtasks don't need to wait for each other - "Research X while setting up Y" - both run simultaneously - Faster completion for complex multi-part tasks **Open questions:** - How to detect which tasks are truly independent? - Resource management (API rate limits, concurrent connections) - How to merge results when parallel tasks have conflicts? - Cost implications of multiple parallel LLM calls *Note: Basic subagent delegation (Section 1) should be implemented first, parallel execution is an optimization on top.* --- ### Plugin/Extension System πŸ”Œ **Concept:** Allow users to add custom tools/skills without modifying core code. **Why interesting:** - Community contributions - Organization-specific tools - Clean separation of core vs. extensions **Open questions:** - Security implications of loading arbitrary code - Versioning and compatibility - Discovery and installation UX --- *Last updated: $(date +%Y-%m-%d)* πŸ€–