* docs: deep audit — fix stale config keys, missing commands, and registry drift Cross-checked ~80 high-impact docs pages (getting-started, reference, top-level user-guide, user-guide/features) against the live registries: hermes_cli/commands.py COMMAND_REGISTRY (slash commands) hermes_cli/auth.py PROVIDER_REGISTRY (providers) hermes_cli/config.py DEFAULT_CONFIG (config keys) toolsets.py TOOLSETS (toolsets) tools/registry.py get_all_tool_names() (tools) python -m hermes_cli.main <subcmd> --help (CLI args) reference/ - cli-commands.md: drop duplicate hermes fallback row + duplicate section, add stepfun/lmstudio to --provider enum, expand auth/mcp/curator subcommand lists to match --help output (status/logout/spotify, login, archive/prune/ list-archived). - slash-commands.md: add missing /sessions and /reload-skills entries + correct the cross-platform Notes line. - tools-reference.md: drop bogus '68 tools' headline, drop fictional 'browser-cdp toolset' (these tools live in 'browser' and are runtime-gated), add missing 'kanban' and 'video' toolset sections, fix MCP example to use the real mcp_<server>_<tool> prefix. - toolsets-reference.md: list browser_cdp/browser_dialog inside the 'browser' row, add missing 'kanban' and 'video' toolset rows, drop the stale '38 tools' count for hermes-cli. - profile-commands.md: add missing install/update/info subcommands, document fish completion. - environment-variables.md: dedupe GMI_API_KEY/GMI_BASE_URL rows (kept the one with the correct gmi-serving.com default). - faq.md: Anthropic/Google/OpenAI examples — direct providers exist (not just via OpenRouter), refresh the OpenAI model list. getting-started/ - installation.md: PortableGit (not MinGit) is what the Windows installer fetches; document the 32-bit MinGit fallback. - installation.md / termux.md: installer prefers .[termux-all] then falls back to .[termux]. - nix-setup.md: Python 3.12 (not 3.11), Node.js 22 (not 20); fix invalid 'nix flake update --flake' invocation. - updating.md: 'hermes backup restore --state pre-update' doesn't exist — point at the snapshot/quick-snapshot flow; correct config key 'updates.pre_update_backup' (was 'update.backup'). user-guide/ - configuration.md: api_max_retries default 3 (not 2); display.runtime_footer is the real key (not display.runtime_metadata_footer); checkpoints defaults enabled=false / max_snapshots=20 (not true / 50). - configuring-models.md: 'hermes model list' / 'hermes model set ...' don't exist — hermes model is interactive only. - tui.md: busy_indicator -> tui_status_indicator with values kaomoji|emoji|unicode|ascii (not kawaii|minimal|dots|wings|none). - security.md: SSH backend keys (TERMINAL_SSH_HOST/USER/KEY) live in .env, not config.yaml. - windows-wsl-quickstart.md: there is no 'hermes api' subcommand — the OpenAI-compatible API server runs inside hermes gateway. user-guide/features/ - computer-use.md: approvals.mode (not security.approval_level); fix broken ./browser-use.md link to ./browser.md. - fallback-providers.md: top-level fallback_providers (not model.fallback_providers); the picker is subcommand-based, not modal. - api-server.md: API_SERVER_* are env vars — write to per-profile .env, not 'hermes config set' which targets YAML. - web-search.md: drop web_crawl as a registered tool (it isn't); deep-crawl modes are exposed through web_extract. - kanban.md: failure_limit default is 2, not '~5'. - plugins.md: drop hard-coded '33 providers' count. - honcho.md: fix unclosed quote in echo HONCHO_API_KEY snippet; document that 'hermes honcho' subcommand is gated on memory.provider=honcho; reconcile subcommand list with actual --help output. - memory-providers.md: legacy 'hermes honcho setup' redirect documented. Verified via 'npm run build' — site builds cleanly; broken-link count went from 149 to 146 (no regressions, fixed a few in passing). * docs: round 2 audit fixes + regenerate skill catalogs Follow-up to the previous commit on this branch: Round 2 manual fixes: - quickstart.md: KIMI_CODING_API_KEY mentioned alongside KIMI_API_KEY; voice-mode and ACP install commands rewritten — bare 'pip install ...' doesn't work for curl-installed setups (no pip on PATH, not in repo dir); replaced with 'cd ~/.hermes/hermes-agent && uv pip install -e ".[voice]"'. ACP already ships in [all] so the curl install includes it. - cli.md / configuration.md: 'auxiliary.compression.model' shown as 'google/gemini-3-flash-preview' (the doc's own claimed default); actual default is empty (= use main model). Reworded as 'leave empty (default) or pin a cheap model'. - built-in-plugins.md: added the bundled 'kanban/dashboard' plugin row that was missing from the table. Regenerated skill catalogs: - ran website/scripts/generate-skill-docs.py to refresh all 163 per-skill pages and both reference catalogs (skills-catalog.md, optional-skills-catalog.md). This adds the entries that were genuinely missing — productivity/teams-meeting-pipeline (bundled), optional/finance/* (entire category — 7 skills: 3-statement-model, comps-analysis, dcf-model, excel-author, lbo-model, merger-model, pptx-author), creative/hyperframes, creative/kanban-video-orchestrator, devops/watchers, productivity/shop-app, research/searxng-search, apple/macos-computer-use — and rewrites every other per-skill page from the current SKILL.md. Most diffs are tiny (one line of refreshed metadata). Validation: - 'npm run build' succeeded. - Broken-link count moved 146 -> 155 — the +9 are zh-Hans translation shells that lag every newly-added skill page (pre-existing pattern). No regressions on any en/ page.
16 KiB
| title | sidebar_label | description |
|---|---|---|
| Dspy — DSPy: declarative LM programs, auto-optimize prompts, RAG | Dspy | DSPy: declarative LM programs, auto-optimize prompts, RAG |
{/* This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. */}
Dspy
DSPy: declarative LM programs, auto-optimize prompts, RAG.
Skill metadata
| Source | Bundled (installed by default) |
| Path | skills/mlops/research/dspy |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | dspy, openai, anthropic |
| Platforms | linux, macos, windows |
| Tags | Prompt Engineering, DSPy, Declarative Programming, RAG, Agents, Prompt Optimization, LM Programming, Stanford NLP, Automatic Optimization, Modular AI |
Reference: full SKILL.md
:::info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active. :::
DSPy: Declarative Language Model Programming
When to Use This Skill
Use DSPy when you need to:
- Build complex AI systems with multiple components and workflows
- Program LMs declaratively instead of manual prompt engineering
- Optimize prompts automatically using data-driven methods
- Create modular AI pipelines that are maintainable and portable
- Improve model outputs systematically with optimizers
- Build RAG systems, agents, or classifiers with better reliability
GitHub Stars: 22,000+ | Created By: Stanford NLP
Installation
# Stable release
pip install dspy
# Latest development version
pip install git+https://github.com/stanfordnlp/dspy.git
# With specific LM providers
pip install dspy[openai] # OpenAI
pip install dspy[anthropic] # Anthropic Claude
pip install dspy[all] # All providers
Quick Start
Basic Example: Question Answering
import dspy
# Configure your language model
lm = dspy.Claude(model="claude-sonnet-4-5-20250929")
dspy.settings.configure(lm=lm)
# Define a signature (input → output)
class QA(dspy.Signature):
"""Answer questions with short factual answers."""
question = dspy.InputField()
answer = dspy.OutputField(desc="often between 1 and 5 words")
# Create a module
qa = dspy.Predict(QA)
# Use it
response = qa(question="What is the capital of France?")
print(response.answer) # "Paris"
Chain of Thought Reasoning
import dspy
lm = dspy.Claude(model="claude-sonnet-4-5-20250929")
dspy.settings.configure(lm=lm)
# Use ChainOfThought for better reasoning
class MathProblem(dspy.Signature):
"""Solve math word problems."""
problem = dspy.InputField()
answer = dspy.OutputField(desc="numerical answer")
# ChainOfThought generates reasoning steps automatically
cot = dspy.ChainOfThought(MathProblem)
response = cot(problem="If John has 5 apples and gives 2 to Mary, how many does he have?")
print(response.rationale) # Shows reasoning steps
print(response.answer) # "3"
Core Concepts
1. Signatures
Signatures define the structure of your AI task (inputs → outputs):
# Inline signature (simple)
qa = dspy.Predict("question -> answer")
# Class signature (detailed)
class Summarize(dspy.Signature):
"""Summarize text into key points."""
text = dspy.InputField()
summary = dspy.OutputField(desc="bullet points, 3-5 items")
summarizer = dspy.ChainOfThought(Summarize)
When to use each:
- Inline: Quick prototyping, simple tasks
- Class: Complex tasks, type hints, better documentation
2. Modules
Modules are reusable components that transform inputs to outputs:
dspy.Predict
Basic prediction module:
predictor = dspy.Predict("context, question -> answer")
result = predictor(context="Paris is the capital of France",
question="What is the capital?")
dspy.ChainOfThought
Generates reasoning steps before answering:
cot = dspy.ChainOfThought("question -> answer")
result = cot(question="Why is the sky blue?")
print(result.rationale) # Reasoning steps
print(result.answer) # Final answer
dspy.ReAct
Agent-like reasoning with tools:
from dspy.predict import ReAct
class SearchQA(dspy.Signature):
"""Answer questions using search."""
question = dspy.InputField()
answer = dspy.OutputField()
def search_tool(query: str) -> str:
"""Search Wikipedia."""
# Your search implementation
return results
react = ReAct(SearchQA, tools=[search_tool])
result = react(question="When was Python created?")
dspy.ProgramOfThought
Generates and executes code for reasoning:
pot = dspy.ProgramOfThought("question -> answer")
result = pot(question="What is 15% of 240?")
# Generates: answer = 240 * 0.15
3. Optimizers
Optimizers improve your modules automatically using training data:
BootstrapFewShot
Learns from examples:
from dspy.teleprompt import BootstrapFewShot
# Training data
trainset = [
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
dspy.Example(question="What is 3+5?", answer="8").with_inputs("question"),
]
# Define metric
def validate_answer(example, pred, trace=None):
return example.answer == pred.answer
# Optimize
optimizer = BootstrapFewShot(metric=validate_answer, max_bootstrapped_demos=3)
optimized_qa = optimizer.compile(qa, trainset=trainset)
# Now optimized_qa performs better!
MIPRO (Most Important Prompt Optimization)
Iteratively improves prompts:
from dspy.teleprompt import MIPRO
optimizer = MIPRO(
metric=validate_answer,
num_candidates=10,
init_temperature=1.0
)
optimized_cot = optimizer.compile(
cot,
trainset=trainset,
num_trials=100
)
BootstrapFinetune
Creates datasets for model fine-tuning:
from dspy.teleprompt import BootstrapFinetune
optimizer = BootstrapFinetune(metric=validate_answer)
optimized_module = optimizer.compile(qa, trainset=trainset)
# Exports training data for fine-tuning
4. Building Complex Systems
Multi-Stage Pipeline
import dspy
class MultiHopQA(dspy.Module):
def __init__(self):
super().__init__()
self.retrieve = dspy.Retrieve(k=3)
self.generate_query = dspy.ChainOfThought("question -> search_query")
self.generate_answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
# Stage 1: Generate search query
search_query = self.generate_query(question=question).search_query
# Stage 2: Retrieve context
passages = self.retrieve(search_query).passages
context = "\n".join(passages)
# Stage 3: Generate answer
answer = self.generate_answer(context=context, question=question).answer
return dspy.Prediction(answer=answer, context=context)
# Use the pipeline
qa_system = MultiHopQA()
result = qa_system(question="Who wrote the book that inspired the movie Blade Runner?")
RAG System with Optimization
import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM
# Configure retriever
retriever = ChromadbRM(
collection_name="documents",
persist_directory="./chroma_db"
)
class RAG(dspy.Module):
def __init__(self, num_passages=3):
super().__init__()
self.retrieve = dspy.Retrieve(k=num_passages)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
# Create and optimize
rag = RAG()
# Optimize with training data
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=validate_answer)
optimized_rag = optimizer.compile(rag, trainset=trainset)
LM Provider Configuration
Anthropic Claude
import dspy
lm = dspy.Claude(
model="claude-sonnet-4-5-20250929",
api_key="your-api-key", # Or set ANTHROPIC_API_KEY env var
max_tokens=1000,
temperature=0.7
)
dspy.settings.configure(lm=lm)
OpenAI
lm = dspy.OpenAI(
model="gpt-4",
api_key="your-api-key",
max_tokens=1000
)
dspy.settings.configure(lm=lm)
Local Models (Ollama)
lm = dspy.OllamaLocal(
model="llama3.1",
base_url="http://localhost:11434"
)
dspy.settings.configure(lm=lm)
Multiple Models
# Different models for different tasks
cheap_lm = dspy.OpenAI(model="gpt-3.5-turbo")
strong_lm = dspy.Claude(model="claude-sonnet-4-5-20250929")
# Use cheap model for retrieval, strong model for reasoning
with dspy.settings.context(lm=cheap_lm):
context = retriever(question)
with dspy.settings.context(lm=strong_lm):
answer = generator(context=context, question=question)
Common Patterns
Pattern 1: Structured Output
from pydantic import BaseModel, Field
class PersonInfo(BaseModel):
name: str = Field(description="Full name")
age: int = Field(description="Age in years")
occupation: str = Field(description="Current job")
class ExtractPerson(dspy.Signature):
"""Extract person information from text."""
text = dspy.InputField()
person: PersonInfo = dspy.OutputField()
extractor = dspy.TypedPredictor(ExtractPerson)
result = extractor(text="John Doe is a 35-year-old software engineer.")
print(result.person.name) # "John Doe"
print(result.person.age) # 35
Pattern 2: Assertion-Driven Optimization
import dspy
from dspy.primitives.assertions import assert_transform_module, backtrack_handler
class MathQA(dspy.Module):
def __init__(self):
super().__init__()
self.solve = dspy.ChainOfThought("problem -> solution: float")
def forward(self, problem):
solution = self.solve(problem=problem).solution
# Assert solution is numeric
dspy.Assert(
isinstance(float(solution), float),
"Solution must be a number",
backtrack=backtrack_handler
)
return dspy.Prediction(solution=solution)
Pattern 3: Self-Consistency
import dspy
from collections import Counter
class ConsistentQA(dspy.Module):
def __init__(self, num_samples=5):
super().__init__()
self.qa = dspy.ChainOfThought("question -> answer")
self.num_samples = num_samples
def forward(self, question):
# Generate multiple answers
answers = []
for _ in range(self.num_samples):
result = self.qa(question=question)
answers.append(result.answer)
# Return most common answer
most_common = Counter(answers).most_common(1)[0][0]
return dspy.Prediction(answer=most_common)
Pattern 4: Retrieval with Reranking
class RerankedRAG(dspy.Module):
def __init__(self):
super().__init__()
self.retrieve = dspy.Retrieve(k=10)
self.rerank = dspy.Predict("question, passage -> relevance_score: float")
self.answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
# Retrieve candidates
passages = self.retrieve(question).passages
# Rerank passages
scored = []
for passage in passages:
score = float(self.rerank(question=question, passage=passage).relevance_score)
scored.append((score, passage))
# Take top 3
top_passages = [p for _, p in sorted(scored, reverse=True)[:3]]
context = "\n\n".join(top_passages)
# Generate answer
return self.answer(context=context, question=question)
Evaluation and Metrics
Custom Metrics
def exact_match(example, pred, trace=None):
"""Exact match metric."""
return example.answer.lower() == pred.answer.lower()
def f1_score(example, pred, trace=None):
"""F1 score for text overlap."""
pred_tokens = set(pred.answer.lower().split())
gold_tokens = set(example.answer.lower().split())
if not pred_tokens:
return 0.0
precision = len(pred_tokens & gold_tokens) / len(pred_tokens)
recall = len(pred_tokens & gold_tokens) / len(gold_tokens)
if precision + recall == 0:
return 0.0
return 2 * (precision * recall) / (precision + recall)
Evaluation
from dspy.evaluate import Evaluate
# Create evaluator
evaluator = Evaluate(
devset=testset,
metric=exact_match,
num_threads=4,
display_progress=True
)
# Evaluate model
score = evaluator(qa_system)
print(f"Accuracy: {score}")
# Compare optimized vs unoptimized
score_before = evaluator(qa)
score_after = evaluator(optimized_qa)
print(f"Improvement: {score_after - score_before:.2%}")
Best Practices
1. Start Simple, Iterate
# Start with Predict
qa = dspy.Predict("question -> answer")
# Add reasoning if needed
qa = dspy.ChainOfThought("question -> answer")
# Add optimization when you have data
optimized_qa = optimizer.compile(qa, trainset=data)
2. Use Descriptive Signatures
# ❌ Bad: Vague
class Task(dspy.Signature):
input = dspy.InputField()
output = dspy.OutputField()
# ✅ Good: Descriptive
class SummarizeArticle(dspy.Signature):
"""Summarize news articles into 3-5 key points."""
article = dspy.InputField(desc="full article text")
summary = dspy.OutputField(desc="bullet points, 3-5 items")
3. Optimize with Representative Data
# Create diverse training examples
trainset = [
dspy.Example(question="factual", answer="...).with_inputs("question"),
dspy.Example(question="reasoning", answer="...").with_inputs("question"),
dspy.Example(question="calculation", answer="...").with_inputs("question"),
]
# Use validation set for metric
def metric(example, pred, trace=None):
return example.answer in pred.answer
4. Save and Load Optimized Models
# Save
optimized_qa.save("models/qa_v1.json")
# Load
loaded_qa = dspy.ChainOfThought("question -> answer")
loaded_qa.load("models/qa_v1.json")
5. Monitor and Debug
# Enable tracing
dspy.settings.configure(lm=lm, trace=[])
# Run prediction
result = qa(question="...")
# Inspect trace
for call in dspy.settings.trace:
print(f"Prompt: {call['prompt']}")
print(f"Response: {call['response']}")
Comparison to Other Approaches
| Feature | Manual Prompting | LangChain | DSPy |
|---|---|---|---|
| Prompt Engineering | Manual | Manual | Automatic |
| Optimization | Trial & error | None | Data-driven |
| Modularity | Low | Medium | High |
| Type Safety | No | Limited | Yes (Signatures) |
| Portability | Low | Medium | High |
| Learning Curve | Low | Medium | Medium-High |
When to choose DSPy:
- You have training data or can generate it
- You need systematic prompt improvement
- You're building complex multi-stage systems
- You want to optimize across different LMs
When to choose alternatives:
- Quick prototypes (manual prompting)
- Simple chains with existing tools (LangChain)
- Custom optimization logic needed
Resources
- Documentation: https://dspy.ai
- GitHub: https://github.com/stanfordnlp/dspy (22k+ stars)
- Discord: https://discord.gg/XCGy2WDCQB
- Twitter: @DSPyOSS
- Paper: "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines"
See Also
references/modules.md- Detailed module guide (Predict, ChainOfThought, ReAct, ProgramOfThought)references/optimizers.md- Optimization algorithms (BootstrapFewShot, MIPRO, BootstrapFinetune)references/examples.md- Real-world examples (RAG, agents, classifiers)