docs: sharpen software-development skills

This commit is contained in:
miha 2026-06-18 16:13:42 +02:00 committed by Teknium
parent 74b5cc7ca4
commit 95d970a752
3 changed files with 116 additions and 22 deletions

View file

@ -1,7 +1,7 @@
---
name: hermes-agent-skill-authoring
description: "Author in-repo SKILL.md: frontmatter, validator, structure."
version: 1.0.0
description: "Author in-repo SKILL.md: frontmatter, validator, structure, and writing-quality principles."
version: 1.1.0
author: Hermes Agent
license: MIT
platforms: [linux, macos, windows]
@ -43,7 +43,7 @@ Peer-matched shape used by every skill under `skills/software-development/`:
---
name: my-skill-name # lowercase, hyphens, ≤64 chars (MAX_NAME_LENGTH)
description: Use when <trigger>. <one-line behavior>.
version: 1.0.0
version: 1.1.0
author: Hermes Agent
license: MIT
metadata:
@ -61,6 +61,29 @@ metadata:
- Full SKILL.md: ≤ 100,000 chars (enforced as `MAX_SKILL_CONTENT_CHARS`, ~36k tokens).
- Peer skills in `software-development/` sit at **8-14k chars**. Aim for that range. If you're pushing past 20k, split into `references/*.md` and reference them from SKILL.md.
## Writing Quality Principles
A skill exists to make the agent's process more predictable. Predictability does **not** mean identical output every run; it means the agent reliably follows the same useful discipline.
Use these quality checks when writing or editing any skill:
1. **Optimize for process predictability.** Ask: what behavior should change when this skill loads? If a line does not change behavior, cut it.
2. **Choose the right context load.** A model-invoked Hermes skill pays for its description every turn. Keep descriptions focused on trigger classes and the skill's distinctive behavior. Put details in the body or linked references.
3. **Use an information hierarchy.** Put always-needed steps in `SKILL.md`; put branch-specific or bulky reference material in `references/`, `templates/`, or `scripts/` and point to it only when needed.
4. **End steps with completion criteria.** Each ordered step should say how the agent knows it is done. Good criteria are checkable and, when it matters, exhaustive: "every modified file accounted for" beats "summarize changes."
5. **Co-locate rules with the concept they govern.** Avoid scattering one idea across the file. Keep definition, caveats, examples, and verification near each other.
6. **Use strong leading words.** Prefer compact concepts the model already knows — e.g. "tight loop," "tracer bullet," "root cause," "regression test" — over long repeated explanations. A good leading word saves tokens and anchors behavior.
7. **Prune duplication and no-ops.** Keep each meaning in one source of truth. Sentence by sentence, ask whether the sentence changes agent behavior versus the default. If not, delete it rather than polishing it.
8. **Watch for premature completion.** If agents tend to rush a step, first sharpen that step's completion criterion. Split the sequence only when later steps distract from doing the current step well.
Common quality failures:
- **Premature completion** — the skill lets the agent move on before the work is genuinely done.
- **Duplication** — the same rule appears in multiple places and drifts.
- **Sediment** — stale lines remain because adding felt safer than deleting.
- **Sprawl** — too much always-visible material; push branch-specific reference behind pointers.
- **No-op prose** — generic advice the agent would already follow without the skill.
## Peer-Matched Structure
Every in-repo skill follows roughly:
@ -150,7 +173,11 @@ Pick the closest existing category. Don't invent new top-level categories casual
6. **Expecting the current session to see the new skill.** It won't. The skill loader is initialized at session start. Verify in a fresh session or via `skill_view` using the exact path.
7. **Linking to skills that don't exist in-repo.** `related_skills: [some-user-local-skill]` works for you but breaks for other clones. Prefer only in-repo links.
7. **Letting skills accumulate sediment.** A skill should get shorter or sharper over time. When adding a rule, remove the old wording it replaces; don't layer advice forever.
8. **Writing no-op prose.** "Be careful," "be thorough," and "use best practices" rarely change model behavior. Replace with a checkable completion criterion or a stronger leading word.
9. **Linking to skills that don't exist in-repo.** `related_skills: [some-user-local-skill]` works for you but breaks for other clones. Prefer only in-repo links.
## Verification Checklist
@ -161,5 +188,9 @@ Pick the closest existing category. Don't invent new top-level categories casual
- [ ] Description ≤ 1024 chars and starts with "Use when ..."
- [ ] Total file ≤ 100,000 chars (aim for 8-15k)
- [ ] Structure: `# Title``## Overview``## When to Use` → body → `## Common Pitfalls``## Verification Checklist`
- [ ] Each ordered step has a checkable completion criterion
- [ ] Description is trigger-focused and avoids duplicated body content
- [ ] Bulky or branch-specific reference is progressively disclosed in linked files
- [ ] No-op prose and duplicated rules removed
- [ ] `related_skills` references resolve in-repo (or are explicitly OK to be user-local)
- [ ] `git add skills/<category>/<name>/ && git commit` completed on the intended branch

View file

@ -29,6 +29,12 @@ NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST
If you haven't completed Phase 1, you cannot propose fixes.
## The Feedback Loop Rule
The feedback loop is the debugging work. Before reading code to build a theory, create or identify a **tight** command that can go red on the user's exact symptom and green when the bug is fixed. A tight loop is fast, deterministic, agent-runnable, and specific enough to catch this bug — not merely "doesn't crash".
When a clean repro is hard, spend disproportionate effort building the loop. Guessing without a red-capable loop is the failure mode this skill exists to prevent.
## When to Use
Use for ANY technical issue:
@ -70,21 +76,46 @@ You MUST complete each phase before proceeding to the next.
**Action:** Use `read_file` on the relevant source files. Use `search_files` to find the error string in the codebase.
### 2. Reproduce Consistently
### 2. Build a Tight Feedback Loop
- Can you trigger it reliably?
- What are the exact steps?
- Does it happen every time?
- If not reproducible → gather more data, don't guess
- Can you trigger the user's exact symptom with one command?
- Does the command fail for this bug and only pass once the bug is fixed?
- Is it fast enough to run repeatedly?
- Is it deterministic? For flaky bugs, can you raise the reproduction rate high enough to debug?
- If not reproducible → gather more data, don't guess.
**Action:** Use the `terminal` tool to run the failing test or trigger the bug:
**Ways to construct a loop — try in roughly this order:**
1. **Failing test** at the seam that reaches the bug: unit, integration, or end-to-end.
2. **HTTP script / curl** against a running dev server.
3. **CLI invocation** with fixture input, diffing stdout/stderr against expected output.
4. **Headless browser script** (Playwright/Puppeteer) asserting on DOM, console, or network.
5. **Replay a captured trace**: HAR, request payload, event log, queue message, or webhook body.
6. **Throwaway harness** that boots the smallest useful slice of the system and calls the failing path.
7. **Property / fuzz loop** when the bug is intermittent wrong output over a broad input space.
8. **Bisection harness** suitable for `git bisect run` when the bug appeared between two known states.
9. **Differential loop** comparing old vs new version, two configs, two providers, or two datasets.
10. **Human-in-the-loop script** only as a last resort: script the human steps and capture their result so the loop stays structured.
**Tighten the loop once it exists:**
- Make it faster: cache setup, narrow scope, skip unrelated initialization.
- Make the signal sharper: assert the exact symptom, not generic success.
- Make it more deterministic: pin time, seed randomness, isolate filesystem, freeze network.
For non-deterministic bugs, the immediate goal is a higher reproduction rate, not perfection. Run the trigger 100x, parallelize, add stress, narrow timing windows, or inject sleeps. A 50% flake is debuggable; a 1% flake usually is not.
**Action:** Use the `terminal` tool to run the tight loop:
```bash
# Run specific failing test
# Run a specific failing test
pytest tests/test_module.py::test_name -v
# Run with verbose output
pytest tests/test_module.py -v --tb=long
# Or run a scripted repro
python scripts/repro_bug.py
# Or run a high-repetition flaky repro
for i in {1..100}; do pytest tests/test_flake.py::test_name -q || break; done
```
### 3. Check Recent Changes
@ -144,11 +175,13 @@ search_files("variable_name\\s*=", path="src/", file_glob="*.py")
### Phase 1 Completion Checklist
- [ ] Error messages fully read and understood
- [ ] Issue reproduced consistently
- [ ] A tight loop command exists and has been run at least once
- [ ] Loop is red-capable: it asserts the user's exact symptom, not a nearby failure
- [ ] Loop is deterministic, or a flaky bug has a high enough reproduction rate to debug
- [ ] Recent changes identified and reviewed
- [ ] Evidence gathered (logs, state, data flow)
- [ ] Problem isolated to specific component/code
- [ ] Root cause hypothesis formed
- [ ] Root cause hypotheses can be stated and tested
**STOP:** Do not proceed to Phase 2 until you understand WHY it's happening.
@ -158,6 +191,12 @@ search_files("variable_name\\s*=", path="src/", file_glob="*.py")
**Find the pattern before fixing:**
### 0. Minimize the Reproduction
Once the loop is red, shrink the repro to the smallest scenario that still goes red. Cut inputs, callers, config, data, and steps **one at a time**, re-running the loop after each cut. Keep only what is load-bearing for the failure.
Done when removing any remaining element makes the loop go green. A minimal repro narrows the hypothesis space and often becomes the cleanest regression test.
### 1. Find Working Examples
- Locate similar working code in the same codebase
@ -193,17 +232,22 @@ search_files("similar_pattern", path="src/", file_glob="*.py")
**Scientific method:**
### 1. Form a Single Hypothesis
### 1. Form Ranked Falsifiable Hypotheses
- State clearly: "I think X is the root cause because Y"
- Write it down
- Be specific, not vague
- Generate 35 plausible hypotheses before testing any single one.
- Rank them by likelihood and cheapness to falsify.
- State the prediction each hypothesis makes: "If X is the cause, then changing or observing Y should make Z happen."
- Discard or sharpen any hypothesis that does not make a testable prediction.
If the user is present, show the ranked list before testing. They may have domain knowledge that instantly re-ranks it. If the user is AFK, proceed with your ranking.
### 2. Test Minimally
- Make the SMALLEST possible change to test the hypothesis
- One variable at a time
- Don't fix multiple things at once
- Test the highest-ranked hypothesis with the smallest possible probe.
- Change one variable at a time.
- Don't fix multiple things at once.
- Prefer debugger/REPL inspection when available; one breakpoint beats ten logs.
- If you add logs, tag every temporary line with a unique prefix such as `[DEBUG-a4f2]` so cleanup is a single search.
### 3. Verify Before Continuing

View file

@ -175,6 +175,25 @@ Keep tests green throughout. Don't add behavior.
Next failing test for next behavior. One cycle at a time.
## Avoid Horizontal Slices
Do **not** write all tests first and then all implementation. That is horizontal slicing: RED becomes "write a pile of imagined tests" and GREEN becomes "make the pile pass." It produces brittle tests because the tests are designed before the implementation has taught you what behavior and interface actually matter.
Use vertical tracer bullets instead:
```text
WRONG:
RED: test1, test2, test3, test4
GREEN: impl1, impl2, impl3, impl4
RIGHT:
RED→GREEN: test1→impl1
RED→GREEN: test2→impl2
RED→GREEN: test3→impl3
```
A tracer bullet is one end-to-end behavior slice. It proves the path works, teaches you about the interface, and keeps each next test grounded in what you just learned.
## Why Order Matters
**"I'll write tests after to verify it works"**