hermes-agent/skills/research/research-paper-writing/SKILL.md
SHL0MS b86647c295 Replace ml-paper-writing with research-paper-writing: full research pipeline skill
Replaces the writing-focused ml-paper-writing skill (940 lines) with a
complete end-to-end research paper pipeline (1,599 lines SKILL.md + 3,184
lines across 7 reference files).

New content:
- Full 8-phase pipeline: project setup, literature review, experiment
  design, execution/monitoring, analysis, paper drafting, review/revision,
  submission preparation
- Iterative refinement strategy guide from autoreason research (when to use
  autoreason vs critique-and-revise vs single-pass, model selection)
- Hermes agent integration: delegate_task parallel drafting, cronjob
  monitoring, memory/todo state management, skill composition
- Professional LaTeX tooling: microtype, siunitx, TikZ diagram patterns,
  algorithm2e, subcaption, latexdiff, SciencePlots
- Human evaluation design: annotation protocols, inter-annotator agreement,
  crowdsourcing platforms
- Title, Figure 1, conclusion, appendix strategy, page budget management
- Anonymization checklist, rebuttal writing, camera-ready preparation
- AAAI and COLM venue coverage (checklists, reviewer guidelines)

Preserved from ml-paper-writing:
- All writing philosophy (Nanda, Farquhar, Gopen & Swan, Lipton, Perez)
- Citation verification workflow (5-step mandatory process)
- All 6 conference templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM)
- Conference requirements, format conversion workflow
- Proactivity/collaboration guidance

Bug fixes in inherited reference files:
- BibLaTeX recommendation now correctly says natbib for conferences
- Bare except clauses fixed to except Exception
- Jinja2 template tags removed from citation-workflow.md
- Stale date caveats added to reviewer-guidelines.md
2026-04-02 16:13:26 -04:00

1599 lines
64 KiB
Markdown

---
name: research-paper-writing
title: Research Paper Writing Pipeline
description: End-to-end pipeline for writing ML/AI research papers — from experiment design through analysis, drafting, revision, and submission. Covers NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Integrates automated experiment monitoring, statistical analysis, iterative writing, and citation verification.
version: 1.0.0
author: Orchestra Research
license: MIT
dependencies: [semanticscholar, arxiv, habanero, requests, scipy, numpy, matplotlib, SciencePlots]
platforms: [linux, macos]
metadata:
hermes:
tags: [Research, Paper Writing, Experiments, ML, AI, NeurIPS, ICML, ICLR, ACL, AAAI, COLM, LaTeX, Citations, Statistical Analysis]
category: research
related_skills: [arxiv, ml-paper-writing, subagent-driven-development, plan]
requires_toolsets: [terminal, files]
---
# Research Paper Writing Pipeline
End-to-end pipeline for producing publication-ready ML/AI research papers targeting **NeurIPS, ICML, ICLR, ACL, AAAI, and COLM**. This skill covers the full research lifecycle: experiment design, execution, monitoring, analysis, paper writing, review, revision, and submission.
This is **not a linear pipeline** — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops.
```
┌─────────────────────────────────────────────────────────────┐
│ RESEARCH PAPER PIPELINE │
│ │
│ Phase 0: Project Setup ──► Phase 1: Literature Review │
│ │ │ │
│ ▼ ▼ │
│ Phase 2: Experiment Phase 5: Paper Drafting ◄──┐ │
│ Design │ │ │
│ │ ▼ │ │
│ ▼ Phase 6: Self-Review │ │
│ Phase 3: Execution & & Revision ──────────┘ │
│ Monitoring │ │
│ │ ▼ │
│ ▼ Phase 7: Submission │
│ Phase 4: Analysis ─────► (feeds back to Phase 2 or 5) │
│ │
└─────────────────────────────────────────────────────────────┘
```
---
## When To Use This Skill
Use this skill when:
- **Starting a new research paper** from an existing codebase or idea
- **Designing and running experiments** to support paper claims
- **Writing or revising** any section of a research paper
- **Preparing for submission** to a specific conference
- **Responding to reviews** with additional experiments or revisions
- **Converting** a paper between conference formats
## Core Philosophy
1. **Be proactive.** Deliver complete drafts, not questions. Scientists are busy — produce something concrete they can react to, then iterate.
2. **Never hallucinate citations.** AI-generated citations have ~40% error rate. Always fetch programmatically. Mark unverifiable citations as `[CITATION NEEDED]`.
3. **Paper is a story, not a collection of experiments.** Every paper needs one clear contribution stated in a single sentence. If you can't do that, the paper isn't ready.
4. **Experiments serve claims.** Every experiment must explicitly state which claim it supports. Never run experiments that don't connect to the paper's narrative.
5. **Commit early, commit often.** Every completed experiment batch, every paper draft update — commit with descriptive messages. Git log is the experiment history.
### Proactivity and Collaboration
**Default: Be proactive. Draft first, ask with the draft.**
| Confidence Level | Action |
|-----------------|--------|
| **High** (clear repo, obvious contribution) | Write full draft, deliver, iterate on feedback |
| **Medium** (some ambiguity) | Write draft with flagged uncertainties, continue |
| **Low** (major unknowns) | Ask 1-2 targeted questions via `clarify`, then draft |
| Section | Draft Autonomously? | Flag With Draft |
|---------|-------------------|-----------------|
| Abstract | Yes | "Framed contribution as X — adjust if needed" |
| Introduction | Yes | "Emphasized problem Y — correct if wrong" |
| Methods | Yes | "Included details A, B, C — add missing pieces" |
| Experiments | Yes | "Highlighted results 1, 2, 3 — reorder if needed" |
| Related Work | Yes | "Cited papers X, Y, Z — add any I missed" |
**Block for input only when**: target venue unclear, multiple contradictory framings, results seem incomplete, explicit request to review first.
---
## Phase 0: Project Setup
**Goal**: Establish the workspace, understand existing work, identify the contribution.
### Step 0.1: Explore the Repository
```bash
# Understand project structure
ls -la
find . -name "*.py" | head -30
find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding"
```
Look for:
- `README.md` — project overview and claims
- `results/`, `outputs/`, `experiments/` — existing findings
- `configs/` — experimental settings
- `.bib` files — existing citations
- Draft documents or notes
### Step 0.2: Organize the Workspace
Establish a consistent workspace structure:
```
workspace/
paper/ # LaTeX source, figures, compiled PDFs
experiments/ # Experiment runner scripts
code/ # Core method implementation
results/ # Raw experiment results (auto-generated)
tasks/ # Task/benchmark definitions
human_eval/ # Human evaluation materials (if needed)
```
### Step 0.3: Set Up Version Control
```bash
git init # if not already
git remote add origin <repo-url>
git checkout -b paper-draft # or main
```
**Git discipline**: Every completed experiment batch gets committed with a descriptive message. Example:
```
Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task)
Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier
```
### Step 0.4: Identify the Contribution
Before writing anything, articulate:
- **The What**: What is the single thing this paper contributes?
- **The Why**: What evidence supports it?
- **The So What**: Why should readers care?
> Propose to the scientist: "Based on my understanding, the main contribution is: [one sentence]. The key results show [Y]. Is this the framing you want?"
### Step 0.5: Create a TODO List
Use the `todo` tool to create a structured project plan:
```
Research Paper TODO:
- [ ] Define one-sentence contribution
- [ ] Literature review (related work + baselines)
- [ ] Design core experiments
- [ ] Run experiments
- [ ] Analyze results
- [ ] Write first draft
- [ ] Self-review (simulate reviewers)
- [ ] Revise based on review
- [ ] Submission prep
```
Update this throughout the project. It serves as the persistent state across sessions.
---
## Phase 1: Literature Review
**Goal**: Find related work, identify baselines, gather citations.
### Step 1.1: Identify Seed Papers
Start from papers already referenced in the codebase:
```bash
# Via terminal:
grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py"
find . -name "*.bib"
```
### Step 1.2: Search for Related Work
**Load the `arxiv` skill** for structured paper discovery: `skill_view("arxiv")`. It provides arXiv REST API search, Semantic Scholar citation graphs, author profiles, and BibTeX generation.
Use `web_search` for broad discovery, `web_extract` for fetching specific papers:
```
# Via web_search:
web_search("[main technique] + [application domain] site:arxiv.org")
web_search("[baseline method] comparison ICML NeurIPS 2024")
# Via web_extract (for specific papers):
web_extract("https://arxiv.org/abs/2303.17651")
```
Additional search queries to try:
```
Search queries:
- "[main technique] + [application domain]"
- "[baseline method] comparison"
- "[problem name] state-of-the-art"
- Author names from existing citations
```
**Recommended**: Install **Exa MCP** for real-time academic search:
```bash
claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"
```
### Step 1.3: Verify Every Citation
**NEVER generate BibTeX from memory. ALWAYS fetch programmatically.**
For each citation, follow the mandatory 5-step process:
```
Citation Verification (MANDATORY per citation):
1. SEARCH → Query Semantic Scholar or Exa MCP with specific keywords
2. VERIFY → Confirm paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef)
3. RETRIEVE → Get BibTeX via DOI content negotiation (programmatically, not from memory)
4. VALIDATE → Confirm the claim you're citing actually appears in the paper
5. ADD → Add verified BibTeX to bibliography
If ANY step fails → mark as [CITATION NEEDED], inform scientist
```
```python
# Fetch BibTeX via DOI
import requests
def doi_to_bibtex(doi: str) -> str:
response = requests.get(
f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"}
)
response.raise_for_status()
return response.text
```
If you cannot verify a citation:
```latex
\cite{PLACEHOLDER_author2024_verify_this} % TODO: Verify this citation exists
```
**Always tell the scientist**: "I've marked [X] citations as placeholders that need verification."
See [references/citation-workflow.md](references/citation-workflow.md) for complete API documentation and the full `CitationManager` class.
### Step 1.4: Organize Related Work
Group papers by methodology, not paper-by-paper:
**Good**: "One line of work uses X's assumption [refs] whereas we use Y's assumption because..."
**Bad**: "Smith et al. introduced X. Jones et al. introduced Y. We combine both."
---
## Phase 2: Experiment Design
**Goal**: Design experiments that directly support paper claims. Every experiment must answer a specific question.
### Step 2.1: Map Claims to Experiments
Create an explicit mapping:
| Claim | Experiment | Expected Evidence |
|-------|-----------|-------------------|
| "Our method outperforms baselines" | Main comparison (Table 1) | Win rate, statistical significance |
| "Effect is larger for weaker models" | Model scaling study | Monotonic improvement curve |
| "Convergence requires scope constraints" | Constrained vs unconstrained | Convergence rate comparison |
**Rule**: If an experiment doesn't map to a claim, don't run it.
### Step 2.2: Design Baselines
Strong baselines are what separates accepted papers from rejected ones. Reviewers will ask: "Did they compare against X?"
Standard baseline categories:
- **Naive baseline**: Simplest possible approach
- **Strong baseline**: Best known existing method
- **Ablation baselines**: Your method minus one component
- **Compute-matched baselines**: Same compute budget, different allocation
### Step 2.3: Define Evaluation Protocol
Before running anything, specify:
- **Metrics**: What you're measuring, direction symbols (higher/lower better)
- **Aggregation**: How results are combined across runs/tasks
- **Statistical tests**: What tests will establish significance
- **Sample sizes**: How many runs/problems/tasks
### Step 2.4: Write Experiment Scripts
Follow these patterns from successful research pipelines:
**Incremental saving** — save results after each step for crash recovery:
```python
# Save after each problem/task
result_path = f"results/{task}/{strategy}/result.json"
if os.path.exists(result_path):
continue # Skip already-completed work
# ... run experiment ...
with open(result_path, 'w') as f:
json.dump(result, f, indent=2)
```
**Artifact preservation** — save all intermediate outputs:
```
results/<experiment>/
<task>/
<strategy>/
final_output.md # Final result
history.json # Full trajectory
pass_01/ # Per-iteration artifacts
version_a.md
version_b.md
critic.md
```
**Separation of concerns** — keep generation, evaluation, and visualization separate:
```
run_experiment.py # Core experiment runner
run_baselines.py # Baseline comparison
run_comparison_judge.py # Blind evaluation
analyze_results.py # Statistical analysis
make_charts.py # Visualization
```
See [references/experiment-patterns.md](references/experiment-patterns.md) for complete design patterns, cron monitoring, and error recovery.
---
## Phase 3: Experiment Execution & Monitoring
**Goal**: Run experiments reliably, monitor progress, recover from failures.
### Step 3.1: Launch Experiments
Use `nohup` for long-running experiments:
```bash
nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 &
echo $! # Record the PID
```
**Parallel execution**: Run independent experiments simultaneously, but be aware of API rate limits. 4+ concurrent experiments on the same API will slow each down.
### Step 3.2: Set Up Monitoring (Cron Pattern)
For long-running experiments, set up periodic status checks. The cron prompt should follow this template:
```
Monitor Prompt Template:
1. Check if process is still running: ps aux | grep <pattern>
2. Read last 30 lines of log: tail -30 <logfile>
3. Check for completed results: ls <result_dir>
4. If results exist, read and report: cat <result_file>
5. If all done, commit: git add -A && git commit -m "<descriptive message>" && git push
6. Report in structured format (tables with key metrics)
7. Answer the key analytical question for this experiment
```
**Silent mode**: If nothing has changed since the last check, respond with `[SILENT]` to suppress notification to the user. Only report when there's news.
### Step 3.3: Handle Failures
Common failure modes and recovery:
| Failure | Detection | Recovery |
|---------|-----------|----------|
| API rate limit / credit exhaustion | 402/429 errors in logs | Wait, then re-run (scripts skip completed work) |
| Process crash | PID gone, incomplete results | Re-run from last checkpoint |
| Timeout on hard problems | Process stuck, no log progress | Kill and skip, note in results |
| Wrong model ID | Errors referencing model name | Fix ID and re-run |
**Key**: Scripts should always check for existing results and skip completed work. This makes re-runs safe and efficient.
### Step 3.4: Commit Completed Results
After each experiment batch completes:
```bash
git add -A
git commit -m "Add <experiment name>: <key finding in 1 line>"
git push
```
---
## Phase 4: Result Analysis
**Goal**: Extract findings, compute statistics, identify the story.
### Step 4.1: Aggregate Results
Write analysis scripts that:
1. Load all result files from a batch
2. Compute per-task and aggregate metrics
3. Generate summary tables
```python
# Standard analysis pattern
import json, os
from pathlib import Path
results = {}
for result_file in Path("results/").rglob("result.json"):
data = json.loads(result_file.read_text())
strategy = result_file.parent.name
task = result_file.parent.parent.name
results.setdefault(strategy, {})[task] = data
# Compute aggregate metrics
for strategy, tasks in results.items():
scores = [t["score"] for t in tasks.values()]
print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}")
```
### Step 4.2: Statistical Significance
Always compute:
- **Error bars**: Standard deviation or standard error, specify which
- **Confidence intervals**: 95% CI for key results
- **Pairwise tests**: McNemar's test for comparing two methods
- **Effect sizes**: Cohen's d or h for practical significance
See [references/experiment-patterns.md](references/experiment-patterns.md) for complete implementations of McNemar's test, bootstrapped CIs, and Cohen's h.
### Step 4.3: Identify the Story
After analysis, explicitly answer:
1. **What is the main finding?** State it in one sentence.
2. **What surprised you?** Unexpected results often make the best papers.
3. **What failed?** Failed experiments can be the most informative. Honest reporting of failures strengthens the paper.
4. **What follow-up experiments are needed?** Results often raise new questions.
### Step 4.4: Create Figures and Tables
**Figures**:
- Use vector graphics (PDF) for all plots: `plt.savefig('fig.pdf')`
- Colorblind-safe palettes (Okabe-Ito or Paul Tol)
- Self-contained captions — reader should understand without main text
- No title inside figure — the caption serves this function
**Tables**:
- Use `booktabs` LaTeX package
- Bold best value per metric
- Include direction symbols (higher/lower better)
- Consistent decimal precision
```latex
\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}
```
### Step 4.5: Decide: More Experiments or Write?
| Situation | Action |
|-----------|--------|
| Core claims supported, results significant | Move to Phase 5 (writing) |
| Results inconclusive, need more data | Back to Phase 2 (design) |
| Unexpected finding suggests new direction | Back to Phase 2 (design) |
| Missing one ablation reviewers will ask for | Run it, then Phase 5 |
| All experiments done but some failed | Note failures, move to Phase 5 |
---
## Iterative Refinement: Strategy Selection
Any output in this pipeline — paper drafts, experiment scripts, analysis — can be iteratively refined. The autoreason research provides empirical evidence for when each refinement strategy works and when it fails. Use this section to choose the right approach.
### Quick Decision Table
| Your Situation | Strategy | Why |
|---------------|----------|-----|
| Mid-tier model + constrained task | **Autoreason** | Sweet spot. Generation-evaluation gap is widest. Baselines actively destroy weak model outputs. |
| Mid-tier model + open task | **Autoreason** with scope constraints added | Add fixed facts, structure, or deliverable to bound the improvement space. |
| Frontier model + constrained task | **Autoreason** | Wins 2/3 constrained tasks even at frontier. |
| Frontier model + unconstrained task | **Critique-and-revise** or **single pass** | Autoreason comes last. Model self-evaluates well enough. |
| Concrete technical task (system design) | **Critique-and-revise** | Direct find-and-fix loop is more efficient. |
| Template-filling task (one correct structure) | **Single pass** or **conservative** | Minimal decision space. Iteration adds no value. |
| Code with test cases | **Autoreason (code variant)** | Structured analysis of *why* it failed before fixing. Recovery rate 62% vs 43%. |
| Very weak model (Llama 8B class) | **Single pass** | Model too weak for diverse candidates. Invest in generation quality. |
### The Generation-Evaluation Gap
**Core insight**: Autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.
```
Model Tier │ Generation │ Self-Eval │ Gap │ Autoreason Value
──────────────────┼────────────┼───────────┼────────┼─────────────────
Weak (Llama 8B) │ Poor │ Poor │ Small │ None — can't generate diverse candidates
Mid (Haiku 3.5) │ Decent │ Poor │ LARGE │ MAXIMUM — 42/42 perfect Borda
Mid (Gemini Flash)│ Decent │ Moderate │ Large │ High — wins 2/3
Strong (Sonnet 4) │ Good │ Decent │ Medium │ Moderate — wins 3/5
Frontier (S4.6) │ Excellent │ Good │ Small │ Only with constraints
```
This gap is structural, not temporary. As costs drop, today's frontier becomes tomorrow's mid-tier. The sweet spot moves but never disappears.
### Autoreason Loop (Summary)
Each pass produces three candidates from fresh, isolated agents:
1. **Critic** → finds problems in incumbent A (no fixes)
2. **Author B** → revises A based on critique
3. **Synthesizer** → merges A and B (randomized labels)
4. **Judge Panel** → 3 blind CoT judges rank A, B, AB via Borda count
5. **Convergence** → A wins k=2 consecutive passes → done
**Key parameters:**
- k=2 convergence (k=1 premature, k=3 too expensive, no quality gain)
- CoT judges always (3x faster convergence)
- Temperature 0.8 authors, 0.3 judges
- Conservative tiebreak: incumbent wins ties
- Every role is a fresh agent with no shared context
### Applying to Paper Drafts
When refining the paper itself through autoreason:
- **Provide ground truth to the critic**: actual experimental data, result JSONs, statistical outputs. Without this, models hallucinate fabricated ablation studies and fake confidence intervals.
- **Use 3 working judges minimum**: A broken judge parser doesn't add noise — it prevents equilibrium entirely.
- **Scope constrain the revision**: "Address these specific weaknesses" not "improve the paper."
### Failure Modes
| Failure | Detection | Fix |
|---------|-----------|-----|
| No convergence (A never wins) | A wins <15% over 20+ passes | Add scope constraints to the task |
| Synthesis drift | Word counts grow unboundedly | Constrain structure and deliverable |
| Degradation below single pass | Baselines score higher than iterated output | Switch to single pass; model may be too weak |
| Overfitting (code) | High public-test pass, low private-test pass | Use structured analysis, not just test feedback |
| Broken judges | Parsing failures reduce panel below 3 | Fix parser before continuing |
See [references/autoreason-methodology.md](references/autoreason-methodology.md) for complete prompts, Borda scoring details, model selection guide, scope constraint design patterns, and compute budget reference.
---
## Phase 5: Paper Drafting
**Goal**: Write a complete, publication-ready paper.
### The Narrative Principle
**The single most critical insight**: Your paper is not a collection of experiments it's a story with one clear contribution supported by evidence.
Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about.
**Three Pillars (must be crystal clear by end of introduction):**
| Pillar | Description | Test |
|--------|-------------|------|
| **The What** | 1-3 specific novel claims | Can you state them in one sentence? |
| **The Why** | Rigorous empirical evidence | Do experiments distinguish your hypothesis from alternatives? |
| **The So What** | Why readers should care | Does this connect to a recognized community problem? |
**If you cannot state your contribution in one sentence, you don't yet have a paper.**
### Time Allocation
Spend approximately **equal time** on each of:
1. The abstract
2. The introduction
3. The figures
4. Everything else combined
**Why?** Most reviewers form judgments before reaching your methods. Readers encounter your paper as: title abstract introduction figures maybe the rest.
### Writing Workflow
```
Paper Writing Checklist:
- [ ] Step 1: Define the one-sentence contribution
- [ ] Step 2: Draft Figure 1 (core idea or most compelling result)
- [ ] Step 3: Draft abstract (5-sentence formula)
- [ ] Step 4: Draft introduction (1-1.5 pages max)
- [ ] Step 5: Draft methods
- [ ] Step 6: Draft experiments & results
- [ ] Step 7: Draft related work
- [ ] Step 8: Draft conclusion & discussion
- [ ] Step 9: Draft limitations (REQUIRED by all venues)
- [ ] Step 10: Plan appendix (proofs, extra experiments, details)
- [ ] Step 11: Complete paper checklist
- [ ] Step 12: Final review
```
### Step 5.0: Title
The title is the single most-read element of the paper. It determines whether anyone clicks through to the abstract.
**Good titles**:
- State the contribution or finding: "Autoreason: When Iterative LLM Refinement Works and Why It Fails"
- Highlight a surprising result: "Scaling Data-Constrained Language Models" (implies you can)
- Name the method + what it does: "DPO: Direct Preference Optimization of Language Models"
**Bad titles**:
- Too generic: "An Approach to Improving Language Model Outputs"
- Too long: anything over ~15 words
- Jargon-only: "Asymptotic Convergence of Iterative Stochastic Policy Refinement" (who is this for?)
**Rules**:
- Include your method name if you have one (for citability)
- Include 1-2 keywords reviewers will search for
- Avoid colons unless both halves carry meaning
- Test: would a reviewer know the domain and contribution from the title alone?
### Step 5.1: Abstract (5-Sentence Formula)
From Sebastian Farquhar (DeepMind):
```
1. What you achieved: "We introduce...", "We prove...", "We demonstrate..."
2. Why this is hard and important
3. How you do it (with specialist keywords for discoverability)
4. What evidence you have
5. Your most remarkable number/result
```
**Delete** generic openings like "Large language models have achieved remarkable success..."
### Step 5.2: Figure 1
Figure 1 is the second thing most readers look at (after abstract). Draft it before writing the introduction it forces you to clarify the core idea.
| Figure 1 Type | When to Use | Example |
|---------------|-------------|---------|
| **Method diagram** | New architecture or pipeline | TikZ flowchart showing your system |
| **Results teaser** | One compelling result tells the whole story | Bar chart: "Ours vs baselines" with clear gap |
| **Problem illustration** | The problem is unintuitive | Before/after showing failure mode you fix |
| **Conceptual diagram** | Abstract contribution needs visual grounding | 2x2 matrix of method properties |
**Rules**: Figure 1 must be understandable without reading any text. The caption alone should communicate the core idea. Use color purposefully don't just decorate.
### Step 5.3: Introduction (1-1.5 pages max)
Must include:
- Clear problem statement
- Brief approach overview
- 2-4 bullet contribution list (max 1-2 lines each in two-column format)
- Methods should start by page 2-3
### Step 5.3: Methods
Enable reimplementation:
- Conceptual outline or pseudocode
- All hyperparameters listed
- Architectural details sufficient for reproduction
- Present final design decisions; ablations go in experiments
### Step 5.4: Experiments & Results
For each experiment, explicitly state:
- **What claim it supports**
- How it connects to main contribution
- What to observe: "the blue line shows X, which demonstrates Y"
Requirements:
- Error bars with methodology (std dev vs std error)
- Hyperparameter search ranges
- Compute infrastructure (GPU type, total hours)
- Seed-setting methods
### Step 5.5: Related Work
Organize methodologically, not paper-by-paper. Cite generously reviewers likely authored relevant papers.
### Step 5.6: Limitations (REQUIRED)
All major conferences require this. Honesty helps:
- Reviewers are instructed not to penalize honest limitation acknowledgment
- Pre-empt criticisms by identifying weaknesses first
- Explain why limitations don't undermine core claims
### Step 5.7: Conclusion & Discussion
**Conclusion** (required, 0.5-1 page):
- Restate the contribution in one sentence (different wording from abstract)
- Summarize key findings (2-3 sentences, not a list)
- Implications: what does this mean for the field?
- Future work: 2-3 concrete next steps (not vague "we leave X for future work")
**Discussion** (optional, sometimes combined with conclusion):
- Broader implications beyond immediate results
- Connections to other subfields
- Honest assessment of when the method does and doesn't work
- Practical deployment considerations
**Do NOT** introduce new results or claims in the conclusion.
### Step 5.8: Appendix Strategy
Appendices are unlimited at all major venues and are essential for reproducibility. Structure:
| Appendix Section | What Goes Here |
|-----------------|---------------|
| **Proofs & Derivations** | Full proofs too long for main text. Main text can state theorems with "proof in Appendix A." |
| **Additional Experiments** | Ablations, scaling curves, per-dataset breakdowns, hyperparameter sensitivity |
| **Implementation Details** | Full hyperparameter tables, training details, hardware specs, random seeds |
| **Dataset Documentation** | Data collection process, annotation guidelines, licensing, preprocessing |
| **Prompts & Templates** | Exact prompts used (for LLM-based methods), evaluation templates |
| **Human Evaluation** | Annotation interface screenshots, instructions given to annotators, IRB details |
| **Additional Figures** | Per-task breakdowns, trajectory visualizations, failure case examples |
**Rules**:
- The main paper must be self-contained reviewers are not required to read appendices
- Never put critical evidence only in the appendix
- Cross-reference: "Full results in Table 5 (Appendix B)" not just "see appendix"
- Use `\appendix` command, then `\section{A: Proofs}` etc.
### Page Budget Management
When over the page limit:
| Cut Strategy | Saves | Risk |
|-------------|-------|------|
| Move proofs to appendix | 0.5-2 pages | Low standard practice |
| Condense related work | 0.5-1 page | Medium may miss key citations |
| Combine tables with subfigures | 0.25-0.5 page | Low often improves readability |
| Use `\vspace{-Xpt}` sparingly | 0.1-0.3 page | Low if subtle, high if obvious |
| Remove qualitative examples | 0.5-1 page | Medium reviewers like examples |
| Reduce figure sizes | 0.25-0.5 page | High figures must remain readable |
**Do NOT**: reduce font size, change margins, remove required sections (limitations, broader impact), or use `\small`/`\footnotesize` for main text.
### Writing Style
**Sentence-level clarity (Gopen & Swan's 7 Principles):**
| Principle | Rule |
|-----------|------|
| Subject-verb proximity | Keep subject and verb close |
| Stress position | Place emphasis at sentence ends |
| Topic position | Put context first, new info after |
| Old before new | Familiar info unfamiliar info |
| One unit, one function | Each paragraph makes one point |
| Action in verb | Use verbs, not nominalizations |
| Context before new | Set stage before presenting |
**Word choice (Lipton, Steinhardt):**
- Be specific: "accuracy" not "performance"
- Eliminate hedging: drop "may" unless genuinely uncertain
- Consistent terminology throughout
- Avoid incremental vocabulary: "develop", not "combine"
**Full writing guide with examples**: See [references/writing-guide.md](references/writing-guide.md)
### Using LaTeX Templates
**Always copy the entire template directory first, then write within it.**
```
Template Setup Checklist:
- [ ] Step 1: Copy entire template directory to new project
- [ ] Step 2: Verify template compiles as-is (before any changes)
- [ ] Step 3: Read the template's example content to understand structure
- [ ] Step 4: Replace example content section by section
- [ ] Step 5: Use template macros (check preamble for \newcommand definitions)
- [ ] Step 6: Clean up template artifacts only at the end
```
**Step 1: Copy the Full Template**
```bash
cp -r templates/neurips2025/ ~/papers/my-paper/
cd ~/papers/my-paper/
ls -la # Should see: main.tex, neurips.sty, Makefile, etc.
```
Copy the ENTIRE directory, not just the .tex file. Templates include style files (.sty), bibliography styles (.bst), example content, and Makefiles.
**Step 2: Verify Template Compiles First**
Before making ANY changes:
```bash
latexmk -pdf main.tex
# Or manual: pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex
```
If the unmodified template doesn't compile, fix that first (usually missing TeX packages install via `tlmgr install <package>`).
**Step 3: Keep Template Content as Reference**
Don't immediately delete example content. Comment it out and use as formatting reference:
```latex
% Template example (keep for reference):
% \begin{figure}[t]
% \centering
% \includegraphics[width=0.8\linewidth]{example-image}
% \caption{Template shows caption style}
% \end{figure}
% Your actual figure:
\begin{figure}[t]
\centering
\includegraphics[width=0.8\linewidth]{your-figure.pdf}
\caption{Your caption following the same style.}
\end{figure}
```
**Step 4: Replace Content Section by Section**
Work through systematically: title/authors abstract introduction methods experiments related work conclusion references appendix. Compile after each section.
**Step 5: Use Template Macros**
```latex
\newcommand{\method}{YourMethodName} % Consistent method naming
\newcommand{\eg}{e.g.,\xspace} % Proper abbreviations
\newcommand{\ie}{i.e.,\xspace}
```
### Template Pitfalls
| Pitfall | Problem | Solution |
|---------|---------|----------|
| Copying only `.tex` file | Missing `.sty`, won't compile | Copy entire directory |
| Modifying `.sty` files | Breaks conference formatting | Never edit style files |
| Adding random packages | Conflicts, breaks template | Only add if necessary |
| Deleting template content early | Lose formatting reference | Keep as comments until done |
| Not compiling frequently | Errors accumulate | Compile after each section |
| Raster PNGs for figures | Blurry in paper | Always use vector PDF via `savefig('fig.pdf')` |
### Quick Template Reference
| Conference | Main File | Style File | Page Limit |
|------------|-----------|------------|------------|
| NeurIPS 2025 | `main.tex` | `neurips.sty` | 9 pages |
| ICML 2026 | `example_paper.tex` | `icml2026.sty` | 8 pages |
| ICLR 2026 | `iclr2026_conference.tex` | `iclr2026_conference.sty` | 9 pages |
| ACL 2025 | `acl_latex.tex` | `acl.sty` | 8 pages (long) |
| AAAI 2026 | `aaai2026-unified-template.tex` | `aaai2026.sty` | 7 pages |
| COLM 2025 | `colm2025_conference.tex` | `colm2025_conference.sty` | 9 pages |
**Universal**: Double-blind, references don't count, appendices unlimited, LaTeX required.
Templates in `templates/` directory. See [templates/README.md](templates/README.md) for compilation setup (VS Code, CLI, Overleaf, other IDEs).
### Tables and Figures
**Tables** use `booktabs` for professional formatting:
```latex
\usepackage{booktabs}
\begin{tabular}{lcc}
\toprule
Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\
\midrule
Baseline & 85.2 & 45ms \\
\textbf{Ours} & \textbf{92.1} & 38ms \\
\bottomrule
\end{tabular}
```
Rules:
- Bold best value per metric
- Include direction symbols ($\uparrow$ higher better, $\downarrow$ lower better)
- Right-align numerical columns
- Consistent decimal precision
**Figures**:
- **Vector graphics** (PDF, EPS) for all plots and diagrams `plt.savefig('fig.pdf')`
- **Raster** (PNG 600 DPI) only for photographs
- **Colorblind-safe palettes** (Okabe-Ito or Paul Tol)
- Verify **grayscale readability** (8% of men have color vision deficiency)
- **No title inside figure** the caption serves this function
- **Self-contained captions** reader should understand without main text
### Conference Resubmission
For converting between venues, see Phase 7 (Submission Preparation) it covers the full conversion workflow, page-change table, and post-rejection guidance.
### Professional LaTeX Preamble
Add these packages to any paper for professional quality. They are compatible with all major conference style files:
```latex
% --- Professional Packages (add after conference style file) ---
% Typography
\usepackage{microtype} % Microtypographic improvements (protrusion, expansion)
% Makes text noticeably more polished — always include
% Tables
\usepackage{booktabs} % Professional table rules (\toprule, \midrule, \bottomrule)
\usepackage{siunitx} % Consistent number formatting, decimal alignment
% Usage: \num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz
% Table alignment: S column type for decimal-aligned numbers
% Figures
\usepackage{graphicx} % Include graphics (\includegraphics)
\usepackage{subcaption} % Subfigures with (a), (b), (c) labels
% Usage: \begin{subfigure}{0.48\textwidth} ... \end{subfigure}
% Diagrams and Algorithms
\usepackage{tikz} % Programmable vector diagrams
\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds}
\usepackage[ruled,vlined]{algorithm2e} % Professional pseudocode
% Alternative: \usepackage{algorithmicx} if template bundles it
% Cross-references
\usepackage{cleveref} % Smart references: \cref{fig:x} → "Figure 1"
% MUST be loaded AFTER hyperref
% Handles: figures, tables, sections, equations, algorithms
% Math (usually included by conference .sty, but verify)
\usepackage{amsmath,amssymb} % AMS math environments and symbols
\usepackage{mathtools} % Extends amsmath (dcases, coloneqq, etc.)
% Colors (for figures and diagrams)
\usepackage{xcolor} % Color management
% Okabe-Ito colorblind-safe palette:
\definecolor{okblue}{HTML}{0072B2}
\definecolor{okorange}{HTML}{E69F00}
\definecolor{okgreen}{HTML}{009E73}
\definecolor{okred}{HTML}{D55E00}
\definecolor{okpurple}{HTML}{CC79A7}
\definecolor{okcyan}{HTML}{56B4E9}
\definecolor{okyellow}{HTML}{F0E442}
```
**Notes:**
- `microtype` is the single highest-impact package for visual quality. It adjusts character spacing at a sub-pixel level. Always include it.
- `siunitx` handles decimal alignment in tables via the `S` column type eliminates manual spacing.
- `cleveref` must be loaded **after** `hyperref`. Most conference .sty files load hyperref, so put cleveref last.
- Check if the conference template already loads any of these (especially `algorithm`, `amsmath`, `graphicx`). Don't double-load.
### siunitx Table Alignment
`siunitx` makes number-heavy tables significantly more readable:
```latex
\begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]}
\toprule
Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\
\midrule
Baseline & 85.2 & 83.7 & 45.3 \\
Ablation (no X) & 87.1 & 85.4 & 42.1 \\
\textbf{Ours} & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\
\bottomrule
\end{tabular}
```
The `S` column type auto-aligns on the decimal point. Headers in `{}` escape the alignment.
### Subfigures
Standard pattern for side-by-side figures:
```latex
\begin{figure}[t]
\centering
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{fig_results_a.pdf}
\caption{Results on Dataset A.}
\label{fig:results-a}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.48\textwidth}
\centering
\includegraphics[width=\textwidth]{fig_results_b.pdf}
\caption{Results on Dataset B.}
\label{fig:results-b}
\end{subfigure}
\caption{Comparison of our method across two datasets. (a) shows the scaling
behavior and (b) shows the ablation results. Both use 5 random seeds.}
\label{fig:results}
\end{figure}
```
Use `\cref{fig:results}` "Figure 1", `\cref{fig:results-a}` "Figure 1a".
### Pseudocode with algorithm2e
```latex
\begin{algorithm}[t]
\caption{Iterative Refinement with Judge Panel}
\label{alg:method}
\KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$}
\KwOut{Final output $A^*$}
$A \gets M(T)$ \tcp*{Initial generation}
$\text{streak} \gets 0$\;
\While{$\text{streak} < k$}{
$C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses}
$B \gets M(T, C)$ \tcp*{Revised version addressing critique}
$AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements}
\ForEach{judge $J_i$}{
$\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking}
}
$\text{winner} \gets \text{BordaCount}(\text{ranks})$\;
\eIf{$\text{winner} = A$}{
$\text{streak} \gets \text{streak} + 1$\;
}{
$A \gets \text{winner}$; $\text{streak} \gets 0$\;
}
}
\Return{$A$}\;
\end{algorithm}
```
### TikZ Diagram Patterns
TikZ is the standard for method diagrams in ML papers. Common patterns:
**Pipeline/Flow Diagram** (most common in ML papers):
```latex
\begin{figure}[t]
\centering
\begin{tikzpicture}[
node distance=1.8cm,
box/.style={rectangle, draw, rounded corners, minimum height=1cm,
minimum width=2cm, align=center, font=\small},
arrow/.style={-{Stealth[length=3mm]}, thick},
]
\node[box, fill=okcyan!20] (input) {Input\\$x$};
\node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$};
\node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$};
\node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$};
\node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$};
\draw[arrow] (input) -- (encoder);
\draw[arrow] (encoder) -- (latent);
\draw[arrow] (latent) -- (decoder);
\draw[arrow] (decoder) -- (output);
\end{tikzpicture}
\caption{Architecture overview. The encoder maps input $x$ to latent
representation $z$, which the decoder reconstructs.}
\label{fig:architecture}
\end{figure}
```
**Comparison/Matrix Diagram** (for showing method variants):
```latex
\begin{tikzpicture}[
cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm,
align=center, font=\small},
header/.style={cell, fill=gray!20, font=\small\bfseries},
]
% Headers
\node[header] at (0, 0) {Method};
\node[header] at (3, 0) {Converges?};
\node[header] at (6, 0) {Quality?};
% Rows
\node[cell] at (0, -1) {Single Pass};
\node[cell, fill=okgreen!15] at (3, -1) {N/A};
\node[cell, fill=okorange!15] at (6, -1) {Baseline};
\node[cell] at (0, -2) {Critique+Revise};
\node[cell, fill=okred!15] at (3, -2) {No};
\node[cell, fill=okred!15] at (6, -2) {Degrades};
\node[cell] at (0, -3) {Ours};
\node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)};
\node[cell, fill=okgreen!15] at (6, -3) {Improves};
\end{tikzpicture}
```
**Iterative Loop Diagram** (for methods with feedback):
```latex
\begin{tikzpicture}[
node distance=2cm,
box/.style={rectangle, draw, rounded corners, minimum height=0.8cm,
minimum width=1.8cm, align=center, font=\small},
arrow/.style={-{Stealth[length=3mm]}, thick},
label/.style={font=\scriptsize, midway, above},
]
\node[box, fill=okblue!20] (gen) {Generator};
\node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic};
\node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel};
\draw[arrow] (gen) -- node[label] {output $A$} (critic);
\draw[arrow] (critic) -- node[label, right] {critique $C$} (judge);
\draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen);
\end{tikzpicture}
```
### latexdiff for Revision Tracking
Essential for rebuttals generates a marked-up PDF showing changes between versions:
```bash
# Install
# macOS: brew install latexdiff (or comes with TeX Live)
# Linux: sudo apt install latexdiff
# Generate diff
latexdiff paper_v1.tex paper_v2.tex > paper_diff.tex
pdflatex paper_diff.tex
# For multi-file projects (with \input{} or \include{})
latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex
```
This produces a PDF with deletions in red strikethrough and additions in blue standard format for rebuttal supplements.
### SciencePlots for matplotlib
Install and use for publication-quality plots:
```bash
pip install SciencePlots
```
```python
import matplotlib.pyplot as plt
import scienceplots # registers styles
# Use science style (IEEE-like, clean)
with plt.style.context(['science', 'no-latex']):
fig, ax = plt.subplots(figsize=(3.5, 2.5)) # Single-column width
ax.plot(x, y, label='Ours', color='#0072B2')
ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--')
ax.set_xlabel('Training Steps')
ax.set_ylabel('Accuracy')
ax.legend()
fig.savefig('paper/fig_results.pdf', bbox_inches='tight')
# Available styles: 'science', 'ieee', 'nature', 'science+ieee'
# Add 'no-latex' if LaTeX is not installed on the machine generating plots
```
**Standard figure sizes** (two-column format):
- Single column: `figsize=(3.5, 2.5)` fits in one column
- Double column: `figsize=(7.0, 3.0)` spans both columns
- Square: `figsize=(3.5, 3.5)` for heatmaps, confusion matrices
---
## Phase 6: Self-Review & Revision
**Goal**: Simulate the review process before submission. Catch weaknesses early.
### Step 6.1: Simulate Reviews
Generate reviews from multiple perspectives using strong models (Opus 4, Sonnet 4.6, Gemini 2.5 Pro). Use the reviewer guidelines from the target venue.
**Review prompt template:**
```
You are an expert reviewer for [VENUE]. Review this paper according to the
official reviewer guidelines. Evaluate:
1. Quality (technical soundness, baselines, claims supported by evidence)
2. Clarity (writing, notation consistency, reproducibility)
3. Significance (impact, importance of the problem)
4. Originality (novelty, new insights)
Provide:
- Summary (2-3 sentences)
- Strengths (bullet list)
- Weaknesses (bullet list, most critical first)
- Questions for authors
- Missing references
- Score (1-6 on NeurIPS scale)
- Confidence (1-5)
```
### Step 6.2: Prioritize Feedback
After collecting reviews, categorize:
| Priority | Action |
|----------|--------|
| **Critical** (technical flaw, missing baseline) | Must fix. May require new experiments back to Phase 2 |
| **High** (clarity issue, missing ablation) | Should fix in this revision |
| **Medium** (minor writing issues, extra experiments) | Fix if time allows |
| **Low** (style preferences, tangential suggestions) | Note for future work |
### Step 6.3: Revision Cycle
For each critical/high issue:
1. Identify the specific section(s) affected
2. Draft the fix
3. Verify the fix doesn't break other claims
4. Update the paper
5. Re-check against the reviewer's concern
### Step 6.4: Rebuttal Writing
When responding to actual reviews (post-submission), rebuttals are a distinct skill from revision:
**Format**: Point-by-point. For each reviewer concern:
```
> R1-W1: "The paper lacks comparison with Method X."
We thank the reviewer for this suggestion. We have added a comparison with
Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric]
(p<0.05). We note that X requires 2x our compute budget.
```
**Rules**:
- Address every concern reviewers notice if you skip one
- Lead with the strongest responses
- Be concise and direct reviewers read dozens of rebuttals
- Include new results if you ran experiments during the rebuttal period
- Never be defensive or dismissive, even of weak criticisms
- Use `latexdiff` to generate a marked-up PDF showing changes (see Professional LaTeX Tooling section)
- Thank reviewers for specific, actionable feedback (not generic praise)
**What NOT to do**: "We respectfully disagree" without evidence. "This is out of scope" without explanation. Ignoring a weakness by only responding to strengths.
### Step 6.5: Paper Evolution Tracking
Save snapshots at key milestones:
```
paper/
paper.tex # Current working version
paper_v1_first_draft.tex # First complete draft
paper_v2_post_review.tex # After simulated review
paper_v3_pre_submission.tex # Final before submission
paper_v4_camera_ready.tex # Post-acceptance final
```
---
## Phase 7: Submission Preparation
**Goal**: Final checks, formatting, and submission.
### Step 7.1: Conference Checklist
Every venue has mandatory checklists. Complete them carefully incomplete checklists can result in desk rejection.
See [references/checklists.md](references/checklists.md) for:
- NeurIPS 16-item paper checklist
- ICML broader impact + reproducibility
- ICLR LLM disclosure policy
- ACL mandatory limitations section
- Universal pre-submission checklist
### Step 7.2: Anonymization Checklist
Double-blind review means reviewers cannot know who wrote the paper. Check ALL of these:
```
Anonymization Checklist:
- [ ] No author names or affiliations anywhere in the PDF
- [ ] No acknowledgments section (add after acceptance)
- [ ] Self-citations written in third person: "Smith et al. [1] showed..." not "We previously showed [1]..."
- [ ] No GitHub/GitLab URLs pointing to your personal repos
- [ ] Use Anonymous GitHub (https://anonymous.4open.science/) for code links
- [ ] No institutional logos or identifiers in figures
- [ ] No file metadata containing author names (check PDF properties)
- [ ] No "our previous work" or "in our earlier paper" phrasing
- [ ] Dataset names don't reveal institution (rename if needed)
- [ ] Supplementary materials don't contain identifying information
```
**Common mistakes**: Git commit messages visible in supplementary code, watermarked figures from institutional tools, acknowledgments left in from a previous draft, arXiv preprint posted before anonymity period.
### Step 7.3: Formatting Verification
```
Pre-Submission Format Check:
- [ ] Page limit respected (excluding references and appendix)
- [ ] All figures are vector (PDF) or high-res raster (600 DPI PNG)
- [ ] All figures readable in grayscale
- [ ] All tables use booktabs
- [ ] References compile correctly (no "?" in citations)
- [ ] No overfull hboxes in critical areas
- [ ] Appendix clearly labeled and separated
- [ ] Required sections present (limitations, broader impact, etc.)
```
### Step 7.3: Final Compilation
```bash
# Clean build
rm -f *.aux *.bbl *.blg *.log *.out *.pdf
latexmk -pdf main.tex
# Or manual
pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex
```
### Step 7.4: Conference-Specific Requirements
| Venue | Special Requirements |
|-------|---------------------|
| **NeurIPS** | Paper checklist in appendix, lay summary if accepted |
| **ICML** | Broader Impact Statement (after conclusion, doesn't count toward limit) |
| **ICLR** | LLM disclosure required, reciprocal reviewing agreement |
| **ACL** | Mandatory Limitations section, Responsible NLP checklist |
| **AAAI** | Strict style file no modifications whatsoever |
| **COLM** | Frame contribution for language model community |
### Step 7.6: Conference Resubmission & Format Conversion
When converting between venues, **never copy LaTeX preambles between templates**:
```bash
# 1. Start fresh with target template
cp -r templates/icml2026/ new_submission/
# 2. Copy ONLY content sections (not preamble)
# - Abstract text, section content, figures, tables, bib entries
# 3. Adjust for page limits
# 4. Add venue-specific required sections
# 5. Update references
```
| From To | Page Change | Key Adjustments |
|-----------|-------------|-----------------|
| NeurIPS ICML | 9 8 | Cut 1 page, add Broader Impact |
| ICML ICLR | 8 9 | Expand experiments, add LLM disclosure |
| NeurIPS ACL | 9 8 | Restructure for NLP conventions, add Limitations |
| ICLR AAAI | 9 7 | Significant cuts, strict style adherence |
| Any COLM | varies 9 | Reframe for language model focus |
When cutting pages: move proofs to appendix, condense related work, combine tables, use subfigures.
When expanding: add ablations, expand limitations, include additional baselines, add qualitative examples.
**After rejection**: Address reviewer concerns in the new version, but don't include a "changes" section or reference the previous submission (blind review).
### Step 7.7: Camera-Ready Preparation (Post-Acceptance)
After acceptance, prepare the camera-ready version:
```
Camera-Ready Checklist:
- [ ] De-anonymize: add author names, affiliations, email addresses
- [ ] Add Acknowledgments section (funding, compute grants, helpful reviewers)
- [ ] Add public code/data URL (real GitHub, not anonymous)
- [ ] Address any mandatory revisions from meta-reviewer
- [ ] Switch template to camera-ready mode (if applicable — e.g., AAAI \anon → \camera)
- [ ] Add copyright notice if required by venue
- [ ] Update any "anonymous" placeholders in text
- [ ] Verify final PDF compiles cleanly
- [ ] Check page limit for camera-ready (sometimes differs from submission)
- [ ] Upload supplementary materials (code, data, appendix) to venue portal
```
---
## Hermes Agent Integration
This skill is designed for the Hermes agent. It uses Hermes tools, delegation, scheduling, and memory for the full research lifecycle.
### Related Skills
Compose this skill with other Hermes skills for specific phases:
| Skill | When to Use | How to Load |
|-------|-------------|-------------|
| **arxiv** | Phase 1 (Literature Review): searching arXiv, generating BibTeX, finding related papers via Semantic Scholar | `skill_view("arxiv")` |
| **subagent-driven-development** | Phase 5 (Drafting): parallel section writing with 2-stage review (spec compliance then quality) | `skill_view("subagent-driven-development")` |
| **plan** | Phase 0 (Setup): creating structured plans before execution. Writes to `.hermes/plans/` | `skill_view("plan")` |
| **qmd** | Phase 1 (Literature): searching local knowledge bases (notes, transcripts, docs) via hybrid BM25+vector search | Install: `skill_manage("install", "qmd")` |
| **diagramming** | Phase 4-5: creating Excalidraw-based figures and architecture diagrams | `skill_view("diagramming")` |
| **data-science** | Phase 4 (Analysis): Jupyter live kernel for interactive analysis and visualization | `skill_view("data-science")` |
**This skill supersedes `ml-paper-writing`** it contains all of ml-paper-writing's content plus the full experiment/analysis pipeline and autoreason methodology.
### Hermes Tools Reference
| Tool | Usage in This Pipeline |
|------|----------------------|
| **`terminal`** | LaTeX compilation (`latexmk -pdf`), git operations, launching experiments (`nohup python run.py &`), process checks |
| **`process`** | Background experiment management: `process("start", ...)`, `process("poll", pid)`, `process("log", pid)`, `process("kill", pid)` |
| **`execute_code`** | Run Python for citation verification, statistical analysis, data aggregation. Has tool access via RPC. |
| **`read_file`** / **`write_file`** / **`patch`** | Paper editing, experiment scripts, result files. Use `patch` for targeted edits to large .tex files. |
| **`web_search`** | Literature discovery: `web_search("transformer attention mechanism 2024")` |
| **`web_extract`** | Fetch paper content, verify citations: `web_extract("https://arxiv.org/abs/2303.17651")` |
| **`delegate_task`** | **Parallel section drafting** spawn isolated subagents for each section. Also for concurrent citation verification. |
| **`todo`** | Primary state tracker across sessions. Update after every phase transition. |
| **`memory`** | Persist key decisions across sessions: contribution framing, venue choice, reviewer feedback. |
| **`cronjob`** | Schedule experiment monitoring, deadline countdowns, automated arXiv checks. |
| **`clarify`** | Ask the user targeted questions when blocked (venue choice, contribution framing). |
| **`send_message`** | Notify user when experiments complete or drafts are ready, even if user isn't in chat. |
### Tool Usage Patterns
**Experiment monitoring** (most common):
```
terminal("ps aux | grep <pattern>")
→ terminal("tail -30 <logfile>")
→ terminal("ls results/")
→ execute_code("analyze results JSON, compute metrics")
→ terminal("git add -A && git commit -m '<descriptive message>' && git push")
→ send_message("Experiment complete: <summary>")
```
**Parallel section drafting** (using delegation):
```
delegate_task("Draft the Methods section based on these experiment scripts and configs.
Include: pseudocode, all hyperparameters, architectural details sufficient for
reproduction. Write in LaTeX using the neurips2025 template conventions.")
delegate_task("Draft the Related Work section. Use web_search and web_extract to
find papers. Verify every citation via Semantic Scholar. Group by methodology.")
delegate_task("Draft the Experiments section. Read all result files in results/.
State which claim each experiment supports. Include error bars and significance.")
```
Each delegate runs as a **fresh subagent** with no shared context provide all necessary information in the prompt. Collect outputs and integrate.
**Citation verification** (using execute_code):
```python
# In execute_code:
from semanticscholar import SemanticScholar
import requests
sch = SemanticScholar()
results = sch.search_paper("attention mechanism transformers", limit=5)
for paper in results:
doi = paper.externalIds.get('DOI', 'N/A')
if doi != 'N/A':
bibtex = requests.get(f"https://doi.org/{doi}",
headers={"Accept": "application/x-bibtex"}).text
print(bibtex)
```
### State Management with `memory` and `todo`
**`memory` tool** persist key decisions (bounded: ~2200 chars for MEMORY.md):
```
memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages).
Contribution: structured refinement works when generation-evaluation gap is wide.
Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3.
Status: Phase 5 — drafting Methods section.")
```
Update memory after major decisions or phase transitions. This persists across sessions.
**`todo` tool** track granular progress:
```
todo("add", "Design constrained task experiments for Sonnet 4.6")
todo("add", "Run Haiku baseline comparison")
todo("add", "Draft Methods section")
todo("update", id=3, status="in_progress")
todo("update", id=1, status="completed")
```
**Session startup protocol:**
```
1. todo("list") # Check current task list
2. memory("read") # Recall key decisions
3. terminal("git log --oneline -10") # Check recent commits
4. terminal("ps aux | grep python") # Check running experiments
5. terminal("ls results/ | tail -20") # Check for new results
6. Report status to user, ask for direction
```
### Cron Monitoring with `cronjob`
Use the `cronjob` tool to schedule periodic experiment checks:
```
cronjob("create", {
"schedule": "*/30 * * * *", # Every 30 minutes
"prompt": "Check experiment status:
1. ps aux | grep run_experiment
2. tail -30 logs/experiment_haiku.log
3. ls results/haiku_baselines/
4. If complete: read results, compute Borda scores,
git add -A && git commit -m 'Add Haiku results' && git push
5. Report: table of results, key finding, next step
6. If nothing changed: respond with [SILENT]"
})
```
**[SILENT] protocol**: When nothing has changed since the last check, respond with exactly `[SILENT]`. This suppresses notification delivery to the user. Only report when there are genuine changes worth knowing about.
**Deadline tracking**:
```
cronjob("create", {
"schedule": "0 9 * * *", # Daily at 9am
"prompt": "NeurIPS 2025 deadline: May 22. Today is {date}.
Days remaining: {compute}.
Check todo list — are we on track?
If <7 days: warn user about remaining tasks."
})
```
### Communication Patterns
**When to notify the user** (via `send_message` or direct response):
- Experiment batch completed (with results table)
- Unexpected finding or failure requiring decision
- Draft section ready for review
- Deadline approaching with incomplete tasks
**When NOT to notify:**
- Experiment still running, no new results `[SILENT]`
- Routine monitoring with no changes `[SILENT]`
- Intermediate steps that don't need attention
**Report format** always include structured data:
```
## Experiment: <name>
Status: Complete / Running / Failed
| Task | Method A | Method B | Method C |
|------|---------|---------|---------|
| Task 1 | 85.2 | 82.1 | **89.4** |
Key finding: <one sentence>
Next step: <what happens next>
```
### Decision Points Requiring Human Input
Use `clarify` for targeted questions when genuinely blocked:
| Decision | When to Ask |
|----------|-------------|
| Target venue | Before starting paper (affects page limits, framing) |
| Contribution framing | When multiple valid framings exist |
| Experiment priority | When TODO list has more experiments than time allows |
| Submission readiness | Before final submission |
**Do NOT ask about** (be proactive, make a choice, flag it):
- Word choice, section ordering
- Which specific results to highlight
- Citation completeness (draft with what you find, note gaps)
---
## Reviewer Evaluation Criteria
Understanding what reviewers look for helps focus effort:
| Criterion | What They Check |
|-----------|----------------|
| **Quality** | Technical soundness, well-supported claims, fair baselines |
| **Clarity** | Clear writing, reproducible by experts, consistent notation |
| **Significance** | Community impact, advances understanding |
| **Originality** | New insights (doesn't require new method) |
**Scoring (NeurIPS 6-point scale):**
- 6: Strong Accept groundbreaking, flawless
- 5: Accept technically solid, high impact
- 4: Borderline Accept solid, limited evaluation
- 3: Borderline Reject weaknesses outweigh
- 2: Reject technical flaws
- 1: Strong Reject known results or ethics issues
See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for detailed guidelines, common concerns, and rebuttal strategies.
---
## Common Issues and Solutions
| Issue | Solution |
|-------|----------|
| Abstract too generic | Delete first sentence if it could prepend any ML paper. Start with your specific contribution. |
| Introduction exceeds 1.5 pages | Split background into Related Work. Front-load contribution bullets. |
| Experiments lack explicit claims | Add: "This experiment tests whether [specific claim]..." before each one. |
| Reviewers find paper hard to follow | Add signposting, use consistent terminology, make figure captions self-contained. |
| Missing statistical significance | Add error bars, number of runs, statistical tests, confidence intervals. |
| Scope creep in experiments | Every experiment must map to a specific claim. Cut experiments that don't. |
| Paper rejected, need to resubmit | See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews. |
---
## Reference Documents
| Document | Contents |
|----------|----------|
| [references/writing-guide.md](references/writing-guide.md) | Gopen & Swan 7 principles, Perez micro-tips, Lipton word choice, Steinhardt precision, figure design |
| [references/citation-workflow.md](references/citation-workflow.md) | Citation APIs, Python code, CitationManager class, BibTeX management |
| [references/checklists.md](references/checklists.md) | NeurIPS 16-item, ICML, ICLR, ACL requirements, universal pre-submission checklist |
| [references/reviewer-guidelines.md](references/reviewer-guidelines.md) | Evaluation criteria, scoring, common concerns, rebuttal template |
| [references/sources.md](references/sources.md) | Complete bibliography of all writing guides, conference guidelines, APIs |
| [references/experiment-patterns.md](references/experiment-patterns.md) | Experiment design patterns, evaluation protocols, monitoring, error recovery |
| [references/autoreason-methodology.md](references/autoreason-methodology.md) | Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring |
### LaTeX Templates
Templates in `templates/` for: **NeurIPS 2025**, **ICML 2026**, **ICLR 2026**, **ACL**, **AAAI 2026**, **COLM 2025**.
See [templates/README.md](templates/README.md) for compilation instructions.
### Key External Sources
**Writing Philosophy:**
- [Neel Nanda: How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers)
- [Sebastian Farquhar: How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/)
- [Gopen & Swan: Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf)
- [Lipton: Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/)
- [Perez: Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/)
**APIs:** [Semantic Scholar](https://api.semanticscholar.org/api-docs/) | [CrossRef](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) | [arXiv](https://info.arxiv.org/help/api/basics.html)
**Venues:** [NeurIPS](https://neurips.cc/Conferences/2025/PaperInformation/StyleFiles) | [ICML](https://icml.cc/Conferences/2025/AuthorInstructions) | [ICLR](https://iclr.cc/Conferences/2026/AuthorGuide) | [ACL](https://github.com/acl-org/acl-style-files)