--- name: research-paper-writing title: Research Paper Writing Pipeline description: End-to-end pipeline for writing ML/AI research papers — from experiment design through analysis, drafting, revision, and submission. Covers NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Integrates automated experiment monitoring, statistical analysis, iterative writing, and citation verification. version: 1.0.0 author: Orchestra Research license: MIT dependencies: [semanticscholar, arxiv, habanero, requests, scipy, numpy, matplotlib, SciencePlots] platforms: [linux, macos] metadata: hermes: tags: [Research, Paper Writing, Experiments, ML, AI, NeurIPS, ICML, ICLR, ACL, AAAI, COLM, LaTeX, Citations, Statistical Analysis] category: research related_skills: [arxiv, ml-paper-writing, subagent-driven-development, plan] requires_toolsets: [terminal, files] --- # Research Paper Writing Pipeline End-to-end pipeline for producing publication-ready ML/AI research papers targeting **NeurIPS, ICML, ICLR, ACL, AAAI, and COLM**. This skill covers the full research lifecycle: experiment design, execution, monitoring, analysis, paper writing, review, revision, and submission. This is **not a linear pipeline** — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops. ``` ┌─────────────────────────────────────────────────────────────┐ │ RESEARCH PAPER PIPELINE │ │ │ │ Phase 0: Project Setup ──► Phase 1: Literature Review │ │ │ │ │ │ ▼ ▼ │ │ Phase 2: Experiment Phase 5: Paper Drafting ◄──┐ │ │ Design │ │ │ │ │ ▼ │ │ │ ▼ Phase 6: Self-Review │ │ │ Phase 3: Execution & & Revision ──────────┘ │ │ Monitoring │ │ │ │ ▼ │ │ ▼ Phase 7: Submission │ │ Phase 4: Analysis ─────► (feeds back to Phase 2 or 5) │ │ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## When To Use This Skill Use this skill when: - **Starting a new research paper** from an existing codebase or idea - **Designing and running experiments** to support paper claims - **Writing or revising** any section of a research paper - **Preparing for submission** to a specific conference - **Responding to reviews** with additional experiments or revisions - **Converting** a paper between conference formats ## Core Philosophy 1. **Be proactive.** Deliver complete drafts, not questions. Scientists are busy — produce something concrete they can react to, then iterate. 2. **Never hallucinate citations.** AI-generated citations have ~40% error rate. Always fetch programmatically. Mark unverifiable citations as `[CITATION NEEDED]`. 3. **Paper is a story, not a collection of experiments.** Every paper needs one clear contribution stated in a single sentence. If you can't do that, the paper isn't ready. 4. **Experiments serve claims.** Every experiment must explicitly state which claim it supports. Never run experiments that don't connect to the paper's narrative. 5. **Commit early, commit often.** Every completed experiment batch, every paper draft update — commit with descriptive messages. Git log is the experiment history. ### Proactivity and Collaboration **Default: Be proactive. Draft first, ask with the draft.** | Confidence Level | Action | |-----------------|--------| | **High** (clear repo, obvious contribution) | Write full draft, deliver, iterate on feedback | | **Medium** (some ambiguity) | Write draft with flagged uncertainties, continue | | **Low** (major unknowns) | Ask 1-2 targeted questions via `clarify`, then draft | | Section | Draft Autonomously? | Flag With Draft | |---------|-------------------|-----------------| | Abstract | Yes | "Framed contribution as X — adjust if needed" | | Introduction | Yes | "Emphasized problem Y — correct if wrong" | | Methods | Yes | "Included details A, B, C — add missing pieces" | | Experiments | Yes | "Highlighted results 1, 2, 3 — reorder if needed" | | Related Work | Yes | "Cited papers X, Y, Z — add any I missed" | **Block for input only when**: target venue unclear, multiple contradictory framings, results seem incomplete, explicit request to review first. --- ## Phase 0: Project Setup **Goal**: Establish the workspace, understand existing work, identify the contribution. ### Step 0.1: Explore the Repository ```bash # Understand project structure ls -la find . -name "*.py" | head -30 find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding" ``` Look for: - `README.md` — project overview and claims - `results/`, `outputs/`, `experiments/` — existing findings - `configs/` — experimental settings - `.bib` files — existing citations - Draft documents or notes ### Step 0.2: Organize the Workspace Establish a consistent workspace structure: ``` workspace/ paper/ # LaTeX source, figures, compiled PDFs experiments/ # Experiment runner scripts code/ # Core method implementation results/ # Raw experiment results (auto-generated) tasks/ # Task/benchmark definitions human_eval/ # Human evaluation materials (if needed) ``` ### Step 0.3: Set Up Version Control ```bash git init # if not already git remote add origin git checkout -b paper-draft # or main ``` **Git discipline**: Every completed experiment batch gets committed with a descriptive message. Example: ``` Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task) Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier ``` ### Step 0.4: Identify the Contribution Before writing anything, articulate: - **The What**: What is the single thing this paper contributes? - **The Why**: What evidence supports it? - **The So What**: Why should readers care? > Propose to the scientist: "Based on my understanding, the main contribution is: [one sentence]. The key results show [Y]. Is this the framing you want?" ### Step 0.5: Create a TODO List Use the `todo` tool to create a structured project plan: ``` Research Paper TODO: - [ ] Define one-sentence contribution - [ ] Literature review (related work + baselines) - [ ] Design core experiments - [ ] Run experiments - [ ] Analyze results - [ ] Write first draft - [ ] Self-review (simulate reviewers) - [ ] Revise based on review - [ ] Submission prep ``` Update this throughout the project. It serves as the persistent state across sessions. --- ## Phase 1: Literature Review **Goal**: Find related work, identify baselines, gather citations. ### Step 1.1: Identify Seed Papers Start from papers already referenced in the codebase: ```bash # Via terminal: grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py" find . -name "*.bib" ``` ### Step 1.2: Search for Related Work **Load the `arxiv` skill** for structured paper discovery: `skill_view("arxiv")`. It provides arXiv REST API search, Semantic Scholar citation graphs, author profiles, and BibTeX generation. Use `web_search` for broad discovery, `web_extract` for fetching specific papers: ``` # Via web_search: web_search("[main technique] + [application domain] site:arxiv.org") web_search("[baseline method] comparison ICML NeurIPS 2024") # Via web_extract (for specific papers): web_extract("https://arxiv.org/abs/2303.17651") ``` Additional search queries to try: ``` Search queries: - "[main technique] + [application domain]" - "[baseline method] comparison" - "[problem name] state-of-the-art" - Author names from existing citations ``` **Recommended**: Install **Exa MCP** for real-time academic search: ```bash claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp" ``` ### Step 1.3: Verify Every Citation **NEVER generate BibTeX from memory. ALWAYS fetch programmatically.** For each citation, follow the mandatory 5-step process: ``` Citation Verification (MANDATORY per citation): 1. SEARCH → Query Semantic Scholar or Exa MCP with specific keywords 2. VERIFY → Confirm paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef) 3. RETRIEVE → Get BibTeX via DOI content negotiation (programmatically, not from memory) 4. VALIDATE → Confirm the claim you're citing actually appears in the paper 5. ADD → Add verified BibTeX to bibliography If ANY step fails → mark as [CITATION NEEDED], inform scientist ``` ```python # Fetch BibTeX via DOI import requests def doi_to_bibtex(doi: str) -> str: response = requests.get( f"https://doi.org/{doi}", headers={"Accept": "application/x-bibtex"} ) response.raise_for_status() return response.text ``` If you cannot verify a citation: ```latex \cite{PLACEHOLDER_author2024_verify_this} % TODO: Verify this citation exists ``` **Always tell the scientist**: "I've marked [X] citations as placeholders that need verification." See [references/citation-workflow.md](references/citation-workflow.md) for complete API documentation and the full `CitationManager` class. ### Step 1.4: Organize Related Work Group papers by methodology, not paper-by-paper: **Good**: "One line of work uses X's assumption [refs] whereas we use Y's assumption because..." **Bad**: "Smith et al. introduced X. Jones et al. introduced Y. We combine both." --- ## Phase 2: Experiment Design **Goal**: Design experiments that directly support paper claims. Every experiment must answer a specific question. ### Step 2.1: Map Claims to Experiments Create an explicit mapping: | Claim | Experiment | Expected Evidence | |-------|-----------|-------------------| | "Our method outperforms baselines" | Main comparison (Table 1) | Win rate, statistical significance | | "Effect is larger for weaker models" | Model scaling study | Monotonic improvement curve | | "Convergence requires scope constraints" | Constrained vs unconstrained | Convergence rate comparison | **Rule**: If an experiment doesn't map to a claim, don't run it. ### Step 2.2: Design Baselines Strong baselines are what separates accepted papers from rejected ones. Reviewers will ask: "Did they compare against X?" Standard baseline categories: - **Naive baseline**: Simplest possible approach - **Strong baseline**: Best known existing method - **Ablation baselines**: Your method minus one component - **Compute-matched baselines**: Same compute budget, different allocation ### Step 2.3: Define Evaluation Protocol Before running anything, specify: - **Metrics**: What you're measuring, direction symbols (higher/lower better) - **Aggregation**: How results are combined across runs/tasks - **Statistical tests**: What tests will establish significance - **Sample sizes**: How many runs/problems/tasks ### Step 2.4: Write Experiment Scripts Follow these patterns from successful research pipelines: **Incremental saving** — save results after each step for crash recovery: ```python # Save after each problem/task result_path = f"results/{task}/{strategy}/result.json" if os.path.exists(result_path): continue # Skip already-completed work # ... run experiment ... with open(result_path, 'w') as f: json.dump(result, f, indent=2) ``` **Artifact preservation** — save all intermediate outputs: ``` results// / / final_output.md # Final result history.json # Full trajectory pass_01/ # Per-iteration artifacts version_a.md version_b.md critic.md ``` **Separation of concerns** — keep generation, evaluation, and visualization separate: ``` run_experiment.py # Core experiment runner run_baselines.py # Baseline comparison run_comparison_judge.py # Blind evaluation analyze_results.py # Statistical analysis make_charts.py # Visualization ``` See [references/experiment-patterns.md](references/experiment-patterns.md) for complete design patterns, cron monitoring, and error recovery. --- ## Phase 3: Experiment Execution & Monitoring **Goal**: Run experiments reliably, monitor progress, recover from failures. ### Step 3.1: Launch Experiments Use `nohup` for long-running experiments: ```bash nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 & echo $! # Record the PID ``` **Parallel execution**: Run independent experiments simultaneously, but be aware of API rate limits. 4+ concurrent experiments on the same API will slow each down. ### Step 3.2: Set Up Monitoring (Cron Pattern) For long-running experiments, set up periodic status checks. The cron prompt should follow this template: ``` Monitor Prompt Template: 1. Check if process is still running: ps aux | grep 2. Read last 30 lines of log: tail -30 3. Check for completed results: ls 4. If results exist, read and report: cat 5. If all done, commit: git add -A && git commit -m "" && git push 6. Report in structured format (tables with key metrics) 7. Answer the key analytical question for this experiment ``` **Silent mode**: If nothing has changed since the last check, respond with `[SILENT]` to suppress notification to the user. Only report when there's news. ### Step 3.3: Handle Failures Common failure modes and recovery: | Failure | Detection | Recovery | |---------|-----------|----------| | API rate limit / credit exhaustion | 402/429 errors in logs | Wait, then re-run (scripts skip completed work) | | Process crash | PID gone, incomplete results | Re-run from last checkpoint | | Timeout on hard problems | Process stuck, no log progress | Kill and skip, note in results | | Wrong model ID | Errors referencing model name | Fix ID and re-run | **Key**: Scripts should always check for existing results and skip completed work. This makes re-runs safe and efficient. ### Step 3.4: Commit Completed Results After each experiment batch completes: ```bash git add -A git commit -m "Add : " git push ``` --- ## Phase 4: Result Analysis **Goal**: Extract findings, compute statistics, identify the story. ### Step 4.1: Aggregate Results Write analysis scripts that: 1. Load all result files from a batch 2. Compute per-task and aggregate metrics 3. Generate summary tables ```python # Standard analysis pattern import json, os from pathlib import Path results = {} for result_file in Path("results/").rglob("result.json"): data = json.loads(result_file.read_text()) strategy = result_file.parent.name task = result_file.parent.parent.name results.setdefault(strategy, {})[task] = data # Compute aggregate metrics for strategy, tasks in results.items(): scores = [t["score"] for t in tasks.values()] print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}") ``` ### Step 4.2: Statistical Significance Always compute: - **Error bars**: Standard deviation or standard error, specify which - **Confidence intervals**: 95% CI for key results - **Pairwise tests**: McNemar's test for comparing two methods - **Effect sizes**: Cohen's d or h for practical significance See [references/experiment-patterns.md](references/experiment-patterns.md) for complete implementations of McNemar's test, bootstrapped CIs, and Cohen's h. ### Step 4.3: Identify the Story After analysis, explicitly answer: 1. **What is the main finding?** State it in one sentence. 2. **What surprised you?** Unexpected results often make the best papers. 3. **What failed?** Failed experiments can be the most informative. Honest reporting of failures strengthens the paper. 4. **What follow-up experiments are needed?** Results often raise new questions. ### Step 4.4: Create Figures and Tables **Figures**: - Use vector graphics (PDF) for all plots: `plt.savefig('fig.pdf')` - Colorblind-safe palettes (Okabe-Ito or Paul Tol) - Self-contained captions — reader should understand without main text - No title inside figure — the caption serves this function **Tables**: - Use `booktabs` LaTeX package - Bold best value per metric - Include direction symbols (higher/lower better) - Consistent decimal precision ```latex \usepackage{booktabs} \begin{tabular}{lcc} \toprule Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\ \midrule Baseline & 85.2 & 45ms \\ \textbf{Ours} & \textbf{92.1} & 38ms \\ \bottomrule \end{tabular} ``` ### Step 4.5: Decide: More Experiments or Write? | Situation | Action | |-----------|--------| | Core claims supported, results significant | Move to Phase 5 (writing) | | Results inconclusive, need more data | Back to Phase 2 (design) | | Unexpected finding suggests new direction | Back to Phase 2 (design) | | Missing one ablation reviewers will ask for | Run it, then Phase 5 | | All experiments done but some failed | Note failures, move to Phase 5 | --- ## Iterative Refinement: Strategy Selection Any output in this pipeline — paper drafts, experiment scripts, analysis — can be iteratively refined. The autoreason research provides empirical evidence for when each refinement strategy works and when it fails. Use this section to choose the right approach. ### Quick Decision Table | Your Situation | Strategy | Why | |---------------|----------|-----| | Mid-tier model + constrained task | **Autoreason** | Sweet spot. Generation-evaluation gap is widest. Baselines actively destroy weak model outputs. | | Mid-tier model + open task | **Autoreason** with scope constraints added | Add fixed facts, structure, or deliverable to bound the improvement space. | | Frontier model + constrained task | **Autoreason** | Wins 2/3 constrained tasks even at frontier. | | Frontier model + unconstrained task | **Critique-and-revise** or **single pass** | Autoreason comes last. Model self-evaluates well enough. | | Concrete technical task (system design) | **Critique-and-revise** | Direct find-and-fix loop is more efficient. | | Template-filling task (one correct structure) | **Single pass** or **conservative** | Minimal decision space. Iteration adds no value. | | Code with test cases | **Autoreason (code variant)** | Structured analysis of *why* it failed before fixing. Recovery rate 62% vs 43%. | | Very weak model (Llama 8B class) | **Single pass** | Model too weak for diverse candidates. Invest in generation quality. | ### The Generation-Evaluation Gap **Core insight**: Autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability. ``` Model Tier │ Generation │ Self-Eval │ Gap │ Autoreason Value ──────────────────┼────────────┼───────────┼────────┼───────────────── Weak (Llama 8B) │ Poor │ Poor │ Small │ None — can't generate diverse candidates Mid (Haiku 3.5) │ Decent │ Poor │ LARGE │ MAXIMUM — 42/42 perfect Borda Mid (Gemini Flash)│ Decent │ Moderate │ Large │ High — wins 2/3 Strong (Sonnet 4) │ Good │ Decent │ Medium │ Moderate — wins 3/5 Frontier (S4.6) │ Excellent │ Good │ Small │ Only with constraints ``` This gap is structural, not temporary. As costs drop, today's frontier becomes tomorrow's mid-tier. The sweet spot moves but never disappears. ### Autoreason Loop (Summary) Each pass produces three candidates from fresh, isolated agents: 1. **Critic** → finds problems in incumbent A (no fixes) 2. **Author B** → revises A based on critique 3. **Synthesizer** → merges A and B (randomized labels) 4. **Judge Panel** → 3 blind CoT judges rank A, B, AB via Borda count 5. **Convergence** → A wins k=2 consecutive passes → done **Key parameters:** - k=2 convergence (k=1 premature, k=3 too expensive, no quality gain) - CoT judges always (3x faster convergence) - Temperature 0.8 authors, 0.3 judges - Conservative tiebreak: incumbent wins ties - Every role is a fresh agent with no shared context ### Applying to Paper Drafts When refining the paper itself through autoreason: - **Provide ground truth to the critic**: actual experimental data, result JSONs, statistical outputs. Without this, models hallucinate fabricated ablation studies and fake confidence intervals. - **Use 3 working judges minimum**: A broken judge parser doesn't add noise — it prevents equilibrium entirely. - **Scope constrain the revision**: "Address these specific weaknesses" not "improve the paper." ### Failure Modes | Failure | Detection | Fix | |---------|-----------|-----| | No convergence (A never wins) | A wins <15% over 20+ passes | Add scope constraints to the task | | Synthesis drift | Word counts grow unboundedly | Constrain structure and deliverable | | Degradation below single pass | Baselines score higher than iterated output | Switch to single pass; model may be too weak | | Overfitting (code) | High public-test pass, low private-test pass | Use structured analysis, not just test feedback | | Broken judges | Parsing failures reduce panel below 3 | Fix parser before continuing | See [references/autoreason-methodology.md](references/autoreason-methodology.md) for complete prompts, Borda scoring details, model selection guide, scope constraint design patterns, and compute budget reference. --- ## Phase 5: Paper Drafting **Goal**: Write a complete, publication-ready paper. ### The Narrative Principle **The single most critical insight**: Your paper is not a collection of experiments — it's a story with one clear contribution supported by evidence. Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about. **Three Pillars (must be crystal clear by end of introduction):** | Pillar | Description | Test | |--------|-------------|------| | **The What** | 1-3 specific novel claims | Can you state them in one sentence? | | **The Why** | Rigorous empirical evidence | Do experiments distinguish your hypothesis from alternatives? | | **The So What** | Why readers should care | Does this connect to a recognized community problem? | **If you cannot state your contribution in one sentence, you don't yet have a paper.** ### Time Allocation Spend approximately **equal time** on each of: 1. The abstract 2. The introduction 3. The figures 4. Everything else combined **Why?** Most reviewers form judgments before reaching your methods. Readers encounter your paper as: title → abstract → introduction → figures → maybe the rest. ### Writing Workflow ``` Paper Writing Checklist: - [ ] Step 1: Define the one-sentence contribution - [ ] Step 2: Draft Figure 1 (core idea or most compelling result) - [ ] Step 3: Draft abstract (5-sentence formula) - [ ] Step 4: Draft introduction (1-1.5 pages max) - [ ] Step 5: Draft methods - [ ] Step 6: Draft experiments & results - [ ] Step 7: Draft related work - [ ] Step 8: Draft conclusion & discussion - [ ] Step 9: Draft limitations (REQUIRED by all venues) - [ ] Step 10: Plan appendix (proofs, extra experiments, details) - [ ] Step 11: Complete paper checklist - [ ] Step 12: Final review ``` ### Step 5.0: Title The title is the single most-read element of the paper. It determines whether anyone clicks through to the abstract. **Good titles**: - State the contribution or finding: "Autoreason: When Iterative LLM Refinement Works and Why It Fails" - Highlight a surprising result: "Scaling Data-Constrained Language Models" (implies you can) - Name the method + what it does: "DPO: Direct Preference Optimization of Language Models" **Bad titles**: - Too generic: "An Approach to Improving Language Model Outputs" - Too long: anything over ~15 words - Jargon-only: "Asymptotic Convergence of Iterative Stochastic Policy Refinement" (who is this for?) **Rules**: - Include your method name if you have one (for citability) - Include 1-2 keywords reviewers will search for - Avoid colons unless both halves carry meaning - Test: would a reviewer know the domain and contribution from the title alone? ### Step 5.1: Abstract (5-Sentence Formula) From Sebastian Farquhar (DeepMind): ``` 1. What you achieved: "We introduce...", "We prove...", "We demonstrate..." 2. Why this is hard and important 3. How you do it (with specialist keywords for discoverability) 4. What evidence you have 5. Your most remarkable number/result ``` **Delete** generic openings like "Large language models have achieved remarkable success..." ### Step 5.2: Figure 1 Figure 1 is the second thing most readers look at (after abstract). Draft it before writing the introduction — it forces you to clarify the core idea. | Figure 1 Type | When to Use | Example | |---------------|-------------|---------| | **Method diagram** | New architecture or pipeline | TikZ flowchart showing your system | | **Results teaser** | One compelling result tells the whole story | Bar chart: "Ours vs baselines" with clear gap | | **Problem illustration** | The problem is unintuitive | Before/after showing failure mode you fix | | **Conceptual diagram** | Abstract contribution needs visual grounding | 2x2 matrix of method properties | **Rules**: Figure 1 must be understandable without reading any text. The caption alone should communicate the core idea. Use color purposefully — don't just decorate. ### Step 5.3: Introduction (1-1.5 pages max) Must include: - Clear problem statement - Brief approach overview - 2-4 bullet contribution list (max 1-2 lines each in two-column format) - Methods should start by page 2-3 ### Step 5.3: Methods Enable reimplementation: - Conceptual outline or pseudocode - All hyperparameters listed - Architectural details sufficient for reproduction - Present final design decisions; ablations go in experiments ### Step 5.4: Experiments & Results For each experiment, explicitly state: - **What claim it supports** - How it connects to main contribution - What to observe: "the blue line shows X, which demonstrates Y" Requirements: - Error bars with methodology (std dev vs std error) - Hyperparameter search ranges - Compute infrastructure (GPU type, total hours) - Seed-setting methods ### Step 5.5: Related Work Organize methodologically, not paper-by-paper. Cite generously — reviewers likely authored relevant papers. ### Step 5.6: Limitations (REQUIRED) All major conferences require this. Honesty helps: - Reviewers are instructed not to penalize honest limitation acknowledgment - Pre-empt criticisms by identifying weaknesses first - Explain why limitations don't undermine core claims ### Step 5.7: Conclusion & Discussion **Conclusion** (required, 0.5-1 page): - Restate the contribution in one sentence (different wording from abstract) - Summarize key findings (2-3 sentences, not a list) - Implications: what does this mean for the field? - Future work: 2-3 concrete next steps (not vague "we leave X for future work") **Discussion** (optional, sometimes combined with conclusion): - Broader implications beyond immediate results - Connections to other subfields - Honest assessment of when the method does and doesn't work - Practical deployment considerations **Do NOT** introduce new results or claims in the conclusion. ### Step 5.8: Appendix Strategy Appendices are unlimited at all major venues and are essential for reproducibility. Structure: | Appendix Section | What Goes Here | |-----------------|---------------| | **Proofs & Derivations** | Full proofs too long for main text. Main text can state theorems with "proof in Appendix A." | | **Additional Experiments** | Ablations, scaling curves, per-dataset breakdowns, hyperparameter sensitivity | | **Implementation Details** | Full hyperparameter tables, training details, hardware specs, random seeds | | **Dataset Documentation** | Data collection process, annotation guidelines, licensing, preprocessing | | **Prompts & Templates** | Exact prompts used (for LLM-based methods), evaluation templates | | **Human Evaluation** | Annotation interface screenshots, instructions given to annotators, IRB details | | **Additional Figures** | Per-task breakdowns, trajectory visualizations, failure case examples | **Rules**: - The main paper must be self-contained — reviewers are not required to read appendices - Never put critical evidence only in the appendix - Cross-reference: "Full results in Table 5 (Appendix B)" not just "see appendix" - Use `\appendix` command, then `\section{A: Proofs}` etc. ### Page Budget Management When over the page limit: | Cut Strategy | Saves | Risk | |-------------|-------|------| | Move proofs to appendix | 0.5-2 pages | Low — standard practice | | Condense related work | 0.5-1 page | Medium — may miss key citations | | Combine tables with subfigures | 0.25-0.5 page | Low — often improves readability | | Use `\vspace{-Xpt}` sparingly | 0.1-0.3 page | Low if subtle, high if obvious | | Remove qualitative examples | 0.5-1 page | Medium — reviewers like examples | | Reduce figure sizes | 0.25-0.5 page | High — figures must remain readable | **Do NOT**: reduce font size, change margins, remove required sections (limitations, broader impact), or use `\small`/`\footnotesize` for main text. ### Writing Style **Sentence-level clarity (Gopen & Swan's 7 Principles):** | Principle | Rule | |-----------|------| | Subject-verb proximity | Keep subject and verb close | | Stress position | Place emphasis at sentence ends | | Topic position | Put context first, new info after | | Old before new | Familiar info → unfamiliar info | | One unit, one function | Each paragraph makes one point | | Action in verb | Use verbs, not nominalizations | | Context before new | Set stage before presenting | **Word choice (Lipton, Steinhardt):** - Be specific: "accuracy" not "performance" - Eliminate hedging: drop "may" unless genuinely uncertain - Consistent terminology throughout - Avoid incremental vocabulary: "develop", not "combine" **Full writing guide with examples**: See [references/writing-guide.md](references/writing-guide.md) ### Using LaTeX Templates **Always copy the entire template directory first, then write within it.** ``` Template Setup Checklist: - [ ] Step 1: Copy entire template directory to new project - [ ] Step 2: Verify template compiles as-is (before any changes) - [ ] Step 3: Read the template's example content to understand structure - [ ] Step 4: Replace example content section by section - [ ] Step 5: Use template macros (check preamble for \newcommand definitions) - [ ] Step 6: Clean up template artifacts only at the end ``` **Step 1: Copy the Full Template** ```bash cp -r templates/neurips2025/ ~/papers/my-paper/ cd ~/papers/my-paper/ ls -la # Should see: main.tex, neurips.sty, Makefile, etc. ``` Copy the ENTIRE directory, not just the .tex file. Templates include style files (.sty), bibliography styles (.bst), example content, and Makefiles. **Step 2: Verify Template Compiles First** Before making ANY changes: ```bash latexmk -pdf main.tex # Or manual: pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex ``` If the unmodified template doesn't compile, fix that first (usually missing TeX packages — install via `tlmgr install `). **Step 3: Keep Template Content as Reference** Don't immediately delete example content. Comment it out and use as formatting reference: ```latex % Template example (keep for reference): % \begin{figure}[t] % \centering % \includegraphics[width=0.8\linewidth]{example-image} % \caption{Template shows caption style} % \end{figure} % Your actual figure: \begin{figure}[t] \centering \includegraphics[width=0.8\linewidth]{your-figure.pdf} \caption{Your caption following the same style.} \end{figure} ``` **Step 4: Replace Content Section by Section** Work through systematically: title/authors → abstract → introduction → methods → experiments → related work → conclusion → references → appendix. Compile after each section. **Step 5: Use Template Macros** ```latex \newcommand{\method}{YourMethodName} % Consistent method naming \newcommand{\eg}{e.g.,\xspace} % Proper abbreviations \newcommand{\ie}{i.e.,\xspace} ``` ### Template Pitfalls | Pitfall | Problem | Solution | |---------|---------|----------| | Copying only `.tex` file | Missing `.sty`, won't compile | Copy entire directory | | Modifying `.sty` files | Breaks conference formatting | Never edit style files | | Adding random packages | Conflicts, breaks template | Only add if necessary | | Deleting template content early | Lose formatting reference | Keep as comments until done | | Not compiling frequently | Errors accumulate | Compile after each section | | Raster PNGs for figures | Blurry in paper | Always use vector PDF via `savefig('fig.pdf')` | ### Quick Template Reference | Conference | Main File | Style File | Page Limit | |------------|-----------|------------|------------| | NeurIPS 2025 | `main.tex` | `neurips.sty` | 9 pages | | ICML 2026 | `example_paper.tex` | `icml2026.sty` | 8 pages | | ICLR 2026 | `iclr2026_conference.tex` | `iclr2026_conference.sty` | 9 pages | | ACL 2025 | `acl_latex.tex` | `acl.sty` | 8 pages (long) | | AAAI 2026 | `aaai2026-unified-template.tex` | `aaai2026.sty` | 7 pages | | COLM 2025 | `colm2025_conference.tex` | `colm2025_conference.sty` | 9 pages | **Universal**: Double-blind, references don't count, appendices unlimited, LaTeX required. Templates in `templates/` directory. See [templates/README.md](templates/README.md) for compilation setup (VS Code, CLI, Overleaf, other IDEs). ### Tables and Figures **Tables** — use `booktabs` for professional formatting: ```latex \usepackage{booktabs} \begin{tabular}{lcc} \toprule Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\ \midrule Baseline & 85.2 & 45ms \\ \textbf{Ours} & \textbf{92.1} & 38ms \\ \bottomrule \end{tabular} ``` Rules: - Bold best value per metric - Include direction symbols ($\uparrow$ higher better, $\downarrow$ lower better) - Right-align numerical columns - Consistent decimal precision **Figures**: - **Vector graphics** (PDF, EPS) for all plots and diagrams — `plt.savefig('fig.pdf')` - **Raster** (PNG 600 DPI) only for photographs - **Colorblind-safe palettes** (Okabe-Ito or Paul Tol) - Verify **grayscale readability** (8% of men have color vision deficiency) - **No title inside figure** — the caption serves this function - **Self-contained captions** — reader should understand without main text ### Conference Resubmission For converting between venues, see Phase 7 (Submission Preparation) — it covers the full conversion workflow, page-change table, and post-rejection guidance. ### Professional LaTeX Preamble Add these packages to any paper for professional quality. They are compatible with all major conference style files: ```latex % --- Professional Packages (add after conference style file) --- % Typography \usepackage{microtype} % Microtypographic improvements (protrusion, expansion) % Makes text noticeably more polished — always include % Tables \usepackage{booktabs} % Professional table rules (\toprule, \midrule, \bottomrule) \usepackage{siunitx} % Consistent number formatting, decimal alignment % Usage: \num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz % Table alignment: S column type for decimal-aligned numbers % Figures \usepackage{graphicx} % Include graphics (\includegraphics) \usepackage{subcaption} % Subfigures with (a), (b), (c) labels % Usage: \begin{subfigure}{0.48\textwidth} ... \end{subfigure} % Diagrams and Algorithms \usepackage{tikz} % Programmable vector diagrams \usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds} \usepackage[ruled,vlined]{algorithm2e} % Professional pseudocode % Alternative: \usepackage{algorithmicx} if template bundles it % Cross-references \usepackage{cleveref} % Smart references: \cref{fig:x} → "Figure 1" % MUST be loaded AFTER hyperref % Handles: figures, tables, sections, equations, algorithms % Math (usually included by conference .sty, but verify) \usepackage{amsmath,amssymb} % AMS math environments and symbols \usepackage{mathtools} % Extends amsmath (dcases, coloneqq, etc.) % Colors (for figures and diagrams) \usepackage{xcolor} % Color management % Okabe-Ito colorblind-safe palette: \definecolor{okblue}{HTML}{0072B2} \definecolor{okorange}{HTML}{E69F00} \definecolor{okgreen}{HTML}{009E73} \definecolor{okred}{HTML}{D55E00} \definecolor{okpurple}{HTML}{CC79A7} \definecolor{okcyan}{HTML}{56B4E9} \definecolor{okyellow}{HTML}{F0E442} ``` **Notes:** - `microtype` is the single highest-impact package for visual quality. It adjusts character spacing at a sub-pixel level. Always include it. - `siunitx` handles decimal alignment in tables via the `S` column type — eliminates manual spacing. - `cleveref` must be loaded **after** `hyperref`. Most conference .sty files load hyperref, so put cleveref last. - Check if the conference template already loads any of these (especially `algorithm`, `amsmath`, `graphicx`). Don't double-load. ### siunitx Table Alignment `siunitx` makes number-heavy tables significantly more readable: ```latex \begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]} \toprule Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\ \midrule Baseline & 85.2 & 83.7 & 45.3 \\ Ablation (no X) & 87.1 & 85.4 & 42.1 \\ \textbf{Ours} & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\ \bottomrule \end{tabular} ``` The `S` column type auto-aligns on the decimal point. Headers in `{}` escape the alignment. ### Subfigures Standard pattern for side-by-side figures: ```latex \begin{figure}[t] \centering \begin{subfigure}[b]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{fig_results_a.pdf} \caption{Results on Dataset A.} \label{fig:results-a} \end{subfigure} \hfill \begin{subfigure}[b]{0.48\textwidth} \centering \includegraphics[width=\textwidth]{fig_results_b.pdf} \caption{Results on Dataset B.} \label{fig:results-b} \end{subfigure} \caption{Comparison of our method across two datasets. (a) shows the scaling behavior and (b) shows the ablation results. Both use 5 random seeds.} \label{fig:results} \end{figure} ``` Use `\cref{fig:results}` → "Figure 1", `\cref{fig:results-a}` → "Figure 1a". ### Pseudocode with algorithm2e ```latex \begin{algorithm}[t] \caption{Iterative Refinement with Judge Panel} \label{alg:method} \KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$} \KwOut{Final output $A^*$} $A \gets M(T)$ \tcp*{Initial generation} $\text{streak} \gets 0$\; \While{$\text{streak} < k$}{ $C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses} $B \gets M(T, C)$ \tcp*{Revised version addressing critique} $AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements} \ForEach{judge $J_i$}{ $\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking} } $\text{winner} \gets \text{BordaCount}(\text{ranks})$\; \eIf{$\text{winner} = A$}{ $\text{streak} \gets \text{streak} + 1$\; }{ $A \gets \text{winner}$; $\text{streak} \gets 0$\; } } \Return{$A$}\; \end{algorithm} ``` ### TikZ Diagram Patterns TikZ is the standard for method diagrams in ML papers. Common patterns: **Pipeline/Flow Diagram** (most common in ML papers): ```latex \begin{figure}[t] \centering \begin{tikzpicture}[ node distance=1.8cm, box/.style={rectangle, draw, rounded corners, minimum height=1cm, minimum width=2cm, align=center, font=\small}, arrow/.style={-{Stealth[length=3mm]}, thick}, ] \node[box, fill=okcyan!20] (input) {Input\\$x$}; \node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$}; \node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$}; \node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$}; \node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$}; \draw[arrow] (input) -- (encoder); \draw[arrow] (encoder) -- (latent); \draw[arrow] (latent) -- (decoder); \draw[arrow] (decoder) -- (output); \end{tikzpicture} \caption{Architecture overview. The encoder maps input $x$ to latent representation $z$, which the decoder reconstructs.} \label{fig:architecture} \end{figure} ``` **Comparison/Matrix Diagram** (for showing method variants): ```latex \begin{tikzpicture}[ cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, align=center, font=\small}, header/.style={cell, fill=gray!20, font=\small\bfseries}, ] % Headers \node[header] at (0, 0) {Method}; \node[header] at (3, 0) {Converges?}; \node[header] at (6, 0) {Quality?}; % Rows \node[cell] at (0, -1) {Single Pass}; \node[cell, fill=okgreen!15] at (3, -1) {N/A}; \node[cell, fill=okorange!15] at (6, -1) {Baseline}; \node[cell] at (0, -2) {Critique+Revise}; \node[cell, fill=okred!15] at (3, -2) {No}; \node[cell, fill=okred!15] at (6, -2) {Degrades}; \node[cell] at (0, -3) {Ours}; \node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)}; \node[cell, fill=okgreen!15] at (6, -3) {Improves}; \end{tikzpicture} ``` **Iterative Loop Diagram** (for methods with feedback): ```latex \begin{tikzpicture}[ node distance=2cm, box/.style={rectangle, draw, rounded corners, minimum height=0.8cm, minimum width=1.8cm, align=center, font=\small}, arrow/.style={-{Stealth[length=3mm]}, thick}, label/.style={font=\scriptsize, midway, above}, ] \node[box, fill=okblue!20] (gen) {Generator}; \node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic}; \node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel}; \draw[arrow] (gen) -- node[label] {output $A$} (critic); \draw[arrow] (critic) -- node[label, right] {critique $C$} (judge); \draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen); \end{tikzpicture} ``` ### latexdiff for Revision Tracking Essential for rebuttals — generates a marked-up PDF showing changes between versions: ```bash # Install # macOS: brew install latexdiff (or comes with TeX Live) # Linux: sudo apt install latexdiff # Generate diff latexdiff paper_v1.tex paper_v2.tex > paper_diff.tex pdflatex paper_diff.tex # For multi-file projects (with \input{} or \include{}) latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex ``` This produces a PDF with deletions in red strikethrough and additions in blue — standard format for rebuttal supplements. ### SciencePlots for matplotlib Install and use for publication-quality plots: ```bash pip install SciencePlots ``` ```python import matplotlib.pyplot as plt import scienceplots # registers styles # Use science style (IEEE-like, clean) with plt.style.context(['science', 'no-latex']): fig, ax = plt.subplots(figsize=(3.5, 2.5)) # Single-column width ax.plot(x, y, label='Ours', color='#0072B2') ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--') ax.set_xlabel('Training Steps') ax.set_ylabel('Accuracy') ax.legend() fig.savefig('paper/fig_results.pdf', bbox_inches='tight') # Available styles: 'science', 'ieee', 'nature', 'science+ieee' # Add 'no-latex' if LaTeX is not installed on the machine generating plots ``` **Standard figure sizes** (two-column format): - Single column: `figsize=(3.5, 2.5)` — fits in one column - Double column: `figsize=(7.0, 3.0)` — spans both columns - Square: `figsize=(3.5, 3.5)` — for heatmaps, confusion matrices --- ## Phase 6: Self-Review & Revision **Goal**: Simulate the review process before submission. Catch weaknesses early. ### Step 6.1: Simulate Reviews Generate reviews from multiple perspectives using strong models (Opus 4, Sonnet 4.6, Gemini 2.5 Pro). Use the reviewer guidelines from the target venue. **Review prompt template:** ``` You are an expert reviewer for [VENUE]. Review this paper according to the official reviewer guidelines. Evaluate: 1. Quality (technical soundness, baselines, claims supported by evidence) 2. Clarity (writing, notation consistency, reproducibility) 3. Significance (impact, importance of the problem) 4. Originality (novelty, new insights) Provide: - Summary (2-3 sentences) - Strengths (bullet list) - Weaknesses (bullet list, most critical first) - Questions for authors - Missing references - Score (1-6 on NeurIPS scale) - Confidence (1-5) ``` ### Step 6.2: Prioritize Feedback After collecting reviews, categorize: | Priority | Action | |----------|--------| | **Critical** (technical flaw, missing baseline) | Must fix. May require new experiments → back to Phase 2 | | **High** (clarity issue, missing ablation) | Should fix in this revision | | **Medium** (minor writing issues, extra experiments) | Fix if time allows | | **Low** (style preferences, tangential suggestions) | Note for future work | ### Step 6.3: Revision Cycle For each critical/high issue: 1. Identify the specific section(s) affected 2. Draft the fix 3. Verify the fix doesn't break other claims 4. Update the paper 5. Re-check against the reviewer's concern ### Step 6.4: Rebuttal Writing When responding to actual reviews (post-submission), rebuttals are a distinct skill from revision: **Format**: Point-by-point. For each reviewer concern: ``` > R1-W1: "The paper lacks comparison with Method X." We thank the reviewer for this suggestion. We have added a comparison with Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric] (p<0.05). We note that X requires 2x our compute budget. ``` **Rules**: - Address every concern — reviewers notice if you skip one - Lead with the strongest responses - Be concise and direct — reviewers read dozens of rebuttals - Include new results if you ran experiments during the rebuttal period - Never be defensive or dismissive, even of weak criticisms - Use `latexdiff` to generate a marked-up PDF showing changes (see Professional LaTeX Tooling section) - Thank reviewers for specific, actionable feedback (not generic praise) **What NOT to do**: "We respectfully disagree" without evidence. "This is out of scope" without explanation. Ignoring a weakness by only responding to strengths. ### Step 6.5: Paper Evolution Tracking Save snapshots at key milestones: ``` paper/ paper.tex # Current working version paper_v1_first_draft.tex # First complete draft paper_v2_post_review.tex # After simulated review paper_v3_pre_submission.tex # Final before submission paper_v4_camera_ready.tex # Post-acceptance final ``` --- ## Phase 7: Submission Preparation **Goal**: Final checks, formatting, and submission. ### Step 7.1: Conference Checklist Every venue has mandatory checklists. Complete them carefully — incomplete checklists can result in desk rejection. See [references/checklists.md](references/checklists.md) for: - NeurIPS 16-item paper checklist - ICML broader impact + reproducibility - ICLR LLM disclosure policy - ACL mandatory limitations section - Universal pre-submission checklist ### Step 7.2: Anonymization Checklist Double-blind review means reviewers cannot know who wrote the paper. Check ALL of these: ``` Anonymization Checklist: - [ ] No author names or affiliations anywhere in the PDF - [ ] No acknowledgments section (add after acceptance) - [ ] Self-citations written in third person: "Smith et al. [1] showed..." not "We previously showed [1]..." - [ ] No GitHub/GitLab URLs pointing to your personal repos - [ ] Use Anonymous GitHub (https://anonymous.4open.science/) for code links - [ ] No institutional logos or identifiers in figures - [ ] No file metadata containing author names (check PDF properties) - [ ] No "our previous work" or "in our earlier paper" phrasing - [ ] Dataset names don't reveal institution (rename if needed) - [ ] Supplementary materials don't contain identifying information ``` **Common mistakes**: Git commit messages visible in supplementary code, watermarked figures from institutional tools, acknowledgments left in from a previous draft, arXiv preprint posted before anonymity period. ### Step 7.3: Formatting Verification ``` Pre-Submission Format Check: - [ ] Page limit respected (excluding references and appendix) - [ ] All figures are vector (PDF) or high-res raster (600 DPI PNG) - [ ] All figures readable in grayscale - [ ] All tables use booktabs - [ ] References compile correctly (no "?" in citations) - [ ] No overfull hboxes in critical areas - [ ] Appendix clearly labeled and separated - [ ] Required sections present (limitations, broader impact, etc.) ``` ### Step 7.3: Final Compilation ```bash # Clean build rm -f *.aux *.bbl *.blg *.log *.out *.pdf latexmk -pdf main.tex # Or manual pdflatex main.tex bibtex main pdflatex main.tex pdflatex main.tex ``` ### Step 7.4: Conference-Specific Requirements | Venue | Special Requirements | |-------|---------------------| | **NeurIPS** | Paper checklist in appendix, lay summary if accepted | | **ICML** | Broader Impact Statement (after conclusion, doesn't count toward limit) | | **ICLR** | LLM disclosure required, reciprocal reviewing agreement | | **ACL** | Mandatory Limitations section, Responsible NLP checklist | | **AAAI** | Strict style file — no modifications whatsoever | | **COLM** | Frame contribution for language model community | ### Step 7.6: Conference Resubmission & Format Conversion When converting between venues, **never copy LaTeX preambles between templates**: ```bash # 1. Start fresh with target template cp -r templates/icml2026/ new_submission/ # 2. Copy ONLY content sections (not preamble) # - Abstract text, section content, figures, tables, bib entries # 3. Adjust for page limits # 4. Add venue-specific required sections # 5. Update references ``` | From → To | Page Change | Key Adjustments | |-----------|-------------|-----------------| | NeurIPS → ICML | 9 → 8 | Cut 1 page, add Broader Impact | | ICML → ICLR | 8 → 9 | Expand experiments, add LLM disclosure | | NeurIPS → ACL | 9 → 8 | Restructure for NLP conventions, add Limitations | | ICLR → AAAI | 9 → 7 | Significant cuts, strict style adherence | | Any → COLM | varies → 9 | Reframe for language model focus | When cutting pages: move proofs to appendix, condense related work, combine tables, use subfigures. When expanding: add ablations, expand limitations, include additional baselines, add qualitative examples. **After rejection**: Address reviewer concerns in the new version, but don't include a "changes" section or reference the previous submission (blind review). ### Step 7.7: Camera-Ready Preparation (Post-Acceptance) After acceptance, prepare the camera-ready version: ``` Camera-Ready Checklist: - [ ] De-anonymize: add author names, affiliations, email addresses - [ ] Add Acknowledgments section (funding, compute grants, helpful reviewers) - [ ] Add public code/data URL (real GitHub, not anonymous) - [ ] Address any mandatory revisions from meta-reviewer - [ ] Switch template to camera-ready mode (if applicable — e.g., AAAI \anon → \camera) - [ ] Add copyright notice if required by venue - [ ] Update any "anonymous" placeholders in text - [ ] Verify final PDF compiles cleanly - [ ] Check page limit for camera-ready (sometimes differs from submission) - [ ] Upload supplementary materials (code, data, appendix) to venue portal ``` --- ## Hermes Agent Integration This skill is designed for the Hermes agent. It uses Hermes tools, delegation, scheduling, and memory for the full research lifecycle. ### Related Skills Compose this skill with other Hermes skills for specific phases: | Skill | When to Use | How to Load | |-------|-------------|-------------| | **arxiv** | Phase 1 (Literature Review): searching arXiv, generating BibTeX, finding related papers via Semantic Scholar | `skill_view("arxiv")` | | **subagent-driven-development** | Phase 5 (Drafting): parallel section writing with 2-stage review (spec compliance then quality) | `skill_view("subagent-driven-development")` | | **plan** | Phase 0 (Setup): creating structured plans before execution. Writes to `.hermes/plans/` | `skill_view("plan")` | | **qmd** | Phase 1 (Literature): searching local knowledge bases (notes, transcripts, docs) via hybrid BM25+vector search | Install: `skill_manage("install", "qmd")` | | **diagramming** | Phase 4-5: creating Excalidraw-based figures and architecture diagrams | `skill_view("diagramming")` | | **data-science** | Phase 4 (Analysis): Jupyter live kernel for interactive analysis and visualization | `skill_view("data-science")` | **This skill supersedes `ml-paper-writing`** — it contains all of ml-paper-writing's content plus the full experiment/analysis pipeline and autoreason methodology. ### Hermes Tools Reference | Tool | Usage in This Pipeline | |------|----------------------| | **`terminal`** | LaTeX compilation (`latexmk -pdf`), git operations, launching experiments (`nohup python run.py &`), process checks | | **`process`** | Background experiment management: `process("start", ...)`, `process("poll", pid)`, `process("log", pid)`, `process("kill", pid)` | | **`execute_code`** | Run Python for citation verification, statistical analysis, data aggregation. Has tool access via RPC. | | **`read_file`** / **`write_file`** / **`patch`** | Paper editing, experiment scripts, result files. Use `patch` for targeted edits to large .tex files. | | **`web_search`** | Literature discovery: `web_search("transformer attention mechanism 2024")` | | **`web_extract`** | Fetch paper content, verify citations: `web_extract("https://arxiv.org/abs/2303.17651")` | | **`delegate_task`** | **Parallel section drafting** — spawn isolated subagents for each section. Also for concurrent citation verification. | | **`todo`** | Primary state tracker across sessions. Update after every phase transition. | | **`memory`** | Persist key decisions across sessions: contribution framing, venue choice, reviewer feedback. | | **`cronjob`** | Schedule experiment monitoring, deadline countdowns, automated arXiv checks. | | **`clarify`** | Ask the user targeted questions when blocked (venue choice, contribution framing). | | **`send_message`** | Notify user when experiments complete or drafts are ready, even if user isn't in chat. | ### Tool Usage Patterns **Experiment monitoring** (most common): ``` terminal("ps aux | grep ") → terminal("tail -30 ") → terminal("ls results/") → execute_code("analyze results JSON, compute metrics") → terminal("git add -A && git commit -m '' && git push") → send_message("Experiment complete: ") ``` **Parallel section drafting** (using delegation): ``` delegate_task("Draft the Methods section based on these experiment scripts and configs. Include: pseudocode, all hyperparameters, architectural details sufficient for reproduction. Write in LaTeX using the neurips2025 template conventions.") delegate_task("Draft the Related Work section. Use web_search and web_extract to find papers. Verify every citation via Semantic Scholar. Group by methodology.") delegate_task("Draft the Experiments section. Read all result files in results/. State which claim each experiment supports. Include error bars and significance.") ``` Each delegate runs as a **fresh subagent** with no shared context — provide all necessary information in the prompt. Collect outputs and integrate. **Citation verification** (using execute_code): ```python # In execute_code: from semanticscholar import SemanticScholar import requests sch = SemanticScholar() results = sch.search_paper("attention mechanism transformers", limit=5) for paper in results: doi = paper.externalIds.get('DOI', 'N/A') if doi != 'N/A': bibtex = requests.get(f"https://doi.org/{doi}", headers={"Accept": "application/x-bibtex"}).text print(bibtex) ``` ### State Management with `memory` and `todo` **`memory` tool** — persist key decisions (bounded: ~2200 chars for MEMORY.md): ``` memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages). Contribution: structured refinement works when generation-evaluation gap is wide. Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3. Status: Phase 5 — drafting Methods section.") ``` Update memory after major decisions or phase transitions. This persists across sessions. **`todo` tool** — track granular progress: ``` todo("add", "Design constrained task experiments for Sonnet 4.6") todo("add", "Run Haiku baseline comparison") todo("add", "Draft Methods section") todo("update", id=3, status="in_progress") todo("update", id=1, status="completed") ``` **Session startup protocol:** ``` 1. todo("list") # Check current task list 2. memory("read") # Recall key decisions 3. terminal("git log --oneline -10") # Check recent commits 4. terminal("ps aux | grep python") # Check running experiments 5. terminal("ls results/ | tail -20") # Check for new results 6. Report status to user, ask for direction ``` ### Cron Monitoring with `cronjob` Use the `cronjob` tool to schedule periodic experiment checks: ``` cronjob("create", { "schedule": "*/30 * * * *", # Every 30 minutes "prompt": "Check experiment status: 1. ps aux | grep run_experiment 2. tail -30 logs/experiment_haiku.log 3. ls results/haiku_baselines/ 4. If complete: read results, compute Borda scores, git add -A && git commit -m 'Add Haiku results' && git push 5. Report: table of results, key finding, next step 6. If nothing changed: respond with [SILENT]" }) ``` **[SILENT] protocol**: When nothing has changed since the last check, respond with exactly `[SILENT]`. This suppresses notification delivery to the user. Only report when there are genuine changes worth knowing about. **Deadline tracking**: ``` cronjob("create", { "schedule": "0 9 * * *", # Daily at 9am "prompt": "NeurIPS 2025 deadline: May 22. Today is {date}. Days remaining: {compute}. Check todo list — are we on track? If <7 days: warn user about remaining tasks." }) ``` ### Communication Patterns **When to notify the user** (via `send_message` or direct response): - Experiment batch completed (with results table) - Unexpected finding or failure requiring decision - Draft section ready for review - Deadline approaching with incomplete tasks **When NOT to notify:** - Experiment still running, no new results → `[SILENT]` - Routine monitoring with no changes → `[SILENT]` - Intermediate steps that don't need attention **Report format** — always include structured data: ``` ## Experiment: Status: Complete / Running / Failed | Task | Method A | Method B | Method C | |------|---------|---------|---------| | Task 1 | 85.2 | 82.1 | **89.4** | Key finding: Next step: ``` ### Decision Points Requiring Human Input Use `clarify` for targeted questions when genuinely blocked: | Decision | When to Ask | |----------|-------------| | Target venue | Before starting paper (affects page limits, framing) | | Contribution framing | When multiple valid framings exist | | Experiment priority | When TODO list has more experiments than time allows | | Submission readiness | Before final submission | **Do NOT ask about** (be proactive, make a choice, flag it): - Word choice, section ordering - Which specific results to highlight - Citation completeness (draft with what you find, note gaps) --- ## Reviewer Evaluation Criteria Understanding what reviewers look for helps focus effort: | Criterion | What They Check | |-----------|----------------| | **Quality** | Technical soundness, well-supported claims, fair baselines | | **Clarity** | Clear writing, reproducible by experts, consistent notation | | **Significance** | Community impact, advances understanding | | **Originality** | New insights (doesn't require new method) | **Scoring (NeurIPS 6-point scale):** - 6: Strong Accept — groundbreaking, flawless - 5: Accept — technically solid, high impact - 4: Borderline Accept — solid, limited evaluation - 3: Borderline Reject — weaknesses outweigh - 2: Reject — technical flaws - 1: Strong Reject — known results or ethics issues See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for detailed guidelines, common concerns, and rebuttal strategies. --- ## Common Issues and Solutions | Issue | Solution | |-------|----------| | Abstract too generic | Delete first sentence if it could prepend any ML paper. Start with your specific contribution. | | Introduction exceeds 1.5 pages | Split background into Related Work. Front-load contribution bullets. | | Experiments lack explicit claims | Add: "This experiment tests whether [specific claim]..." before each one. | | Reviewers find paper hard to follow | Add signposting, use consistent terminology, make figure captions self-contained. | | Missing statistical significance | Add error bars, number of runs, statistical tests, confidence intervals. | | Scope creep in experiments | Every experiment must map to a specific claim. Cut experiments that don't. | | Paper rejected, need to resubmit | See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews. | --- ## Reference Documents | Document | Contents | |----------|----------| | [references/writing-guide.md](references/writing-guide.md) | Gopen & Swan 7 principles, Perez micro-tips, Lipton word choice, Steinhardt precision, figure design | | [references/citation-workflow.md](references/citation-workflow.md) | Citation APIs, Python code, CitationManager class, BibTeX management | | [references/checklists.md](references/checklists.md) | NeurIPS 16-item, ICML, ICLR, ACL requirements, universal pre-submission checklist | | [references/reviewer-guidelines.md](references/reviewer-guidelines.md) | Evaluation criteria, scoring, common concerns, rebuttal template | | [references/sources.md](references/sources.md) | Complete bibliography of all writing guides, conference guidelines, APIs | | [references/experiment-patterns.md](references/experiment-patterns.md) | Experiment design patterns, evaluation protocols, monitoring, error recovery | | [references/autoreason-methodology.md](references/autoreason-methodology.md) | Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring | ### LaTeX Templates Templates in `templates/` for: **NeurIPS 2025**, **ICML 2026**, **ICLR 2026**, **ACL**, **AAAI 2026**, **COLM 2025**. See [templates/README.md](templates/README.md) for compilation instructions. ### Key External Sources **Writing Philosophy:** - [Neel Nanda: How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) - [Sebastian Farquhar: How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) - [Gopen & Swan: Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) - [Lipton: Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) - [Perez: Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/) **APIs:** [Semantic Scholar](https://api.semanticscholar.org/api-docs/) | [CrossRef](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) | [arXiv](https://info.arxiv.org/help/api/basics.html) **Venues:** [NeurIPS](https://neurips.cc/Conferences/2025/PaperInformation/StyleFiles) | [ICML](https://icml.cc/Conferences/2025/AuthorInstructions) | [ICLR](https://iclr.cc/Conferences/2026/AuthorGuide) | [ACL](https://github.com/acl-org/acl-style-files)