feat: add OpenThoughts-TBLite evaluation environment and configuration files

Introduced a new evaluation environment for OpenThoughts-TBLite, including the main evaluation script, configuration YAML, and README documentation. This environment provides a faster alternative to Terminal-Bench 2.0, featuring 100 difficulty-calibrated tasks for terminal agents. The setup allows for easy evaluation and configuration, enhancing the benchmarking capabilities for terminal agents.
2026-04-25 00:51:20 +00:00 · 2026-03-04 11:38:32 +00:00 · 2026-03-04 11:38:32 +00:00 · 0ea6c34325
commit 0ea6c34325
parent 3db3d60368
4 changed files with 231 additions and 0 deletions
--- a/environments/benchmarks/tblite/README.md
+++ b/environments/benchmarks/tblite/README.md
@ -0,0 +1,73 @@
+# OpenThoughts-TBLite Evaluation Environment
+
+This environment evaluates terminal agents on the [OpenThoughts-TBLite](https://huggingface.co/datasets/open-thoughts/OpenThoughts-TBLite) benchmark, a difficulty-calibrated subset of [Terminal-Bench 2.0](https://www.tbench.ai/leaderboard/terminal-bench/2.0).
+
+## Source
+
+OpenThoughts-TBLite was created by the [OpenThoughts](https://www.openthoughts.ai/) Agent team in collaboration with [Snorkel AI](https://snorkel.ai/) and [Bespoke Labs](https://bespokelabs.ai/). The original dataset and documentation live at:
+
+- **Dataset (source):** [open-thoughts/OpenThoughts-TBLite](https://huggingface.co/datasets/open-thoughts/OpenThoughts-TBLite)
+- **GitHub:** [open-thoughts/OpenThoughts-TBLite](https://github.com/open-thoughts/OpenThoughts-TBLite)
+- **Blog post:** [openthoughts.ai/blog/openthoughts-tblite](https://www.openthoughts.ai/blog/openthoughts-tblite)
+
+## Our Dataset
+
+We converted the source into the same schema used by our Terminal-Bench 2.0 environment (pre-built Docker Hub images, base64-encoded test tarballs, etc.) and published it as:
+
+- **Dataset (ours):** [NousResearch/openthoughts-tblite](https://huggingface.co/datasets/NousResearch/openthoughts-tblite)
+- **Docker images:** `nousresearch/tblite-<task-name>:latest` on Docker Hub (100 images)
+
+The conversion script is at `scripts/prepare_tblite_dataset.py`.
+
+## Why TBLite?
+
+Terminal-Bench 2.0 is one of the strongest frontier evaluations for terminal agents, but when a model scores near the floor (e.g., Qwen 3 8B at <1%), many changes look identical in aggregate score. TBLite addresses this by calibrating task difficulty using Claude Haiku 4.5 as a reference:
+
+| Difficulty | Pass Rate Range | Tasks |
+|------------|----------------|-------|
+| Easy       | >= 70%         | 40    |
+| Medium     | 40-69%         | 26    |
+| Hard       | 10-39%         | 26    |
+| Extreme    | < 10%          | 8     |
+
+This gives enough solvable tasks to detect small improvements quickly, while preserving enough hard tasks to avoid saturation. The correlation between TBLite and TB2 scores is **r = 0.911**.
+
+TBLite also runs 2.6-8x faster than the full TB2, making it practical for iteration loops.
+
+## Usage
+
+```bash
+# Run the full benchmark
+python environments/benchmarks/tblite/tblite_env.py evaluate
+
+# Filter to specific tasks
+python environments/benchmarks/tblite/tblite_env.py evaluate \
+    --env.task_filter "broken-python,pandas-etl"
+
+# Use a different model
+python environments/benchmarks/tblite/tblite_env.py evaluate \
+    --server.model_name "qwen/qwen3-30b"
+```
+
+## Architecture
+
+`TBLiteEvalEnv` is a thin subclass of `TerminalBench2EvalEnv`. All evaluation logic (agent loop, Docker sandbox management, test verification, metrics) is inherited. Only the defaults differ:
+
+| Setting        | TB2                              | TBLite                                  |
+|----------------|----------------------------------|-----------------------------------------|
+| Dataset        | `NousResearch/terminal-bench-2`  | `NousResearch/openthoughts-tblite`      |
+| Tasks          | 89                               | 100                                     |
+| Task timeout   | 1800s (30 min)                   | 1200s (20 min)                          |
+| Wandb name     | `terminal-bench-2`               | `openthoughts-tblite`                   |
+
+## Citation
+
+```bibtex
+@software{OpenThoughts-TBLite,
+  author = {OpenThoughts-Agent team, Snorkel AI, Bespoke Labs},
+  month = Feb,
+  title = {{OpenThoughts-TBLite: A High-Signal Benchmark for Iterating on Terminal Agents}},
+  howpublished = {https://www.openthoughts.ai/blog/openthoughts-tblite},
+  year = {2026}
+}
+```