From 6bd43111d10f976977ee30bb74fbf277c79665d7 Mon Sep 17 00:00:00 2001
From: Teknium <127238744+teknium1@users.noreply.github.com>
Date: Tue, 19 May 2026 20:02:52 -0700
Subject: [PATCH] perf(terminal): adaptive subprocess poll cuts ~195ms off
 every tool call (#29006)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

`_wait_for_process()` was sleeping for a fixed 200ms between polls of
the subprocess exit status. For commands that complete in <50ms (echo,
pwd, date, cat short files, write_file with small content, read_file
with small content), the agent was stuck waiting for the next 200ms
tick to notice the process had exited. That floor was the dominant
component of per-tool latency for typical short commands.

Replace with adaptive backoff: start at 5ms, multiply by 1.5 each
iteration up to 200ms. Fast commands (the common case) return in
~6ms; long-running commands (builds, tests, sleeps) reach the 200ms
steady-state poll rate within ~12 iterations (~150ms total) and pay
identical CPU after that.

Tool-call wall time (deterministic microbench of `echo first`):
  before: median 200ms min 200ms max 200ms
  after:  median   5ms min   5ms max   7ms
  saved:  ~195ms per terminal tool call

End-to-end chat -q with 3 sequential terminal tool calls
(`echo first`, `echo second`, `echo third`):
  before: median 5.73s, min 5.61s
  after:  median 4.64s, min 4.60s
  saved:  ~1100ms wall per turn

Live tmux session: a typical 'write file, read it back' turn now
displays each tool as 0.1s in the spinner (was 0.9s before). The
agent observes the subprocess exit ~200ms faster per call. For chat
workflows that do 4-8 terminal/file calls per turn this saves
800ms-1.5s of pure wall-clock waiting.

Why it's safe:
- Interrupt and timeout checks still fire on every iteration (no
  longer rate-limited to 5/sec)
- Activity callback fires on the same 'due' schedule (`touch_activity_if_due`)
- DEBUG_INTERRUPT heartbeat is unchanged (30s)
- Steady-state poll rate for long-running commands matches the old
  200ms within ~150ms of startup

Tests:
- tests/tools/ — 5246 passed, 22 skipped, 2 pre-existing xdist flakes
  (test_delegate.py::test_depth_limit, test_constants — pass in isolation)
- Live tmux: 2-turn conversation + multiple tool calls, no errors
---
 tools/environments/base.py | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/environments/base.py b/tools/environments/base.py
index 8a53cefb5bf..2666990bf18 100644
--- a/tools/environments/base.py
+++ b/tools/environments/base.py
@@ -609,6 +609,7 @@ class BaseEnvironment(ABC):
             )
 
         try:
+            _poll_sleep = 0.005
             while proc.poll() is None:
                 _iter_count += 1
                 if is_interrupted():
@@ -662,7 +663,17 @@ class BaseEnvironment(ABC):
                     _last_heartbeat = time.monotonic()
                     _cb_was_none = _cb_now_none
 
-                time.sleep(0.2)
+                # Adaptive poll: start at 5ms so fast commands (echo, pwd,
+                # date, cat short files) return in ~6ms instead of being
+                # stuck waiting for the next 200ms tick. Back off
+                # exponentially toward 200ms so long-running commands
+                # (builds, tests, sleeps) don't pay measurable CPU in the
+                # poll loop. For an `echo` this saves ~195ms per tool call;
+                # for a 10s build the steady-state poll rate is identical
+                # to the old behavior.
+                time.sleep(_poll_sleep)
+                if _poll_sleep < 0.2:
+                    _poll_sleep = min(_poll_sleep * 1.5, 0.2)
         except (KeyboardInterrupt, SystemExit):
             # Signal arrived (SIGTERM/SIGHUP/SIGINT) or sys.exit() was called
             # while we were polling.  The local backend spawns subprocesses