hermes-agent/tests/plugins
kshitij db84a78e61
fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763) (#26320)
* fix(langfuse): reject placeholder credentials with one-shot warning

When operators leave HERMES_LANGFUSE_PUBLIC_KEY / HERMES_LANGFUSE_SECRET_KEY
at a template value like 'placeholder', 'test-key', or 'your-langfuse-key',
the Langfuse SDK silently accepts the credentials at construction time and
drops every trace at flush time. No warning, no error — just an empty
Langfuse dashboard the operator only notices hours later.

Add prefix-based validation in _get_langfuse() against the documented
'pk-lf-' / 'sk-lf-' prefixes that Langfuse always issues server-side.
Anything else fires a single warning naming the offending env var(s)
with a log-safe value preview (full string for short placeholders so the
operator knows which template they left in place; truncated for long
values so a real secret pasted into the wrong field never hits the log),
then short-circuits via the existing _INIT_FAILED cache so the warning
fires once per process, not once per hook invocation.

The check sits after the 'Langfuse is None' SDK-installed guard so hosts
without the optional langfuse SDK don't see misleading 'set real keys'
hints when the actionable fix is 'pip install langfuse'. Missing
credentials remains the documented opt-out path and stays silent — no
log noise for unconfigured installs.

Fixes #22763
Fixes #23823

* fix(langfuse): use actual API request messages for generation input

on_pre_llm_request previously used the messages kwarg alone, which
could be None when Hermes passes the payload via request_messages,
conversation_history, or user_message instead. Add _coerce_request_messages
to pick the first available list across all variants, falling back to a
synthetic user message. Generations now show the real outbound payload
rather than an empty input.

* fix(langfuse): record tool call outputs in traces

Tool observations showed input (arguments) but output was always
undefined. Root cause: when tool_call_id is empty, pre_tool_call stored
observations under a unique time-based key that post_tool_call could
never reconstruct, so every tool span was closed without output by the
_finish_trace sweep.

Fix pre/post matching by routing empty-tool_call_id tools through a
per-name FIFO queue (pending_tools_by_name) instead of the time-based
key. Tools with a tool_call_id continue to use the id-keyed dict.

Also:
 - Preserve OpenAI-style nested function shape in serialized tool calls
   so Langfuse renders name/arguments correctly
 - Keep name + tool_call_id on role:tool messages for proper pairing
 - Backfill tool results onto the matching turn_tool_calls entry so the
   generation's tool-call record carries the result alongside arguments
 - Coerce request messages from whichever field the runtime provides
   (request_messages, messages, conversation_history, user_message)

* fix(langfuse): salvage-review polish — drop dead is_first_turn, shallow-copy request_messages, real threaded FIFO test

Self-review of the combined #22345 + #23831 salvage surfaced three issues
worth fixing in the same PR rather than as follow-ups:

1. Drop is_first_turn from the pre_api_request hook. The boolean expression
   `not bool(conversation_history)` was wrong: conversation_history is
   reassigned to None mid-run after compression (5 sites in run_agent.py),
   so the value flips False -> True mid-conversation on every post-compression
   API call. The langfuse plugin never consumed it, so the kwarg was both
   misleading AND dead.

2. Replace copy.deepcopy(request_messages) with shallow list() copy. The
   pre_api_request hook contract discards return values (invoke_hook never
   writes back to api_kwargs), and the langfuse plugin's _serialize_messages
   already builds its own snapshot dicts via _safe_value. A deepcopy on every
   API call would walk every tool result and base64 image — significant
   overhead for no real isolation benefit. Shallow copy of the outer list
   protects against later mutations of api_messages without paying for the
   inner-dict walk.

3. Rename test_empty_tool_call_id_concurrent_fifo_order ->
   test_empty_tool_call_id_observations_are_fifo_within_tool_name and add a
   real test_threaded_post_calls_preserve_fifo_under_lock that spawns 8
   threads behind a barrier to actually exercise _STATE_LOCK on the
   pending_tools_by_name queue. The original test was sequential and only
   validated Python list semantics; this one validates the lock discipline.

4. Fix stale 'Cleared by reset_cache_for_tests()' comment on _INIT_FAILED —
   that function does not exist. Tests reload the module via sys.modules.pop
   + importlib.import_module instead.

Tests: 37 langfuse plugin tests pass, 658 plugin tests overall pass.

---------

Co-authored-by: xxxigm <tuancanhnguyen706@gmail.com>
Co-authored-by: Brian Conklin <brian@dralth.com>
2026-05-15 05:04:02 -07:00
..
image_gen test(xai-image): regression-guard literal '1k'/'2k' resolution payload 2026-05-09 13:07:46 -07:00
memory fix(memory): skip OpenViking upload symlinks 2026-05-14 07:48:03 -07:00
video_gen feat(video_gen): unified video_generate tool with pluggable provider backends (#25126) 2026-05-13 16:39:41 -07:00
web feat(web): firecrawl plugin natively supports crawl; delete legacy inline path 2026-05-13 22:31:28 -07:00
__init__.py fix: mem0 API v2 compat, prefetch context fencing, secret redaction (#5423) 2026-04-05 22:43:33 -07:00
test_achievements_plugin.py fix: stabilize CI — TS widen, sys.modules restore, WS subscriber race (#17836) 2026-04-30 01:34:08 -07:00
test_disk_cleanup_plugin.py feat(plugins): make all plugins opt-in by default 2026-04-20 04:46:45 -07:00
test_google_meet_audio.py feat(plugins): google_meet \u2014 join, transcribe, speak, follow up (#16364) 2026-04-27 06:22:25 -07:00
test_google_meet_node.py feat(plugins): google_meet \u2014 join, transcribe, speak, follow up (#16364) 2026-04-27 06:22:25 -07:00
test_google_meet_plugin.py feat(plugins): google_meet \u2014 join, transcribe, speak, follow up (#16364) 2026-04-27 06:22:25 -07:00
test_google_meet_realtime.py feat(plugins): google_meet \u2014 join, transcribe, speak, follow up (#16364) 2026-04-27 06:22:25 -07:00
test_kanban_dashboard_plugin.py test(kanban): remove stale t.summary assertion from search test 2026-05-10 21:44:37 -07:00
test_langfuse_plugin.py fix(langfuse): complete observability fix — trace I/O, tool outputs, placeholder credentials (closes #22342, #22763) (#26320) 2026-05-15 05:04:02 -07:00
test_retaindb_plugin.py test: speed up slow tests (backoff + subprocess + IMDS network) (#11797) 2026-04-17 14:21:22 -07:00
test_teams_pipeline_plugin.py fix(teams-pipeline): fill in missing delivery URL in adapter-reuse test 2026-05-08 12:00:09 -07:00