fix(gateway): unify MEDIA: extraction extension set + close the unknown-ext black hole (#34517) (#34844)

MEDIA:<path> tags for .md/.json/.yaml/.xml/.html and other document
extensions were silently dropped. extract_media() carried a narrow
extension allowlist that omitted them, while extract_local_files()
had a broad one. The dispatch sites then ran an unconditional
re.sub(r'MEDIA:\\s*\\S+', '') that stripped the tag from the body even
when extract_media had not matched it — so extract_local_files (broad
list) ran on text where the path was already gone, and the file was
delivered by neither path.

- Add MEDIA_DELIVERY_EXTS in gateway/platforms/base.py as the single
  source of truth; extract_media and extract_local_files both derive
  their extension set from it (no more drift).
- Replace the loose MEDIA cleanup at the non-streaming dispatch site
  (base.py) and the streaming consumer (stream_consumer.py) with the
  shared, extension-anchored MEDIA_TAG_CLEANUP_RE. A MEDIA: tag with an
  unknown extension is left in the body so the bare-path detector can
  still pick it up instead of being black-holed.
- Chain cleaned text through extract_media -> extract_images ->
  extract_local_files in run.py's post-stream media delivery (it was
  dropping the cleaned text and rescanning raw text with MEDIA: tags).
- Regression tests covering both halves: previously-dropped extensions
  now extract, and unknown-ext paths survive the cleanup.

Consolidates the MEDIA extension-allowlist PR cluster.

Co-authored-by: Bartok9 <259807879+Bartok9@users.noreply.github.com>
Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com>
Co-authored-by: Kyzcreig <9063726+Kyzcreig@users.noreply.github.com>
This commit is contained in:
Teknium 2026-05-29 13:24:01 -07:00 committed by GitHub
parent 0dc0c5ea6b
commit 781604ce4c
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 147 additions and 29 deletions

View file

@ -525,6 +525,8 @@ AUTHOR_MAP = {
"barnacleboy.jezzahehn@agentmail.to": "JezzaHehn",
"254021826+dodo-reach@users.noreply.github.com": "dodo-reach",
"259807879+Bartok9@users.noreply.github.com": "Bartok9",
"123342691+banditburai@users.noreply.github.com": "banditburai",
"9063726+Kyzcreig@users.noreply.github.com": "Kyzcreig",
"270082434+crayfish-ai@users.noreply.github.com": "crayfish-ai",
"241404605+MestreY0d4-Uninter@users.noreply.github.com": "MestreY0d4-Uninter",
"268667990+Roy-oss1@users.noreply.github.com": "Roy-oss1",