fix: use UTF-16 length for Telegram stream consumer message splitting

The stream consumer measured message length using Python's len() (Unicode
code points), but Telegram's actual limit is in UTF-16 code units. This
caused messages with supplementary characters (emoji, CJK, etc.) to exceed
Telegram's 4096-character limit, resulting in truncated messages with
formatting artifacts.

Changes:
- Add message_len_fn property to BasePlatformAdapter (defaults to len)
- Override in TelegramAdapter to return utf16_len
- Stream consumer uses adapter.message_len_fn for:
  - safe_limit calculation
  - overflow detection
  - truncate_message calls
  - split point calculation (via _custom_unit_to_cp)
  - fallback final send chunking

Fixes truncated messages with black square artifacts on Telegram when
the model generates responses containing multi-byte Unicode characters.
This commit is contained in:
Aubrey Freeman III 2026-04-16 13:05:22 -05:00 committed by Teknium
parent c5f1f863ac
commit c0da5d09a6
3 changed files with 54 additions and 12 deletions

View file

@ -1311,6 +1311,15 @@ class BasePlatformAdapter(ABC):
# _keep_typing skips send_typing when the chat_id is in this set.
self._typing_paused: set = set()
@property
def message_len_fn(self) -> Callable[[str], int]:
"""Return the length function for measuring message size on this platform.
Override in adapters whose platform counts characters differently from
Python ``len`` (e.g. Telegram counts UTF-16 code units).
"""
return len
@property
def has_fatal_error(self) -> bool:
return self._fatal_error_message is not None

View file

@ -283,6 +283,11 @@ class TelegramAdapter(BasePlatformAdapter):
MEDIA_GROUP_WAIT_SECONDS = 0.8
_GENERAL_TOPIC_THREAD_ID = "1"
@property
def message_len_fn(self):
"""Telegram measures message length in UTF-16 code units."""
return utf16_len
def __init__(self, config: PlatformConfig):
super().__init__(config, Platform.TELEGRAM)
self._app: Optional[Application] = None