feat(agents): add prompt-compaction middleware for McpClient#2055
feat(agents): add prompt-compaction middleware for McpClient#2055Mgczacki wants to merge 2 commits into
Conversation
Caps the prompt the agent sends to its LLM so the conversation history
never grows unbounded. Runs as a langchain AgentMiddleware via
create_agent(middleware=...), so the size bound becomes an invariant of
the agent loop — `before_model` fires before every model call, including
intra-turn re-invocations (model -> tool -> tool result -> model).
Two-stage compaction:
1. Strip image content blocks from older messages (replace with a small
text placeholder).
2. If still over target, summarize older messages into a single
SystemMessage and keep the most recent turns verbatim.
The current turn (latest dimos_turn group + any trailing untagged
messages, i.e. in-flight tool calls) is preserved untouched — never
compacted, never image-stripped.
Configuration via McpClientConfig fields, env-driven by default:
AGENT_COMPACTION_THRESHOLD trigger size (default 40000)
AGENT_COMPACTION_TARGET size after compaction (default 3000)
AGENT_COMPACTION_SUMMARY_SIZE generated summary size (default 1000)
AGENT_COMPACTION_MODEL optional separate summarizer model
Also includes:
- Per-message turn tagging via additional_kwargs["dimos_turn"], stamped
in McpClient._process_message so compaction can group/score by turn.
- McpClient._history mirror updated to honor langgraph's add_messages
reducer semantics (RemoveMessage(id=REMOVE_ALL_MESSAGES) sentinel) so
the local history doesn't accrete pre-compaction state.
- Token counter is a pessimistic placeholder (3 chars/token,
1000/image), memoized on each message for O(new-only) recompute cost.
Designed to be swapped for a real tokenizer later without touching
callers.
- 15 pytest cases (hermetic, no API key needed), including two
integration tests that drive a real create_agent loop and prove
compaction can fire mid-turn between a tool result and the next
model call.
Defaults are intentionally conservative so the feature is on by default
without changing behavior for short sessions.
Greptile SummaryThis PR adds a
Confidence Score: 5/5Safe to merge; the compaction logic is well-reasoned, hermetically tested, and defaults are conservative enough to opt out if something unexpected arises in production. The core algorithm is correct and well-tested across 15 unit and integration cases. The _apply_messages_update history-sync logic handles the REMOVE_ALL_MESSAGES sentinel correctly and suppresses duplicate publishes. The only findings are an edge-case gap in untagged-message boundary alignment (normal agent flow is unaffected since all messages are tagged), an unused helper method, and a silent or-fallback for zero-value env vars. None of these affect the happy path. No files require special attention; all findings are confined to edge cases and dead code. Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant MC as McpClient
participant G as LangGraph agent
participant MW as DimosCompactionMiddleware
participant LLM as Chat model
participant H as _history
U->>MC: HumanMessage
MC->>MC: increment turn, tag message
MC->>H: append and publish
MC->>G: stream(history)
loop each model call in the agent loop
G->>MW: before_model(state)
alt "total tokens <= threshold"
MW-->>G: None (no-op)
else stage 1 image strip suffices
MW-->>G: RemoveMessage + stripped + current_turn
else stage 2 summarize
MW->>LLM: invoke transcript
LLM-->>MW: summary text
MW-->>G: RemoveMessage + protected + SummaryMsg + keep + current_turn
end
G->>LLM: invoke compacted messages
LLM-->>G: AIMessage
G-->>MC: stream update
MC->>MC: _apply_messages_update
MC->>H: rebuild history, publish new messages only
end
Reviews (2): Last reviewed commit: "fix(compaction): address Greptile review..." | Re-trigger Greptile |
| def _env_int(name: str) -> int | None: | ||
| v = os.environ.get(name) | ||
| return int(v) if v else None |
There was a problem hiding this comment.
_env_int calls int(v) without a try/except, so a non-numeric value like AGENT_COMPACTION_THRESHOLD=abc raises a bare ValueError deep inside pydantic's default_factory during config construction, producing an unhelpful traceback with no mention of which env var is at fault.
| def _env_int(name: str) -> int | None: | |
| v = os.environ.get(name) | |
| return int(v) if v else None | |
| def _env_int(name: str) -> int | None: | |
| v = os.environ.get(name) | |
| if not v: | |
| return None | |
| try: | |
| return int(v) | |
| except ValueError: | |
| raise ValueError(f"Environment variable {name!r} must be an integer, got {v!r}") from None |
- McpClient._apply_messages_update: dedupe publish on compaction replay. When the middleware emits [RemoveMessage, protected..., summary, keep..., current_turn...], the protected/keep/current messages are the same Python objects that were already published when they first arrived. Skip publish+print for any iter_msg whose id() was in the pre-wipe history; only the genuinely-new summary (and later AIMessages from the agent node in subsequent stream updates) get republished. Identified by Greptile P1. - McpClient._env_int: re-raise a labeled ValueError when the env var value isn't a valid integer, so misconfiguration surfaces with the offending name instead of a bare pydantic traceback. Identified by Greptile P2. - DimosCompactionMiddleware._static_tokens: drop the per-call hash computation. Inputs (system_prompt, tool_schemas) are bound at __init__ and never mutate, so a simple None-check on the cache is sufficient. Identified by Greptile P2.
Summary
Closes #1899
Caps the prompt the dimos agent sends to its LLM so the conversation history
never grows unbounded. Implemented as a langchain
AgentMiddlewareplugged intocreate_agent(middleware=...). Because the hook (before_model) fires beforeevery model invocation, the input-size bound becomes an invariant of the agent
loop — including intra-turn re-invocations (model → tool → tool result → model).
On long sessions the middleware quietly summarizes older turns once it detects
an oversized prompt. Behavior is unchanged for short sessions.
Concepts
dimos_turnA new integer tag attached to each message's
additional_kwargsdict.Incremented once per
McpClient._process_messagecall — that is, once peruser-facing turn (a human input from
agent-send, or a tool-streamnotification that wakes the agent). Every message that flows through during
that turn — the input
HumanMessage, intermediateAIMessages withtool_calls, the resultingToolMessages, the finalAIMessage— all getstamped with the same turn number.
This is what lets compaction:
together (compaction selects entire turns, never partial ones — no orphan
tool_call_idreferences).untagged in-flight messages from the agent loop) and preserve it untouched
regardless of threshold.
keep-N-most-recent strategies).
dimos_turnis metadata only — it lives inadditional_kwargs, whichproviders ignore but langchain serialization preserves. The compaction
summary itself is tagged with the max turn it covers (plus
dimos_compacted: True), so re-compaction folds the prior summary into thenext one cleanly.
Current turn is sacred
_current_turn_startwalks from the end of the message list to find theboundary of the latest turn. Everything from that boundary forward is never
compacted — no image strip, no summary touch. This protects:
ToolMessageresponsesHow it works
Two-stage compaction inside
before_model:Strip images in messages older than the current turn. Image content
blocks are replaced with a small text placeholder. If this alone gets us
below
target_tokens, we stop here.As to why I decided to strip images: LLM's visual reasoning capabilities are
currently noticeably worse than with text. Additionally, the way in which the
agent loop is set up right now makes it so the model gets to see the image at the
beginning of a new turn, and it tends to give a description of what's in the image.
This description is detailed enough for reasoning about the content of the image,
but it also causes a secondary effect: The model, when considering the image, will
default to anchor its perception (even if the image is available in chat history) to the
comment it gave at the moment. Keeping images that were already observed therefore
seems like a waste of tokens that we can save since we are already going to cause a
cache burst with our compaction process.
SystemMessagewhile keepingthe most recent turns verbatim. The summarizer LLM is configurable;
defaults to reusing the agent's own model. Output is hard-capped via
summarizer.bind(max_tokens=summary_size_tokens).See it in action
A public Langfuse trace captured with deliberately small defaults so
compaction fires after a handful of turns:
https://us.cloud.langfuse.com/project/cmp23t80n09ooad08jnw1lksy/traces/887630cfbf49bb97f1c5b4d2cc980ad1?observation=b73fcf77cb4f2dc5×tamp=2026-05-12T07:54:34.311Z
Use the trace timeline to see the prompt that hits the LLM at each
agent-turn-Nspan — older turns get folded into a single summarySystemMessageand the agent continues with a shrunk prompt.Configuration
All on by default via
McpClientConfig, env-driven:AGENT_COMPACTION_THRESHOLDagent_compaction_threshold40000AGENT_COMPACTION_TARGETagent_compaction_target3000AGENT_COMPACTION_SUMMARY_SIZEagent_compaction_summary_size1000AGENT_COMPACTION_MODELagent_compaction_modelNone(reuses agent's model)Why a middleware
Two reasons, both documented in
compaction_middleware.py's module docstring:_historywould only fire once per user turn, leaving every intra-turn re-invocation
unprotected. Middleware fires before each model call.
before_modelvsafter_model/wrap_model_call.before_modelisthe minimal-intervention hook.
after_modelis too late (the modelalready errored on overflow);
wrap_model_callconflates compaction withthe model-call concerns (retries, error shaping, tool dispatch).
Changes
New files
dimos/agents/compaction_middleware.py—DimosCompactionMiddlewareclass (subclass of
langchain.agents.middleware.AgentMiddleware),placeholder token counter (3 chars/token, 1000 tokens/image; memoized in
additional_kwargs[\"dimos_tokens\"]for O(new-only) recompute), statictoken cache for
system_prompt+ tool schemas, and the algorithm helpers(
_strip_images,_split,_current_turn_start,_summarize).dimos/agents/test_compaction_middleware.py— 15 pytest cases,hermetic (no API key needed). Coverage includes:
before_modelno-op below thresholdcreate_agentloop with aRecordingFakeAgentand assert: (a) the agent node receives a compactedprompt (proves langgraph's
add_messagesreducer interprets theRemoveMessage(REMOVE_ALL_MESSAGES)sentinel correctly), and(b) compaction can fire mid-turn between a tool result and the next
model call.
Modified:
dimos/agents/mcp/mcp_client.pyMcpClientConfigreading the env vars inthe table above.
_env_int/_env_strhelpers loaded via pydanticField(default_factory=...)._turn: intcounter onMcpClient(incremented atthe top of
_process_message), and a new module-level_tag_turn(message, turn)helper that stampsadditional_kwargs[\"dimos_turn\"]. Everymessage flowing through a turn gets stamped — the incoming
HumanMessagefirst, then every message emitted by the state graph.
_apply_messages_updatemethod that mirrorslanggraph's
add_messagesreducer semantics locally — honorsRemoveMessage(id=REMOVE_ALL_MESSAGES)as "wipe history, use whatfollows" and specific-id
RemoveMessageas targeted removal. This keepsMcpClient._historyin sync with the graph's internal state even when themiddleware replaces the entire message list.
on_system_modules, construct the summarizer(either via
init_chat_model(agent_compaction_model), orinit_chat_model(model)if the agent's model is a string, or reuse theagent's model object), build the middleware with the system prompt and
tool JSON schemas (
t.args_schema.model_json_schema()), and pass it ascreate_agent(..., middleware=middleware).middleware no-op updates that yield
{node: None}instead of{node: {\"messages\": [...]}}, which would previously crash with'NoneType' object has no attribute 'get'.Modified:
.gitignoreAdds
MUJOCO_LOG.TXT(MuJoCo runtime artifact written to the repo root onevery sim run; should never be committed).
Test plan
uv run pytest dimos/agents/test_compaction_middleware.py -v— 15/15pass.
uv run mypy dimos/agents/compaction_middleware.py dimos/agents/test_compaction_middleware.py dimos/agents/mcp/mcp_client.py— clean.dimos --simulation run unitree-go2-agenticwithAGENT_COMPACTION_THRESHOLD=2000, drive the agent until the thresholdis crossed, confirm a
Compaction fired (summarize)log line appearsand the next prompt sent to the LLM contains the summary
SystemMessageinstead of the older turns.Known limitations
Documented in the module docstring as "Known limitations":
Progressive disclosure with a content store is the right long-term answer.
very long session could exceed the summarizer model's own context window.
Mitigation deferred to a follow-up (chunked summarization).
@retry(on_exception=Exception)is intentionally broad because thesummarizer is duck-typed; permanent errors cost up to 3 attempts + 1s of
sleeps before propagating.