Skip to content

feat(agents): add prompt-compaction middleware for McpClient#2055

Open
Mgczacki wants to merge 2 commits into
dimensionalOS:mainfrom
Mgczacki:feat/agent-prompt-compaction
Open

feat(agents): add prompt-compaction middleware for McpClient#2055
Mgczacki wants to merge 2 commits into
dimensionalOS:mainfrom
Mgczacki:feat/agent-prompt-compaction

Conversation

@Mgczacki
Copy link
Copy Markdown

@Mgczacki Mgczacki commented May 12, 2026

Summary

Closes #1899

Caps the prompt the dimos agent sends to its LLM so the conversation history
never grows unbounded. Implemented as a langchain AgentMiddleware plugged into
create_agent(middleware=...). Because the hook (before_model) fires before
every model invocation, the input-size bound becomes an invariant of the agent
loop — including intra-turn re-invocations (model → tool → tool result → model).

On long sessions the middleware quietly summarizes older turns once it detects
an oversized prompt. Behavior is unchanged for short sessions.

Concepts

dimos_turn

A new integer tag attached to each message's additional_kwargs dict.
Incremented once per McpClient._process_message call — that is, once per
user-facing turn (a human input from agent-send, or a tool-stream
notification that wakes the agent). Every message that flows through during
that turn — the input HumanMessage, intermediate AIMessages with
tool_calls, the resulting ToolMessages, the final AIMessage — all get
stamped with the same turn number.

This is what lets compaction:

  1. Group messages by turn so tool_call/tool_response pairs always travel
    together (compaction selects entire turns, never partial ones — no orphan
    tool_call_id references).
  2. Identify the current turn (the latest tag value plus any trailing
    untagged in-flight messages from the agent loop) and preserve it untouched
    regardless of threshold.
  3. Score / inspect the history per-turn for future heuristics (e.g.,
    keep-N-most-recent strategies).

dimos_turn is metadata only — it lives in additional_kwargs, which
providers ignore but langchain serialization preserves. The compaction
summary itself is tagged with the max turn it covers (plus
dimos_compacted: True), so re-compaction folds the prior summary into the
next one cleanly.

Current turn is sacred

_current_turn_start walks from the end of the message list to find the
boundary of the latest turn. Everything from that boundary forward is never
compacted — no image strip, no summary touch. This protects:

  • The user's current query
  • In-progress tool calls and their pending ToolMessage responses
  • Fresh images from perception that the user might be asking about right now

How it works

Two-stage compaction inside before_model:

  1. Strip images in messages older than the current turn. Image content
    blocks are replaced with a small text placeholder. If this alone gets us
    below target_tokens, we stop here.

    Caveat: this is an incomplete solution. Dropping the image with only
    a "[image removed]" placeholder is destructive as the model can no
    longer refer back to that perception. A more principled design would
    follow progressive disclosure: keep the image addressable in a content
    store and replace the inline block with a reference (e.g.,
    [image: ref://…]) plus a tool the agent can call to re-fetch it on
    demand. I am deferring this decision as it needs a broader agent-harness
    conversation about content addressability.

As to why I decided to strip images: LLM's visual reasoning capabilities are
currently noticeably worse than with text. Additionally, the way in which the
agent loop is set up right now makes it so the model gets to see the image at the
beginning of a new turn, and it tends to give a description of what's in the image.
This description is detailed enough for reasoning about the content of the image,
but it also causes a secondary effect: The model, when considering the image, will
default to anchor its perception (even if the image is available in chat history) to the
comment it gave at the moment. Keeping images that were already observed therefore
seems like a waste of tokens that we can save since we are already going to cause a
cache burst with our compaction process.

  1. Summarize older messages into a single SystemMessage while keeping
    the most recent turns verbatim. The summarizer LLM is configurable;
    defaults to reusing the agent's own model. Output is hard-capped via
    summarizer.bind(max_tokens=summary_size_tokens).

See it in action

A public Langfuse trace captured with deliberately small defaults so
compaction fires after a handful of turns:

https://us.cloud.langfuse.com/project/cmp23t80n09ooad08jnw1lksy/traces/887630cfbf49bb97f1c5b4d2cc980ad1?observation=b73fcf77cb4f2dc5&timestamp=2026-05-12T07:54:34.311Z

Use the trace timeline to see the prompt that hits the LLM at each
agent-turn-N span — older turns get folded into a single summary
SystemMessage and the agent continues with a shrunk prompt.

Configuration

All on by default via McpClientConfig, env-driven:

Env var Field Default
AGENT_COMPACTION_THRESHOLD agent_compaction_threshold 40000
AGENT_COMPACTION_TARGET agent_compaction_target 3000
AGENT_COMPACTION_SUMMARY_SIZE agent_compaction_summary_size 1000
AGENT_COMPACTION_MODEL agent_compaction_model None (reuses agent's model)

Why a middleware

Two reasons, both documented in compaction_middleware.py's module docstring:

  1. Middleware vs preprocessing. External preprocessing on _history
    would only fire once per user turn, leaving every intra-turn re-invocation
    unprotected. Middleware fires before each model call.
  2. before_model vs after_model / wrap_model_call. before_model is
    the minimal-intervention hook. after_model is too late (the model
    already errored on overflow); wrap_model_call conflates compaction with
    the model-call concerns (retries, error shaping, tool dispatch).

Changes

New files

  • dimos/agents/compaction_middleware.pyDimosCompactionMiddleware
    class (subclass of langchain.agents.middleware.AgentMiddleware),
    placeholder token counter (3 chars/token, 1000 tokens/image; memoized in
    additional_kwargs[\"dimos_tokens\"] for O(new-only) recompute), static
    token cache for system_prompt + tool schemas, and the algorithm helpers
    (_strip_images, _split, _current_turn_start, _summarize).
  • dimos/agents/test_compaction_middleware.py — 15 pytest cases,
    hermetic (no API key needed). Coverage includes:
    • Token counter unit tests (text, image, memoization, static cache)
    • before_model no-op below threshold
    • Stage 1 alone suffices (image strip only)
    • Stage 2 summarization with FakeListChatModel summarizer
    • Protected SystemMessage prefix preserved
    • Mid-list untagged messages get summarized (not protected)
    • Prior summary re-folded into the next summary (no stacking)
    • Most-recent turns kept verbatim
    • Tool-call/tool-response pairs never split across summarize/keep boundary
    • Summarizer failure propagates after retries
    • Two integration tests that drive a real create_agent loop with a
      RecordingFakeAgent and assert: (a) the agent node receives a compacted
      prompt (proves langgraph's add_messages reducer interprets the
      RemoveMessage(REMOVE_ALL_MESSAGES) sentinel correctly), and
      (b) compaction can fire mid-turn between a tool result and the next
      model call.

Modified: dimos/agents/mcp/mcp_client.py

  • Config: four new fields on McpClientConfig reading the env vars in
    the table above. _env_int / _env_str helpers loaded via pydantic
    Field(default_factory=...).
  • Turn tagging: new _turn: int counter on McpClient (incremented at
    the top of _process_message), and a new module-level _tag_turn(message, turn) helper that stamps additional_kwargs[\"dimos_turn\"]. Every
    message flowing through a turn gets stamped — the incoming HumanMessage
    first, then every message emitted by the state graph.
  • History sync: new _apply_messages_update method that mirrors
    langgraph's add_messages reducer semantics locally — honors
    RemoveMessage(id=REMOVE_ALL_MESSAGES) as "wipe history, use what
    follows" and specific-id RemoveMessage as targeted removal. This keeps
    McpClient._history in sync with the graph's internal state even when the
    middleware replaces the entire message list.
  • Middleware wiring: in on_system_modules, construct the summarizer
    (either via init_chat_model(agent_compaction_model), or
    init_chat_model(model) if the agent's model is a string, or reuse the
    agent's model object), build the middleware with the system prompt and
    tool JSON schemas (t.args_schema.model_json_schema()), and pass it as
    create_agent(..., middleware=middleware).
  • Robustness in the stream loop: the worker thread now guards against
    middleware no-op updates that yield {node: None} instead of {node: {\"messages\": [...]}}, which would previously crash with 'NoneType' object has no attribute 'get'.

Modified: .gitignore

Adds MUJOCO_LOG.TXT (MuJoCo runtime artifact written to the repo root on
every sim run; should never be committed).

Test plan

  • uv run pytest dimos/agents/test_compaction_middleware.py -v — 15/15
    pass.
  • uv run mypy dimos/agents/compaction_middleware.py dimos/agents/test_compaction_middleware.py dimos/agents/mcp/mcp_client.py — clean.
  • Live verification: dimos --simulation run unitree-go2-agentic with
    AGENT_COMPACTION_THRESHOLD=2000, drive the agent until the threshold
    is crossed, confirm a Compaction fired (summarize) log line appears
    and the next prompt sent to the LLM contains the summary
    SystemMessage instead of the older turns.

Known limitations

Documented in the module docstring as "Known limitations":

  1. Image stripping is destructive — see caveat under stage 1 above.
    Progressive disclosure with a content store is the right long-term answer.
  2. Summarizer transcript size is unbounded — a first-ever compaction on a
    very long session could exceed the summarizer model's own context window.
    Mitigation deferred to a follow-up (chunked summarization).
  3. @retry(on_exception=Exception) is intentionally broad because the
    summarizer is duck-typed; permanent errors cost up to 3 attempts + 1s of
    sleeps before propagating.

Caps the prompt the agent sends to its LLM so the conversation history
never grows unbounded. Runs as a langchain AgentMiddleware via
create_agent(middleware=...), so the size bound becomes an invariant of
the agent loop — `before_model` fires before every model call, including
intra-turn re-invocations (model -> tool -> tool result -> model).

Two-stage compaction:
  1. Strip image content blocks from older messages (replace with a small
     text placeholder).
  2. If still over target, summarize older messages into a single
     SystemMessage and keep the most recent turns verbatim.

The current turn (latest dimos_turn group + any trailing untagged
messages, i.e. in-flight tool calls) is preserved untouched — never
compacted, never image-stripped.

Configuration via McpClientConfig fields, env-driven by default:
  AGENT_COMPACTION_THRESHOLD     trigger size           (default 40000)
  AGENT_COMPACTION_TARGET        size after compaction  (default 3000)
  AGENT_COMPACTION_SUMMARY_SIZE  generated summary size (default 1000)
  AGENT_COMPACTION_MODEL         optional separate summarizer model

Also includes:

- Per-message turn tagging via additional_kwargs["dimos_turn"], stamped
  in McpClient._process_message so compaction can group/score by turn.
- McpClient._history mirror updated to honor langgraph's add_messages
  reducer semantics (RemoveMessage(id=REMOVE_ALL_MESSAGES) sentinel) so
  the local history doesn't accrete pre-compaction state.
- Token counter is a pessimistic placeholder (3 chars/token,
  1000/image), memoized on each message for O(new-only) recompute cost.
  Designed to be swapped for a real tokenizer later without touching
  callers.
- 15 pytest cases (hermetic, no API key needed), including two
  integration tests that drive a real create_agent loop and prove
  compaction can fire mid-turn between a tool result and the next
  model call.

Defaults are intentionally conservative so the feature is on by default
without changing behavior for short sessions.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR adds a DimosCompactionMiddleware that caps the agent's prompt size before every LLM call, preventing unbounded history growth. It is wired into McpClient via create_agent(middleware=...) and operates in two stages: strip images from older turns, then summarize those turns into a single SystemMessage if still over budget.

  • New compaction_middleware.py: DimosCompactionMiddleware with memoized token counting, two-stage compaction, dimos_turn-aware boundary alignment to keep tool-call/tool-response pairs coherent, and 15 hermetic pytest cases covering unit behaviour, integration with a real agent graph, re-compaction folding, and failure propagation.
  • mcp_client.py updates: Four new env-driven McpClientConfig fields, a _turn counter that stamps every message with a dimos_turn tag, a new _apply_messages_update method that mirrors LangGraph's add_messages reducer locally and suppresses duplicate publishes after a compaction wipe, and a guard in the stream loop for middleware no-op updates.
  • .gitignore: Adds MUJOCO_LOG.TXT to exclude the MuJoCo runtime artifact.

Confidence Score: 5/5

Safe to merge; the compaction logic is well-reasoned, hermetically tested, and defaults are conservative enough to opt out if something unexpected arises in production.

The core algorithm is correct and well-tested across 15 unit and integration cases. The _apply_messages_update history-sync logic handles the REMOVE_ALL_MESSAGES sentinel correctly and suppresses duplicate publishes. The only findings are an edge-case gap in untagged-message boundary alignment (normal agent flow is unaffected since all messages are tagged), an unused helper method, and a silent or-fallback for zero-value env vars. None of these affect the happy path.

No files require special attention; all findings are confined to edge cases and dead code.

Important Files Changed

Filename Overview
dimos/agents/compaction_middleware.py New middleware implementing two-stage prompt compaction (image strip to summarize); algorithm is well-designed and thoroughly tested, with minor dead code and an edge-case gap in turn-boundary alignment for untagged messages.
dimos/agents/mcp/mcp_client.py Adds turn tagging, compaction middleware wiring, and _apply_messages_update for history sync; the or-fallback pattern in Field default factories silently swallows an explicit 0 env var value for the three integer config fields.
dimos/agents/test_compaction_middleware.py Comprehensive hermetic test suite covering unit behaviour, integration with a real create_agent graph, tool-call coherence, re-compaction folding, and failure propagation; no issues found.
.gitignore Adds MUJOCO_LOG.TXT to prevent the MuJoCo runtime log from being committed; trivial, correct change.

Sequence Diagram

sequenceDiagram
    participant U as User
    participant MC as McpClient
    participant G as LangGraph agent
    participant MW as DimosCompactionMiddleware
    participant LLM as Chat model
    participant H as _history

    U->>MC: HumanMessage
    MC->>MC: increment turn, tag message
    MC->>H: append and publish
    MC->>G: stream(history)

    loop each model call in the agent loop
        G->>MW: before_model(state)
        alt "total tokens <= threshold"
            MW-->>G: None (no-op)
        else stage 1 image strip suffices
            MW-->>G: RemoveMessage + stripped + current_turn
        else stage 2 summarize
            MW->>LLM: invoke transcript
            LLM-->>MW: summary text
            MW-->>G: RemoveMessage + protected + SummaryMsg + keep + current_turn
        end
        G->>LLM: invoke compacted messages
        LLM-->>G: AIMessage
        G-->>MC: stream update
        MC->>MC: _apply_messages_update
        MC->>H: rebuild history, publish new messages only
    end
Loading

Reviews (2): Last reviewed commit: "fix(compaction): address Greptile review..." | Re-trigger Greptile

Comment thread dimos/agents/mcp/mcp_client.py
Comment thread dimos/agents/mcp/mcp_client.py Outdated
Comment on lines +49 to +51
def _env_int(name: str) -> int | None:
v = os.environ.get(name)
return int(v) if v else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _env_int calls int(v) without a try/except, so a non-numeric value like AGENT_COMPACTION_THRESHOLD=abc raises a bare ValueError deep inside pydantic's default_factory during config construction, producing an unhelpful traceback with no mention of which env var is at fault.

Suggested change
def _env_int(name: str) -> int | None:
v = os.environ.get(name)
return int(v) if v else None
def _env_int(name: str) -> int | None:
v = os.environ.get(name)
if not v:
return None
try:
return int(v)
except ValueError:
raise ValueError(f"Environment variable {name!r} must be an integer, got {v!r}") from None

Comment thread dimos/agents/compaction_middleware.py
- McpClient._apply_messages_update: dedupe publish on compaction replay.
  When the middleware emits [RemoveMessage, protected..., summary,
  keep..., current_turn...], the protected/keep/current messages are the
  same Python objects that were already published when they first arrived.
  Skip publish+print for any iter_msg whose id() was in the pre-wipe
  history; only the genuinely-new summary (and later AIMessages from the
  agent node in subsequent stream updates) get republished. Identified by
  Greptile P1.

- McpClient._env_int: re-raise a labeled ValueError when the env var
  value isn't a valid integer, so misconfiguration surfaces with the
  offending name instead of a bare pydantic traceback. Identified by
  Greptile P2.

- DimosCompactionMiddleware._static_tokens: drop the per-call hash
  computation. Inputs (system_prompt, tool_schemas) are bound at
  __init__ and never mutate, so a simple None-check on the cache is
  sufficient. Identified by Greptile P2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent Compaction

1 participant