Skip to content

Fix/issue 817 idle timeout log level#824

Open
lutz-grex wants to merge 2 commits intomodelcontextprotocol:mainfrom
lutz-grex:fix/issue-817-idle-timeout-log-level
Open

Fix/issue 817 idle timeout log level#824
lutz-grex wants to merge 2 commits intomodelcontextprotocol:mainfrom
lutz-grex:fix/issue-817-idle-timeout-log-level

Conversation

@lutz-grex
Copy link
Copy Markdown
Contributor

Idle keep-alive timeout is normal zombie-session cleanup, not a transport failure. This PR stops it from producing ERROR-level log lines that are indistinguishable from real transport fatals.

Motivation and Context

When LocalSessionWorker hits its configured keep-alive timeout (default 5 min), it returns WorkerQuitReason::Fatal, which the transport worker logs at tracing::error!. The SessionConfig::keep_alive docstring explicitly
frames this as a safety net for silently-dropped HTTP connections — normal lifecycle cleanup, not an error condition.

This causes every idle reap to emit:

ERROR rmcp::transport::worker: worker quit with fatal: keep alive timeout after 300000ms, when poll next session event

Operators who wire ERROR-level alerts get paged on something that is normal by design. Downstream consumers cannot filter idle reaps without also silencing real transport fatals.

Additionally, the post-exit cleanup path in close_session could fail with SessionServiceTerminated when the worker had already exited due to idle timeout, producing a second spurious error.

Closes #817

How Has This Been Tested?

  • Added an integration test (test_keep_alive_timeout_does_not_emit_error_log) that spins up a Streamable HTTP server with a 200ms keep-alive, connects a client, waits for idle reap, and asserts: no ERROR-level logs are
    emitted, and a DEBUG log with IdleTimeout is present.
  • Added a second integration test (test_explicit_close_on_live_session_succeeds) that verifies close_session on a still-alive worker succeeds without error.
  • Existing tests pass locally (cargo test).

Breaking Changes

None. WorkerQuitReason is #[non_exhaustive], so the new IdleTimeout variant is additive. The removed LocalSessionWorkerError::KeepAliveTimeout variant was not part of the public API.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Documentation
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have added or updated documentation as needed

Additional context

This implements Option A from #817: a dedicated WorkerQuitReason::IdleTimeout(Duration) variant that is logged at debug level alongside the other expected-exit arms (Cancelled, TransportClosed, HandlerTerminated). True
transport failures continue to flow through Fatal and keep their ERROR severity.

Changes across two commits:

  1. d41d3c8 — Add WorkerQuitReason::IdleTimeout, return it from the keep-alive arm, log at debug. Remove unused KeepAliveTimeout error variant.
  2. 2463369 — Swallow SessionServiceTerminated in close_session when the worker has already exited, preventing a spurious error during post-exit cleanup.

Idle keep-alive timeout is normal zombie-session cleanup, not a transport failure.

Route it through a dedicated WorkerQuitReason::IdleTimeout variant.

Log it at debug level instead of treating it as a fatal error.

Remove the unused LocalSessionWorkerError::KeepAliveTimeout variant.

Closes modelcontextprotocol#817
Swallow SessionServiceTerminated in close_session when the worker has already exited.
This prevents a spurious ERROR log during the post-exit cleanup path in
spawn_session_worker.
@lutz-grex lutz-grex requested a review from a team as a code owner April 30, 2026 12:18
@github-actions github-actions Bot added T-test Testing related changes T-core Core library changes T-transport Transport layer changes labels Apr 30, 2026
Comment on lines -931 to -932
#[error("keep alive timeout after {}ms", _0.as_millis())]
KeepAliveTimeout(Duration),
Copy link
Copy Markdown
Member

@DaleSeo DaleSeo May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed LocalSessionWorkerError::KeepAliveTimeout variant was not part of the public API.

I think it's reachable via fully pub module chain:
rmcp::transport::streamable_http_server::session::local::LocalSessionWorkerError::KeepAliveTimeout

Could we keep the variant with #[deprecated] to avoid a breaking change?

TransportClosed,
#[error("Handler terminated")]
HandlerTerminated,
#[error("Worker idle timeout ({}ms)", _0.as_millis())]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("Worker idle timeout ({}ms)", _0.as_millis())]
#[error("Worker idle timeout after {}ms", _0.as_millis())]


let error_events: Vec<_> = captured
.iter()
.filter(|e| e.level == tracing::Level::ERROR)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also filter errors by target? Any unrelated subsystem that logs at ERROR would flake this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T-core Core library changes T-test Testing related changes T-transport Transport layer changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LocalSessionWorker idle keep-alive timeout emits tracing::error! for normal zombie-session reaping

2 participants