Cap reconnect attempts in PeerIceModule (default 5)#86
Cap reconnect attempts in PeerIceModule (default 5)#86juerghegglin wants to merge 1 commit intoFAForever:masterfrom
Conversation
PeerIceModule.onConnectionLost() previously scheduled a new initiateIce()
on every failure with no upper bound on attempts. When a peer truly
disappears (their machine dies, FA closes, or a P2P route in the mesh
breaks unrecoverably), the loop runs until the local FA SHUTDOWN: every
~11 seconds, gather STUN+TURN candidates, send them via RPC, wait 6s for
a reply that never comes, repeat.
Real-world data from a deadlocked 16-player session: one peer drop led
to 112 ICE re-initiations across 20 minutes for a single peer that was
never coming back. During that whole window the local game generates
state-change RPC notifications, telemetry events, and STUN/TURN load
that contributes nothing to recovery. In a P2P mesh, this can also
contribute CPU pressure that delays echo replies on healthy peers,
inducing further false-positive timeouts on otherwise-good links.
This change adds a counter for consecutive failed reconnect attempts:
- reset to 0 in startIce() when CONNECTED is reached;
- reset to 0 in onConnectionLost() when the previous state was
CONNECTED (a fresh disconnect deserves a fresh retry budget);
- incremented on every retry attempt;
- once the counter exceeds MAX_RECONNECT_ATTEMPTS (default 5),
the retry loop stops and the peer stays in DISCONNECTED.
Recovery path is unchanged: the FAF client's manual reconnect button
(PR FAForever#61) calls GameSession.reconnectToPeer(), which creates a fresh
Peer and a fresh PeerIceModule (counter starts at 0 again), so the user
retains explicit control. Answerer-mode peers also re-engage normally
when they receive a new offer via RPC.
5 attempts at ~11s each gives roughly 55s of automatic retry before
giving up: generous enough for transient flaps (logs from a separate
session show a peer flapping with 3 successful reconnects in 26s, well
within budget), short enough that a truly-gone peer doesn't burn 20
minutes of background work.
📝 WalkthroughWalkthroughThe pull request introduces a capped ICE reconnection policy in ChangesICE Reconnection Cap
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java (1)
451-464: 🏗️ Heavy liftAdd automated boundary tests for retry cap/reset behavior.
This async branch is easy to regress; please add tests for: (1) retries 1..5 scheduled, (2) retry 6 blocked, (3) counter reset after reconnect success, and (4) immediate-vs-delayed retry timing by previous state.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java` around lines 451 - 464, Add automated unit tests in the PeerIceModule test suite covering retry cap and reset: simulate peer.isLocalOffer() true and assert consecutiveFailedAttempts increments and that initiateIce is scheduled for attempts 1..MAX_RECONNECT_ATTEMPTS (use a controllable executor or mock IceAdapter.getExecutor()), assert that attempt MAX_RECONNECT_ATTEMPTS+1 does not schedule initiateIce (blocked), verify that a successful reconnect resets consecutiveFailedAttempts to 0, and verify delayMs is 0 when previousState == CONNECTED and 5000 otherwise by asserting which delayedExecutor delay is used; target symbols: PeerIceModule, consecutiveFailedAttempts, MAX_RECONNECT_ATTEMPTS, initiateIce, previousState, CONNECTED.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java`:
- Around line 451-464: Add automated unit tests in the PeerIceModule test suite
covering retry cap and reset: simulate peer.isLocalOffer() true and assert
consecutiveFailedAttempts increments and that initiateIce is scheduled for
attempts 1..MAX_RECONNECT_ATTEMPTS (use a controllable executor or mock
IceAdapter.getExecutor()), assert that attempt MAX_RECONNECT_ATTEMPTS+1 does not
schedule initiateIce (blocked), verify that a successful reconnect resets
consecutiveFailedAttempts to 0, and verify delayMs is 0 when previousState ==
CONNECTED and 5000 otherwise by asserting which delayedExecutor delay is used;
target symbols: PeerIceModule, consecutiveFailedAttempts,
MAX_RECONNECT_ATTEMPTS, initiateIce, previousState, CONNECTED.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 49bcc204-560b-4900-8faa-1ce915e2d9b6
📒 Files selected for processing (1)
ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java
|
Acknowledging the suggestion and proposing to defer it.
I'd rather propose JUnit onboarding separately so maintainers can accept (or reject) the test framework on its own merits, independent of accepting this fix. The boundary tests are written and passing locally on a separate branch — happy to open a follow-up PR introducing JUnit Jupiter + these tests (plus equivalent regression tests for #84/#85) if there's appetite for unit-testing infrastructure in this module. In the meantime the change is small enough to review by inspection: the entire retry-decision block fits on one screen and the counter semantics are isolated to the existing |
Problem
PeerIceModule.onConnectionLost()schedules a newinitiateIce()on every failure with no upper bound on attempts. When a peer truly disappears mid-game — their machine crashes, FA closes, or a P2P route in the mesh breaks in a way that won't recover — the retry loop runs until local FA shutdown:Real-world evidence
A deadlocked 16-player session: a single peer drop produced 112 ICE re-initiations across 20 minutes for one peer that never returned. The local FA's lockstep took ~10 s to mark the peer defeated and continue (per game-engine state in
JsonStats), but the ice-adapter spent the next ~20 minutes retrying into the void:onIceConnectionStateChanged: gathering / awaitingCandidates / disconnected) sent to the FAF clientA separate teammate's log from a different session shows a peer flapping with 3 successful reconnects in 26 seconds — those recoveries happen well within a 5-attempt budget, so this cap does not regress the transient-flap recovery case that the existing retry loop is designed to handle.
Change
Adds a counter for consecutive failed reconnect attempts:
startIce()whenCONNECTEDis reached. A successful connection clears the budget.onConnectionLost()when the previous state wasCONNECTED(a fresh disconnect deserves a fresh retry budget).attempt > MAX_RECONNECT_ATTEMPTS(default 5),onConnectionLost()returns without scheduling a newinitiateIce(). State staysDISCONNECTED.Total automatic-retry budget: about 55 seconds (5 attempts × ~11 s each).
Recovery semantics (unchanged)
#61) callsGameSession.reconnectToPeer(), whichdisconnects and re-connectToPeer()s. That creates a freshPeerand a freshPeerIceModule— counter starts at 0 again. Users retain explicit control to retry.onIceMessageReceived()regardless of cap state — that path goes throughinitiateIce()directly, not the retry scheduler.Risk
MAX_RECONNECT_ATTEMPTS = 5constant is conservative. Could be made configurable viaIceOptionsin a follow-up if maintainers want tournament-mode tuning. Kept hardcoded here for minimal scope.Verification
:ice-adapter:check(compile + spotlessCheck) passes afterspotlessApply.ice-adapterhas no JUnit infrastructure yet, and bringing it in alongside this change would expand scope. Happy to follow up with a regression test (and the JUnit infrastructure for it) if maintainers want.Co-authored with Claude (Anthropic); reviewed and locally verified by me.
Summary by CodeRabbit
Bug Fixes