Skip to content

Cap reconnect attempts in PeerIceModule (default 5)#86

Open
juerghegglin wants to merge 1 commit intoFAForever:masterfrom
juerghegglin:fix/cap-reconnect-attempts
Open

Cap reconnect attempts in PeerIceModule (default 5)#86
juerghegglin wants to merge 1 commit intoFAForever:masterfrom
juerghegglin:fix/cap-reconnect-attempts

Conversation

@juerghegglin
Copy link
Copy Markdown

@juerghegglin juerghegglin commented May 6, 2026

Problem

PeerIceModule.onConnectionLost() schedules a new initiateIce() on every failure with no upper bound on attempts. When a peer truly disappears mid-game — their machine crashes, FA closes, or a P2P route in the mesh breaks in a way that won't recover — the retry loop runs until local FA shutdown:

  • every ~11 seconds: gather STUN + TURN candidates, send them via RPC to the lobby server, wait 6 s for the peer's response, fail, schedule another attempt
  • in a 16-player game with one bad link, this is one P2P session burning CPU/network/log indefinitely while the simulation has long since moved past

Real-world evidence

A deadlocked 16-player session: a single peer drop produced 112 ICE re-initiations across 20 minutes for one peer that never returned. The local FA's lockstep took ~10 s to mark the peer defeated and continue (per game-engine state in JsonStats), but the ice-adapter spent the next ~20 minutes retrying into the void:

  • 100+ state-change notifications (onIceConnectionStateChanged: gathering / awaitingCandidates / disconnected) sent to the FAF client
  • Continuous STUN/TURN harvesting (5 servers per attempt × 112 attempts = ~560 round trips)
  • Telemetry websocket churn coupled to each state change

A separate teammate's log from a different session shows a peer flapping with 3 successful reconnects in 26 seconds — those recoveries happen well within a 5-attempt budget, so this cap does not regress the transient-flap recovery case that the existing retry loop is designed to handle.

Change

Adds a counter for consecutive failed reconnect attempts:

  • Reset to 0 in startIce() when CONNECTED is reached. A successful connection clears the budget.
  • Reset to 0 in onConnectionLost() when the previous state was CONNECTED (a fresh disconnect deserves a fresh retry budget).
  • Incremented on every retry attempt.
  • Once attempt > MAX_RECONNECT_ATTEMPTS (default 5), onConnectionLost() returns without scheduling a new initiateIce(). State stays DISCONNECTED.

Total automatic-retry budget: about 55 seconds (5 attempts × ~11 s each).

Recovery semantics (unchanged)

  • The FAF client's manual reconnect button (#61) calls GameSession.reconnectToPeer(), which disconnects and re-connectToPeer()s. That creates a fresh Peer and a fresh PeerIceModule — counter starts at 0 again. Users retain explicit control to retry.
  • Answerer-mode peers continue to react to new offers received via onIceMessageReceived() regardless of cap state — that path goes through initiateIce() directly, not the retry scheduler.

Risk

  • The MAX_RECONNECT_ATTEMPTS = 5 constant is conservative. Could be made configurable via IceOptions in a follow-up if maintainers want tournament-mode tuning. Kept hardcoded here for minimal scope.
  • Behavior is unchanged for any peer that successfully reconnects within 5 attempts (which is every recovering peer in the available real-world logs).
  • Behavior changes for peers that never recover: the adapter now stops retrying after ~55 s instead of looping forever.

Verification

  • Local :ice-adapter:check (compile + spotlessCheck) passes after spotlessApply.
  • No JUnit tests added — ice-adapter has no JUnit infrastructure yet, and bringing it in alongside this change would expand scope. Happy to follow up with a regression test (and the JUnit infrastructure for it) if maintainers want.

Co-authored with Claude (Anthropic); reviewed and locally verified by me.

Summary by CodeRabbit

Bug Fixes

  • Implemented a limit of 5 consecutive reconnection attempts to prevent infinite connection loops and improve system stability.
  • Enhanced connection recovery with intelligent retry delays and improved connection state tracking for better resilience during network disruptions.
  • Added notifications to inform users when maximum reconnection attempts are reached.

PeerIceModule.onConnectionLost() previously scheduled a new initiateIce()
on every failure with no upper bound on attempts. When a peer truly
disappears (their machine dies, FA closes, or a P2P route in the mesh
breaks unrecoverably), the loop runs until the local FA SHUTDOWN: every
~11 seconds, gather STUN+TURN candidates, send them via RPC, wait 6s for
a reply that never comes, repeat.

Real-world data from a deadlocked 16-player session: one peer drop led
to 112 ICE re-initiations across 20 minutes for a single peer that was
never coming back. During that whole window the local game generates
state-change RPC notifications, telemetry events, and STUN/TURN load
that contributes nothing to recovery. In a P2P mesh, this can also
contribute CPU pressure that delays echo replies on healthy peers,
inducing further false-positive timeouts on otherwise-good links.

This change adds a counter for consecutive failed reconnect attempts:
  - reset to 0 in startIce() when CONNECTED is reached;
  - reset to 0 in onConnectionLost() when the previous state was
    CONNECTED (a fresh disconnect deserves a fresh retry budget);
  - incremented on every retry attempt;
  - once the counter exceeds MAX_RECONNECT_ATTEMPTS (default 5),
    the retry loop stops and the peer stays in DISCONNECTED.

Recovery path is unchanged: the FAF client's manual reconnect button
(PR FAForever#61) calls GameSession.reconnectToPeer(), which creates a fresh
Peer and a fresh PeerIceModule (counter starts at 0 again), so the user
retains explicit control. Answerer-mode peers also re-engage normally
when they receive a new offer via RPC.

5 attempts at ~11s each gives roughly 55s of automatic retry before
giving up: generous enough for transient flaps (logs from a separate
session show a peer flapping with 3 successful reconnects in 26s, well
within budget), short enough that a truly-gone peer doesn't burn 20
minutes of background work.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

📝 Walkthrough

Walkthrough

The pull request introduces a capped ICE reconnection policy in PeerIceModule that limits consecutive reconnection attempts to 5 and applies strategic delays before reattempts based on prior connection state.

Changes

ICE Reconnection Cap

Layer / File(s) Summary
State & Constants
ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java
MAX_RECONNECT_ATTEMPTS constant set to 5, and consecutiveFailedAttempts AtomicInteger field added to track reconnection attempts.
Connection Establishment
ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java
When a connection is established, consecutiveFailedAttempts is reset to 0.
Reconnection Logic
ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java
Enhanced onConnectionLost increments the attempt counter, enforces the cap, logs a warning when max is reached, and schedules reattempts with conditional delays (0 ms if previously connected, 5 seconds otherwise).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Five chances to reconnect, no more, no less,
With timely delays to ease the ICE stress,
A counter keeps watch, the logic runs clean,
The most resilient connection you've ever seen!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Cap reconnect attempts in PeerIceModule (default 5)' directly and accurately describes the main change: implementing a reconnect attempt limit of 5 in the PeerIceModule class. It is concise, specific, and aligns with the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java (1)

451-464: 🏗️ Heavy lift

Add automated boundary tests for retry cap/reset behavior.

This async branch is easy to regress; please add tests for: (1) retries 1..5 scheduled, (2) retry 6 blocked, (3) counter reset after reconnect success, and (4) immediate-vs-delayed retry timing by previous state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java`
around lines 451 - 464, Add automated unit tests in the PeerIceModule test suite
covering retry cap and reset: simulate peer.isLocalOffer() true and assert
consecutiveFailedAttempts increments and that initiateIce is scheduled for
attempts 1..MAX_RECONNECT_ATTEMPTS (use a controllable executor or mock
IceAdapter.getExecutor()), assert that attempt MAX_RECONNECT_ATTEMPTS+1 does not
schedule initiateIce (blocked), verify that a successful reconnect resets
consecutiveFailedAttempts to 0, and verify delayMs is 0 when previousState ==
CONNECTED and 5000 otherwise by asserting which delayedExecutor delay is used;
target symbols: PeerIceModule, consecutiveFailedAttempts,
MAX_RECONNECT_ATTEMPTS, initiateIce, previousState, CONNECTED.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java`:
- Around line 451-464: Add automated unit tests in the PeerIceModule test suite
covering retry cap and reset: simulate peer.isLocalOffer() true and assert
consecutiveFailedAttempts increments and that initiateIce is scheduled for
attempts 1..MAX_RECONNECT_ATTEMPTS (use a controllable executor or mock
IceAdapter.getExecutor()), assert that attempt MAX_RECONNECT_ATTEMPTS+1 does not
schedule initiateIce (blocked), verify that a successful reconnect resets
consecutiveFailedAttempts to 0, and verify delayMs is 0 when previousState ==
CONNECTED and 5000 otherwise by asserting which delayedExecutor delay is used;
target symbols: PeerIceModule, consecutiveFailedAttempts,
MAX_RECONNECT_ATTEMPTS, initiateIce, previousState, CONNECTED.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 49bcc204-560b-4900-8faa-1ce915e2d9b6

📥 Commits

Reviewing files that changed from the base of the PR and between d49f9fe and f08236f.

📒 Files selected for processing (1)
  • ice-adapter/src/main/java/com/faforever/iceadapter/ice/PeerIceModule.java

@juerghegglin
Copy link
Copy Markdown
Author

Acknowledging the suggestion and proposing to defer it.

ice-adapter currently has no JUnit infrastructure: no junit-jupiter testImplementation, no useJUnitPlatform() in build.gradle, and src/test/java contains only the legacy main()-style integration helpers (IceTest.java, SignallingServer.java). Adding the proposed boundary tests would also require adding the test framework, which expands this PR from a focused 1-file behavior change into a project-wide infrastructure addition.

I'd rather propose JUnit onboarding separately so maintainers can accept (or reject) the test framework on its own merits, independent of accepting this fix. The boundary tests are written and passing locally on a separate branch — happy to open a follow-up PR introducing JUnit Jupiter + these tests (plus equivalent regression tests for #84/#85) if there's appetite for unit-testing infrastructure in this module.

In the meantime the change is small enough to review by inspection: the entire retry-decision block fits on one screen and the counter semantics are isolated to the existing lockLostConnection critical section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant