Skip to content

SOLR-18174 Fix test race condition failure in AsyncTrackerSemaphoreLeakTest#4331

Open
janhoy wants to merge 1 commit intoapache:mainfrom
janhoy:fix/async-tracker-semaphore-test-race
Open

SOLR-18174 Fix test race condition failure in AsyncTrackerSemaphoreLeakTest#4331
janhoy wants to merge 1 commit intoapache:mainfrom
janhoy:fix/async-tracker-semaphore-test-race

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented Apr 24, 2026

https://issues.apache.org/jira/browse/SOLR-18174 / #4236

Problem description

A seldom test failure in CI:

java.lang.AssertionError: All permits should be restored after retries complete expected:<40> but was:<39>               
at __randomizedtesting.SeedInfo.seed([7610D4B8595A5B28:A5B02F53333203AA]:0)                                              
at org.junit.Assert.fail(Assert.java:89)                                                                                 
at org.junit.Assert.failNotEquals(Assert.java:835)                                                                       
at org.junit.Assert.assertEquals(Assert.java:647)                                                                        
at org.apache.solr.handler.component.AsyncTrackerSemaphoreLeakTest.testSemaphoreLeakOnLBRetry(AsyncTrackerSemaphoreLeakT 
est.java:209)                                                                                                            

Root cause analysis by Claude:

A thread race between two concurrent paths after fakeServer.rstAll():

  • Executor thread: apiFuture is completed (from onHeaders dispatching to executor, or from onFailure dispatching
    future.completeExceptionally). allOf(futures).get() waits on this.
  • Jetty IO thread: asyncTracker.completeListener.onComplete() → available.release(). This fires after the response
    listeners (onHeaders/onFailure), meaning it can lag behind the executor thread.

allOf(futures).get() can return with all futures done while 1 (or more) completeListeners on the IO thread haven't had
a chance to call available.release() yet. The permits ARE eventually restored — the test just measured too eagerly.

The fix

After allOf().get(), poll the permit count for up to 5 seconds (in 10ms increments) to let the IO thread catch up
before asserting. This makes the test robust without modifying any production code.

This fix needs back-porting to branch_10x and branch_9x, which also experienced these failures

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts AsyncTrackerSemaphoreLeakTest to avoid a race where request futures complete before Jetty’s IO-thread onComplete callback releases semaphore permits, causing a transient (but test-failing) permit deficit.

Changes:

  • After CompletableFuture.allOf(...).get(), polls asyncTrackerAvailablePermits() for up to 5 seconds to allow IO-thread releases to catch up before asserting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants