Skip to content

[fix](be) Fix exchange receiver dependency race#62777

Merged
yiguolei merged 1 commit intoapache:masterfrom
mrhhsg:fix_exchange
Apr 28, 2026
Merged

[fix](be) Fix exchange receiver dependency race#62777
yiguolei merged 1 commit intoapache:masterfrom
mrhhsg:fix_exchange

Conversation

@mrhhsg
Copy link
Copy Markdown
Member

@mrhhsg mrhhsg commented Apr 24, 2026

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Data or EOS could arrive after the receiver is registered but before its source dependency is installed. In that window, ready notification was lost and the exchange source could remain blocked. Recheck the queue state when setting the dependency and protect channel turn-off checks with the instance mutex.

Release note

Fix a potential query hang caused by an exchange receiver dependency ready-notification race.

Check List (For Author)

  • Test: Unit Test / Manual test
    • Unit Test: ./run-be-ut.sh --run --filter=DataStreamRecvrTest.TestEosBeforeSetDependency:DataStreamRecvrTest.TestDataBeforeSetDependencyWithRemainingSenders
    • Manual test: build-support/check-format.sh
    • Static analysis attempted: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN failed because clang-tidy could not analyze the files due to environment/pre-existing diagnostics, including missing stddef.h from system/libstdc++ headers and an existing unmatched NOLINTEND in be/src/core/types.h
  • Behavior changed: Yes. Exchange source dependencies are now marked ready if queued data or EOS arrived before set_dependency().
  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 24, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking findings in the reviewed changes.

Critical Checkpoints

  • Goal and correctness: The PR fixes the lost ready-notification window between receiver registration and set_dependency(). SenderQueue::set_dependency() now rechecks queue and EOS state under _lock, and the two added unit tests cover both "data before dependency" and "EOS before dependency". I found no remaining correctness issue in the modified flow.
  • Scope: The change is small and focused to the receiver readiness race, the sink buffer mutex protection, and targeted tests.
  • Concurrency: The modified receiver path now uses SenderQueue::_lock consistently for _source_dependency, _block_queue, _num_remaining_senders, and _is_cancelled when deciding readiness. The sink buffer now reads rpc_channel_is_turn_off under instance_data.mutex, removing an unsynchronized access. I found no new lock-order or deadlock issue.
  • Lifecycle: No new special lifecycle or static-initialization risk. The existing create-and-register ordering window in VDataStreamMgr is the race being fixed.
  • Config and compatibility: No new config, protocol, symbol, or storage compatibility concern.
  • Parallel paths: I checked analogous dependency-based queue code and did not find another path with the same externally visible registration race in this PR scope.
  • Conditions and comments: The new recheck in set_dependency() is documented and matches the actual race window.
  • Test coverage: The new BE unit tests are appropriate for the receiver race. There is no dedicated test for the sink-buffer mutex move, but I did not find a behavioral regression there. I attempted to run ./run-be-ut.sh --run --filter=DataStreamRecvrTest.TestEosBeforeSetDependency:DataStreamRecvrTest.TestDataBeforeSetDependencyWithRemainingSenders in this runner, but the build failed before compilation because thirdparty/installed/bin/protoc is missing.
  • Observability: The added teardown log is acceptable and does not hide failures.
  • Transactions, persistence, data writes, and FE-to-BE variable passing: Not applicable.
  • Performance: No meaningful regression; the added locking work is negligible.

User Focus

  • No additional user-provided review focus was supplied.

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 24, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 64.71% (11/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.31% (20385/38238)
Line Coverage 36.85% (192020/521089)
Region Coverage 33.18% (149390/450268)
Branch Coverage 34.29% (65333/190556)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 88.24% (15/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.45% (26777/37477)
Line Coverage 53.81% (279642/519700)
Region Coverage 47.19% (215069/455765)
Branch Coverage 50.46% (97271/192778)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 64.71% (11/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.35% (20400/38241)
Line Coverage 36.88% (192237/521196)
Region Coverage 33.19% (149509/450465)
Branch Coverage 34.32% (65433/190648)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 88.24% (15/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.44% (26773/37477)
Line Coverage 53.76% (279395/519700)
Region Coverage 47.15% (214880/455765)
Branch Coverage 50.41% (97189/192778)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 64.71% (11/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.36% (20414/38256)
Line Coverage 36.89% (192296/521255)
Region Coverage 33.22% (149665/450480)
Branch Coverage 34.32% (65435/190648)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 88.24% (15/17) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.44% (26773/37477)
Line Coverage 53.76% (279400/519700)
Region Coverage 47.16% (214934/455765)
Branch Coverage 50.42% (97190/192778)

Issue Number: None

Related PR: None

Problem Summary: Data or EOS could arrive after the receiver is registered but before its source dependency is installed. In that window, ready notification was lost and the exchange source could remain blocked. Recheck the queue state when setting the dependency and protect channel turn-off checks with the instance mutex.

Fix a potential query hang caused by an exchange receiver dependency ready-notification race.

- Test: Unit Test / Manual test
    - Unit Test: ./run-be-ut.sh --run --filter=DataStreamRecvrTest.TestEosBeforeSetDependency:DataStreamRecvrTest.TestDataBeforeSetDependencyWithRemainingSenders
    - Manual test: build-support/check-format.sh
    - Static analysis attempted: build-support/run-clang-tidy.sh --build-dir be/ut_build_ASAN failed because clang-tidy could not analyze the files due to environment/pre-existing diagnostics, including missing stddef.h from system/libstdc++ headers and an existing unmatched NOLINTEND in be/src/core/types.h
- Behavior changed: Yes. Exchange source dependencies are now marked ready if queued data or EOS arrived before set_dependency().
- Does this need documentation: No
@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@mrhhsg
Copy link
Copy Markdown
Member Author

mrhhsg commented Apr 27, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 50.00% (6/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.34% (20417/38275)
Line Coverage 36.90% (192471/521563)
Region Coverage 33.22% (149766/450798)
Branch Coverage 34.34% (65516/190776)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.33% (10/12) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.92% (26971/37500)
Line Coverage 55.13% (286826/520268)
Region Coverage 52.12% (237194/455124)
Branch Coverage 53.53% (102507/191506)

@yiguolei yiguolei merged commit bc82bbb into apache:master Apr 28, 2026
30 of 32 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 28, 2026
### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Data or EOS could arrive after the receiver is
registered but before its source dependency is installed. In that
window, ready notification was lost and the exchange source could remain
blocked. Recheck the queue state when setting the dependency and protect
channel turn-off checks with the instance mutex.

### Release note

Fix a potential query hang caused by an exchange receiver dependency
ready-notification race.

### Check List (For Author)

- Test: Unit Test / Manual test
- Unit Test: ./run-be-ut.sh --run
--filter=DataStreamRecvrTest.TestEosBeforeSetDependency:DataStreamRecvrTest.TestDataBeforeSetDependencyWithRemainingSenders
    - Manual test: build-support/check-format.sh
- Static analysis attempted: build-support/run-clang-tidy.sh --build-dir
be/ut_build_ASAN failed because clang-tidy could not analyze the files
due to environment/pre-existing diagnostics, including missing stddef.h
from system/libstdc++ headers and an existing unmatched NOLINTEND in
be/src/core/types.h
- Behavior changed: Yes. Exchange source dependencies are now marked
ready if queued data or EOS arrived before set_dependency().
- Does this need documentation: No

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@mrhhsg mrhhsg deleted the fix_exchange branch April 28, 2026 09:00
yiguolei pushed a commit that referenced this pull request Apr 29, 2026
…62885)

Cherry-picked from #62777

Co-authored-by: Jerry Hu <hushenggang@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants