Skip to content

Conversation

@xzhangxian1008
Copy link
Contributor

@xzhangxian1008 xzhangxian1008 commented Jan 14, 2026

What problem does this PR solve?

Issue Number: close #10636

Problem Summary:

What is changed and how it works?

In `waitForBlockAvailableForTest`, CTEReader will wait the wake of `cv_for_test`. However, we need to check if there are any available blocks before entering the wait or it will wait forever when there is no more block to be pushed.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Fix the issue that TestCTE.Concurrent may hang forever

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed labels Jan 14, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 14, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zanmato1984 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 14, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 14, 2026

📝 Walkthrough

Walkthrough

The pull request refactors synchronization primitives in the CTE (Common Table Expression) operator to prevent deadlocks. A shared read-write lock replaces per-partition mutexes for protecting a broader set of state variables. Lock acquisition order is carefully adjusted during blocking waits to avoid circular wait conditions.

Changes

Cohort / File(s) Summary
Synchronization primitive refactoring
dbms/src/Operators/CTE.h
Added public getRWLockForTest() accessor and private std::shared_mutex rw_lock field to protect multiple state variables (next_cte_reader_id, is_eof, is_cancelled, get_resp, resp, err_msg, sink_exit_num, source_exit_num, registered_sink_num). Removed per-partition test mutex initialization.
Partition test mutex removal
dbms/src/Operators/CTEPartition.h
Removed mu_for_test public member field (unique_ptrstd::mutex) from struct. Other synchronization primitives (mu, cv_for_test) retained.
Lock acquisition pattern updates
dbms/src/Operators/CTEReader.cpp
Updated to use shared rw-lock from CTE with careful lock ordering during blocking waits. Replaced checkBlockAvailable call with checkBlockAvailableNoLock to avoid holding partition lock during availability checks. Adjusted waiting loop to release rw-lock before condition variable wait, then re-acquire locks in deadlock-safe order.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

approved, lgtm

Suggested reviewers

  • JaySon-Huang
  • wshwsh12
  • JinheLin

Poem

🐰 A lock for each partition caused a tangled mess,
But shared mutexes bring harmony and blessed access,
TSAN's complaints fade as deadlocks take their final bow,
CTE's symphony plays in perfect order now! 🎼

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main objective of the PR: fixing a hang issue in the TestCTE.Concurrent test.
Linked Issues check ✅ Passed The PR addresses the hang issue in TestCTE.Concurrent by implementing synchronization mechanism improvements: replacing per-partition mutex with shared read-write lock in CTE, and adjusting lock acquisition/release patterns to prevent deadlocks.
Out of Scope Changes check ✅ Passed All code changes (CTE.h, CTEPartition.h, CTEReader.cpp) directly relate to fixing the TestCTE.Concurrent hang issue through improved synchronization logic, with no unrelated modifications detected.
Description check ✅ Passed The PR description follows the template with all required sections completed, including issue reference, problem summary, code changes explanation, test checklist, side effects assessment, and release note.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@xzhangxian1008
Copy link
Contributor Author

/cc @windtalker @gengliqi

@ti-chi-bot ti-chi-bot bot requested review from gengliqi and windtalker January 15, 2026 01:43
@xzhangxian1008
Copy link
Contributor Author

/run-check-issue-triage-complete


CTEPartition & getPartitionForTest(size_t partition_idx) { return this->partitions[partition_idx]; }

std::shared_mutex * getRWLockForTest() { return &(this->rw_lock); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a reference instead of a pointer?

// For example: in `CTE::tryGetBlockAt`, we will lock rw_lock first then lock partition.mu.
// If locking partition.mu first here, `CTE::tryGetBlockAt` may have locked rw_lock. Then
// each of them needs to lock the other lock, but the other lock has been locked now.
rw_lock.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a sleep above this line of code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail TSAN test: TestCTE.Concurrent hang forever

2 participants