-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Hash join buffering on probe side #19761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Hash join buffering on probe side #19761
Conversation
|
run benchmarks |
|
🤖 Hi @gabotechs, thanks for the request (#19761 (comment)). |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpcds tpch10 |
|
🤖 |
|
Benchmark script failed with exit code 1. Last 10 lines of output: Click to expand |
|
run benchmark tpch10 |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpcds |
|
🤖 |
|
Benchmark script failed with exit code 1. Last 10 lines of output: Click to expand |
|
🤔 the tpcds benchmark command seems broken |
3e4660b to
cdc6ad1
Compare
|
It does seem that some queries get a significant slowdown... I think this needs further investigation. |
|
So in summary about >10% faster on average, but some slowdowns.
|
|
What might be a good thing to try: Currently, a high part of the cost of the
We should be able to do the hash % mod of the build side during the join, avoiding the need for a When this is implemented, we might want to look at |
Tested it locally and it does not seem to have a significant impact. Even doing an unbounded buffering has no significant impact. What I get from that is that the main driver for speed is the fact that something forces the probe side
This can indeed reduce the impact of any issue affecting build side creation speed. I would not say "hiding" as the issues should still be visible, just that the overall query latency is no longer the best metric to discover those. |
I do expect buffering to have a positive impact even if all optimizations you mentioned are shipped. Buffering has a much greater impact in real scenarios, where the IO component is way heavier as data might be stored in a bucket or in a remote resource like an API, I was actually surprised to see that there's a non negligible impact if running benchmarks against local files, like the ones reported in this PR. Regardless of the order of events, this PR still needs work, it should not imply slowdowns in any of the current benchmarks. |
|
I've just digged a bit more in why the slowdowns: it is because buffering is rendering the dynamic filters useless. Disabling buffering if a dynamic filter attempts to be pushed down through a Benchmark results for TPC-DS with `BufferExec` removed if a dynamic filter tried to cross it |
cdc6ad1 to
09c6b68
Compare
| // If there is a dynamic filter being pushed down through this node, we don't want to buffer, | ||
| // we prefer to give a chance to the dynamic filter to be populated with something rather | ||
| // than eagerly polling data with an empty dynamic filter. | ||
| let mut has_dynamic_filter = false; | ||
| for parent_filter in &child_pushdown_result.parent_filters { | ||
| if is_dynamic_physical_expr(&parent_filter.filter) { | ||
| has_dynamic_filter = true; | ||
| } | ||
| } | ||
| if has_dynamic_filter { | ||
| let mut result = FilterPushdownPropagation::if_all(child_pushdown_result); | ||
| result.updated_node = Some(Arc::clone(self.input())); | ||
| Ok(result) | ||
| } else { | ||
| Ok(FilterPushdownPropagation::if_all(child_pushdown_result)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb am I doing this right? what I mainly want is: if at any point a dynamic filter is pushed down through this node, then I want this node to disappear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this looks right to me! I'm not sure this was tested as part of the optimizer rule but "swap with my child" would be a good test to add (especially if this is broken).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was under the impression that I should be doing something like this instead:
for parent_filter in &child_pushdown_result.parent_filters {
if is_dynamic_physical_expr(&parent_filter.filter)
+ && matches!(parent_filter.all(), PushedDown::Yes)
{
has_dynamic_filter = true;
}
}But parent_filter.all() is always PushedDown::No even if the dynamic filter is do getting properly pushed down. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is normal if you have datafusion.execution.parquet.pushdown_filters = false (the default). It means the scan node may hold onto a reference and use it for e.g. stats scanning but does not promise to apply it perfectly. I.e. as far as the caller is concerned it was not pushed down. You could use DynamicFilterPhyscialExpr::is_used which @LiaCastaneda introduced recently.
|
run benchmark tpcds tpch |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
It does not close any issue, but it's related to:
Rationale for this change
This is a PR from a batch of PRs that attempt to improve performance in hash joins:
It adds the new
BufferExecnode at the top of the probe side of hash joins so that some work is eagerly performed before the build side of the hash join is completely finished.Why should this speed up joins?
In order to better understand the impact of this PR, it's useful to understand how streams work in Rust: creating a stream does not perform any work, progress is just made if the stream gets polled.
This means that whenever we call
.execute()on anExecutionPlan(like the probe side of a join), nothing happens, not even the most basic TCP connections or system calls are performed. Instead, all this work is delayed as much as possible until the first poll is made to the stream, losing the opportunity to make some early progress.This gets worst when multiple hash joins are chained together: they will get executed in cascade as if they were domino pieces, which has the benefit of leaving a small memory footprint, but underutilizes the resources of the machine for executing the query faster.
NOTE: still don't know if this improves the benchmarks, just experimenting for now
What changes are included in this PR?
Adds a new
HashJoinBufferingphysical optimizer rule that will idempotently placeBufferExecnodes on the probe side of has joins:Are these changes tested?
yes, by existing tests
Are there any user-facing changes?
yes, users will see a new
BufferExecbeing placed at top of the probe side of each hash join. (Still unsure about whether de default mode should be enabled)Results
Warning
I'm very skeptical about this benchmarks run on my laptop, take them with a grain of salt, they should be run in a more controlled environment