Skip to content

Conversation

@gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Jan 12, 2026

Which issue does this PR close?

It does not close any issue, but it's related to:

Rationale for this change

This is a PR from a batch of PRs that attempt to improve performance in hash joins:

It adds the new BufferExec node at the top of the probe side of hash joins so that some work is eagerly performed before the build side of the hash join is completely finished.

Why should this speed up joins?

In order to better understand the impact of this PR, it's useful to understand how streams work in Rust: creating a stream does not perform any work, progress is just made if the stream gets polled.

This means that whenever we call .execute() on an ExecutionPlan (like the probe side of a join), nothing happens, not even the most basic TCP connections or system calls are performed. Instead, all this work is delayed as much as possible until the first poll is made to the stream, losing the opportunity to make some early progress.

This gets worst when multiple hash joins are chained together: they will get executed in cascade as if they were domino pieces, which has the benefit of leaving a small memory footprint, but underutilizes the resources of the machine for executing the query faster.

NOTE: still don't know if this improves the benchmarks, just experimenting for now

What changes are included in this PR?

Adds a new HashJoinBuffering physical optimizer rule that will idempotently place BufferExec nodes on the probe side of has joins:

            ┌───────────────────┐
            │   HashJoinExec    │
            └─────▲────────▲────┘
          ┌───────┘        └─────────┐
          │                          │
 ┌────────────────┐         ┌─────────────────┐
 │   Build side   │       + │   BufferExec    │
 └────────────────┘         └────────▲────────┘
                                     │
                            ┌────────┴────────┐
                            │   Probe side    │
                            └─────────────────┘

Are these changes tested?

yes, by existing tests

Are there any user-facing changes?

yes, users will see a new BufferExec being placed at top of the probe side of each hash join. (Still unsure about whether de default mode should be enabled)


Results

Warning

I'm very skeptical about this benchmarks run on my laptop, take them with a grain of salt, they should be run in a more controlled environment

Comparing main and hash-join-buffering-on-probe-side
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃       main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │   37.80 ms │                          19.07 ms │ +1.98x faster │
│ QQuery 2  │  130.36 ms │                          54.25 ms │ +2.40x faster │
│ QQuery 3  │   99.05 ms │                          90.99 ms │ +1.09x faster │
│ QQuery 4  │  894.61 ms │                         340.70 ms │ +2.63x faster │
│ QQuery 5  │  151.16 ms │                         147.84 ms │     no change │
│ QQuery 6  │  566.37 ms │                         513.89 ms │ +1.10x faster │
│ QQuery 7  │  290.12 ms │                         248.25 ms │ +1.17x faster │
│ QQuery 8  │   97.46 ms │                          90.59 ms │ +1.08x faster │
│ QQuery 9  │   88.59 ms │                          94.18 ms │  1.06x slower │
│ QQuery 10 │   85.89 ms │                          48.71 ms │ +1.76x faster │
│ QQuery 11 │  567.85 ms │                         180.30 ms │ +3.15x faster │
│ QQuery 12 │   35.66 ms │                          32.78 ms │ +1.09x faster │
│ QQuery 13 │  313.89 ms │                         312.86 ms │     no change │
│ QQuery 14 │  741.51 ms │                         367.39 ms │ +2.02x faster │
│ QQuery 15 │   23.11 ms │                          49.44 ms │  2.14x slower │
│ QQuery 16 │   32.72 ms │                         109.53 ms │  3.35x slower │
│ QQuery 17 │  220.05 ms │                         160.70 ms │ +1.37x faster │
│ QQuery 18 │  114.36 ms │                         162.51 ms │  1.42x slower │
│ QQuery 19 │  133.50 ms │                         123.87 ms │ +1.08x faster │
│ QQuery 20 │   12.37 ms │                          52.66 ms │  4.26x slower │
│ QQuery 21 │   15.53 ms │                         132.58 ms │  8.54x slower │
│ QQuery 22 │  288.69 ms │                         375.91 ms │  1.30x slower │
│ QQuery 23 │  772.46 ms │                         488.07 ms │ +1.58x faster │
│ QQuery 24 │  340.42 ms │                         287.51 ms │ +1.18x faster │
│ QQuery 25 │  307.77 ms │                         195.09 ms │ +1.58x faster │
│ QQuery 26 │   81.78 ms │                         123.89 ms │  1.51x slower │
│ QQuery 27 │  297.72 ms │                         240.88 ms │ +1.24x faster │
│ QQuery 28 │  127.20 ms │                         127.28 ms │     no change │
│ QQuery 29 │  261.03 ms │                         161.52 ms │ +1.62x faster │
│ QQuery 30 │   35.53 ms │                          26.18 ms │ +1.36x faster │
│ QQuery 31 │  120.02 ms │                         101.47 ms │ +1.18x faster │
│ QQuery 32 │   48.49 ms │                          43.37 ms │ +1.12x faster │
│ QQuery 33 │  112.83 ms │                         110.45 ms │     no change │
│ QQuery 34 │   85.92 ms │                          80.71 ms │ +1.06x faster │
│ QQuery 35 │   81.94 ms │                          51.65 ms │ +1.59x faster │
│ QQuery 36 │  165.56 ms │                         168.79 ms │     no change │
│ QQuery 37 │  153.98 ms │                         155.81 ms │     no change │
│ QQuery 38 │   60.75 ms │                          53.06 ms │ +1.14x faster │
│ QQuery 39 │   81.49 ms │                         294.01 ms │  3.61x slower │
│ QQuery 40 │   87.94 ms │                          76.12 ms │ +1.16x faster │
│ QQuery 41 │   10.61 ms │                           9.61 ms │ +1.10x faster │
│ QQuery 42 │   89.63 ms │                          88.33 ms │     no change │
│ QQuery 43 │   69.61 ms │                          63.42 ms │ +1.10x faster │
│ QQuery 44 │    9.08 ms │                           7.78 ms │ +1.17x faster │
│ QQuery 45 │   53.17 ms │                          32.19 ms │ +1.65x faster │
│ QQuery 46 │  175.44 ms │                         167.41 ms │     no change │
│ QQuery 47 │  478.10 ms │                         123.03 ms │ +3.89x faster │
│ QQuery 48 │  224.20 ms │                         212.88 ms │ +1.05x faster │
│ QQuery 49 │  206.10 ms │                         200.87 ms │     no change │
│ QQuery 50 │  176.44 ms │                         141.12 ms │ +1.25x faster │
│ QQuery 51 │  141.42 ms │                         105.32 ms │ +1.34x faster │
│ QQuery 52 │   90.66 ms │                          89.26 ms │     no change │
│ QQuery 53 │   89.56 ms │                          83.37 ms │ +1.07x faster │
│ QQuery 54 │  123.43 ms │                         119.06 ms │     no change │
│ QQuery 55 │   88.73 ms │                          90.23 ms │     no change │
│ QQuery 56 │  114.66 ms │                         112.92 ms │     no change │
│ QQuery 57 │  131.64 ms │                          69.73 ms │ +1.89x faster │
│ QQuery 58 │  228.01 ms │                         127.59 ms │ +1.79x faster │
│ QQuery 59 │  169.17 ms │                         127.03 ms │ +1.33x faster │
│ QQuery 60 │  118.92 ms │                         115.28 ms │     no change │
│ QQuery 61 │  149.06 ms │                         147.06 ms │     no change │
│ QQuery 62 │  441.11 ms │                         433.50 ms │     no change │
│ QQuery 63 │   95.44 ms │                          85.84 ms │ +1.11x faster │
│ QQuery 64 │  606.32 ms │                         442.72 ms │ +1.37x faster │
│ QQuery 65 │  208.68 ms │                          91.03 ms │ +2.29x faster │
│ QQuery 66 │  188.17 ms │                         177.41 ms │ +1.06x faster │
│ QQuery 67 │  249.91 ms │                         234.31 ms │ +1.07x faster │
│ QQuery 68 │  235.92 ms │                         224.15 ms │     no change │
│ QQuery 69 │   89.95 ms │                          46.44 ms │ +1.94x faster │
│ QQuery 70 │  278.67 ms │                         203.35 ms │ +1.37x faster │
│ QQuery 71 │  109.23 ms │                         109.86 ms │     no change │
│ QQuery 72 │  508.24 ms │                         391.84 ms │ +1.30x faster │
│ QQuery 73 │   90.02 ms │                          78.49 ms │ +1.15x faster │
│ QQuery 74 │  373.75 ms │                         112.90 ms │ +3.31x faster │
│ QQuery 75 │  227.43 ms │                         172.97 ms │ +1.31x faster │
│ QQuery 76 │  116.42 ms │                         110.72 ms │     no change │
│ QQuery 77 │  170.31 ms │                         144.66 ms │ +1.18x faster │
│ QQuery 78 │  422.27 ms │                         245.42 ms │ +1.72x faster │
│ QQuery 79 │  190.47 ms │                         166.21 ms │ +1.15x faster │
│ QQuery 80 │  265.88 ms │                         242.36 ms │ +1.10x faster │
│ QQuery 81 │   23.05 ms │                          17.96 ms │ +1.28x faster │
│ QQuery 82 │  173.94 ms │                         162.41 ms │ +1.07x faster │
│ QQuery 83 │   40.37 ms │                          18.62 ms │ +2.17x faster │
│ QQuery 84 │   40.52 ms │                          26.07 ms │ +1.55x faster │
│ QQuery 85 │  138.45 ms │                          71.38 ms │ +1.94x faster │
│ QQuery 86 │   30.41 ms │                          28.27 ms │ +1.08x faster │
│ QQuery 87 │   62.64 ms │                          54.20 ms │ +1.16x faster │
│ QQuery 88 │   84.50 ms │                          74.60 ms │ +1.13x faster │
│ QQuery 89 │  108.95 ms │                          89.03 ms │ +1.22x faster │
│ QQuery 90 │   19.19 ms │                          16.36 ms │ +1.17x faster │
│ QQuery 91 │   53.45 ms │                          34.82 ms │ +1.54x faster │
│ QQuery 92 │   49.13 ms │                          25.47 ms │ +1.93x faster │
│ QQuery 93 │  151.86 ms │                         134.34 ms │ +1.13x faster │
│ QQuery 94 │   52.94 ms │                          46.45 ms │ +1.14x faster │
│ QQuery 95 │  125.23 ms │                          50.85 ms │ +2.46x faster │
│ QQuery 96 │   59.70 ms │                          54.86 ms │ +1.09x faster │
│ QQuery 97 │   99.90 ms │                          71.00 ms │ +1.41x faster │
│ QQuery 98 │  129.60 ms │                         111.11 ms │ +1.17x faster │
│ QQuery 99 │ 4562.37 ms │                        4353.70 ms │     no change │
└───────────┴────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)                                │ 21975.53ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 17884.01ms │
│ Average Time (main)                              │   221.98ms │
│ Average Time (hash-join-buffering-on-probe-side) │   180.65ms │
│ Queries Faster                                   │         70 │
│ Queries Slower                                   │          9 │
│ Queries with No Change                           │         20 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │  44.90 ms │                          40.62 ms │ +1.11x faster │
│ QQuery 2  │  18.76 ms │                          12.43 ms │ +1.51x faster │
│ QQuery 3  │  28.97 ms │                          23.39 ms │ +1.24x faster │
│ QQuery 4  │  17.85 ms │                          16.29 ms │ +1.10x faster │
│ QQuery 5  │  93.97 ms │                          43.91 ms │ +2.14x faster │
│ QQuery 6  │  17.08 ms │                          17.50 ms │     no change │
│ QQuery 7  │  90.73 ms │                          46.86 ms │ +1.94x faster │
│ QQuery 8  │  85.72 ms │                          36.05 ms │ +2.38x faster │
│ QQuery 9  │  74.19 ms │                          43.14 ms │ +1.72x faster │
│ QQuery 10 │  89.22 ms │                          39.76 ms │ +2.24x faster │
│ QQuery 11 │  13.64 ms │                           9.49 ms │ +1.44x faster │
│ QQuery 12 │  53.55 ms │                          28.44 ms │ +1.88x faster │
│ QQuery 13 │  20.46 ms │                          20.60 ms │     no change │
│ QQuery 14 │  44.52 ms │                          22.86 ms │ +1.95x faster │
│ QQuery 15 │  33.20 ms │                          27.10 ms │ +1.22x faster │
│ QQuery 16 │  12.82 ms │                          11.75 ms │ +1.09x faster │
│ QQuery 17 │  82.07 ms │                          50.03 ms │ +1.64x faster │
│ QQuery 18 │ 109.41 ms │                          62.02 ms │ +1.76x faster │
│ QQuery 19 │  39.01 ms │                          34.62 ms │ +1.13x faster │
│ QQuery 20 │  53.24 ms │                          26.53 ms │ +2.01x faster │
│ QQuery 21 │  76.87 ms │                          53.66 ms │ +1.43x faster │
│ QQuery 22 │   9.18 ms │                           8.46 ms │ +1.09x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                                │ 1109.37ms │
│ Total Time (hash-join-buffering-on-probe-side)   │  675.51ms │
│ Average Time (main)                              │   50.43ms │
│ Average Time (hash-join-buffering-on-probe-side) │   30.71ms │
│ Queries Faster                                   │        20 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │         2 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 333.88 ms │                         333.10 ms │     no change │
│ QQuery 2  │ 149.56 ms │                          95.79 ms │ +1.56x faster │
│ QQuery 3  │ 291.89 ms │                         272.45 ms │ +1.07x faster │
│ QQuery 4  │ 115.77 ms │                         116.32 ms │     no change │
│ QQuery 5  │ 435.41 ms │                         408.67 ms │ +1.07x faster │
│ QQuery 6  │ 122.00 ms │                         119.41 ms │     no change │
│ QQuery 7  │ 597.53 ms │                         554.64 ms │ +1.08x faster │
│ QQuery 8  │ 505.06 ms │                         447.98 ms │ +1.13x faster │
│ QQuery 9  │ 718.08 ms │                         664.75 ms │ +1.08x faster │
│ QQuery 10 │ 355.45 ms │                         318.31 ms │ +1.12x faster │
│ QQuery 11 │ 117.63 ms │                          87.23 ms │ +1.35x faster │
│ QQuery 12 │ 229.20 ms │                         197.97 ms │ +1.16x faster │
│ QQuery 13 │ 250.32 ms │                         219.43 ms │ +1.14x faster │
│ QQuery 14 │ 197.94 ms │                         173.28 ms │ +1.14x faster │
│ QQuery 15 │ 318.42 ms │                         288.27 ms │ +1.10x faster │
│ QQuery 16 │  85.11 ms │                          66.98 ms │ +1.27x faster │
│ QQuery 17 │ 723.73 ms │                         667.37 ms │ +1.08x faster │
│ QQuery 18 │ 794.77 ms │                         726.88 ms │ +1.09x faster │
│ QQuery 19 │ 320.78 ms │                         292.61 ms │ +1.10x faster │
│ QQuery 20 │ 293.52 ms │                         258.06 ms │ +1.14x faster │
│ QQuery 21 │ 786.11 ms │                         732.63 ms │ +1.07x faster │
│ QQuery 22 │  84.85 ms │                          79.90 ms │ +1.06x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                                │ 7827.02ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 7122.04ms │
│ Average Time (main)                              │  355.77ms │
│ Average Time (hash-join-buffering-on-probe-side) │  323.73ms │
│ Queries Faster                                   │        19 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │         3 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Jan 12, 2026
@gabotechs
Copy link
Contributor Author

run benchmarks

@alamb-ghbot
Copy link

🤖 Hi @gabotechs, thanks for the request (#19761 (comment)). scrape_comments.py only responds to whitelisted users. Allowed users: Dandandan, Omega359, adriangb, alamb, comphead, geoffreyclaude, klion26, rluvaton, xudong963, zhuqi-lucas.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 12, 2026
@gabotechs
Copy link
Contributor Author

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query    ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │  2479.50 ms │                        2365.91 ms │     no change │
│ QQuery 1 │   933.04 ms │                         961.61 ms │     no change │
│ QQuery 2 │  2128.72 ms │                        1828.41 ms │ +1.16x faster │
│ QQuery 3 │  1140.67 ms │                        1106.77 ms │     no change │
│ QQuery 4 │  2349.73 ms │                        2265.79 ms │     no change │
│ QQuery 5 │ 28477.94 ms │                       27819.90 ms │     no change │
│ QQuery 6 │  3913.85 ms │                        3886.72 ms │     no change │
│ QQuery 7 │  2907.17 ms │                        2857.38 ms │     no change │
└──────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 44330.62ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 43092.50ms │
│ Average Time (HEAD)                              │  5541.33ms │
│ Average Time (hash-join-buffering-on-probe-side) │  5386.56ms │
│ Queries Faster                                   │          1 │
│ Queries Slower                                   │          0 │
│ Queries with No Change                           │          7 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │     1.91 ms │                           1.94 ms │     no change │
│ QQuery 1  │    50.86 ms │                          51.03 ms │     no change │
│ QQuery 2  │   129.07 ms │                         131.06 ms │     no change │
│ QQuery 3  │   151.75 ms │                         154.89 ms │     no change │
│ QQuery 4  │  1070.04 ms │                        1218.71 ms │  1.14x slower │
│ QQuery 5  │  1377.65 ms │                        1501.78 ms │  1.09x slower │
│ QQuery 6  │     1.82 ms │                           1.87 ms │     no change │
│ QQuery 7  │    56.03 ms │                          61.22 ms │  1.09x slower │
│ QQuery 8  │  1423.84 ms │                        1561.18 ms │  1.10x slower │
│ QQuery 9  │  1748.54 ms │                        1871.82 ms │  1.07x slower │
│ QQuery 10 │   343.11 ms │                         350.58 ms │     no change │
│ QQuery 11 │   390.93 ms │                         400.26 ms │     no change │
│ QQuery 12 │  1249.28 ms │                        1460.10 ms │  1.17x slower │
│ QQuery 13 │  1916.12 ms │                        2067.22 ms │  1.08x slower │
│ QQuery 14 │  1214.64 ms │                        1359.01 ms │  1.12x slower │
│ QQuery 15 │  1224.35 ms │                        1382.17 ms │  1.13x slower │
│ QQuery 16 │  2587.35 ms │                        2651.10 ms │     no change │
│ QQuery 17 │  2481.42 ms │                        2645.83 ms │  1.07x slower │
│ QQuery 18 │  6019.63 ms │                        4969.84 ms │ +1.21x faster │
│ QQuery 19 │   118.04 ms │                         122.91 ms │     no change │
│ QQuery 20 │  1977.36 ms │                        1907.42 ms │     no change │
│ QQuery 21 │  2282.79 ms │                        2227.74 ms │     no change │
│ QQuery 22 │  4147.94 ms │                        3809.68 ms │ +1.09x faster │
│ QQuery 23 │ 18037.69 ms │                       12405.70 ms │ +1.45x faster │
│ QQuery 24 │   203.52 ms │                         236.74 ms │  1.16x slower │
│ QQuery 25 │   482.62 ms │                         517.70 ms │  1.07x slower │
│ QQuery 26 │   218.15 ms │                         233.60 ms │  1.07x slower │
│ QQuery 27 │  2805.96 ms │                        2772.92 ms │     no change │
│ QQuery 28 │ 22174.76 ms │                       21847.75 ms │     no change │
│ QQuery 29 │   977.94 ms │                         952.08 ms │     no change │
│ QQuery 30 │  1315.68 ms │                        1336.40 ms │     no change │
│ QQuery 31 │  1366.16 ms │                        1421.09 ms │     no change │
│ QQuery 32 │  5155.78 ms │                        4350.28 ms │ +1.19x faster │
│ QQuery 33 │  5715.37 ms │                        5687.38 ms │     no change │
│ QQuery 34 │  6016.46 ms │                        5853.35 ms │     no change │
│ QQuery 35 │  1918.61 ms │                        2098.03 ms │  1.09x slower │
│ QQuery 36 │    67.22 ms │                          70.35 ms │     no change │
│ QQuery 37 │    45.47 ms │                          49.54 ms │  1.09x slower │
│ QQuery 38 │    65.42 ms │                          68.24 ms │     no change │
│ QQuery 39 │   104.44 ms │                         111.22 ms │  1.06x slower │
│ QQuery 40 │    27.46 ms │                          27.62 ms │     no change │
│ QQuery 41 │    23.04 ms │                          24.38 ms │  1.06x slower │
│ QQuery 42 │    19.89 ms │                          21.71 ms │  1.09x slower │
└───────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 98706.12ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 91995.42ms │
│ Average Time (HEAD)                              │  2295.49ms │
│ Average Time (hash-join-buffering-on-probe-side) │  2139.43ms │
│ Queries Faster                                   │          4 │
│ Queries Slower                                   │         18 │
│ Queries with No Change                           │         21 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 140.65 ms │                         101.97 ms │ +1.38x faster │
│ QQuery 2  │  37.21 ms │                          30.95 ms │ +1.20x faster │
│ QQuery 3  │  44.92 ms │                          32.31 ms │ +1.39x faster │
│ QQuery 4  │  31.87 ms │                          30.19 ms │ +1.06x faster │
│ QQuery 5  │  92.53 ms │                          94.55 ms │     no change │
│ QQuery 6  │  21.01 ms │                          20.99 ms │     no change │
│ QQuery 7  │ 157.97 ms │                         165.53 ms │     no change │
│ QQuery 8  │  41.01 ms │                          35.06 ms │ +1.17x faster │
│ QQuery 9  │ 102.50 ms │                          93.90 ms │ +1.09x faster │
│ QQuery 10 │  68.82 ms │                          67.90 ms │     no change │
│ QQuery 11 │  19.57 ms │                          17.92 ms │ +1.09x faster │
│ QQuery 12 │  52.47 ms │                          54.41 ms │     no change │
│ QQuery 13 │  50.52 ms │                          47.74 ms │ +1.06x faster │
│ QQuery 14 │  15.26 ms │                          15.25 ms │     no change │
│ QQuery 15 │  31.19 ms │                          30.51 ms │     no change │
│ QQuery 16 │  30.26 ms │                          28.22 ms │ +1.07x faster │
│ QQuery 17 │ 144.19 ms │                         150.21 ms │     no change │
│ QQuery 18 │ 286.83 ms │                         262.07 ms │ +1.09x faster │
│ QQuery 19 │  40.60 ms │                          41.31 ms │     no change │
│ QQuery 20 │  57.30 ms │                          56.06 ms │     no change │
│ QQuery 21 │ 188.92 ms │                         179.62 ms │     no change │
│ QQuery 22 │  22.42 ms │                          22.15 ms │     no change │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 1678.03ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 1578.79ms │
│ Average Time (HEAD)                              │   76.27ms │
│ Average Time (hash-join-buffering-on-probe-side) │   71.76ms │
│ Queries Faster                                   │        10 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │        12 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@gabotechs
Copy link
Contributor Author

run benchmark tpcds tpch10

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

Benchmark script failed with exit code 1.

Last 10 lines of output:

Click to expand
BRANCH_NAME: HEAD
DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
RESULTS_DIR: /home/alamb/arrow-datafusion/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

@gabotechs
Copy link
Contributor Author

run benchmark tpch10

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpch10
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query     ┃ HEAD ┃ hash-join-buffering-on-probe-side ┃       Change ┃
┡━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1  │ FAIL │                              FAIL │ incomparable │
│ QQuery 2  │ FAIL │                              FAIL │ incomparable │
│ QQuery 3  │ FAIL │                              FAIL │ incomparable │
│ QQuery 4  │ FAIL │                              FAIL │ incomparable │
│ QQuery 5  │ FAIL │                              FAIL │ incomparable │
│ QQuery 6  │ FAIL │                              FAIL │ incomparable │
│ QQuery 7  │ FAIL │                              FAIL │ incomparable │
│ QQuery 8  │ FAIL │                              FAIL │ incomparable │
│ QQuery 9  │ FAIL │                              FAIL │ incomparable │
│ QQuery 10 │ FAIL │                              FAIL │ incomparable │
│ QQuery 11 │ FAIL │                              FAIL │ incomparable │
│ QQuery 12 │ FAIL │                              FAIL │ incomparable │
│ QQuery 13 │ FAIL │                              FAIL │ incomparable │
│ QQuery 14 │ FAIL │                              FAIL │ incomparable │
│ QQuery 15 │ FAIL │                              FAIL │ incomparable │
│ QQuery 16 │ FAIL │                              FAIL │ incomparable │
│ QQuery 17 │ FAIL │                              FAIL │ incomparable │
│ QQuery 18 │ FAIL │                              FAIL │ incomparable │
│ QQuery 19 │ FAIL │                              FAIL │ incomparable │
│ QQuery 20 │ FAIL │                              FAIL │ incomparable │
│ QQuery 21 │ FAIL │                              FAIL │ incomparable │
│ QQuery 22 │ FAIL │                              FAIL │ incomparable │
└───────────┴──────┴───────────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃ Benchmark Summary                                ┃        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ Total Time (HEAD)                                │ 0.00ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 0.00ms │
│ Average Time (HEAD)                              │ 0.00ms │
│ Average Time (hash-join-buffering-on-probe-side) │ 0.00ms │
│ Queries Faster                                   │      0 │
│ Queries Slower                                   │      0 │
│ Queries with No Change                           │      0 │
│ Queries with Failure                             │     22 │
└──────────────────────────────────────────────────┴────────┘

@gabotechs
Copy link
Contributor Author

run benchmark tpch

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 186.54 ms │                         180.81 ms │     no change │
│ QQuery 2  │  92.79 ms │                          48.71 ms │ +1.90x faster │
│ QQuery 3  │ 129.28 ms │                         106.07 ms │ +1.22x faster │
│ QQuery 4  │  80.78 ms │                          74.64 ms │ +1.08x faster │
│ QQuery 5  │ 186.74 ms │                         163.71 ms │ +1.14x faster │
│ QQuery 6  │  70.54 ms │                          66.87 ms │ +1.06x faster │
│ QQuery 7  │ 222.50 ms │                         194.54 ms │ +1.14x faster │
│ QQuery 8  │ 175.16 ms │                         125.23 ms │ +1.40x faster │
│ QQuery 9  │ 231.17 ms │                         174.24 ms │ +1.33x faster │
│ QQuery 10 │ 190.18 ms │                         148.84 ms │ +1.28x faster │
│ QQuery 11 │  70.01 ms │                          46.31 ms │ +1.51x faster │
│ QQuery 12 │ 120.18 ms │                         109.09 ms │ +1.10x faster │
│ QQuery 13 │ 219.34 ms │                         204.01 ms │ +1.08x faster │
│ QQuery 14 │  95.98 ms │                          88.23 ms │ +1.09x faster │
│ QQuery 15 │ 132.46 ms │                         100.40 ms │ +1.32x faster │
│ QQuery 16 │  64.09 ms │                          46.41 ms │ +1.38x faster │
│ QQuery 17 │ 280.98 ms │                         211.97 ms │ +1.33x faster │
│ QQuery 18 │ 332.62 ms │                         271.65 ms │ +1.22x faster │
│ QQuery 19 │ 140.44 ms │                         130.87 ms │ +1.07x faster │
│ QQuery 20 │ 135.30 ms │                         100.57 ms │ +1.35x faster │
│ QQuery 21 │ 265.90 ms │                         234.12 ms │ +1.14x faster │
│ QQuery 22 │  41.36 ms │                          37.33 ms │ +1.11x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 3464.36ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 2864.63ms │
│ Average Time (HEAD)                              │  157.47ms │
│ Average Time (hash-join-buffering-on-probe-side) │  130.21ms │
│ Queries Faster                                   │        21 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │         1 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

@gabotechs
Copy link
Contributor Author

run benchmark tpcds

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (3e4660b) to 0c5c97b diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

Benchmark script failed with exit code 1.

Last 10 lines of output:

Click to expand
BRANCH_NAME: HEAD
DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
RESULTS_DIR: /home/alamb/arrow-datafusion/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************

Please prepare TPC-DS data first by following instructions:
  ./bench.sh data tpcds

@gabotechs
Copy link
Contributor Author

🤔 the tpcds benchmark command seems broken

@gabotechs gabotechs force-pushed the hash-join-buffering-on-probe-side branch from 3e4660b to cdc6ad1 Compare January 13, 2026 09:51
@gabotechs
Copy link
Contributor Author

It does seem that some queries get a significant slowdown... I think this needs further investigation.

@Dandandan
Copy link
Contributor

So in summary about >10% faster on average, but some slowdowns.

  • I am wondering what the speedup would be if we just load a single batch per partition instead of based on size - perhaps the speedup is similar with reduced overhead?
  • While I am excited about the speedup, I wonder if this is not mostly "hiding" some "issues" with our current join implementation: hashing and concatenation is currently single-threaded for CollectLeft, both could be parallelized. Also we could lower the overhead of the parallel hash join and reduce the threshold to make the total query run more in parallel and avoid time spent waiting on one not well parallelizable.

@Dandandan
Copy link
Contributor

Dandandan commented Jan 13, 2026

What might be a good thing to try:

Currently, a high part of the cost of the Partitioned join is repartitioning the entire build side of the join - which currently copies all of the columns twice(!), this makes this hash join slower if this side is large/wide. We could avoid one copy in RepartitionExec (using arrow-rs coalesce API when fully implemented), but not two.

CollectLeft avoids this cost at the right side, but the building phase is single threaded, which greatly limits the parallelism in the query.

We should be able to do the hash % mod of the build side during the join, avoiding the need for a RepartitionExec passing the indices of the matching partitions to the left sides - which should greatly reduce the overhead of the repartitioning.

When this is implemented, we might want to look at hash_join_single_partition_threshold and hash_join_single_partition_threshold_rows again which could be reduced to make most joins run fully in parallel./

@gabotechs
Copy link
Contributor Author

I am wondering what the speedup would be if we just load a single batch per partition instead of based on size - perhaps the speedup is similar with reduced overhead?

Tested it locally and it does not seem to have a significant impact. Even doing an unbounded buffering has no significant impact. What I get from that is that the main driver for speed is the fact that something forces the probe side RecordBatchStreams to make progress whether that implies buffering actual record batches or not.

While I am excited about the speedup, I wonder if this is not mostly "hiding" some "issues" with our current join implementation: hashing and concatenation is currently single-threaded for CollectLeft, both could be parallelized. Also we could lower the overhead of the parallel hash join and reduce the threshold to make the total query run more in parallel and avoid time spent waiting on one not well parallelizable.

This can indeed reduce the impact of any issue affecting build side creation speed. I would not say "hiding" as the issues should still be visible, just that the overall query latency is no longer the best metric to discover those.

@gabotechs
Copy link
Contributor Author

gabotechs commented Jan 13, 2026

When this is implemented, we might want to look at hash_join_single_partition_threshold and hash_join_single_partition_threshold_rows again which could be reduced to make most joins run fully in parallel.

I do expect buffering to have a positive impact even if all optimizations you mentioned are shipped. Buffering has a much greater impact in real scenarios, where the IO component is way heavier as data might be stored in a bucket or in a remote resource like an API, I was actually surprised to see that there's a non negligible impact if running benchmarks against local files, like the ones reported in this PR.

Regardless of the order of events, this PR still needs work, it should not imply slowdowns in any of the current benchmarks.

@gabotechs
Copy link
Contributor Author

gabotechs commented Jan 13, 2026

I've just digged a bit more in why the slowdowns: it is because buffering is rendering the dynamic filters useless. Disabling buffering if a dynamic filter attempts to be pushed down through a BufferExec solves the slow downs, but it does not yield that big speedups.

Benchmark results for TPC-DS with `BufferExec` removed if a dynamic filter tried to cross it
┏━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃       main ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │   37.80 ms │                          39.24 ms │     no change │
│ QQuery 2  │  130.36 ms │                         133.30 ms │     no change │
│ QQuery 3  │   99.05 ms │                         101.77 ms │     no change │
│ QQuery 4  │  894.61 ms │                         936.15 ms │     no change │
│ QQuery 5  │  151.16 ms │                         154.67 ms │     no change │
│ QQuery 6  │  566.37 ms │                         571.07 ms │     no change │
│ QQuery 7  │  290.12 ms │                         300.14 ms │     no change │
│ QQuery 8  │   97.46 ms │                          97.43 ms │     no change │
│ QQuery 9  │   88.59 ms │                          94.02 ms │  1.06x slower │
│ QQuery 10 │   85.89 ms │                          54.36 ms │ +1.58x faster │
│ QQuery 11 │  567.85 ms │                         609.94 ms │  1.07x slower │
│ QQuery 12 │   35.66 ms │                          38.42 ms │  1.08x slower │
│ QQuery 13 │  313.89 ms │                         324.31 ms │     no change │
│ QQuery 14 │  741.51 ms │                         609.64 ms │ +1.22x faster │
│ QQuery 15 │   23.11 ms │                          26.78 ms │  1.16x slower │
│ QQuery 16 │   32.72 ms │                          29.25 ms │ +1.12x faster │
│ QQuery 17 │  220.05 ms │                         220.04 ms │     no change │
│ QQuery 18 │  114.36 ms │                         115.66 ms │     no change │
│ QQuery 19 │  133.50 ms │                         134.75 ms │     no change │
│ QQuery 20 │   12.37 ms │                          12.09 ms │     no change │
│ QQuery 21 │   15.53 ms │                          16.90 ms │  1.09x slower │
│ QQuery 22 │  288.69 ms │                         303.75 ms │  1.05x slower │
│ QQuery 23 │  772.46 ms │                         566.41 ms │ +1.36x faster │
│ QQuery 24 │  340.42 ms │                         353.78 ms │     no change │
│ QQuery 25 │  307.77 ms │                         303.73 ms │     no change │
│ QQuery 26 │   81.78 ms │                          83.28 ms │     no change │
│ QQuery 27 │  297.72 ms │                         287.01 ms │     no change │
│ QQuery 28 │  127.20 ms │                         128.65 ms │     no change │
│ QQuery 29 │  261.03 ms │                         264.57 ms │     no change │
│ QQuery 30 │   35.53 ms │                          35.40 ms │     no change │
│ QQuery 31 │  120.02 ms │                         124.84 ms │     no change │
│ QQuery 32 │   48.49 ms │                          50.59 ms │     no change │
│ QQuery 33 │  112.83 ms │                         116.26 ms │     no change │
│ QQuery 34 │   85.92 ms │                          88.66 ms │     no change │
│ QQuery 35 │   81.94 ms │                          57.09 ms │ +1.44x faster │
│ QQuery 36 │  165.56 ms │                         169.67 ms │     no change │
│ QQuery 37 │  153.98 ms │                         160.90 ms │     no change │
│ QQuery 38 │   60.75 ms │                          61.72 ms │     no change │
│ QQuery 39 │   81.49 ms │                          83.97 ms │     no change │
│ QQuery 40 │   87.94 ms │                          68.86 ms │ +1.28x faster │
│ QQuery 41 │   10.61 ms │                           9.65 ms │ +1.10x faster │
│ QQuery 42 │   89.63 ms │                          92.19 ms │     no change │
│ QQuery 43 │   69.61 ms │                          72.19 ms │     no change │
│ QQuery 44 │    9.08 ms │                           9.57 ms │  1.05x slower │
│ QQuery 45 │   53.17 ms │                          52.21 ms │     no change │
│ QQuery 46 │  175.44 ms │                         181.36 ms │     no change │
│ QQuery 47 │  478.10 ms │                         500.48 ms │     no change │
│ QQuery 48 │  224.20 ms │                         233.67 ms │     no change │
│ QQuery 49 │  206.10 ms │                         194.55 ms │ +1.06x faster │
│ QQuery 50 │  176.44 ms │                         181.54 ms │     no change │
│ QQuery 51 │  141.42 ms │                         112.74 ms │ +1.25x faster │
│ QQuery 52 │   90.66 ms │                          95.63 ms │  1.05x slower │
│ QQuery 53 │   89.56 ms │                          93.86 ms │     no change │
│ QQuery 54 │  123.43 ms │                         133.27 ms │  1.08x slower │
│ QQuery 55 │   88.73 ms │                          92.72 ms │     no change │
│ QQuery 56 │  114.66 ms │                         117.86 ms │     no change │
│ QQuery 57 │  131.64 ms │                         132.08 ms │     no change │
│ QQuery 58 │  228.01 ms │                         225.07 ms │     no change │
│ QQuery 59 │  169.17 ms │                         161.65 ms │     no change │
│ QQuery 60 │  118.92 ms │                         117.22 ms │     no change │
│ QQuery 61 │  149.06 ms │                         145.92 ms │     no change │
│ QQuery 62 │  441.11 ms │                         419.63 ms │     no change │
│ QQuery 63 │   95.44 ms │                          91.75 ms │     no change │
│ QQuery 64 │  606.32 ms │                         600.93 ms │     no change │
│ QQuery 65 │  208.68 ms │                         193.18 ms │ +1.08x faster │
│ QQuery 66 │  188.17 ms │                         185.86 ms │     no change │
│ QQuery 67 │  249.91 ms │                         241.50 ms │     no change │
│ QQuery 68 │  235.92 ms │                         224.96 ms │     no change │
│ QQuery 69 │   89.95 ms │                          54.48 ms │ +1.65x faster │
│ QQuery 70 │  278.67 ms │                         248.38 ms │ +1.12x faster │
│ QQuery 71 │  109.23 ms │                         112.11 ms │     no change │
│ QQuery 72 │  508.24 ms │                         492.46 ms │     no change │
│ QQuery 73 │   90.02 ms │                          84.98 ms │ +1.06x faster │
│ QQuery 74 │  373.75 ms │                         377.10 ms │     no change │
│ QQuery 75 │  227.43 ms │                         217.26 ms │     no change │
│ QQuery 76 │  116.42 ms │                         116.46 ms │     no change │
│ QQuery 77 │  170.31 ms │                         148.75 ms │ +1.14x faster │
│ QQuery 78 │  422.27 ms │                         215.66 ms │ +1.96x faster │
│ QQuery 79 │  190.47 ms │                         184.73 ms │     no change │
│ QQuery 80 │  265.88 ms │                         218.52 ms │ +1.22x faster │
│ QQuery 81 │   23.05 ms │                          22.41 ms │     no change │
│ QQuery 82 │  173.94 ms │                         171.55 ms │     no change │
│ QQuery 83 │   40.37 ms │                          36.72 ms │ +1.10x faster │
│ QQuery 84 │   40.52 ms │                          37.96 ms │ +1.07x faster │
│ QQuery 85 │  138.45 ms │                         136.80 ms │     no change │
│ QQuery 86 │   30.41 ms │                          32.06 ms │  1.05x slower │
│ QQuery 87 │   62.64 ms │                          62.08 ms │     no change │
│ QQuery 88 │   84.50 ms │                          83.80 ms │     no change │
│ QQuery 89 │  108.95 ms │                         103.50 ms │ +1.05x faster │
│ QQuery 90 │   19.19 ms │                          18.80 ms │     no change │
│ QQuery 91 │   53.45 ms │                          51.47 ms │     no change │
│ QQuery 92 │   49.13 ms │                          50.11 ms │     no change │
│ QQuery 93 │  151.86 ms │                         110.14 ms │ +1.38x faster │
│ QQuery 94 │   52.94 ms │                          49.08 ms │ +1.08x faster │
│ QQuery 95 │  125.23 ms │                          62.45 ms │ +2.01x faster │
│ QQuery 96 │   59.70 ms │                          61.53 ms │     no change │
│ QQuery 97 │   99.90 ms │                          78.41 ms │ +1.27x faster │
│ QQuery 98 │  129.60 ms │                         123.68 ms │     no change │
│ QQuery 99 │ 4562.37 ms │                        4393.38 ms │     no change │
└───────────┴────────────┴───────────────────────────────────┴───────────────┘

@gabotechs gabotechs force-pushed the hash-join-buffering-on-probe-side branch from cdc6ad1 to 09c6b68 Compare January 13, 2026 16:55
Comment on lines +265 to +280
// If there is a dynamic filter being pushed down through this node, we don't want to buffer,
// we prefer to give a chance to the dynamic filter to be populated with something rather
// than eagerly polling data with an empty dynamic filter.
let mut has_dynamic_filter = false;
for parent_filter in &child_pushdown_result.parent_filters {
if is_dynamic_physical_expr(&parent_filter.filter) {
has_dynamic_filter = true;
}
}
if has_dynamic_filter {
let mut result = FilterPushdownPropagation::if_all(child_pushdown_result);
result.updated_node = Some(Arc::clone(self.input()));
Ok(result)
} else {
Ok(FilterPushdownPropagation::if_all(child_pushdown_result))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb am I doing this right? what I mainly want is: if at any point a dynamic filter is pushed down through this node, then I want this node to disappear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this looks right to me! I'm not sure this was tested as part of the optimizer rule but "swap with my child" would be a good test to add (especially if this is broken).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that I should be doing something like this instead:

        for parent_filter in &child_pushdown_result.parent_filters {
            if is_dynamic_physical_expr(&parent_filter.filter)
+               && matches!(parent_filter.all(), PushedDown::Yes)
            {
                has_dynamic_filter = true;
            }
        }

But parent_filter.all() is always PushedDown::No even if the dynamic filter is do getting properly pushed down. Is that expected?

Copy link
Contributor

@adriangb adriangb Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that is normal if you have datafusion.execution.parquet.pushdown_filters = false (the default). It means the scan node may hold onto a reference and use it for e.g. stats scanning but does not promise to apply it perfectly. I.e. as far as the caller is concerned it was not pushed down. You could use DynamicFilterPhyscialExpr::is_used which @LiaCastaneda introduced recently.

@Dandandan
Copy link
Contributor

run benchmark tpcds tpch

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (09c6b68) to 617700d diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃        HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │    74.05 ms │                          73.38 ms │     no change │
│ QQuery 2  │   218.26 ms │                         209.73 ms │     no change │
│ QQuery 3  │   165.49 ms │                         163.67 ms │     no change │
│ QQuery 4  │  1861.71 ms │                        1874.24 ms │     no change │
│ QQuery 5  │   285.73 ms │                         300.28 ms │  1.05x slower │
│ QQuery 6  │  1472.93 ms │                        1448.11 ms │     no change │
│ QQuery 7  │   503.37 ms │                         496.29 ms │     no change │
│ QQuery 8  │   173.73 ms │                         174.29 ms │     no change │
│ QQuery 9  │   285.51 ms │                         307.66 ms │  1.08x slower │
│ QQuery 10 │   183.00 ms │                         117.73 ms │ +1.55x faster │
│ QQuery 11 │  1294.36 ms │                        1270.53 ms │     no change │
│ QQuery 12 │    69.20 ms │                          71.22 ms │     no change │
│ QQuery 13 │   542.70 ms │                         551.64 ms │     no change │
│ QQuery 14 │  1884.53 ms │                        1756.24 ms │ +1.07x faster │
│ QQuery 15 │    31.71 ms │                          29.97 ms │ +1.06x faster │
│ QQuery 16 │    64.42 ms │                          59.80 ms │ +1.08x faster │
│ QQuery 17 │   360.18 ms │                         357.71 ms │     no change │
│ QQuery 18 │   196.74 ms │                         195.24 ms │     no change │
│ QQuery 19 │   229.70 ms │                         229.59 ms │     no change │
│ QQuery 20 │    25.92 ms │                          27.74 ms │  1.07x slower │
│ QQuery 21 │    41.28 ms │                          39.05 ms │ +1.06x faster │
│ QQuery 22 │   743.61 ms │                         741.41 ms │     no change │
│ QQuery 23 │  1767.23 ms │                        1657.41 ms │ +1.07x faster │
│ QQuery 24 │   642.03 ms │                         656.26 ms │     no change │
│ QQuery 25 │   522.37 ms │                         508.21 ms │     no change │
│ QQuery 26 │   128.66 ms │                         128.20 ms │     no change │
│ QQuery 27 │   482.54 ms │                         500.63 ms │     no change │
│ QQuery 28 │   298.95 ms │                         309.29 ms │     no change │
│ QQuery 29 │   445.84 ms │                         443.27 ms │     no change │
│ QQuery 30 │    77.16 ms │                          73.76 ms │     no change │
│ QQuery 31 │   324.66 ms │                         312.51 ms │     no change │
│ QQuery 32 │    84.41 ms │                          86.78 ms │     no change │
│ QQuery 33 │   208.46 ms │                         211.76 ms │     no change │
│ QQuery 34 │   163.67 ms │                         162.72 ms │     no change │
│ QQuery 35 │   181.54 ms │                         119.53 ms │ +1.52x faster │
│ QQuery 36 │   291.03 ms │                         286.07 ms │     no change │
│ QQuery 37 │   256.05 ms │                         258.80 ms │     no change │
│ QQuery 38 │   158.65 ms │                         139.16 ms │ +1.14x faster │
│ QQuery 39 │   210.25 ms │                         206.04 ms │     no change │
│ QQuery 40 │   171.06 ms │                         142.19 ms │ +1.20x faster │
│ QQuery 41 │    23.33 ms │                          21.66 ms │ +1.08x faster │
│ QQuery 42 │   146.92 ms │                         146.55 ms │     no change │
│ QQuery 43 │   127.85 ms │                         129.69 ms │     no change │
│ QQuery 44 │    29.17 ms │                          28.56 ms │     no change │
│ QQuery 45 │    84.91 ms │                          85.12 ms │     no change │
│ QQuery 46 │   324.91 ms │                         323.20 ms │     no change │
│ QQuery 47 │  1031.07 ms │                        1025.94 ms │     no change │
│ QQuery 48 │   404.18 ms │                         420.39 ms │     no change │
│ QQuery 49 │   374.50 ms │                         374.27 ms │     no change │
│ QQuery 50 │   340.58 ms │                         332.54 ms │     no change │
│ QQuery 51 │   306.04 ms │                         253.80 ms │ +1.21x faster │
│ QQuery 52 │   148.39 ms │                         149.45 ms │     no change │
│ QQuery 53 │   151.67 ms │                         154.70 ms │     no change │
│ QQuery 54 │   217.07 ms │                         220.03 ms │     no change │
│ QQuery 55 │   146.19 ms │                         146.63 ms │     no change │
│ QQuery 56 │   208.12 ms │                         209.59 ms │     no change │
│ QQuery 57 │   303.31 ms │                         300.59 ms │     no change │
│ QQuery 58 │   476.31 ms │                         468.90 ms │     no change │
│ QQuery 59 │   294.14 ms │                         294.35 ms │     no change │
│ QQuery 60 │   215.81 ms │                         214.81 ms │     no change │
│ QQuery 61 │   249.83 ms │                         253.01 ms │     no change │
│ QQuery 62 │  1259.37 ms │                        1295.54 ms │     no change │
│ QQuery 63 │   151.50 ms │                         155.02 ms │     no change │
│ QQuery 64 │  1155.61 ms │                        1160.08 ms │     no change │
│ QQuery 65 │   354.55 ms │                         359.56 ms │     no change │
│ QQuery 66 │   409.87 ms │                         394.80 ms │     no change │
│ QQuery 67 │   537.39 ms │                         548.74 ms │     no change │
│ QQuery 68 │   372.75 ms │                         377.37 ms │     no change │
│ QQuery 69 │   173.30 ms │                         110.46 ms │ +1.57x faster │
│ QQuery 70 │   499.70 ms │                         466.82 ms │ +1.07x faster │
│ QQuery 71 │   188.82 ms │                         186.35 ms │     no change │
│ QQuery 72 │  2075.95 ms │                        2071.90 ms │     no change │
│ QQuery 73 │   157.96 ms │                         158.78 ms │     no change │
│ QQuery 74 │   825.15 ms │                         806.82 ms │     no change │
│ QQuery 75 │   417.20 ms │                         397.24 ms │     no change │
│ QQuery 76 │   184.59 ms │                         189.72 ms │     no change │
│ QQuery 77 │   291.67 ms │                         268.15 ms │ +1.09x faster │
│ QQuery 78 │   936.40 ms │                         633.80 ms │ +1.48x faster │
│ QQuery 79 │   328.77 ms │                         328.65 ms │     no change │
│ QQuery 80 │   507.27 ms │                         452.20 ms │ +1.12x faster │
│ QQuery 81 │    52.59 ms │                          53.51 ms │     no change │
│ QQuery 82 │   283.58 ms │                         290.41 ms │     no change │
│ QQuery 83 │    80.37 ms │                          70.04 ms │ +1.15x faster │
│ QQuery 84 │    68.35 ms │                          71.32 ms │     no change │
│ QQuery 85 │   223.05 ms │                         230.03 ms │     no change │
│ QQuery 86 │    59.15 ms │                          60.09 ms │     no change │
│ QQuery 87 │   156.30 ms │                         137.52 ms │ +1.14x faster │
│ QQuery 88 │   274.62 ms │                         270.70 ms │     no change │
│ QQuery 89 │   172.28 ms │                         174.50 ms │     no change │
│ QQuery 90 │    47.63 ms │                          45.64 ms │     no change │
│ QQuery 91 │    97.30 ms │                          97.02 ms │     no change │
│ QQuery 92 │    84.23 ms │                          85.08 ms │     no change │
│ QQuery 93 │   263.45 ms │                         207.53 ms │ +1.27x faster │
│ QQuery 94 │    92.90 ms │                          89.52 ms │     no change │
│ QQuery 95 │   244.80 ms │                         178.52 ms │ +1.37x faster │
│ QQuery 96 │   115.93 ms │                         117.26 ms │     no change │
│ QQuery 97 │   188.56 ms │                         154.61 ms │ +1.22x faster │
│ QQuery 98 │   218.24 ms │                         215.56 ms │     no change │
│ QQuery 99 │ 14155.30 ms │                       14127.07 ms │     no change │
└───────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 50433.41ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 49289.83ms │
│ Average Time (HEAD)                              │   509.43ms │
│ Average Time (hash-join-buffering-on-probe-side) │   497.88ms │
│ Queries Faster                                   │         21 │
│ Queries Slower                                   │          3 │
│ Queries with No Change                           │         75 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing hash-join-buffering-on-probe-side (09c6b68) to 617700d diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and hash-join-buffering-on-probe-side
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃      HEAD ┃ hash-join-buffering-on-probe-side ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1  │ 180.57 ms │                         176.46 ms │     no change │
│ QQuery 2  │  89.12 ms │                          84.08 ms │ +1.06x faster │
│ QQuery 3  │ 128.55 ms │                         124.28 ms │     no change │
│ QQuery 4  │  81.28 ms │                          70.77 ms │ +1.15x faster │
│ QQuery 5  │ 176.57 ms │                         178.40 ms │     no change │
│ QQuery 6  │  66.91 ms │                          66.10 ms │     no change │
│ QQuery 7  │ 207.12 ms │                         206.37 ms │     no change │
│ QQuery 8  │ 165.42 ms │                         170.96 ms │     no change │
│ QQuery 9  │ 228.04 ms │                         222.14 ms │     no change │
│ QQuery 10 │ 185.13 ms │                         180.64 ms │     no change │
│ QQuery 11 │  63.15 ms │                          59.49 ms │ +1.06x faster │
│ QQuery 12 │ 116.15 ms │                         114.64 ms │     no change │
│ QQuery 13 │ 210.81 ms │                         196.82 ms │ +1.07x faster │
│ QQuery 14 │  94.86 ms │                          96.01 ms │     no change │
│ QQuery 15 │ 128.09 ms │                         127.08 ms │     no change │
│ QQuery 16 │  60.71 ms │                          53.27 ms │ +1.14x faster │
│ QQuery 17 │ 260.69 ms │                         259.69 ms │     no change │
│ QQuery 18 │ 311.87 ms │                         275.89 ms │ +1.13x faster │
│ QQuery 19 │ 135.83 ms │                         138.88 ms │     no change │
│ QQuery 20 │ 130.95 ms │                         119.96 ms │ +1.09x faster │
│ QQuery 21 │ 263.09 ms │                         244.16 ms │ +1.08x faster │
│ QQuery 22 │  40.63 ms │                          34.67 ms │ +1.17x faster │
└───────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 3325.53ms │
│ Total Time (hash-join-buffering-on-probe-side)   │ 3200.77ms │
│ Average Time (HEAD)                              │  151.16ms │
│ Average Time (hash-join-buffering-on-probe-side) │  145.49ms │
│ Queries Faster                                   │         9 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │        13 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants