-
Notifications
You must be signed in to change notification settings - Fork 702
Fixed rows count in pg-deeplake. #3117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes the row count reporting in pg-deeplake by correcting deeplake_relation_size to return the actual physical storage size (1 block) instead of incorrectly converting total dataset bytes to blocks. This ensures PostgreSQL's ANALYZE command properly updates pg_class.reltuples statistics, which customer tooling relies on.
Changes:
- Modified
deeplake_relation_sizeto return exactlyBLCKSZ(1 block) matching the physical storage created inrelation_set_new_node - Updated references from
num_rows()tonum_total_rows()throughout the codebase to ensure accurate row counts including uncommitted/staged data - Enhanced
deeplake_scan_analyze_next_blockto properly consume blocks from the read stream, ensuring PostgreSQL's BlockSampler correctly tracks sampled blocks
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| postgres/tests/py_tests/test_statistics_integration.py | Added comprehensive tests validating pg_class.reltuples updates correctly after ANALYZE for both regular and small tables |
| cpp/deeplake_pg/timing_guard.hpp | New header file defining a timing guard utility class for performance instrumentation |
| cpp/deeplake_pg/timing_guard.cpp | Implementation of timing guard utility for logging operation durations |
| cpp/deeplake_pg/table_scan_impl.hpp | Updated to use num_total_rows() and removed empty loop |
| cpp/deeplake_pg/table_scan.hpp | Added get_num_rows() getter method |
| cpp/deeplake_pg/table_data_impl.hpp | Enhanced row count tracking with comments explaining refresh behavior and delete handling |
| cpp/deeplake_pg/table_am.cpp | Core fix: corrected deeplake_relation_size, updated ANALYZE block handling, and added physical storage creation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| get_session_start_time(); | ||
|
|
||
| std::ostringstream filename; | ||
| filename << "/home/ubuntu/deeplake_timing_" << std::this_thread::get_id() << ".log"; |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded path /home/ubuntu/ is not portable. Consider using an environment variable, a configurable path, or a system temp directory to ensure the code works across different environments and users.
|
|
||
| uint64_t deeplake_relation_size(Relation rel, ForkNumber) | ||
| uint64_t deeplake_relation_size(Relation rel, ForkNumber fork) | ||
| { |
Copilot
AI
Jan 19, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter fork is declared but never used in the function body. If it's required by the function signature/interface, consider adding a comment explaining why it's unused or use a cast to void to indicate intentional non-use.
| { | |
| { | |
| /* ForkNumber is part of the table AM API but is not used by Deeplake. */ | |
| (void)fork; |



🚀 🚀 Pull Request
Impact
Description
Things to be aware of
Things to worry about
Additional Context