-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/persistent table registration #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
v0 of the Cascades-style optimizer. - Exhaustive optimization: expression and group returns only when the subgraph is optimized. - Applying enforcer rules and adding generated expressions to the memo table. - Special termination logic is required when the child has the same group + physical requirement as the parent. - Exhaustive exploration when applying rules, generate all bindings before doing the transform, but only expand based on specified rule patterns. Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
## Problem Some methods are added to the IR as experimental features. We also got feedback from the dev meeting that the rule seems hard to read (or long). We would like to clean up these rough edges. ## Summary of changes - eliminate`try_bind_ref_xxx` and use `try_borrow` - add `borrow_raw_parts` so we always refer to `$node_name` instead of `$ref_name`. - Plumb through property methods to use shorthand. **_TODO:_** Pattern builder can also be generated by macros. --------- Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a comprehensive persistent table registration system with multi-schema support, transforming OptD's catalog from memory-only to a production-ready persistent metadata store backed by DuckLake. The implementation adds:
- Persistent external table storage with snapshot-based versioning and time-travel queries
- Multi-schema support with schema creation/deletion and qualified table names
- Service layer for async catalog operations with thread-safe access
- DataFusion integration for lazy-loading tables from the persistent catalog
- CLI enhancements with DDL interception and automatic statistics extraction
Key Changes
- Removed unused
optd/storagemodule - Implemented complete
Catalogtrait with DuckLakeCatalog backend using DuckDB - Added async
CatalogServicewith mpsc-based request handling - Integrated catalog providers into DataFusion for transparent table loading
- Enhanced CLI with CREATE/DROP interception and auto-statistics support
Reviewed changes
Copilot reviewed 33 out of 35 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| optd/storage/* | Removed unused storage module scaffolding |
| optd/catalog/src/lib.rs | Core catalog implementation with 2000+ lines of DuckDB-backed persistence |
| optd/catalog/src/service.rs | Async service layer with mpsc channels for thread-safe catalog access |
| optd/catalog/tests/*.rs | Comprehensive test suite (91 tests) covering all catalog features |
| connectors/datafusion/src/catalog.rs | DataFusion integration for lazy table loading from catalog |
| connectors/datafusion/src/table.rs | OptdTableProvider wrapper for future statistics integration |
| cli/tests/*.rs | CLI integration tests for DDL operations and statistics |
| Cargo.toml, .gitignore | Dependency and configuration updates |
The implementation is well-structured with comprehensive test coverage and follows good architectural patterns. The code is production-ready.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Persistent External Table Registration with Multi-Schema Support
Implement catalog-driven persistence for external tables, enabling cross-session table definitions, automatic statistics extraction, time-travel queries, and multi-schema support. This feature transforms OptD's catalog from a memory-only registry to a production-ready persistent metadata store backed by DuckLake. This pull request overlaps with and builds on top of #16 in many aspects.
Architecture
Optd Catalog (
optd/catalog/)Enhanced catalog service with persistent external table management and multi-schema support. Below is the code architecture:
Catalogtrait: Core trait defining catalog operations including:register_external_table,get_external_table,list_external_tables,drop_external_table);get_external_table_at_snapshot,list_external_tables_at_snapshot,list_snapshots);create_schema,list_schemas,drop_schema);set_table_statistics,table_statistics,update_table_column_stats).DuckLakeCatalog: Implements theCatalogtrait with persistent metadata storage in DuckDB:optd_external_table: Table registration metadata with soft-delete support (begin/end snapshots);optd_external_table_options: Key-value options storage (compression, format-specific settings);CatalogService&CatalogServiceHandle: Thread-safe async service layer withmpsc:DataFusion Connector (
connectors/datafusion/src/)Integration layer enabling lazy-loading of persistent tables into DataFusion:
OptdCatalogProviderList: Wraps DataFusion's catalog list, propagatesCatalogServiceHandleto enable catalog-backed table discovery;OptdCatalogProvider: Wraps DataFusion catalog provider with schema discovery:OptdSchemaProvider: Schema provider with lazy table loading:TableProviderfrom metadata (CSV/Parquet/JSON);schema.table).OptdTableProvider: Wrapper for tables to empower future statistics integration.EmptySchemaProvider: Minimal schema provider for catalog-only schemas.CLI (
cli/src/)Enhanced CLI with CREATE/DROP interception and automatic statistics for testing and validation:
OptdCliSessionContext: Custom SessionContext that intercepts DDL operations:schema.table) for multi-schema support.Time-travel UDTFs (
udtf.rs):list_snapshots(): Shows all catalog snapshots with timestamps;table_at_snapshot(snapshot_id): Lists tables visible at specific snapshot;Eager loading (
main.rs):populate_external_tables()function (for testing and demo):Auto-statistics (
auto_stats.rs): Automatic statistics extraction (for testing and demo):OPTD_AUTO_STATS_*);Database Schema
New tables in DuckLake metadata store:
Key Features
1. Persistent External Tables
Tables registered via
CREATE EXTERNAL TABLEsurvive CLI restarts:2. Multi-Schema Support
Create and use multiple schemas for table organization:
3. Time-Travel Queries
Query historical table states:
Testing
Optd Catalog (
optd/catalog/tests/)external_tables_tests.rs(16 tests): Registration, retrieval, listing, soft-delete, persistence across connections;schema_tests.rs(17 tests): Schema creation/listing/deletion, multi-schema isolation, time-travel, qualified names;statistics_tests.rs(27 tests): Statistics CRUD, versioning, snapshot isolation, concurrent updates, edge cases;service_tests.rs(18 tests): Service lifecycle, concurrent operations, handle cloning, shutdown;catalog_error_tests.rs(13 tests): Error handling, invalid operations, concurrent modifications.DataFusion Connector (
connectors/datafusion/tests/)integration_test.rs(16 tests): Catalog provider wrapping, schema retrieval, multi-schema isolation, service integration, snapshot/schema retrieval through connector;table_loading_test.rs(8 tests): Lazy loading, caching, format support (Parquet/CSV/JSON), error handling for missing tables/unsupported formats.CLI (
cli/tests/)auto_stats_tests.rs(6 tests): Automatic statistics extraction from Parquet/CSV file metadata;catalog_service_integration.rs(2 tests): Catalog service lifecycle and handle management;comprehensive_table_tests.rs(8 tests): Multi-format table operations (Parquet, CSV, JSON);drop_table_tests.rs(2 tests): DROP TABLE persistence and error handling;error_handling_tests.rs(10 tests): Invalid CREATE/DROP operations, file not found, schema errors;udtf_tests.rs(10 tests): User-defined table functions (list_snapshots, list_tables_at_snapshot), UDTF edge cases (empty snapshots, multiple formats, DROP table reflection, semicolon handling);cross_session_tests.rs(3 tests): Table persistence across CLI sessions, concurrent access;eager_loading_tests.rs(2 tests): Startup table population vs. lazy loading behavior;multi_schema_tests.rs(7 tests): Multiple tables/schemas, JOINs, DROP TABLE, schema isolation validation;statistics_retrieval_tests.rs(2 tests): Statistics availability after CREATE, snapshot time-travel validation;Environment Variables