Skip to content

Conversation

@rayhhome
Copy link
Collaborator

Persistent External Table Registration with Multi-Schema Support

Implement catalog-driven persistence for external tables, enabling cross-session table definitions, automatic statistics extraction, time-travel queries, and multi-schema support. This feature transforms OptD's catalog from a memory-only registry to a production-ready persistent metadata store backed by DuckLake. This pull request overlaps with and builds on top of #16 in many aspects.

Architecture

Optd Catalog (optd/catalog/)

Enhanced catalog service with persistent external table management and multi-schema support. Below is the code architecture:

  • Catalog trait: Core trait defining catalog operations including:

    • External table lifecycle (register_external_table, get_external_table, list_external_tables, drop_external_table);
    • Time-travel queries (get_external_table_at_snapshot, list_external_tables_at_snapshot, list_snapshots);
    • Schema management (create_schema, list_schemas, drop_schema);
    • Statistics operations (set_table_statistics, table_statistics, update_table_column_stats).
  • DuckLakeCatalog: Implements the Catalog trait with persistent metadata storage in DuckDB:

    • optd_external_table: Table registration metadata with soft-delete support (begin/end snapshots);
    • optd_external_table_options: Key-value options storage (compression, format-specific settings);
    • Snapshot-based versioning: All operations create new snapshots for time-travel and audit trails;
    • Multi-schema support: Schema-qualified table names with "main" as default schema.
  • CatalogService & CatalogServiceHandle: Thread-safe async service layer with mpsc:

    • Service runs in background Tokio task with mpsc channel;
    • Handle is cloneable for multi-threaded access;
    • All catalog operations exposed through async methods;
    • Graceful shutdown with cleanup.

DataFusion Connector (connectors/datafusion/src/)

Integration layer enabling lazy-loading of persistent tables into DataFusion:

  • OptdCatalogProviderList: Wraps DataFusion's catalog list, propagates CatalogServiceHandle to enable catalog-backed table discovery;

  • OptdCatalogProvider: Wraps DataFusion catalog provider with schema discovery:

    • Merges schemas from DataFusion's in-memory catalog and OptD persistent catalog;
    • Maps DataFusion "public" schema to OptD "main" schema;
    • Creates schema providers on-demand for persistent schemas.
  • OptdSchemaProvider: Schema provider with lazy table loading:

    • Queries catalog service when table not found in DataFusion's memory;
    • Reconstructs TableProvider from metadata (CSV/Parquet/JSON);
    • Caches loaded tables in DataFusion's registry;
    • Supports schema-qualified table names (schema.table).
  • OptdTableProvider: Wrapper for tables to empower future statistics integration.

  • EmptySchemaProvider: Minimal schema provider for catalog-only schemas.

CLI (cli/src/)

Enhanced CLI with CREATE/DROP interception and automatic statistics for testing and validation:

  • OptdCliSessionContext: Custom SessionContext that intercepts DDL operations:

    • CREATE EXTERNAL TABLE: Registers table in both DataFusion memory and persistent catalog;
    • DROP TABLE: Soft-deletes from catalog (end_snapshot set, data preserved);
    • Parses schema-qualified names (schema.table) for multi-schema support.
  • Time-travel UDTFs (udtf.rs):

    • list_snapshots(): Shows all catalog snapshots with timestamps;
    • table_at_snapshot(snapshot_id): Lists tables visible at specific snapshot;
    • Registered automatically at CLI startup.
  • Eager loading (main.rs): populate_external_tables() function (for testing and demo):

    • Loads all catalog tables into DataFusion on startup;
    • Enables SHOW TABLES to display persistent tables immediately;
    • Reconstructs TableProvider from metadata without re-reading files.
  • Auto-statistics (auto_stats.rs): Automatic statistics extraction (for testing and demo):

    • Parquet: Fast metadata-based extraction (row count, column stats, min/max, null count);
    • CSV/JSON: Configurable sampling (disabled by default due to cost);
    • Environment variables for fine-grained control (OPTD_AUTO_STATS_*);
    • Stores statistics in catalog immediately after table creation.

Database Schema

New tables in DuckLake metadata store:

-- External table registry
CREATE TABLE optd_external_table (
    table_id BIGINT PRIMARY KEY,
    schema_id BIGINT NOT NULL,
    table_name VARCHAR NOT NULL,
    location VARCHAR NOT NULL,
    file_format VARCHAR NOT NULL,
    compression VARCHAR,
    begin_snapshot BIGINT NOT NULL,
    end_snapshot BIGINT,  -- NULL = exist and active, set = soft-deleted
    created_at TIMESTAMP DEFAULT NOW()
);

-- Table options (key-value pairs)
CREATE TABLE optd_external_table_options (
    table_id BIGINT NOT NULL,
    option_key VARCHAR NOT NULL,
    option_value VARCHAR NOT NULL,
    PRIMARY KEY (table_id, option_key)
);

-- Indexes for performance (optional)
CREATE INDEX idx_optd_external_table_schema 
    ON optd_external_table(schema_id, table_name, end_snapshot);
CREATE INDEX idx_optd_external_table_snapshot
    ON optd_external_table(begin_snapshot, end_snapshot);

Key Features

1. Persistent External Tables

Tables registered via CREATE EXTERNAL TABLE survive CLI restarts:

-- Session 1
CREATE EXTERNAL TABLE users STORED AS PARQUET LOCATION 'users.parquet';

-- Session 2 (new CLI instance)
SELECT * FROM users;  -- Works! Lazy-loaded from catalog

2. Multi-Schema Support

Create and use multiple schemas for table organization:

CREATE SCHEMA analytics;
CREATE EXTERNAL TABLE analytics.events STORED AS PARQUET LOCATION 'events.parquet';
SELECT * FROM analytics.events;

-- Join across schemas
SELECT * FROM users JOIN analytics.events ON users.id = events.user_id;

3. Time-Travel Queries

Query historical table states:

-- List all snapshots
SELECT * FROM list_snapshots();

-- See tables at specific snapshot
SELECT * FROM table_at_snapshot(42);

-- Recover dropped table (via metadata, Phase 12)

Testing

Optd Catalog (optd/catalog/tests/)

  • external_tables_tests.rs (16 tests): Registration, retrieval, listing, soft-delete, persistence across connections;
  • schema_tests.rs (17 tests): Schema creation/listing/deletion, multi-schema isolation, time-travel, qualified names;
  • statistics_tests.rs (27 tests): Statistics CRUD, versioning, snapshot isolation, concurrent updates, edge cases;
  • service_tests.rs (18 tests): Service lifecycle, concurrent operations, handle cloning, shutdown;
  • catalog_error_tests.rs (13 tests): Error handling, invalid operations, concurrent modifications.

DataFusion Connector (connectors/datafusion/tests/)

  • integration_test.rs (16 tests): Catalog provider wrapping, schema retrieval, multi-schema isolation, service integration, snapshot/schema retrieval through connector;
  • table_loading_test.rs (8 tests): Lazy loading, caching, format support (Parquet/CSV/JSON), error handling for missing tables/unsupported formats.

CLI (cli/tests/)

  • auto_stats_tests.rs (6 tests): Automatic statistics extraction from Parquet/CSV file metadata;
  • catalog_service_integration.rs (2 tests): Catalog service lifecycle and handle management;
  • comprehensive_table_tests.rs (8 tests): Multi-format table operations (Parquet, CSV, JSON);
  • drop_table_tests.rs (2 tests): DROP TABLE persistence and error handling;
  • error_handling_tests.rs (10 tests): Invalid CREATE/DROP operations, file not found, schema errors;
  • udtf_tests.rs (10 tests): User-defined table functions (list_snapshots, list_tables_at_snapshot), UDTF edge cases (empty snapshots, multiple formats, DROP table reflection, semicolon handling);
  • cross_session_tests.rs (3 tests): Table persistence across CLI sessions, concurrent access;
  • eager_loading_tests.rs (2 tests): Startup table population vs. lazy loading behavior;
  • multi_schema_tests.rs (7 tests): Multiple tables/schemas, JOINs, DROP TABLE, schema isolation validation;
  • statistics_retrieval_tests.rs (2 tests): Statistics availability after CREATE, snapshot time-travel validation;
  • Unit tests (3 tests): Core CLI functionality.

Environment Variables

# Auto-statistics configuration
export OPTD_AUTO_STATS=true                # Enable auto-stats (default: true)
export OPTD_AUTO_STATS_PARQUET=true       # Parquet metadata (default: true)
export OPTD_AUTO_STATS_CSV=false          # CSV sampling (default: false)
export OPTD_AUTO_STATS_JSON=false         # JSON sampling (default: false)
export OPTD_AUTO_STATS_SAMPLE_SIZE=10000  # Sample size (default: 10000)

rayhhome and others added 30 commits August 5, 2025 18:33
v0 of the Cascades-style optimizer.

- Exhaustive optimization: expression and group returns only when the
subgraph is optimized.
- Applying enforcer rules and adding generated expressions to the memo
table.
- Special termination logic is required when the child has the same
group + physical requirement as the parent.
- Exhaustive exploration when applying rules, generate all bindings
before doing the transform, but only expand based on specified rule
patterns.

Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
## Problem


Some methods are added to the IR as experimental features. We also got
feedback from the dev meeting that the rule seems hard to read (or
long). We would like to clean up these rough edges.


## Summary of changes

- eliminate`try_bind_ref_xxx` and use `try_borrow`
- add `borrow_raw_parts` so we always refer to `$node_name` instead of
`$ref_name`.
- Plumb through property methods to use shorthand.

**_TODO:_** Pattern builder can also be generated by macros.

---------

Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Signed-off-by: Yuchen Liang <yuchenl3@andrew.cmu.edu>
Copilot AI review requested due to automatic review settings December 17, 2025 10:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive persistent table registration system with multi-schema support, transforming OptD's catalog from memory-only to a production-ready persistent metadata store backed by DuckLake. The implementation adds:

  • Persistent external table storage with snapshot-based versioning and time-travel queries
  • Multi-schema support with schema creation/deletion and qualified table names
  • Service layer for async catalog operations with thread-safe access
  • DataFusion integration for lazy-loading tables from the persistent catalog
  • CLI enhancements with DDL interception and automatic statistics extraction

Key Changes

  • Removed unused optd/storage module
  • Implemented complete Catalog trait with DuckLakeCatalog backend using DuckDB
  • Added async CatalogService with mpsc-based request handling
  • Integrated catalog providers into DataFusion for transparent table loading
  • Enhanced CLI with CREATE/DROP interception and auto-statistics support

Reviewed changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated no comments.

Show a summary per file
File Description
optd/storage/* Removed unused storage module scaffolding
optd/catalog/src/lib.rs Core catalog implementation with 2000+ lines of DuckDB-backed persistence
optd/catalog/src/service.rs Async service layer with mpsc channels for thread-safe catalog access
optd/catalog/tests/*.rs Comprehensive test suite (91 tests) covering all catalog features
connectors/datafusion/src/catalog.rs DataFusion integration for lazy table loading from catalog
connectors/datafusion/src/table.rs OptdTableProvider wrapper for future statistics integration
cli/tests/*.rs CLI integration tests for DDL operations and statistics
Cargo.toml, .gitignore Dependency and configuration updates

The implementation is well-structured with comprehensive test coverage and follows good architectural patterns. The code is production-ready.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rayhhome rayhhome requested a review from yliang412 December 17, 2025 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants