Skip to content

Conversation

@majin1102
Copy link
Contributor

@majin1102 majin1102 commented Jan 11, 2026

This is the first step of #4779

Close #5712

@majin1102 majin1102 marked this pull request as draft January 11, 2026 14:51
@github-actions github-actions bot added the enhancement New feature or request label Jan 11, 2026
@majin1102 majin1102 force-pushed the namespace-datafusion-rust branch from 6239d73 to fc9064b Compare January 11, 2026 15:27
@majin1102 majin1102 force-pushed the namespace-datafusion-rust branch from fc9064b to 3e79cb1 Compare January 11, 2026 15:28
@codecov
Copy link

codecov bot commented Jan 11, 2026

@majin1102 majin1102 force-pushed the namespace-datafusion-rust branch from 9a7f365 to 958e528 Compare January 12, 2026 04:01
@majin1102
Copy link
Contributor Author

Hi @wjones127 @jackye1995 @yanghua @westonpace @Xuanwo ,

This PR is driven by #4779. In our production scenario, we’d like to leverage DataFusion to deliver a better SQL experience that supports search, analytics, and AI-related operators (e.g., via UDFs/UDTFs).

I’ve already discussed this with @jackye1995 offline, and we agreed to try this with a dedicated crate for the DataFusion namespace integration. This PR introduces the foundational approach.

I’ll follow up with additional issues covering next steps—such as Python bindings, URL-based tables, SQL extensions, etc.

Please take a look when you have time!

@majin1102 majin1102 marked this pull request as ready for review January 13, 2026 07:51
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool. I have a few suggestions.

Comment on lines 6 to 11
/// Convert a Lance error into a DataFusion error.
///
/// This keeps all Lance-specific error formatting in a single place.
pub fn to_datafusion_error<E: std::fmt::Display>(err: E) -> DataFusionError {
DataFusionError::Execution(err.to_string())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should preserve the original error using DataFusionError::External instead of converting to a string. This lets the caller turn the DataFusion error back into the original Lance error.

You can see how this is already partially done in #5606


/// A dynamic [`CatalogProviderList`] that maps Lance namespaces to catalogs.
///
/// The underlying namespace must be a four-level namespace. It is explicitly configured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must be a four-level namespace

Why is this necessary? I wonder if we can work around that.

For example, it's a three level namespace, it could be:

DEFAULT > LVL1 > LVL2 > LVL3

And then if it's a two level namespace, it could be

DEFAULT > DEFAULT > LVL1 > LVL2

There might be some other standard name besides default that would make more sense (maybe what other DataFusion plugins do), but you get the idea.

Copy link
Contributor Author

@majin1102 majin1102 Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs might be somewhat misleading(have updated it). Let me clarify:

  1. First, DataFusion only has three-level metadata. The CatalogProviderList is optional. As mentioned, we use SessionBuilder::with_root to configure a four-level namespace (let’s call this the root). In this case, all three-level child namespaces under the root are automatically registered as LanceCatalogProviders. This four-level namespace acts like a “catalog of catalogs” and is purely for convenience—instead of calling add_catalog() for each catalog individually.

  2. We can always use SessionBuilder::add_catalog to manually register a catalog provider, regardless of whether the four-level namespace is configured.

  3. I think what you’re really concerned about might be whether we can write queries against two- or three-level tables (e.g., SELECT * FROM db.tb or SELECT * FROM tb) when the four-level namespace is in use. The answer is yes. To do this, we need to configure the default catalog and schema names. If we don’t set them explicitly, they default to "datafusion" (for catalog) and "public" (for schema). That means db.tb would be interpreted as datafusion.db.tb, and tb as datafusion.public.tb.

  4. I agree this is an important point. I’ve added methods to configure the default catalog and schema, and included examples in the tests.

@wjones127 wjones127 self-assigned this Jan 14, 2026
majin1102 and others added 2 commits January 15, 2026 11:51
Co-authored-by: Will Jones <willjones127@gmail.com>
@majin1102 majin1102 force-pushed the namespace-datafusion-rust branch from 89032bb to c9a6d3c Compare January 17, 2026 12:37
@majin1102
Copy link
Contributor Author

Ready for another review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expose Lance tables in DataFusion via namespace

2 participants