-
Notifications
You must be signed in to change notification settings - Fork 526
feat(rust): add datafusion catalog_provider through namespace #5686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6239d73 to
fc9064b
Compare
fc9064b to
3e79cb1
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
9a7f365 to
958e528
Compare
|
Hi @wjones127 @jackye1995 @yanghua @westonpace @Xuanwo , This PR is driven by #4779. In our production scenario, we’d like to leverage DataFusion to deliver a better SQL experience that supports search, analytics, and AI-related operators (e.g., via UDFs/UDTFs). I’ve already discussed this with @jackye1995 offline, and we agreed to try this with a dedicated crate for the DataFusion namespace integration. This PR introduces the foundational approach. I’ll follow up with additional issues covering next steps—such as Python bindings, URL-based tables, SQL extensions, etc. Please take a look when you have time! |
wjones127
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is cool. I have a few suggestions.
| /// Convert a Lance error into a DataFusion error. | ||
| /// | ||
| /// This keeps all Lance-specific error formatting in a single place. | ||
| pub fn to_datafusion_error<E: std::fmt::Display>(err: E) -> DataFusionError { | ||
| DataFusionError::Execution(err.to_string()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should preserve the original error using DataFusionError::External instead of converting to a string. This lets the caller turn the DataFusion error back into the original Lance error.
You can see how this is already partially done in #5606
|
|
||
| /// A dynamic [`CatalogProviderList`] that maps Lance namespaces to catalogs. | ||
| /// | ||
| /// The underlying namespace must be a four-level namespace. It is explicitly configured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
must be a four-level namespace
Why is this necessary? I wonder if we can work around that.
For example, it's a three level namespace, it could be:
DEFAULT > LVL1 > LVL2 > LVL3
And then if it's a two level namespace, it could be
DEFAULT > DEFAULT > LVL1 > LVL2
There might be some other standard name besides default that would make more sense (maybe what other DataFusion plugins do), but you get the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docs might be somewhat misleading(have updated it). Let me clarify:
-
First, DataFusion only has three-level metadata. The CatalogProviderList is optional. As mentioned, we use
SessionBuilder::with_rootto configure a four-level namespace (let’s call this the root). In this case, all three-level child namespaces under the root are automatically registered asLanceCatalogProviders. This four-level namespace acts like a “catalog of catalogs” and is purely for convenience—instead of callingadd_catalog()for each catalog individually. -
We can always use S
essionBuilder::add_catalogto manually register a catalog provider, regardless of whether the four-level namespace is configured. -
I think what you’re really concerned about might be whether we can write queries against two- or three-level tables (e.g.,
SELECT * FROM db.tb or SELECT * FROM tb) when the four-level namespace is in use. The answer is yes. To do this, we need to configure the default catalog and schema names. If we don’t set them explicitly, they default to "datafusion" (for catalog) and "public" (for schema). That meansdb.tbwould be interpreted asdatafusion.db.tb, andtbas datafusion.public.tb. -
I agree this is an important point. I’ve added methods to configure the default catalog and schema, and included examples in the tests.
Co-authored-by: Will Jones <willjones127@gmail.com>
89032bb to
c9a6d3c
Compare
|
Ready for another review |
This is the first step of #4779
Close #5712