fix: avoid dataset enumeration on GCS/S3 for query execution #89

beinan · 2026-01-19T01:46:05Z

Summary

Skip expensive list_datasets() enumeration when specific table names are provided to load_tables()
Extract referenced tables from Cypher query via node_labels() and relationship_types()
Compute dataset paths directly from root path instead of enumerating all datasets

Performance Impact

For queries on GCS/S3 with 20+ datasets:

Before: Enumerate all datasets (~slow), load all 20+ datasets (~10s)
After: Compute paths for only referenced tables (2-3), load only those (~1s)

Test plan

All 72 existing tests pass
Manual testing on GCS with 20+ datasets to confirm latency improvement

Fixes #87

🤖 Generated with Claude Code

Previously, every query would enumerate all datasets on cloud storage and load all of them, causing ~10s latency with 20+ datasets on GCS. Now the query parser extracts which tables are actually referenced (via node_labels() and relationship_types()), and only those specific datasets are loaded. Paths are computed directly from the root path without enumeration. Fixes lance-format#87 Co-Authored-By: Claude <noreply@anthropic.com>

ChunxuTang approved these changes Jan 19, 2026

View reviewed changes

beinan merged commit 2ba1faa into lance-format:main Jan 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: avoid dataset enumeration on GCS/S3 for query execution #89

fix: avoid dataset enumeration on GCS/S3 for query execution #89

beinan commented Jan 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: avoid dataset enumeration on GCS/S3 for query execution #89

fix: avoid dataset enumeration on GCS/S3 for query execution #89

Conversation

beinan commented Jan 19, 2026

Summary

Performance Impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants