Skip to content

Conversation

@beinan
Copy link
Collaborator

@beinan beinan commented Jan 19, 2026

Summary

  • Skip expensive list_datasets() enumeration when specific table names are provided to load_tables()
  • Extract referenced tables from Cypher query via node_labels() and relationship_types()
  • Compute dataset paths directly from root path instead of enumerating all datasets

Performance Impact

For queries on GCS/S3 with 20+ datasets:

  • Before: Enumerate all datasets (~slow), load all 20+ datasets (~10s)
  • After: Compute paths for only referenced tables (2-3), load only those (~1s)

Test plan

  • All 72 existing tests pass
  • Manual testing on GCS with 20+ datasets to confirm latency improvement

Fixes #87

🤖 Generated with Claude Code

Previously, every query would enumerate all datasets on cloud storage
and load all of them, causing ~10s latency with 20+ datasets on GCS.

Now the query parser extracts which tables are actually referenced
(via node_labels() and relationship_types()), and only those specific
datasets are loaded. Paths are computed directly from the root path
without enumeration.

Fixes lance-format#87

Co-Authored-By: Claude <noreply@anthropic.com>
@beinan beinan merged commit 2ba1faa into lance-format:main Jan 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance issue when there is 20+ datasets on GCS

2 participants