feat: introduce MemWAL writer #5709

touch-of-grey · 2026-01-14T00:29:24Z

Based on draft shared from @jackye1995 , cleanup and publish for review, also added a custom fix that RegionWriter uses Arc<EpochGuard> instead of EpochGuard to avoid unnecessarily incrementing the epoch

There are some code that are in the reader, but hard to separate out in the PR, I marked them as dead code for now.

jackye1995 · 2026-01-14T00:30:30Z

Thanks for the fast turnaround! I will take a look tonight. Meanwhile, I think the code path deserves some benchmark, can you add that?

codecov · 2026-01-14T01:08:30Z

Codecov Report

❌ Patch coverage is 65.07744% with 1150 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/mem_wal/write/indexes.rs	52.23%	369 Missing and 15 partials ⚠️
rust/lance/src/dataset/mem_wal/write/flush.rs	63.44%	168 Missing and 21 partials ⚠️
rust/lance/src/dataset/mem_wal/write/memtable.rs	65.36%	111 Missing and 13 partials ⚠️
rust/lance/src/dataset/mem_wal/write/writer.rs	74.80%	85 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/api.rs	0.00%	73 Missing ⚠️
rust/lance/src/dataset/mem_wal/manifest.rs	75.09%	54 Missing and 10 partials ⚠️
rust/lance/src/dataset/mem_wal/dispatcher.rs	61.11%	59 Missing and 4 partials ⚠️
rust/lance/src/dataset/mem_wal/config.rs	0.00%	46 Missing ⚠️
rust/lance/src/dataset/mem_wal/write/wal.rs	83.70%	30 Missing and 14 partials ⚠️
.../lance/src/dataset/mem_wal/write/fragment_store.rs	80.00%	26 Missing and 7 partials ⚠️
... and 2 more

📢 Thoughts on this report? Let us know!

Implements a Memory Write-Ahead Log (MemWAL) system for Lance with: - LSM-tree based write path with WAL, MemTable, and SST layers - Support for maintaining BTree, IVF-PQ, and FTS indexes during writes - Configurable durable/nondurable and sync/async index modes - Benchmark showing 19+ Melem/s throughput with batch size 100 Co-Authored-By: Jack Ye <yezhaoqin@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jackye1995 · 2026-01-16T05:18:56Z

rust/lance-io/src/object_store.rs

+    /// (e.g., another writer already wrote the same WAL entry).
+    ///
+    /// Returns `Err` with `AlreadyExists` if the destination file exists.
+    pub async fn rename_if_not_exists(&self, from: &Path, to: &Path) -> Result<()> {


why do we need rename_if_not_exists and copy_if_not_exists? We can fulfill rename operation with existing object_store APIs, see how we do that in commit.rs to do atomic manifest commit for local storage.

jackye1995 · 2026-01-16T05:21:10Z

rust/lance/src/dataset/mem_wal/api.rs

+        if pk_fields.is_empty() {
+            return Err(Error::invalid_input(
+                "MemWAL requires a primary key on the dataset. \
+                 Define a primary key using the 'lance-schema:unenforced-primary-key' field metadata.",


nit: Arrow field metadata

jackye1995 · 2026-01-16T05:22:34Z

rust/lance/src/dataset/mem_wal/api.rs

+#[derive(Debug, Clone, Default)]
+pub struct MemWalConfig {
+    /// Region specification for partitioning writes.
+    pub region_specs: Vec<RegionSpec>,


this should just take a single RegionSpec to begin with, and it should be optional (you can create a MemWAL index without region spec).

We can in the future add APIs like add_region_spec (we can add a TODO for those, don't need to add now)

jackye1995 · 2026-01-16T05:24:24Z

rust/lance/src/dataset/mem_wal/api.rs

+    ///
+    /// This opens the vector index and extracts the IVF model and product
+    /// quantizer needed for in-memory index maintenance.
+    async fn load_vector_index_config(


can these just be functions, not methods within Dataset?

jackye1995 · 2026-01-16T05:25:20Z

rust/lance/src/dataset/mem_wal/config.rs

+    /// When false:
+    /// - Index updates are deferred
+    /// - New data may not appear in index-accelerated queries immediately
+    pub indexed_writes: bool,


this name is confusing, because it can mean indexed vs not indexed. What about sync_indexed_write

jackye1995 · 2026-01-16T05:27:11Z

rust/lance/src/dataset/mem_wal/manifest.rs

+            })?;
+
+        // Best-effort update version hint
+        self.write_version_hint(version).await;


this should log a warning if failed

jackye1995 · 2026-01-16T05:27:53Z

rust/lance/src/dataset/mem_wal/manifest.rs

+            ..Default::default()
+        };
+
+        self.object_store


this should also differentiate between local and cloud, since local will use a rename to ensure atomicity?

jackye1995 · 2026-01-16T05:28:32Z

rust/lance/src/dataset/mem_wal/manifest.rs

+        }
+
+        // Parallel scan forward with batches of HEAD requests
+        let batch_size = 8;


this should be a config in region writer config

batch size can just be 2 as default

- Remove unnecessary rename_if_not_exists/copy_if_not_exists wrappers - Change MemWalConfig.region_specs to optional single region_spec - Move load_vector_index_config to standalone function - Rename indexed_writes to sync_indexed_write for clarity - Add warning log comment for manifest update failures - Differentiate local vs cloud manifest writes (rename vs PUT-IF-NOT-EXISTS) - Make manifest_scan_batch_size configurable with default 2 Co-Authored-By: Jack Ye <yezhaoqin@gmail.com> Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

touch-of-grey · 2026-01-16T05:46:26Z

All review comments have been addressed in commit cda38c0:

✅ Removed rename_if_not_exists/copy_if_not_exists wrappers - using existing object_store APIs directly
✅ Fixed "Arrow field metadata" typo
✅ Changed region_specs: Vec<RegionSpec> to region_spec: Option<RegionSpec>, added TODO for add_region_spec() API
✅ Moved load_vector_index_config and load_ivf_pq_components to standalone functions
✅ Renamed indexed_writes to sync_indexed_write for clarity
✅ Added comment clarifying that version hint failures are logged as warnings
✅ Differentiated local vs cloud manifest writes (local uses temp file + atomic rename, cloud uses PUT-IF-NOT-EXISTS)
✅ Made manifest_scan_batch_size configurable in RegionWriterConfig with default of 2

Replace Vec<MemTableFragment> with an actual Lance in-memory Dataset for queryable storage. Each write now: 1. Writes RecordBatch to in-memory Dataset (creates new version) 2. Appends to WAL buffer for durability 3. Updates in-memory indexes Key changes: - memtable.rs: Use Dataset instead of Vec<MemTableFragment> - insert() is now async and writes to memory:// Dataset - Added scan_batches() method to read from Dataset - Simplified WAL flush tracking with HashSet<usize> - writer.rs: Update put() to handle async insert - Clone batch for WAL before async insert - Remove separate indexes field from WriterState - flush.rs: Read from Dataset via scan_batches() - write.rs: Remove MemTableFragment export Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Eliminate batch cloning by using the in-memory Dataset as single source of truth. The WAL buffer now tracks fragment IDs only (no batch data), and scans fragments from the Dataset during flush. Changes: - WAL buffer: Remove batch storage, track fragment IDs as cursor - MemTable: Add scan_fragments_by_ids() for scanning specific fragments - Writer: Remove batch cloning, pass MemTable reference to WAL flush Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace batch collection with direct streaming from the in-memory Dataset scanner. Instead of collecting all batches into a Vec and then creating a RecordBatchIterator, we now: 1. Get DatasetRecordBatchStream from dataset.scan().try_into_stream() 2. Convert to SendableRecordBatchStream (implements StreamingWriteSource) 3. Pass directly to InsertBuilder.execute_stream() This eliminates the intermediate Vec<RecordBatch> allocation and reduces memory pressure during flush operations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- WAL entries are now stored as Lance datasets instead of Arrow IPC streams - Schema: LargeBinary (file_bytes) + Binary (fragment_bytes) with writer_epoch metadata - Enables direct file byte copying from in-memory store (no re-encoding) - Simplified flush logic by removing temp file + rename approach - Added WalEntryData::read() for WAL replay Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add FragmentStore for O(1) append operations (avoids manifest growth) - Replace BTreeMemIndex with SkipListMemIndex using crossbeam-skiplist - Add TTL-based Dataset caching in MemTable for eventual consistency - Store original RecordBatches for efficient Dataset reconstruction Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions bot added the enhancement New feature or request label Jan 14, 2026

jackye1995 self-requested a review January 14, 2026 00:30

touch-of-grey force-pushed the lsm-writer branch from 4fae180 to eb46adc Compare January 16, 2026 05:15

jackye1995 reviewed Jan 16, 2026

View reviewed changes

touch-of-grey and others added 5 commits January 15, 2026 22:10

feat: introduce MemWAL writer #5709

Are you sure you want to change the base?

feat: introduce MemWAL writer #5709

Uh oh!

Conversation

touch-of-grey commented Jan 14, 2026

Uh oh!

jackye1995 commented Jan 14, 2026

Uh oh!

codecov bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

touch-of-grey commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Jan 14, 2026 •

edited

Loading