Skip to content

Conversation

@matthew-levan
Copy link
Contributor

@matthew-levan matthew-levan commented Jan 2, 2026

This PR replaces LMDB with book, a custom append-only file-based event log persistence layer tailored to Urbit's sequential access patterns.

Motivation

Unlimited event size

LMDB's general-purpose key-value store features (random access, transactions) are unnecessary overhead for Urbit's strictly append-only event log. With LMDB, reducing log size on disk is impossible (due to B+tree) and maximum value size (event size, in our case) is limited to 4GB or less. This new API provides a simpler, more focused solution.

Faster writes

Additionally, write speeds with book will exceed LMDB's, thus removing a potential bottleneck (should we approach it after integrating SKA with the core operating function).

Implementation

Events are stored in book.log, with a preceding immutable header:

/* u3_book_head: on-disk file header (64 bytes)
*/
typedef struct _u3_book_head {
  c3_w mag_w;      //  magic number: 0x424f4f4b ("BOOK")
  c3_w ver_w;      //  format version: 1
  c3_d fir_d;      //  first event number in file
} u3_book_head;

Events on-disk are written as deeds, which are jam buffers sandwiched by heads and tails:

/* u3_book_deed_head: on-disk deed header
*/
typedef struct _u3_book_deed_head {
  c3_d len_d;    //  payload size (mug + jam)
  c3_l mug_l;    //  mug/hash
} u3_book_deed_head;

/* u3_book_deed_tail: on-disk deed trailer
*/
typedef struct _u3_book_deed_tail {
  c3_w crc_w;    //  CRC32 checksum
  c3_d let_d;    //  length trailer (validates len_d)
} u3_book_deed_tail;

/* u3_book_deed: complete on-disk event record
**
**   NB: not used directly for I/O due to variable-length jam data
**   actual format: deed_head | jam_data | deed_tail
*/
typedef struct _u3_book_deed {
  u3_book_deed_head hed_u;
  // c3_y jam_y[];  //  variable-length jam data
  u3_book_deed_tail tal_u;
} u3_book_deed;

reeds are used to represent deeds temporarily:

/* u3_book_reed: in-memory event record representation for I/O
*/
typedef struct _u3_book_reed {
  c3_d  len_d;    //  total payload size
  c3_l  mug_l;    //  mug/hash
  c3_y* jam_y;    //  jam data (caller owns, len = len_d - 4)
  c3_w  crc_w;    //  CRC32 checksum
} u3_book_reed;

Finally, a u3_book structure is used for operations like reading, writing, etc., internally and in the disk.c API:

/* u3_book: event log handle
*/
typedef struct _u3_book {
  c3_i         fid_i;      //  file descriptor for book.log
  c3_i         met_i;      //  file descriptor for meta.bin
  c3_c*        pax_c;      //  file path to book.log
  u3_book_head hed_u;      //  cached header (immutable)
  c3_d         las_d;      //  cached last event number
  c3_d         off_d;      //  cached append offset
} u3_book;

pread and pwrite syscalls are used for thread-safety and stateless (no cursor position tracking) operation.

Features:

  • Automatic crash recovery via file scanning and truncation
  • Embedded key-value metadata storage
  • Iterator API for sequential reads (u3_book_walk_*)
  • Thread-safe when used with libuv (maintains existing async patterns)
  • ACID (at the event batch, not single, level)
  • Functional partial and full (via play -f) replay support
  • Thread-safety (thanks to offsets) and stateless operations (no cursor position tracking) via pread and pwrite
  • Drop-in replacement for LMDB (API mirrors u3_lmdb_* functions)

Testing

Tests focus on failure mode, edge case, and recovery scenarios.

Run: zig build book-test

Compatibility

This PR changes how events are stored in future epochs, but it continues to use LMDB to store global pier metadata in the top-leve log directory ($pier/.urb/log/data.mdb). This ensures that helpful error messages can be printed even when users attempt to boot their book-style piers with old binaries. It should be noted that the top-level metadata should be considered canonical. Metadata stored within epochs (meta.bin, as of this PR) maintains consistency with the top-level as far as I can tell, however.

To-do

  • Migrations from v1 and v2 epoch versions
  • Failure mode tests

@dozreg-toplud
Copy link
Contributor

Overall looks good, couple of comments:

@dozreg-toplud
Copy link
Contributor

_book_scan_end iterates over every event in the file, validating them and the event count in the header, and it is called on every event log load (including on every boot) to locate the append offset.

On my laptop it took 0.320835 seconds to iterate over 119054 events. On ~dozreg-toplud (far from the busiest ship on the network) there are epochs with ~20M events. This means that it would take around a minute to just read the event log in order to boot.

Surely the last offset should just be stored in the header, and _book_scan_end should be reserved for corruption recovery. With that we could also iterate from end to the start of the iterator range in u3_book_walk_init whenever it would make sense: the deeds already have sizes in their tails.

@dozreg-toplud
Copy link
Contributor

_book_scan_end will also attempt to truncate all events after a corrupted event was encountered. Is this desirable?

@matthew-levan matthew-levan marked this pull request as ready for review January 23, 2026 01:37
@matthew-levan matthew-levan requested a review from a team as a code owner January 23, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants