Metadata layer in cluster. #2346

numinnex · 2025-11-13T18:57:38Z

numinnex
Nov 13, 2025
Collaborator

Related issue: #2339

As part of the clustering functionality, we need to implement SMR (State Machine Replication) for our metadata (streams, topics, consumer groups, users, etc.).
Currently, I am thinking about a few different options for how to model parts of the abstraction that we need.

State machine (or state machines)
There are a few options. One is to have a single state machine that covers all metadata functionality, or to have multiple state machines (one each for streams, topics, consumer groups, etc.). Each approach has its pros and cons.
Single state machine:
Pros:

Simpler mental model and implementation - no need to reason about multiple state machines, side effects, and potential interactions between them.
One atomic snapshot.

Cons:

Worse scalability (both performance and extensibility).
Since there is one global snapshot, fast-forwarding new/lagging replicas goes through a single chokepoint.

Mux state machine:
Pros:

Inversion of all the cons of the single state machine approach. Additionally, with multiple state machines we can defer deserialization to each state machine, avoiding eagerly deserializing the command payload. We can also version snapshots for each state machine independently.

Cons:

More complex mental model and implementation - need to reason about interactions and side effects between state machines.
Snapshot coordination becomes more complex.

JournalLog (where we will store uncommitted and committed ops)
There are a few options as well, but I think it's much clearer what to do here than with the STM. I think an on-disk circular buffer (considering that we can calculate an upper bound size for each metadata command) that is snapshotable is more than enough for our use-case.
Snapshot
Given the design choice of JournalLog, I think we need a full count-based snapshotting mechanism that will use sequence as its revision number. The snapshot needs to be stored in 3 copies (for redundancy) and needs to be composable from smaller snapshots (individual STMs in case of MuxStateMachine).
Maybe it's actually a good idea to take incremental snapshots of each STM and create a global snapshot that would be a full snapshot of the metadata? Something worth exploring if we choose the MuxStateMachine variant.

Those are the main points that I would like to hear your input on, if you have anything that you think is worth considering as well, feel free to include it in your comments.

chiradip · 2025-11-15T15:14:30Z

chiradip
Nov 15, 2025

Thanks Greg — great write-up. A couple of focused points from building clustered metadata systems with SMR (Raft/VSR):

Since Iggy is using VSR, start with one VSR group + one ordered log

A single VSR log keeps all metadata ops (streams, topics, consumer groups, users) linearizable and naturally atomic.
It avoids every cross-SM coordination problem that appears the moment you introduce multiple VSR groups.

You can still keep logical namespaces inside the state machine — just not separate consensus groups yet.

Make snapshots modular, not multi-group

The “MuxStateMachine” idea is useful inside one VSR log, but running multiple SMRs now creates 2PC-style coordination for deletes/renames/permissions.
Instead: one snapshot manifest with per-domain sub-snapshots. Easy to install, easy to stream, and future-proof if you ever need to split domains.

The journal needs real segmented logs

A circular buffer with “upper-bounded ops” will break in production when metadata spikes.
Use append-only segmented logs + snapshot-triggered truncation + streaming snapshot install for follower catch-up.

Define semantics first

Before choosing architecture, clarify:

Is metadata strictly linearizable?
Should cross-domain metadata changes be atomic?
What’s the recovery path for a cold follower?

These decisions determine whether multiple SMs are even viable long-term.

TL;DR

Greg — the cleanest and safest path is:

one VSR group
one log
logical domain separation inside the SM
modular snapshot format

It gives you correctness, simple recovery, atomic metadata ops, and a clean way to scale later without committing to multi-group coordination now.

Happy to dig deeper if helpful.

0 replies

numinnex · 2025-11-15T18:48:15Z

numinnex
Nov 15, 2025
Collaborator Author

Thanks Chiradip.

Yeah after thinking about it more, I agree that a single log, with one atomic snapshot is the easiest path to achieve strict serializability, considering that metadata defines all the necessary elements required for the clients to operate the server, SS is a must.

logical domain separation inside the SM
modular snapshot format

Yeah those can be achieved by having MuxStateMachine, while still having one atomic snapshot, seems that redpanda does exactly that.

A circular buffer with “upper-bounded ops” will break in production when metadata spikes.
What do you mean by that ?

What I've thought is, since our metadata ops have an "upperbound" size (which isn't large it's few thousands bytes maximum), we can easily estimate the size of the log and limit it's size by the snapshot interval -- log_size = snapshot_interval * max_op_size, and utilize the snapshot sequence number to distinguish between the "junk" log entries and "new" log entries, without paying the cost of physically truncating the log, the log entries become sort of "copy-on-write". Tigerbeetle uses that approach for their log and it seems to work really great (they even add an extra "lag" when snapshotting the log, to allow for extra pipelinening), I don't think so we need to go this far, as metadata isn't the "hot" path of our system.

0 replies

chiradip · 2025-11-15T21:01:58Z

chiradip
Nov 15, 2025

Greg — I pushed through the tradeoffs and I think we should prefer segmented append-only logs + snapshot-triggered truncation over a bounded circular buffer for metadata.

Reason: the circular ring requires strict timing guarantees (snapshot must finish before wrap) and a bunch of brittle guardrails (preemptive snapshot triggers, throttling, overflow spill, complex crash recovery). Those are doable but add a lot of correctness surface area. Segmented logs tolerate follower lag and crashes far more naturally, simplify recovery, and are easier to test/operate.

If we accept slightly more IO/manifest work, we get drastically simpler correctness. We can still keep logical domain namespaces and modular snapshots (MuxStateMachine inside one VSR group) and can optimize later if real measurements demand it. Happy to sketch segment format + compaction/truncation rules if that helps.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata layer in cluster. #2346

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Metadata layer in cluster. #2346

Uh oh!

numinnex Nov 13, 2025 Collaborator

Replies: 3 comments

Uh oh!

chiradip Nov 15, 2025

Uh oh!

numinnex Nov 15, 2025 Collaborator Author

Uh oh!

chiradip Nov 15, 2025

numinnex
Nov 13, 2025
Collaborator

chiradip
Nov 15, 2025

numinnex
Nov 15, 2025
Collaborator Author

chiradip
Nov 15, 2025