Replies: 3 comments
-
|
Thanks Greg — great write-up. A couple of focused points from building clustered metadata systems with SMR (Raft/VSR):
A single VSR log keeps all metadata ops (streams, topics, consumer groups, users) linearizable and naturally atomic. You can still keep logical namespaces inside the state machine — just not separate consensus groups yet.
The “MuxStateMachine” idea is useful inside one VSR log, but running multiple SMRs now creates 2PC-style coordination for deletes/renames/permissions.
A circular buffer with “upper-bounded ops” will break in production when metadata spikes.
Before choosing architecture, clarify:
These decisions determine whether multiple SMs are even viable long-term. TL;DR Greg — the cleanest and safest path is:
It gives you correctness, simple recovery, atomic metadata ops, and a clean way to scale later without committing to multi-group coordination now. Happy to dig deeper if helpful. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks Chiradip. Yeah after thinking about it more, I agree that a single log, with one atomic snapshot is the easiest path to achieve strict serializability, considering that metadata defines all the necessary elements required for the clients to operate the server, SS is a must.
Yeah those can be achieved by having MuxStateMachine, while still having one atomic snapshot, seems that redpanda does exactly that.
What I've thought is, since our metadata ops have an "upperbound" size (which isn't large it's few thousands bytes maximum), we can easily estimate the size of the log and limit it's size by the snapshot interval -- log_size = snapshot_interval * max_op_size, and utilize the snapshot sequence number to distinguish between the "junk" log entries and "new" log entries, without paying the cost of physically truncating the log, the log entries become sort of "copy-on-write". Tigerbeetle uses that approach for their log and it seems to work really great (they even add an extra "lag" when snapshotting the log, to allow for extra pipelinening), I don't think so we need to go this far, as metadata isn't the "hot" path of our system. |
Beta Was this translation helpful? Give feedback.
-
|
Greg — I pushed through the tradeoffs and I think we should prefer segmented append-only logs + snapshot-triggered truncation over a bounded circular buffer for metadata. Reason: the circular ring requires strict timing guarantees (snapshot must finish before wrap) and a bunch of brittle guardrails (preemptive snapshot triggers, throttling, overflow spill, complex crash recovery). Those are doable but add a lot of correctness surface area. Segmented logs tolerate follower lag and crashes far more naturally, simplify recovery, and are easier to test/operate. If we accept slightly more IO/manifest work, we get drastically simpler correctness. We can still keep logical domain namespaces and modular snapshots (MuxStateMachine inside one VSR group) and can optimize later if real measurements demand it. Happy to sketch segment format + compaction/truncation rules if that helps. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Related issue: #2339
As part of the clustering functionality, we need to implement SMR (State Machine Replication) for our metadata (streams, topics, consumer groups, users, etc.).
Currently, I am thinking about a few different options for how to model parts of the abstraction that we need.
There are a few options. One is to have a single state machine that covers all metadata functionality, or to have multiple state machines (one each for streams, topics, consumer groups, etc.). Each approach has its pros and cons.
Single state machine:
Pros:
Simpler mental model and implementation - no need to reason about multiple state machines, side effects, and potential interactions between them.
One atomic snapshot.
Cons:
Worse scalability (both performance and extensibility).
Since there is one global snapshot, fast-forwarding new/lagging replicas goes through a single chokepoint.
Mux state machine:
Pros:
Inversion of all the cons of the single state machine approach. Additionally, with multiple state machines we can defer deserialization to each state machine, avoiding eagerly deserializing the command payload. We can also version snapshots for each state machine independently.
Cons:
More complex mental model and implementation - need to reason about interactions and side effects between state machines.
Snapshot coordination becomes more complex.
JournalLog (where we will store uncommitted and committed ops)
There are a few options as well, but I think it's much clearer what to do here than with the STM. I think an on-disk circular buffer (considering that we can calculate an upper bound size for each metadata command) that is snapshotable is more than enough for our use-case.
Snapshot
Given the design choice of JournalLog, I think we need a full count-based snapshotting mechanism that will use sequence as its revision number. The snapshot needs to be stored in 3 copies (for redundancy) and needs to be composable from smaller snapshots (individual STMs in case of MuxStateMachine).
Maybe it's actually a good idea to take incremental snapshots of each STM and create a global snapshot that would be a full snapshot of the metadata? Something worth exploring if we choose the MuxStateMachine variant.
Those are the main points that I would like to hear your input on, if you have anything that you think is worth considering as well, feel free to include it in your comments.
Beta Was this translation helpful? Give feedback.
All reactions