DRAFT: MDEV-34705: Storing binlog in InnoDB #3775

knielsen · 2025-01-17T13:13:11Z

Draft pull request for work-in-progress on the MDEV-34075 binlog-in-engine feature

A new option --binlog-storage-engine=ENGINE moves the binlog implementation into the storage engine, for supporting engines (currently only InnoDB).

InnoDB implements the binlog files as a new type of tablespace, and uses its redo log to make the binlog crash-safe without the overhead and complexity of two-phase commit.

CLAassistant · 2025-01-17T13:13:19Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ bnestere
❌ knielsen
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

bnestere · 2025-02-11T23:39:53Z

Hi @knielsen!

This is very cool to see. I haven’t gone through with a fine-toothed comb, but I’d say overall I have a pretty good understanding of the idea, and have some high-level questions to start (nothing code-related yet, as there are still many things in flux, and I feel any code comments I’d make would be superseded in time anyway)

A few questions on the subject of the out-of-band data to start:

By using the forest of perfectly balanced trees to index prior OOB writes, my understanding is that the main advantage we get here is faster reads (as opposed to just a stack, with each node referencing the last written node), which I would think would be most noticeable for semi-sync setups, once that is implemented. Is there more to be gained? (e.g., I don’t think it would reduce the memory footprint, as every node will be eventually referenced in-order to reconstruct the full event anyway).
I wonder if each binary log file should have a footer to summarize the file dependencies, e.g. if there are OOB transactions, what is the earliest binlog file needed to be able to reconstruct all transactions in this log.

2a) Likewise, for a transaction with OOB data, I wonder if we should file the earlier binlog file needed to reconstruct that transaction

2b) My thought is that this would help on reporting problems off-the-bat, e.g. mysqlbinlog could report missing binlog files immediately, instead of writing data up-to a transaction with missing OOB chunks.

2c) My thought on a use-case for this is for really large transactions, and (when eventually supported) 2-phase XA
transactions, where the PREPARE may happen long in the past.
(This is nothing to address now, but more to consider in the design to support more things in the long term): I don’t think the master-slave protocol supports this now, but perhaps we could parallelize OOB reads, where the dump-thread would be sending OOB chunks while another thread is buffering them. Just mentioning this for consideration in the current design to be extensible for such things.
I wonder if OOB binlogging should be optional. E.g., say some user’s workload is rollback heavy, that could result in binlogs bloated with garbage.

Then generally to the API, your API seems to be encapsulated from the existing MYSQL_BIN_LOG/Event_log/TC_LOG APIs (which makes sense, as yours is more of an engine-plugin, vs the legacy approach is the handler). Though a few thoughts come from this:

Are we losing the ability for the binary log (stored in innodb) to function as a transaction coordinator? We’ve previously talked about that being the case, where multi-engine transactions aren’t yet supported. Do you have any new thoughts on how that would (eventually) be implemented?
Your binlog-in-engine API functions are hard-coded into the handler type. Could it instead be a separate class, perhaps used to initialize the handler? I wonder if that design would make the transaction coordinator piece easier to implement (e.g. perhaps via a shared in-engine binlog/context that multiple handlers use).
Instead of having MYSQL_BIN_LOG use a lot of conditionals reliant around opt_binlog_engine_hton, I wonder if it can be refactored to prefer some composition-based strategy which is set-up at initialization time (e.g. just give it the right class instance).

knielsen · 2025-02-12T10:14:01Z

Hi Brandon, thanks for taking a look at the patch and for initial comments/questions! Brandon Nesterenko ***@***.***> writes:

idea, and have some high-level questions to start (nothing code-related yet, as there are still many things in flux, and I feel any code comments I’d make would be superseded in time anyway)

Yes, agree.

A few questions on the subject of the out-of-band data to start: 1. By using the forest of perfectly balanced trees to index prior OOB writes, my understanding is that the main advantage we get here is faster reads (as opposed to just a stack, with each node referencing the last written node), which I would think would be most noticeable for semi-sync setups, once that is implemented.

The OOB is needed to: 1. Preserve the strict append-only way of writing the binlog data (so no back-patching to update forward pointers in previous nodes). 2. Avoid having to read all pages from the file backwards through the stack one by one before we can even send the first byte to the slave, which could be very slow for large event groups (I've heard of users with 50 GB event groups .oO). 3. Have an algorithm that needs O(log(N)) in-memory space as opposed to O(N), where N is the size of the event group, and does not need to read every page twice. So yes, basically faster reads, but also to be scalable to large event groups (ie. gigabytes) which can easily happen for row-based.

Is there more to be gained? (e.g., I don’t think it would reduce the memory footprint, as every node will be eventually referenced in-order to reconstruct the full event anyway).

The reduction in memory footprint is mainly to not have to keep a list of _all_ nodes in memory (O(N)) while traversing the stack, instead needing only to keep the path from root to current leaf (O(log(N))). Replication in general tries hard to not restrict the size of event groups/transactions to available memory. An additional goal is to reduce I/O, including having to read the file backwards which I fear could be inefficient on some systems.

2. I wonder if each binary log file should have a footer to summarize the file dependencies, e.g. if there are OOB transactions, what is the earliest binlog file needed to be able to reconstruct all transactions in this log.

Yes, this is the plan. I think it can go in the file header (as opposed to a footer), but this is still ToDo: to implement, we will see what is needed.

2a) Likewise, for a transaction with OOB data, I wonder if we should file the earlier binlog file needed to reconstruct that transaction

Right, I think that is already there, if I understood you correctly? In-memory we have handler_binlog_event_group_info::engine_ptr where we have binlog_oob_context::first_node_file_no. An on-disk, the commit record for the transaction has a pointer to the first OOB node.

2b) My thought is that this would help on reporting problems off-the-bat, e.g. mysqlbinlog could report missing binlog files immediately, instead of writing data up-to a transaction with missing OOB chunks.

Not sure this is possible in the general case. If we start reading from binlog 10, and its header says it references back to binlog 8, and binlog 8 (and 9) are available, then yes, we can be sure we have everything we will need. But the converse is not true. Even if binlog 8 is missing we can not be sure we will need it just from reading the file 10 header. We might be starting somewhere in the middle of file 10, after any references back to 8. Or we might be starting from a multi-domain GTID position where each domain starts at different positions in binlog 10. Or the OOB data in binlog 8 belongs to a transaction that ended up rolling back and so will never be needed.

2c) My thought on a use-case for this is for really large transactions, and (when eventually supported) 2-phase XA transactions, where the PREPARE may happen long in the past.

Yes. A future use case could be the ability to start optimistically replicating large transactions on the slave before it has committed on the master.

3. (This is nothing to address now, but more to consider in the design to support more things in the long term): I don’t think the master-slave protocol supports this now, but perhaps we could parallelize OOB reads, where the dump-thread would be sending OOB chunks while another thread is buffering them. Just mentioning this for consideration in the current design to be extensible for such things.

Right. For the current master-slave protocol, we need to send the whole event group as one consecutive stream of bytes, which is why we have the forest-of-trees to efficiently do that. This could be very interesting to think about extending to sending OOB data early to the slave (if that is what you emant with "another thread buffering them"?). I agree it would be very good to have more ideas for this to make the initial design suitably extensible.

4. I wonder if OOB binlogging should be optional. E.g., say some user’s workload is rollback heavy, that could result in binlogs bloated with garbage.

Yes, I thought about that too, it should be easy to implement. Mostly just omit the code that spills transaction cache data directly to the engine instead of to a tmpfile, eg. skip this: trx_cache.cache_log.write_function= binlog_spill_to_engine; InnoDB has a limit of 1MB (or something like that) on mini-transactions. So event groups larger than that still need to be split into multiple records. But we can optionally keep all data in the transaction cache tmpfile until commit, then hold LOCK_log and write all the pieces consecutively to the binlog, so there is no more interleaving of oob data. (And then we can use a normal commit record without any OOB nodes).

Then generally to the API, your API seems to be encapsulated from the existing MYSQL_BIN_LOG/Event_log/TC_LOG APIs (which makes sense, as yours is more of an engine-plugin, vs the legacy approach is the handler). Though a few thoughts come from this:

Right. My idea is to focus on the first release of this (already huge) project to get a stable functionality on all the user-facing parts. Thus, if possible, I would keep the existing logic on the server level during commit (LOCK_log, queue_for_group_commit(), ...). Then in the next version we can focus on improving scalability even more by avoiding a lot of this machinery in the common case where there is no multi-engine transaction. But details of this is still ToDo, might be more changes needed to get a functional first release.x

5. Are we losing the ability for the binary log (stored in innodb) to function as a transaction coordinator? We’ve previously talked about that being the case, where multi-engine transactions aren’t yet supported. Do you have any new thoughts on how that would (eventually) be implemented?

I think the new binlog will work more or less the same way as the existing one wrt. being the transaction coordinator for multi-engine transactions. Thus, in the old binlog: RocksDB.prepare() ; InnoDB.prepare(); binlog.write(); RocksDB::commit(); InnoDB.commit(); In the new binlog, the InnoDB prepare() are omitted: RocksDB.prepare() ; InnoDB.binlog_and_commit(); RocksDB::commit() The XA crash recovery algorithm then needs to be able to read from the new binlog implementation to find RocksDB prepared transactions and see if they were binlogged or not. We still have the XID events needed for this. I would avoid the complexity of the binlog checkpoints for the new binlog case, and just require the extra engines to fsync() again in their commit, for simplicity. The challenge is mostly if we have any functioning RocksDB or other transactional engine to test against? My (limited) understanding is that there are not really any resources to maintain RocksDB currently. So it might make sense to just disable multi-engine transactions in the first version, to focus on getting the more important use-cases stable first (and not accidentally turn MDEV-34705 into a fix-RocksDB-XA-bugs project...).

6. Your binlog-in-engine API functions are hard-coded into the handler type. Could it instead be a separate class, perhaps used to initialize the handler? I wonder if that design would make the transaction coordinator piece easier to implement (e.g. perhaps via a shared in-engine binlog/context that multiple handlers use).

Yes, at least I do foresee changes to precisely how the extensions to the storage engine API will look, even if the underlying functionality would stay much the same. I deliberately tried not to spend too much time on this so far, as I want to solicit input for any preference on how the storage API should look. Eg. Serg traditionally was much involved in storage API extensions, Sergei Petrunia also knows a lot about it from RocksDB work. Adding handlerton methods is the traditional way, though I did add the class handler_binlog_reader() as some small step towards a more C++ interface. Note that traditionally, storage engines could be written in C (InnoDB was pure C for many years, except handler/ha_innodb.cc perhaps). I am not sure what the preferred style is nowadays, I am open to using any preferred style, or to some new better style if there are no existing preferences.

7. Instead of having MYSQL_BIN_LOG use a lot of conditionals reliant around opt_binlog_engine_hton, I wonder if it can be refactored to prefer some composition-based strategy which is set-up at initialization time (e.g. just give it the right class instance).

Definitely, this needs to be improved in the current patch. My approach is to start with this mess-of-conditionals to get an overview of all the parts of the server-layer code that are affected. But then something needs to be done to get a cleaner interface, using composition and/or whatever is needed, still a ToDo. This might also be extended in a follow-up patch to do more general refactor/cleanup of the interfaces around binlog, which has become quite convoluted and hard to work with. I know Monty has ideas for splitting the relay log and the binlog into separate classes, to similarly avoid many of the is_relay_log() conditionals. Again, thanks for comments, and happy to answer any follow-up questions as always. - Kristian.

janlindstrom · 2025-02-12T11:49:14Z

@knielsen Before I can do any meaningful analysis of this implementation, I think higher level intro would be nice. Especially transactional behavior is interesting to me.

Currently, there is 2PC between binlog, innodb and wsrep. Is this somehow going to change?
There is GTID and there is XID where wsrep overwriting XID with wsrep XID during 2PC prepare, this most likely is going to change, but do we really need all these?
Normally something is written to binlog at commit as transaction is replicated on commit, but streaming replication is sending fragments of transactions even before commit, is this still possible?
wsrep writes its XID on InnoDB rollback segments, can we get one single XYZ that would identify transaction GTID, XID, node UUID etc ?

knielsen · 2025-02-12T13:19:31Z

Jan Lindström ***@***.***> writes:

@knielsen Before I can do any meaningful analysis of this implementation, I think higher level intro would be nice. Especially transactional behavior is interesting to me. * Currently, there is 2PC between binlog, innodb and wsrep. Is this somehow going to change?

Initially, I think nothing is going to change for Galera/wsrep. Galera will continue to use the existing binlog implementation, and will not support --binlog-storage-engine=innodb. Longer-term, I do believe it could be beneficial for Galera to support using the new binlog format. Having the binlog write part integrated with the InnoDB should both improve performance and reduce complexity, just as it does for other binlog-related code. The binlog events in event groups (ie. write-sets) remain unchanged with this implementation. Is it so that Galera can run without binlog enabled at all? In that mode, the old or new binlog format might make very little change at all, if any. To eventually enable Galera to use the new binlog format when it is running with binlog enabled will require that Galera is properly integrated into the MariaDB server layer wrt. binlog, group commit, GTID allocation, and transaction coordinator. As has been discussed before, Galera needs to take over the role of transaction coordinator, implement the TC_LOG interface and take over the role of binlogging that is currently done by the MYSQL_BIN_LOG subclass of TC_LOG. We do not want to carry over the current endless problems with Galera inconsistency with binlog GTIDs, nor do we want the mess of tons of #ifdef WITH_WSREP / IF_WSREP() in the new binlog code.

* There is GTID and there is XID where wsrep overwriting XID with wsrep XID during 2PC prepare, this most likely is going to change, but do we really need all these?

I am not familiar with with "wsrep overwriting XID with wsrep XID". But this is not going to change for Galera initially, the existing binlog will remain, the new binlog implementation will be optional.

* Normally something is written to binlog at commit as transaction is replicated on commit, but streaming replication is sending fragments of transactions even before commit, is this still possible?

I am not familiar with what streaming replication is (it sounds like it is Galera sending partial write-sets around the cluster during transaction execution to reduce stall around commit). There is something similar in the new binlog implementation, where large transactions are written in smaller pieces interleaved with other transactions during execution and before commit. This is referred to as "out-of-band" (OOB) binlogging.

* wsrep writes its XID on InnoDB rollback segments, can we get one single XYZ that would identify transaction GTID, XID, node UUID etc ?

GTID will remain, it is the very core of MariaDB replication and binlogging. I do not think XID is needed in the new binlog in the normal case where only InnoDB is used in a transaction, but it will still be needed for multi-engine transactions (eg. InnoDB+RocksDB). If Galera would also use GTID as the single identifier, by implementing TC_LOG as mentioned above, it should be beneficial. But this is optional whether Galera devs want to spend effort on this. It will be a big change I suppose and probably diverge from how Galera works in MySQL codebase. - Kristian.

dr-m · 2025-07-18T11:07:33Z

sql/log.cc

+  if (opt_binlog_engine_hton && value)
+  {
+    sql_print_information("Value of binlog_checksum forced to NONE since binlog_storage_engine is enabled, and InnoDB uses its own superior checksumming of pages");
+    value= 0;
+  }


I’m not sure it is helpful to mention InnoDB here. It would make sense to use the same binlog block format, no matter which storage engine is taking care of the durability (that is, the write-ahead logging).

knielsen · 2025-07-18T15:49:22Z

Marko Mäkelä ***@***.***> writes:

@dr-m commented on this pull request. > + if (opt_binlog_engine_hton && value) + { + sql_print_information("Value of binlog_checksum forced to NONE since binlog_storage_engine is enabled, and InnoDB uses its own superior checksumming of pages"); + value= 0; + } I’m not sure it is helpful to mention InnoDB here. It would make sense to use the same binlog block format, no matter which storage engine is taking care of the durability (that is, the write-ahead logging).

Agree, this is on the server layer, and could in the future be using a different storage engine binlog implementation, I'll reword in a generic way. - Kristian.

bnestere · 2025-07-19T20:20:05Z

storage/innobase/include/fsp_binlog.h

 extern uint64_t last_created_binlog_file_no;
-extern std::atomic<uint64_t> binlog_cur_written_offset[2];
-extern std::atomic<uint64_t> binlog_cur_end_offset[2];
+extern std::atomic<uint64_t> binlog_cur_durable_offset[4];


@knielsen just skimming some of your latest commits, not a formal review, but one thought here would be to have sub-classes for a durable vs non-durable reader, instead of the same reader which is always set up for durable & non-durable reads with a flag in the constructor that tells it which mode to use.

Adjust an assertion that checks that no unspilled savepoints are left after spilling the trx cache as OOB data to the engine binlog. If the savepoint is at the very end of the trx cache when we spill, there is no need to spill that particular savepoint to the engine (and we do not do so); the savepoint can still be rolled back to by truncating the in-memory part of the cache. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

When the tablespace is closed during shutdown, it waits for the last record written in the tablespace to be durable in the InnoDB redo log. There was an obvious mistake/race in the code in function ibb_wait_durable_offset(), which did not check for the waited-for condition before doing the wait. Thus if the last record in the binlog file became durable between the check in fsp_binlog_tablespace_close() and the wait in ibb_wait_durable_offset(), it would wait for new data to be written; this could cause a hang. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

When an empty user XA transaction is committed (or rolled back), we do not need to binlog any real transaction, but we still need to binlog a rollback record to clear the XID for reuse and free the prepare record from purge and from recovery. Also fix a bug that in case of error adding to the innodb binlog internal xid hash (eg. duplicate), we must still ensure that the written XA prepare record is entered into the pending LSN fifo, so we can track when it becomes durable (there was a shutdown hang possible if a prepare record was last in the binlog and XID insert failed so the record was never marked durable). Bugs found in RQG runs. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

1. Handle RELEASE SAVEPOINT, removing any released savepoint from the list of savepoints pending in the cache. Also fix a bug in the server layer; RELEASE SAVEPOINT removes the specified savepoint _and_ any later savepoints; engines were not informed of the removal of the later ones, if any. 2. Fix a bug when spilling non-transactional statement data inside of a transaction using savepoints. The spill of the statement cache must not spill any savepoints, those apply only to the trx cache. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

If the user copies manually an engine-implemented binlog file and runs mysqlbinlog on it, any following or oob-referenced files may not be available to read from. Treat this as end-of-file rather than an error (so we can output at least any part of the file that is available). But still output a message about the failure to open the file, to give some indication why the dump stopped at that point. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

…pl_sync.inc Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

When using the InnoDB-implemented binlog with another transactional storage engine, or with explicit user XA transactions, recover such transactions consistently from the binlog at server startup. When a transaction is prepared with an XID, the binlog records a "prepare" record containing the XID and link to the out-of-band replication event data. When a previously prepared transaction is committed, the commit record links to the oob data referenced from the prepare record, and the record is preceeded by an "XA complete" record containing the XID. If instead a prepared transaction is rolled back, just an "XA complete" record is binlogged with the XID and a "rollback" flag. While any prepared XA transactions are active, maintain in-memory reference counts in each binlog file, and in each binlog file record the file_no of the earliest binlog file containing any XID records of still active transactions. When the server restarts (possibly after crash), look up the file_no of the earliest binlog file that may contain active XID records, if any. Scan the binlogs from that point and record any XID prepare or complete records. For any XID prepare record, record oob data and reference count, recovering the in-memory state present before the server restart. Return a hash to the server layer containing each active XID in the binlog and its state (prepared, committed, rolled back). On the server layer, ask each engine for a list of pending XID in prepared state. If the binlog state of an XID is committed, commit in the engine. If the binlog state is rolled back or is missing, roll back in the engine. If the binlog state is prepared, _and_ all participating engines have the transaction prepared also, then leave the transaction prepared. If a binlog prepared transaction is missing from an engine, then roll it back in any other engines and in the binlog (this is to handle a crash in the middle of an XA PREPARE). The result is that multi-engine (or non-InnoDB) transactions, as well as user XA transactions, will be recovered after a crash consisent with the binlog content. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

For CREATE TEMPORARY TABLE ... SELECT, InnoDB had code to not start a new transaction for the CREATE TEMPORARY (correct). But the code that handled failure for the SELECT part (ha_innobase::extra(HA_EXTRA_ABORT_ALTER_COPY)) was missing a check for CREATE TEMPORARY, so it would roll back the entire transaction, which is wrong, and could lead to inconsistency with binlog or other engines in the same transaction. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>

This happened when the first OOB record of an event group spans two binlog files, say N and N+1. The reference counting would wrongly attribute the OOB to N+1, allowing N to be purged while it was still needed. This for example could cause server restart to fail when it tries to recover the GTID state from N+1, unable to follow OOB references to N because it was purged before the server restart. Fix by: - Increment OOB refcount _before_ binlogging the first OOB record. - Decrement refcount only _after binlogging complete. - Protecting from purge any files referenced from the active file. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>