Skip to content

Conversation

@knielsen
Copy link
Member

Draft pull request for work-in-progress on the MDEV-34075 binlog-in-engine feature

A new option --binlog-storage-engine=ENGINE moves the binlog implementation into the storage engine, for supporting engines (currently only InnoDB).

InnoDB implements the binlog files as a new type of tablespace, and uses its redo log to make the binlog crash-safe without the overhead and complexity of two-phase commit.

@CLAassistant
Copy link

CLAassistant commented Jan 17, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ bnestere
❌ knielsen
You have signed the CLA already but the status is still pending? Let us recheck it.

@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch 2 times, most recently from aa64f73 to 18932be Compare January 17, 2025 20:07
@cvicentiu cvicentiu added the MariaDB Foundation Pull requests created by MariaDB Foundation label Jan 22, 2025
@bnestere bnestere self-assigned this Feb 8, 2025
@bnestere bnestere self-requested a review February 8, 2025 15:34
@bnestere
Copy link
Contributor

Hi @knielsen!

This is very cool to see. I haven’t gone through with a fine-toothed comb, but I’d say overall I have a pretty good understanding of the idea, and have some high-level questions to start (nothing code-related yet, as there are still many things in flux, and I feel any code comments I’d make would be superseded in time anyway)

A few questions on the subject of the out-of-band data to start:

  1. By using the forest of perfectly balanced trees to index prior OOB writes, my understanding is that the main advantage we get here is faster reads (as opposed to just a stack, with each node referencing the last written node), which I would think would be most noticeable for semi-sync setups, once that is implemented. Is there more to be gained? (e.g., I don’t think it would reduce the memory footprint, as every node will be eventually referenced in-order to reconstruct the full event anyway).

  2. I wonder if each binary log file should have a footer to summarize the file dependencies, e.g. if there are OOB transactions, what is the earliest binlog file needed to be able to reconstruct all transactions in this log.

    2a) Likewise, for a transaction with OOB data, I wonder if we should file the earlier binlog file needed to reconstruct that transaction

    2b) My thought is that this would help on reporting problems off-the-bat, e.g. mysqlbinlog could report missing binlog files immediately, instead of writing data up-to a transaction with missing OOB chunks.

    2c) My thought on a use-case for this is for really large transactions, and (when eventually supported) 2-phase XA
    transactions, where the PREPARE may happen long in the past.

  3. (This is nothing to address now, but more to consider in the design to support more things in the long term): I don’t think the master-slave protocol supports this now, but perhaps we could parallelize OOB reads, where the dump-thread would be sending OOB chunks while another thread is buffering them. Just mentioning this for consideration in the current design to be extensible for such things.

  4. I wonder if OOB binlogging should be optional. E.g., say some user’s workload is rollback heavy, that could result in binlogs bloated with garbage.

Then generally to the API, your API seems to be encapsulated from the existing MYSQL_BIN_LOG/Event_log/TC_LOG APIs (which makes sense, as yours is more of an engine-plugin, vs the legacy approach is the handler). Though a few thoughts come from this:

  1. Are we losing the ability for the binary log (stored in innodb) to function as a transaction coordinator? We’ve previously talked about that being the case, where multi-engine transactions aren’t yet supported. Do you have any new thoughts on how that would (eventually) be implemented?

  2. Your binlog-in-engine API functions are hard-coded into the handler type. Could it instead be a separate class, perhaps used to initialize the handler? I wonder if that design would make the transaction coordinator piece easier to implement (e.g. perhaps via a shared in-engine binlog/context that multiple handlers use).

  3. Instead of having MYSQL_BIN_LOG use a lot of conditionals reliant around opt_binlog_engine_hton, I wonder if it can be refactored to prefer some composition-based strategy which is set-up at initialization time (e.g. just give it the right class instance).

@knielsen
Copy link
Member Author

knielsen commented Feb 12, 2025 via email

@janlindstrom
Copy link
Contributor

@knielsen Before I can do any meaningful analysis of this implementation, I think higher level intro would be nice. Especially transactional behavior is interesting to me.

  • Currently, there is 2PC between binlog, innodb and wsrep. Is this somehow going to change?
  • There is GTID and there is XID where wsrep overwriting XID with wsrep XID during 2PC prepare, this most likely is going to change, but do we really need all these?
  • Normally something is written to binlog at commit as transaction is replicated on commit, but streaming replication is sending fragments of transactions even before commit, is this still possible?
  • wsrep writes its XID on InnoDB rollback segments, can we get one single XYZ that would identify transaction GTID, XID, node UUID etc ?

@knielsen
Copy link
Member Author

knielsen commented Feb 12, 2025 via email

@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from 407471e to 061fe31 Compare March 2, 2025 10:41
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from 37a89db to 0dc288e Compare March 12, 2025 17:10
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch 5 times, most recently from 4fa615f to 6a3f998 Compare March 22, 2025 20:19
@bnestere bnestere added the Replication Patches involved in replication label Mar 25, 2025
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch 2 times, most recently from 8f6b69d to 0327708 Compare April 7, 2025 19:46
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from 27d8eaa to 55a69e8 Compare May 14, 2025 10:56
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from dc07708 to 30bbac9 Compare June 1, 2025 19:50
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch 4 times, most recently from 78253de to d7a3ece Compare July 3, 2025 12:47
Comment on lines 13626 to 14489
if (opt_binlog_engine_hton && value)
{
sql_print_information("Value of binlog_checksum forced to NONE since binlog_storage_engine is enabled, and InnoDB uses its own superior checksumming of pages");
value= 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure it is helpful to mention InnoDB here. It would make sense to use the same binlog block format, no matter which storage engine is taking care of the durability (that is, the write-ahead logging).

@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from d06fb59 to 2b0547d Compare July 18, 2025 14:00
@knielsen
Copy link
Member Author

knielsen commented Jul 18, 2025 via email

extern uint64_t last_created_binlog_file_no;
extern std::atomic<uint64_t> binlog_cur_written_offset[2];
extern std::atomic<uint64_t> binlog_cur_end_offset[2];
extern std::atomic<uint64_t> binlog_cur_durable_offset[4];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knielsen just skimming some of your latest commits, not a formal review, but one thought here would be to have sub-classes for a durable vs non-durable reader, instead of the same reader which is always set up for durable & non-durable reads with a flag in the constructor that tells it which mode to use.

Adjust an assertion that checks that no unspilled savepoints are left after
spilling the trx cache as OOB data to the engine binlog.

If the savepoint is at the very end of the trx cache when we spill, there is
no need to spill that particular savepoint to the engine (and we do not do
so); the savepoint can still be rolled back to by truncating the in-memory
part of the cache.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When the tablespace is closed during shutdown, it waits for the last record
written in the tablespace to be durable in the InnoDB redo log.

There was an obvious mistake/race in the code in function
ibb_wait_durable_offset(), which did not check for the waited-for condition
before doing the wait.

Thus if the last record in the binlog file became durable between the check
in fsp_binlog_tablespace_close() and the wait in ibb_wait_durable_offset(),
it would wait for new data to be written; this could cause a hang.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When an empty user XA transaction is committed (or rolled back), we do
not need to binlog any real transaction, but we still need to binlog a
rollback record to clear the XID for reuse and free the prepare record from
purge and from recovery.

Also fix a bug that in case of error adding to the innodb binlog internal
xid hash (eg. duplicate), we must still ensure that the written XA prepare
record is entered into the pending LSN fifo, so we can track when it becomes
durable (there was a shutdown hang possible if a prepare record was last in
the binlog and XID insert failed so the record was never marked durable).

Bugs found in RQG runs.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
1. Handle RELEASE SAVEPOINT, removing any released savepoint from the list
of savepoints pending in the cache. Also fix a bug in the server layer;
RELEASE SAVEPOINT removes the specified savepoint _and_ any later
savepoints; engines were not informed of the removal of the later ones, if
any.

2. Fix a bug when spilling non-transactional statement data inside of a
transaction using savepoints. The spill of the statement cache must not
spill any savepoints, those apply only to the trx cache.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If the user copies manually an engine-implemented binlog file and runs
mysqlbinlog on it, any following or oob-referenced files may not be
available to read from. Treat this as end-of-file rather than an error
(so we can output at least any part of the file that is available).
But still output a message about the failure to open the file, to give
some indication why the dump stopped at that point.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…pl_sync.inc

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When using the InnoDB-implemented binlog with another transactional storage
engine, or with explicit user XA transactions, recover such transactions
consistently from the binlog at server startup.

When a transaction is prepared with an XID, the binlog records a "prepare"
record containing the XID and link to the out-of-band replication event
data.

When a previously prepared transaction is committed, the commit record links
to the oob data referenced from the prepare record, and the record is
preceeded by an "XA complete" record containing the XID.

If instead a prepared transaction is rolled back, just an "XA complete"
record is binlogged with the XID and a "rollback" flag.

While any prepared XA transactions are active, maintain in-memory reference
counts in each binlog file, and in each binlog file record the file_no of
the earliest binlog file containing any XID records of still active
transactions.

When the server restarts (possibly after crash), look up the file_no of the
earliest binlog file that may contain active XID records, if any. Scan the
binlogs from that point and record any XID prepare or complete records.

For any XID prepare record, record oob data and reference count, recovering
the in-memory state present before the server restart. Return a hash to the
server layer containing each active XID in the binlog and its state
(prepared, committed, rolled back).

On the server layer, ask each engine for a list of pending XID in prepared
state. If the binlog state of an XID is committed, commit in the engine. If
the binlog state is rolled back or is missing, roll back in the engine. If
the binlog state is prepared, _and_ all participating engines have the
transaction prepared also, then leave the transaction prepared. If a binlog
prepared transaction is missing from an engine, then roll it back in any
other engines and in the binlog (this is to handle a crash in the middle of
an XA PREPARE).

The result is that multi-engine (or non-InnoDB) transactions, as well as
user XA transactions, will be recovered after a crash consisent with the
binlog content.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
For CREATE TEMPORARY TABLE ... SELECT, InnoDB had code to not start a new
transaction for the CREATE TEMPORARY (correct). But the code that handled
failure for the SELECT part (ha_innobase::extra(HA_EXTRA_ABORT_ALTER_COPY))
was missing a check for CREATE TEMPORARY, so it would roll back the entire
transaction, which is wrong, and could lead to inconsistency with binlog or
other engines in the same transaction.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This happened when the first OOB record of an event group spans two binlog
files, say N and N+1. The reference counting would wrongly attribute the OOB
to N+1, allowing N to be purged while it was still needed.

This for example could cause server restart to fail when it tries to recover
the GTID state from N+1, unable to follow OOB references to N because it was
purged before the server restart.

Fix by:
 - Increment OOB refcount _before_ binlogging the first OOB record.
 - Decrement refcount only _after binlogging complete.
 - Protecting from purge any files referenced from the active file.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Add user documentation for the new binlog implementation. And add error messages for the remaining configuration options that are not available with the new binlog.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The GTID position at slave connect is found from the GTID state records
written at the start and at every --innodb-binlog-state-interval bytes of
the binlog files. There was a bug that for a binlog group commit, the binlog
state written was the one corresponding to the last GTID in the group,
regardless of where during the binlogging of the group it was written. Thus,
it could mistakenly write a GTID state record of 0-1-10, say, followed by a
lower GTID 0-1-9. This could cause a slave connecting at 0-1-10 to receive
an extra GTID 0-1-9, and replication would diverge.

Fix by maintaining a full GTID binlog state inside the engine binlog, same
as is done for the differential state.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If binlog files are deleted or otherwise unreadable during server restart,
don't make the server unstartable. Instead, start up, recovering what is
available, but complaining in the error log.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…ackup

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…ithout "ddl" mark on the GTID

This patch fixes that ALTER TABLE can call wakeup_subsequent_commits() too
early and allow following event groups to commit out-of-order in parallel
replication. Fixed by calling suspend_subsequent_commits() at the start
of the ALTER.

Could be seen as an assertion:
  !tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id

(Normally this is prevented because an ALTER TABLE will run in its own GCO,
and thus no following event groups can even start; however the missing DDL
mark caused by MDEV-38429 made this visible. And calling
wakeup_subsequent_commits() too early is wrong in any case).

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…ithout "ddl" mark on the GTID

When ddl log recovery needs to binlog during the crash recovery, the
GTID was binlogged without the required "ddl" marker. This caused wrong
behaviour on the slave when using parallel replication.

Fixed by explicitly marking the "current statement" as DDL when binlogging
in ddl log crash recovery.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The error handling path forgot to unlock the LOCK_log mutex, hanging the
server or causing assertion mysql_mutex_assert_not_owner.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
FLUSH BINARY LOGS before dumping, to make sure the file is on disk and not
get different mysqlbinlog output depending on timing.

Treat completely empty (all zeros) file the same as file with the header
page written but no events yet.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
There was a race where a new GTID could be allocated (but not written to the
binlog)during the FLUSH, so that the GTID state written at the start of the
new binlog file was incorrect. This in turn could lead to duplicate GTID
being sent to the slave if it happens to reconnect at that exact point.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…f InnoDB redo log

The code for binlogging out-of-band data was missing an appropriate call to
log_free_check(). This call is needed to throttle write activity and wait
for an InnoDB checkpoint, when the redo log is too small (or otherwise has
insufficient space available) to accomodate the write activity.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
SAVEPOINT inside a trigger doesn't work correctly. Setting a savepoint
inside a trigger somehow loses the implicit savepoint set at transaction
start, so that the partial changes are left if the statement later fails.
Referencing an existing savepoint claims the savepoint does not exist (and
it is in any case very unclear what exactly it should mean to rollback to a
savepoint from the middle of a statement, or set in the middle of a prior
statement).

These problems are independent of binlog-in-engine, but in the new binlog
implementation we are trying to make things work more correctly and
robustly, so let's disallow use of savepoints inside triggers. The new
binlog is off by default, so backwards compatibility is less of a concern,
though arguably disallowing savepoints in triggers would be better done
unconditionally.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Don't use (and crash on) any --binlog-directory option specified for
--backup, always use the value fetched from the running server.

Ensure a slash in-between path components when using a relative path for options
such as --innodb-undo-directory and --binlog-directory.

Clarify the description of the --binlog-directory option.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
@knielsen knielsen force-pushed the knielsen_binlog_in_engine branch from f17a54f to 94a302d Compare January 16, 2026 22:41
…nlog_dir

Remove the --innodb-undo-directory=undos command-line argument from the
test, as it causes failures when the test suite is run from distro package
and the test directory is not writeable, and it's not relevant for what is
being tested in that test case.

Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

MariaDB Foundation Pull requests created by MariaDB Foundation Replication Patches involved in replication

Development

Successfully merging this pull request may close these issues.

7 participants