-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
DRAFT: MDEV-34705: Storing binlog in InnoDB #3775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 11.4
Are you sure you want to change the base?
Conversation
|
|
aa64f73 to
18932be
Compare
|
Hi @knielsen! This is very cool to see. I haven’t gone through with a fine-toothed comb, but I’d say overall I have a pretty good understanding of the idea, and have some high-level questions to start (nothing code-related yet, as there are still many things in flux, and I feel any code comments I’d make would be superseded in time anyway) A few questions on the subject of the out-of-band data to start:
Then generally to the API, your API seems to be encapsulated from the existing MYSQL_BIN_LOG/Event_log/TC_LOG APIs (which makes sense, as yours is more of an engine-plugin, vs the legacy approach is the handler). Though a few thoughts come from this:
|
|
Hi Brandon, thanks for taking a look at the patch and for initial
comments/questions!
Brandon Nesterenko ***@***.***> writes:
idea, and have some high-level questions to start (nothing
code-related yet, as there are still many things in flux, and I feel
any code comments I’d make would be superseded in time anyway)
Yes, agree.
A few questions on the subject of the out-of-band data to start:
1. By using the forest of perfectly balanced trees to index prior OOB
writes, my understanding is that the main advantage we get here is
faster reads (as opposed to just a stack, with each node referencing
the last written node), which I would think would be most noticeable
for semi-sync setups, once that is implemented.
The OOB is needed to:
1. Preserve the strict append-only way of writing the binlog data (so no
back-patching to update forward pointers in previous nodes).
2. Avoid having to read all pages from the file backwards through the stack
one by one before we can even send the first byte to the slave, which could
be very slow for large event groups (I've heard of users with 50 GB event
groups .oO).
3. Have an algorithm that needs O(log(N)) in-memory space as opposed to
O(N), where N is the size of the event group, and does not need to read
every page twice.
So yes, basically faster reads, but also to be scalable to large event
groups (ie. gigabytes) which can easily happen for row-based.
Is there more to be
gained? (e.g., I don’t think it would reduce the memory footprint, as
every node will be eventually referenced in-order to reconstruct the
full event anyway).
The reduction in memory footprint is mainly to not have to keep a list of
_all_ nodes in memory (O(N)) while traversing the stack, instead needing
only to keep the path from root to current leaf (O(log(N))). Replication in
general tries hard to not restrict the size of event groups/transactions to
available memory. An additional goal is to reduce I/O, including having to
read the file backwards which I fear could be inefficient on some systems.
2. I wonder if each binary log file should have a footer to summarize
the file dependencies, e.g. if there are OOB transactions, what is the
earliest binlog file needed to be able to reconstruct all transactions
in this log.
Yes, this is the plan. I think it can go in the file header (as opposed to
a footer), but this is still ToDo: to implement, we will see what is needed.
2a) Likewise, for a transaction with OOB data, I wonder if we
should file the earlier binlog file needed to reconstruct that
transaction
Right, I think that is already there, if I understood you correctly?
In-memory we have handler_binlog_event_group_info::engine_ptr where we have
binlog_oob_context::first_node_file_no.
An on-disk, the commit record for the transaction has a pointer to the first
OOB node.
2b) My thought is that this would help on reporting problems
off-the-bat, e.g. mysqlbinlog could report missing binlog files
immediately, instead of writing data up-to a transaction with missing
OOB chunks.
Not sure this is possible in the general case.
If we start reading from binlog 10, and its header says it references back
to binlog 8, and binlog 8 (and 9) are available, then yes, we can be sure we
have everything we will need.
But the converse is not true. Even if binlog 8 is missing we can not be sure
we will need it just from reading the file 10 header. We might be starting
somewhere in the middle of file 10, after any references back to 8. Or we
might be starting from a multi-domain GTID position where each domain starts
at different positions in binlog 10. Or the OOB data in binlog 8 belongs to
a transaction that ended up rolling back and so will never be needed.
2c) My thought on a use-case for this is for really large
transactions, and (when eventually supported) 2-phase XA
transactions, where the PREPARE may happen long in the past.
Yes.
A future use case could be the ability to start optimistically replicating
large transactions on the slave before it has committed on the master.
3. (This is nothing to address now, but more to consider in the design
to support more things in the long term): I don’t think the
master-slave protocol supports this now, but perhaps we could
parallelize OOB reads, where the dump-thread would be sending OOB
chunks while another thread is buffering them. Just mentioning this
for consideration in the current design to be extensible for such
things.
Right. For the current master-slave protocol, we need to send the whole
event group as one consecutive stream of bytes, which is why we have the
forest-of-trees to efficiently do that. This could be very interesting to
think about extending to sending OOB data early to the slave (if that is
what you emant with "another thread buffering them"?). I agree it would be
very good to have more ideas for this to make the initial design suitably
extensible.
4. I wonder if OOB binlogging should be optional. E.g., say some
user’s workload is rollback heavy, that could result in binlogs
bloated with garbage.
Yes, I thought about that too, it should be easy to implement. Mostly just
omit the code that spills transaction cache data directly to the engine
instead of to a tmpfile, eg. skip this:
trx_cache.cache_log.write_function= binlog_spill_to_engine;
InnoDB has a limit of 1MB (or something like that) on mini-transactions. So
event groups larger than that still need to be split into multiple records.
But we can optionally keep all data in the transaction cache tmpfile until
commit, then hold LOCK_log and write all the pieces consecutively to the
binlog, so there is no more interleaving of oob data. (And then we can use a
normal commit record without any OOB nodes).
Then generally to the API, your API seems to be encapsulated from the
existing MYSQL_BIN_LOG/Event_log/TC_LOG APIs (which makes sense, as
yours is more of an engine-plugin, vs the legacy approach is the
handler). Though a few thoughts come from this:
Right.
My idea is to focus on the first release of this (already huge) project to
get a stable functionality on all the user-facing parts. Thus, if possible,
I would keep the existing logic on the server level during commit (LOCK_log,
queue_for_group_commit(), ...). Then in the next version we can focus on
improving scalability even more by avoiding a lot of this machinery in the
common case where there is no multi-engine transaction.
But details of this is still ToDo, might be more changes needed to get a
functional first release.x
5. Are we losing the ability for the binary log (stored in innodb) to
function as a transaction coordinator? We’ve previously talked about
that being the case, where multi-engine transactions aren’t yet
supported. Do you have any new thoughts on how that would (eventually)
be implemented?
I think the new binlog will work more or less the same way as the existing
one wrt. being the transaction coordinator for multi-engine transactions.
Thus, in the old binlog:
RocksDB.prepare() ; InnoDB.prepare(); binlog.write(); RocksDB::commit(); InnoDB.commit();
In the new binlog, the InnoDB prepare() are omitted:
RocksDB.prepare() ; InnoDB.binlog_and_commit(); RocksDB::commit()
The XA crash recovery algorithm then needs to be able to read from the new
binlog implementation to find RocksDB prepared transactions and see if they
were binlogged or not. We still have the XID events needed for this. I would
avoid the complexity of the binlog checkpoints for the new binlog case, and
just require the extra engines to fsync() again in their commit, for
simplicity.
The challenge is mostly if we have any functioning RocksDB or other
transactional engine to test against? My (limited) understanding is that
there are not really any resources to maintain RocksDB currently. So it
might make sense to just disable multi-engine transactions in the first
version, to focus on getting the more important use-cases stable first (and
not accidentally turn MDEV-34705 into a fix-RocksDB-XA-bugs project...).
6. Your binlog-in-engine API functions are hard-coded into the handler
type. Could it instead be a separate class, perhaps used to initialize
the handler? I wonder if that design would make the transaction
coordinator piece easier to implement (e.g. perhaps via a shared
in-engine binlog/context that multiple handlers use).
Yes, at least I do foresee changes to precisely how the extensions to the
storage engine API will look, even if the underlying functionality would
stay much the same. I deliberately tried not to spend too much time on this
so far, as I want to solicit input for any preference on how the storage API
should look. Eg. Serg traditionally was much involved in storage API
extensions, Sergei Petrunia also knows a lot about it from RocksDB work.
Adding handlerton methods is the traditional way, though I did add the class
handler_binlog_reader() as some small step towards a more C++ interface.
Note that traditionally, storage engines could be written in C (InnoDB was
pure C for many years, except handler/ha_innodb.cc perhaps). I am not sure
what the preferred style is nowadays, I am open to using any preferred
style, or to some new better style if there are no existing preferences.
7. Instead of having MYSQL_BIN_LOG use a lot of conditionals reliant
around opt_binlog_engine_hton, I wonder if it can be refactored to
prefer some composition-based strategy which is set-up at
initialization time (e.g. just give it the right class instance).
Definitely, this needs to be improved in the current patch.
My approach is to start with this mess-of-conditionals to get an overview of
all the parts of the server-layer code that are affected. But then something
needs to be done to get a cleaner interface, using composition and/or whatever
is needed, still a ToDo.
This might also be extended in a follow-up patch to do more general
refactor/cleanup of the interfaces around binlog, which has become quite
convoluted and hard to work with. I know Monty has ideas for splitting the
relay log and the binlog into separate classes, to similarly avoid many of
the is_relay_log() conditionals.
Again, thanks for comments, and happy to answer any follow-up questions as
always.
- Kristian.
|
|
@knielsen Before I can do any meaningful analysis of this implementation, I think higher level intro would be nice. Especially transactional behavior is interesting to me.
|
|
Jan Lindström ***@***.***> writes:
@knielsen Before I can do any meaningful analysis of this
implementation, I think higher level intro would be nice. Especially
transactional behavior is interesting to me.
* Currently, there is 2PC between binlog, innodb and wsrep. Is this somehow going to change?
Initially, I think nothing is going to change for Galera/wsrep. Galera will
continue to use the existing binlog implementation, and will not support
--binlog-storage-engine=innodb.
Longer-term, I do believe it could be beneficial for Galera to support using
the new binlog format. Having the binlog write part integrated with the
InnoDB should both improve performance and reduce complexity, just as it
does for other binlog-related code.
The binlog events in event groups (ie. write-sets) remain unchanged with
this implementation. Is it so that Galera can run without binlog enabled at
all? In that mode, the old or new binlog format might make very little
change at all, if any.
To eventually enable Galera to use the new binlog format when it is running
with binlog enabled will require that Galera is properly integrated into the
MariaDB server layer wrt. binlog, group commit, GTID allocation, and
transaction coordinator. As has been discussed before, Galera needs to take
over the role of transaction coordinator, implement the TC_LOG interface and
take over the role of binlogging that is currently done by the MYSQL_BIN_LOG
subclass of TC_LOG. We do not want to carry over the current endless
problems with Galera inconsistency with binlog GTIDs, nor do we want the
mess of tons of #ifdef WITH_WSREP / IF_WSREP() in the new binlog code.
* There is GTID and there is XID where wsrep overwriting XID with
wsrep XID during 2PC prepare, this most likely is going to change, but
do we really need all these?
I am not familiar with with "wsrep overwriting XID with wsrep XID". But this
is not going to change for Galera initially, the existing binlog will
remain, the new binlog implementation will be optional.
* Normally something is written to binlog at commit as transaction is
replicated on commit, but streaming replication is sending fragments
of transactions even before commit, is this still possible?
I am not familiar with what streaming replication is (it sounds like it is
Galera sending partial write-sets around the cluster during transaction
execution to reduce stall around commit).
There is something similar in the new binlog implementation, where large
transactions are written in smaller pieces interleaved with other
transactions during execution and before commit. This is referred to as
"out-of-band" (OOB) binlogging.
* wsrep writes its XID on InnoDB rollback segments, can we get one
single XYZ that would identify transaction GTID, XID, node UUID etc ?
GTID will remain, it is the very core of MariaDB replication and binlogging.
I do not think XID is needed in the new binlog in the normal case where only
InnoDB is used in a transaction, but it will still be needed for
multi-engine transactions (eg. InnoDB+RocksDB).
If Galera would also use GTID as the single identifier, by implementing
TC_LOG as mentioned above, it should be beneficial. But this is optional
whether Galera devs want to spend effort on this. It will be a big change I
suppose and probably diverge from how Galera works in MySQL codebase.
- Kristian.
|
407471e to
061fe31
Compare
37a89db to
0dc288e
Compare
4fa615f to
6a3f998
Compare
8f6b69d to
0327708
Compare
27d8eaa to
55a69e8
Compare
dc07708 to
30bbac9
Compare
78253de to
d7a3ece
Compare
| if (opt_binlog_engine_hton && value) | ||
| { | ||
| sql_print_information("Value of binlog_checksum forced to NONE since binlog_storage_engine is enabled, and InnoDB uses its own superior checksumming of pages"); | ||
| value= 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure it is helpful to mention InnoDB here. It would make sense to use the same binlog block format, no matter which storage engine is taking care of the durability (that is, the write-ahead logging).
d06fb59 to
2b0547d
Compare
|
Marko Mäkelä ***@***.***> writes:
@dr-m commented on this pull request.
> + if (opt_binlog_engine_hton && value)
+ {
+ sql_print_information("Value of binlog_checksum forced to NONE since
binlog_storage_engine is enabled, and InnoDB uses its own superior
checksumming of pages");
+ value= 0;
+ }
I’m not sure it is helpful to mention InnoDB here. It would make sense
to use the same binlog block format, no matter which storage engine is
taking care of the durability (that is, the write-ahead logging).
Agree, this is on the server layer, and could in the future be using a
different storage engine binlog implementation, I'll reword in a generic
way.
- Kristian.
|
| extern uint64_t last_created_binlog_file_no; | ||
| extern std::atomic<uint64_t> binlog_cur_written_offset[2]; | ||
| extern std::atomic<uint64_t> binlog_cur_end_offset[2]; | ||
| extern std::atomic<uint64_t> binlog_cur_durable_offset[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knielsen just skimming some of your latest commits, not a formal review, but one thought here would be to have sub-classes for a durable vs non-durable reader, instead of the same reader which is always set up for durable & non-durable reads with a flag in the constructor that tells it which mode to use.
Adjust an assertion that checks that no unspilled savepoints are left after spilling the trx cache as OOB data to the engine binlog. If the savepoint is at the very end of the trx cache when we spill, there is no need to spill that particular savepoint to the engine (and we do not do so); the savepoint can still be rolled back to by truncating the in-memory part of the cache. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When the tablespace is closed during shutdown, it waits for the last record written in the tablespace to be durable in the InnoDB redo log. There was an obvious mistake/race in the code in function ibb_wait_durable_offset(), which did not check for the waited-for condition before doing the wait. Thus if the last record in the binlog file became durable between the check in fsp_binlog_tablespace_close() and the wait in ibb_wait_durable_offset(), it would wait for new data to be written; this could cause a hang. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When an empty user XA transaction is committed (or rolled back), we do not need to binlog any real transaction, but we still need to binlog a rollback record to clear the XID for reuse and free the prepare record from purge and from recovery. Also fix a bug that in case of error adding to the innodb binlog internal xid hash (eg. duplicate), we must still ensure that the written XA prepare record is entered into the pending LSN fifo, so we can track when it becomes durable (there was a shutdown hang possible if a prepare record was last in the binlog and XID insert failed so the record was never marked durable). Bugs found in RQG runs. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
1. Handle RELEASE SAVEPOINT, removing any released savepoint from the list of savepoints pending in the cache. Also fix a bug in the server layer; RELEASE SAVEPOINT removes the specified savepoint _and_ any later savepoints; engines were not informed of the removal of the later ones, if any. 2. Fix a bug when spilling non-transactional statement data inside of a transaction using savepoints. The spill of the statement cache must not spill any savepoints, those apply only to the trx cache. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If the user copies manually an engine-implemented binlog file and runs mysqlbinlog on it, any following or oob-referenced files may not be available to read from. Treat this as end-of-file rather than an error (so we can output at least any part of the file that is available). But still output a message about the failure to open the file, to give some indication why the dump stopped at that point. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…pl_sync.inc Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
When using the InnoDB-implemented binlog with another transactional storage engine, or with explicit user XA transactions, recover such transactions consistently from the binlog at server startup. When a transaction is prepared with an XID, the binlog records a "prepare" record containing the XID and link to the out-of-band replication event data. When a previously prepared transaction is committed, the commit record links to the oob data referenced from the prepare record, and the record is preceeded by an "XA complete" record containing the XID. If instead a prepared transaction is rolled back, just an "XA complete" record is binlogged with the XID and a "rollback" flag. While any prepared XA transactions are active, maintain in-memory reference counts in each binlog file, and in each binlog file record the file_no of the earliest binlog file containing any XID records of still active transactions. When the server restarts (possibly after crash), look up the file_no of the earliest binlog file that may contain active XID records, if any. Scan the binlogs from that point and record any XID prepare or complete records. For any XID prepare record, record oob data and reference count, recovering the in-memory state present before the server restart. Return a hash to the server layer containing each active XID in the binlog and its state (prepared, committed, rolled back). On the server layer, ask each engine for a list of pending XID in prepared state. If the binlog state of an XID is committed, commit in the engine. If the binlog state is rolled back or is missing, roll back in the engine. If the binlog state is prepared, _and_ all participating engines have the transaction prepared also, then leave the transaction prepared. If a binlog prepared transaction is missing from an engine, then roll it back in any other engines and in the binlog (this is to handle a crash in the middle of an XA PREPARE). The result is that multi-engine (or non-InnoDB) transactions, as well as user XA transactions, will be recovered after a crash consisent with the binlog content. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
For CREATE TEMPORARY TABLE ... SELECT, InnoDB had code to not start a new transaction for the CREATE TEMPORARY (correct). But the code that handled failure for the SELECT part (ha_innobase::extra(HA_EXTRA_ABORT_ALTER_COPY)) was missing a check for CREATE TEMPORARY, so it would roll back the entire transaction, which is wrong, and could lead to inconsistency with binlog or other engines in the same transaction. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
This happened when the first OOB record of an event group spans two binlog files, say N and N+1. The reference counting would wrongly attribute the OOB to N+1, allowing N to be purged while it was still needed. This for example could cause server restart to fail when it tries to recover the GTID state from N+1, unable to follow OOB references to N because it was purged before the server restart. Fix by: - Increment OOB refcount _before_ binlogging the first OOB record. - Decrement refcount only _after binlogging complete. - Protecting from purge any files referenced from the active file. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Add user documentation for the new binlog implementation. And add error messages for the remaining configuration options that are not available with the new binlog. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The GTID position at slave connect is found from the GTID state records written at the start and at every --innodb-binlog-state-interval bytes of the binlog files. There was a bug that for a binlog group commit, the binlog state written was the one corresponding to the last GTID in the group, regardless of where during the binlogging of the group it was written. Thus, it could mistakenly write a GTID state record of 0-1-10, say, followed by a lower GTID 0-1-9. This could cause a slave connecting at 0-1-10 to receive an extra GTID 0-1-9, and replication would diverge. Fix by maintaining a full GTID binlog state inside the engine binlog, same as is done for the differential state. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
If binlog files are deleted or otherwise unreadable during server restart, don't make the server unstartable. Instead, start up, recovering what is available, but complaining in the error log. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…ackup Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…ithout "ddl" mark on the GTID This patch fixes that ALTER TABLE can call wakeup_subsequent_commits() too early and allow following event groups to commit out-of-order in parallel replication. Fixed by calling suspend_subsequent_commits() at the start of the ALTER. Could be seen as an assertion: !tmp_gco->next_gco || tmp_gco->last_sub_id > sub_id (Normally this is prevented because an ALTER TABLE will run in its own GCO, and thus no following event groups can even start; however the missing DDL mark caused by MDEV-38429 made this visible. And calling wakeup_subsequent_commits() too early is wrong in any case). Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…ithout "ddl" mark on the GTID When ddl log recovery needs to binlog during the crash recovery, the GTID was binlogged without the required "ddl" marker. This caused wrong behaviour on the slave when using parallel replication. Fixed by explicitly marking the "current statement" as DDL when binlogging in ddl log crash recovery. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
The error handling path forgot to unlock the LOCK_log mutex, hanging the server or causing assertion mysql_mutex_assert_not_owner. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
FLUSH BINARY LOGS before dumping, to make sure the file is on disk and not get different mysqlbinlog output depending on timing. Treat completely empty (all zeros) file the same as file with the header page written but no events yet. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
There was a race where a new GTID could be allocated (but not written to the binlog)during the FLUSH, so that the GTID state written at the start of the new binlog file was incorrect. This in turn could lead to duplicate GTID being sent to the slave if it happens to reconnect at that exact point. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
…f InnoDB redo log The code for binlogging out-of-band data was missing an appropriate call to log_free_check(). This call is needed to throttle write activity and wait for an InnoDB checkpoint, when the redo log is too small (or otherwise has insufficient space available) to accomodate the write activity. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
SAVEPOINT inside a trigger doesn't work correctly. Setting a savepoint inside a trigger somehow loses the implicit savepoint set at transaction start, so that the partial changes are left if the statement later fails. Referencing an existing savepoint claims the savepoint does not exist (and it is in any case very unclear what exactly it should mean to rollback to a savepoint from the middle of a statement, or set in the middle of a prior statement). These problems are independent of binlog-in-engine, but in the new binlog implementation we are trying to make things work more correctly and robustly, so let's disallow use of savepoints inside triggers. The new binlog is off by default, so backwards compatibility is less of a concern, though arguably disallowing savepoints in triggers would be better done unconditionally. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Don't use (and crash on) any --binlog-directory option specified for --backup, always use the value fetched from the running server. Ensure a slash in-between path components when using a relative path for options such as --innodb-undo-directory and --binlog-directory. Clarify the description of the --binlog-directory option. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
f17a54f to
94a302d
Compare
…nlog_dir Remove the --innodb-undo-directory=undos command-line argument from the test, as it causes failures when the test suite is run from distro package and the test directory is not writeable, and it's not relevant for what is being tested in that test case. Signed-off-by: Kristian Nielsen <knielsen@knielsen-hq.org>
Draft pull request for work-in-progress on the MDEV-34075 binlog-in-engine feature
A new option --binlog-storage-engine=ENGINE moves the binlog implementation into the storage engine, for supporting engines (currently only InnoDB).
InnoDB implements the binlog files as a new type of tablespace, and uses its redo log to make the binlog crash-safe without the overhead and complexity of two-phase commit.