参数引发的复制中断:max_binlog_cache_size 导致 SQL 线程异常的复现与分析

📅 2026/6/30 16:38:26
参数引发的复制中断:max_binlog_cache_size 导致 SQL 线程异常的复现与分析
问题发生某个客户的一套数据库主库为 MySQL 5.7.44从库为 GreatSQL 8.0.32-27从库的 SQL 线程中断报错如下2026-05-07T21:53:42.15185608:00 145 [ERROR] [MY-010584] [Repl] Slave SQL for channel : Worker 1 failed executing transaction 18c68804-62e7-11f1-846e-00163ed55da5:3425315180 at master log binlog.022823, end_log_pos 127908803; Could not execute Write_rows event on table ort.bm_cdr; Writing one row to the row-based binary log failed, Error_code: 1534; handler error HA_ERR_RBR_LOGGING_FAILED; the events master log FIRST, end_log_pos 627908803, Error_code: MY-001534 2026-05-07T21:53:42.24598208:00 144 [Warning] [MY-010584] [Repl] Slave SQL for channel : ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756报错信息表示主库在回放事务 18c68804-62e7-11f1-846e-00163ed55da5:3425315180 时失败了失败位置发生在执行 Write_rows 事件写入从库本地 row-based binlog 的过程中最终导致 SQL 线程终止。问题排查看到Could not execute Write_rows event on table容易误以为是表数据写入失败或数据冲突。Writing one row to the row-based binary log failedError_code: 1534handler error HA_ERR_RBR_LOGGING_FAILED这说明故障与从库写本地 binlog 有关。该从库开启了本地 binlog并记录复制回放产生的更新即开启了log_slave_updates/log_replica_updates。因此SQL 线程不仅要把主库事务回放到表中还要把该事务再次写入从库自己的 binlog如果从库 binlog 事务缓存上限过小就可能在回放大事务时失败。对比主库和从库参数后发现关键差异如下主库 max_binlog_cache_size 4G 从库 max_binlog_cache_size 10M主库允许生成较大的事务 binlog而从库写本地 binlog 时只允许最多使用 10M 的事务缓存。将从库max_binlog_cache_size调整为与主库一致后再启动 SQL 线程复制恢复正常。随后使用mysqlbinlog解析故障 GTID 对应的 binlog并与业务侧确认该 GTID 确实是一个大事务事务产生的 binlog 体量远超从库 10M 的max_binlog_cache_size限制。我们知道通常当max_binlog_cache_size设置过小时会有 Error_code: MY-001197 的报错那什么时候会出现仅有 Error_code: MY-001534 的报错呢本文整理复现过程如下。实验验证测试环境参数配置搭建主从异步复制环境主库端参数配置如下binlog_cache_size 32768 max_binlog_cache_size 4294967296 #4G max_binlog_size 1073741824 binlog_row_image FULL binlog_rows_query_log_events ON从库端参数配置如下binlog_cache_size 32768 max_binlog_cache_size 10485760 #10M max_binlog_size 1073741824 binlog_row_image FULL binlog_rows_query_log_events ON log_replica_updates ON复现该问题的关键前提是从库开启本地 binlog并记录复制回放的更新。如果从库不写本地 binlog通常不会在 SQL 线程回放时触发本地max_binlog_cache_size限制。创建测试表在主库创建测试表并插入数据mysql CREATE TABLE test.t1 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, c2 VARCHAR(255), c3 VARCHAR(255), c4 VARCHAR(255), c5 VARCHAR(255)); mysql INSERT INTO test.t1 VALUES (NULL, REPEAT(a, 250), REPEAT(b, 250), REPEAT(c,250), REPEAT(d, 250)); -- 持续执行如下语句14次 mysql INSERT INTO test.t1 (c2,c3,c4,c5) SELECT c2,c3,c4,c5 FROM t1; mysql SELECT COUNT(*) FROM test.t1; ---------- | count(*) | ---------- | 16384 | ---------- 1 row in set (0.04 sec)实验一单语句大事务触发错误 1197 1534主库执行如下INSERT语句插入10319条数据这时产生的 binlog 未超过从库max_binlog_cache_size10M的限制查看主从复制状态没有异常且主从数据一致。mysql INSERT INTO test.t1 (c2,c3,c4,c5) SELECT c2,c3,c4,c5 FROM t1 LIMIT 10319; Query OK, 10319 rows affected (0.62 sec) Records: 10319 Duplicates: 0 Warnings: 0主库执行如下INSERT语句插入10320条数据这时产生的binlog已经超过从库max_binlog_cache_size10M的限制。mysql INSERT INTO test.t1 (c2,c3,c4,c5) SELECT c2,c3,c4,c5 FROM t1 LIMIT 10320; Query OK, 10320 rows affected (0.62 sec) Records: 10320 Duplicates: 0 Warnings: 0再次查看主从复制状态SQL 线程已经异常终止error.log 中打印错误信息如下2026-06-10T11:19:02.36658308:00 212 [ERROR] [MY-010584] [Repl] Slave SQL for channel : Worker 1 failed executing transaction 18c68804-62e7-11f1-846e-00163ed55da5:85 at master log binlog.000001, end_log_pos 755113733; Could not execute Write_rows event on table test.t1; Multi-statement transaction required more than max_binlog_cache_size bytes of storage; increase this mysqld variable and try again, Error_code: 1197; Writing one row to the row-based binary log failed, Error_code: 1534; handler error HA_ERR_RBR_LOGGING_FAILED; the events master log binlog.000001, end_log_pos 755113733, Error_code: MY-001197 2026-06-10T11:19:02.54879008:00 211 [Warning] [MY-010584] [Repl] Slave SQL for channel : ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756可以看到错误日志中明确提示max_binlog_cache_size过小错误码为11971534。此时调大max_binlog_cache_size再 START SLAVE 问题即可解决。实验二多语句事务只暴露错误 1534在主库开启事务依次执行以下语句mysql BEGIN; Query OK, 0 rows affected (0.00 sec) mysql INSERT INTO test.t1 (c2,c3,c4,c5) SELECT c2,c3,c4,c5 FROM t1 LIMIT 10319; Query OK, 10319 rows affected (0.25 sec) Records: 10319 Duplicates: 0 Warnings: 0 mysql INSERT INTO test.t1 VALUES (NULL, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa); Query OK, 1 row affected (0.00 sec) mysql COMMIT; Query OK, 0 rows affected (0.06 sec)此时观察从库主从复制状态显示 SQL 线程状态异常error.log 中信息如下可以看到只有1534报错码没有提示max_binlog_cache_size过小的信息。2026-06-10T11:34:24.33334208:00 232 [ERROR] [MY-010584] [Repl] Slave SQL for channel : Worker 1 failed executing transaction 18c68804-62e7-11f1-846e-00163ed55da5:86 at master log binlog.000001, end_log_pos 771775512; Could not execute Write_rows event on table test.t1; Writing one row to the row-based binary log failed, Error_code: 1534; handler error HA_ERR_RBR_LOGGING_FAILED; the events master log FIRST, end_log_pos 771775512, Error_code: MY-001534 2026-06-10T11:34:24.53735308:00 231 [Warning] [MY-010584] [Repl] Slave SQL for channel : ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: MY-001756实验结论以上两种情况都是由于从库端max_binlog_cache_size设置过小导致SQL线程异常终止。前者是批量插入binlog Event 总量超过max_binlog_cache_size限制时数据库会抛出明确的 1197 错误缓存超限并伴随 1534 错误单行写入失败。后者是最后一个单行插入语句生成的 Row Event 过大超出了当前 binlog Cache 剩余可用空间导致这一行无法写入从而仅触发 1534 错误。虽然 error.log 记录的信息不同但都属于同一故障。总结从库也会受 binlog cache 限制。只要从库开启本地 binlog 并记录复制更新SQL 线程回放事务时就要写本地 binlog因此会受从库max_binlog_cache_size限制。1534** 不是数据冲突。** 当日志中出现Writing one row to the row-based binary log failed和HA_ERR_RBR_LOGGING_FAILED时应优先检查 row-based binlog 写入链路包括max_binlog_cache_size、磁盘空间、临时目录、本地 binlog 配置等。主库和从库关键参数应保持基线一致。尤其是max_binlog_cache_size、binlog_format、binlog_row_image、max_allowed_packet、lower_case_table_names等会影响复制兼容性或回放行为的参数建议在变更前后做自动化比对。大事务是根因风险。调大max_binlog_cache_size可以恢复复制但不能消除大事务带来的 binlog 膨胀、复制延迟、回滚成本、备份恢复耗时等问题。批量 DML 应尽量拆分成小事务分批提交。不要混淆max_binlog_cache_size和max_binlog_size。前者限制单事务写 binlog 可使用的缓存上限后者只是 binlog 文件滚动的期望大小。一个大事务不会被拆到多个 binlog 文件中。阈值不要照搬。本文中 10319/10320 行只是当前表结构、字符集、row image、binlog 配置和版本组合下的临界点。换成其他表结构或参数后临界点会变化。