MySQL – Fixing: Got fatal error 1236 from master when reading data from binary log: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master;
If you spend all day every day looking after hundreds of MySQL servers, the chances are that you have encountered this problem. And the chances are that if your servers are configured properly (i.e. you have the same configuration on the master and the slaves), the error is almost certainly spurious and doesn’t mean what it says it means. So why does it happen? It isn’t entirely clear – it could be a consequence of a hardware glitch, in-flight data corruption in the network layer (though between TCP checksums, and binlog checksums it is vanishingly unlikely to pass through undetected), or a really difficult to reproduce MySQL bug (this happens very rarely, we see it maybe 2-3 times every year across hundreds of database servers in various different environments).
But it does happen. And when it does happen, the only way to make completely sure that the slave is sound is to re-initialize the entire data set – and that can be painful when you have terabytes of data. Of course, most of the MySQL and MariaDB database servers we look after run on ZFS with hourly or daily snapshots enabled. In those cases, we can simply roll back to the most recent snapshot before the error arose, and that almost always fixed the problem. But those who haven’t had their infrastructure
So let’s explore what you might be able to do avoid or at least postpone the slave re-initialization.
Working Through an Actual Example
For example, if it happens to you and you see this in your SHOW SLAVE STATUS\G
output:
Exec_Master_Log_Pos: 123516843
Seconds_Behind_Master: NULL
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event 'mysql-bin.140409' at 123516843, the last event read from './mysql-bin.140409' at 123, the last byte read from './mysql-bin.140409' at 123516862.'
The first thing that looks fishy is that the difference between the Exec_Master_Log_Pos
and the last position read from binlog is tiny – and therefore highly unlikely to be on the scale of max_allowed_packet transaction
size.
Looking at the binlog on the master, it quickly becomes obvious that both of those binlog positions are actually bogus:
mysqlbinlog mysql-bin.140409 | grep '^# at ' | grep -P 'at 12351\d{4}$'
at 123512938
at 123521149
So we might be able to get away with is resetting the slave replication coordinates:
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE='mysql-bin.140409', MASTER_LOG_POS=123512938, MASTER_HOST='master-server', MASTER_USER='master-user', MASTER_PASSWORD='master-password';
Reason we go for the value before rather than the value after is because in most cases, if there is a primary key and we are using binlog_format=ROW
, replication will break on a missing row (on a delete) or a duplicate key (on an INSERT), and we can skip the transaction and resume. If we go straight for the first value after the reported error, we could miss a transaction and we potentially won’t know until something breaks later on, or until we can check with pt-table-checksum
– and that can take hours or days with terabytes of data.
Errors Often Cluster
Importantly – if you had a glitch like this occur, then it is clear that something is amiss. Don’t just apply this fix and assume everything else is OK on the server that was affected. At the very least run pt-table-checksum
on it to make sure that the data is consistent. Servers that exhibit spurious errors rarely only do it once.