MySQL – Fixing: Got fatal error 1236 from master when reading data from binary log: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master;

If you spend all day every day looking after hundreds of MySQL servers, the chances are that you have encountered this problem. And the chances are that if your servers are configured properly (i.e. you have the same configuration on the master and the slaves), the error is almost certainly spurious and doesn’t mean what it says it means. So why does it happen? It isn’t entirely clear – it could be a consequence of a hardware glitch, in-flight data corruption in the network layer (though between TCP checksums, and binlog checksums it is vanishingly unlikely to pass through undetected), or a really difficult to reproduce MySQL bug (this happens very rarely, we see it maybe 2-3 times every year across hundreds of database servers in various different environments).

But it does happen. And when it does happen, the only way to make completely sure that the slave is sound is to re-initialize the entire data set – and that can be painful when you have terabytes of data. Of course, most of the MySQL and MariaDB database servers we look after run on ZFS with hourly or daily snapshots enabled. In those cases, we can simply roll back to the most recent snapshot before the error arose, and that almost always fixed the problem. But those who haven’t had their infrastructure

So let’s explore what you might be able to do avoid or at least postpone the slave re-initialization.

Working Through an Actual Example

For example, if it happens to you and you see this in your SHOW SLAVE STATUS\G output:

Exec_Master_Log_Pos: 123516843
Seconds_Behind_Master: NULL
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: ‘log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master; the first event ‘mysql-bin.140409’ at 123516843, the last event read from ‘./mysql-bin.140409’ at 123, the last byte read from ‘./mysql-bin.140409′ at 123516862.’

The first thing that looks fishy is that the difference between the Exec_Master_Log_Pos and the last position read from binlog is tiny – and therefore highly unlikely to be on the scale of max_allowed_packet transaction size.

Looking at the binlog on the master, it quickly becomes obvious that both of those binlog positions are actually bogus:

mysqlbinlog mysql-bin.140409 | grep '^# at ' | grep -P 'at 12351\d{4}$'at 123512938
at 123521149

So we might be able to get away with is resetting the slave replication coordinates:

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO MASTER_LOG_FILE=’mysql-bin.140409′, MASTER_LOG_POS=123512938, MASTER_HOST=’master-server’, MASTER_USER=’master-user’, MASTER_PASSWORD=’master-password’;

The reason we go for the value before rather than the value after is because in most cases, if there is a primary key and we are using binlog_format=ROW, replication will break on a missing row (on a delete) or a duplicate key (on an INSERT), and we can skip the transaction and resume. If we go straight for the first value after the reported error, we could miss a transaction and we potentially won’t know until something breaks later on, or until we can check with pt-table-checksum – and that can take hours or days with terabytes of data.

Errors Often Cluster

Importantly – if you had a glitch like this occur, then it is clear that something is amiss. Don’t just apply this fix and assume everything else is OK on the server that was affected. At the very least run pt-table-checksum on it to make sure that the data is consistent. Servers that exhibit spurious errors rarely only do it once.

Update

In newer version of MariaDB, the same problem can also manifest with the following error:

Slave I/O: Replication event checksum verification failed while reading from network, Internal MariaDB error code: 1743

The way to establish whether you actually have a corrupted binlog is by looking at the Exec_Master_Log_Pos in the output of SHOW SLAVE STATUS\G and compare with the output of mysqlbinlog on the binary log. If mysqlbinlog doesn’t show that there is a transaction blog that begins at the value given by Exec_Master_Log_Pos, then you are looking at the manifestation of the same problem listed above. Circumstances under which this happens can be strange, and we have also recently seen a case where two separate downstream servers both errored out in the same way with the exact same value of Exec_Master_Log_Pos. Other downstream servers replicating from the same source weren’t affected. It is not clear in what single place this error can originate in a way that affects multiple (but not all) downstream servers at the same log execution position.