Firelit Design: April 2011

Tuesday, April 5, 2011

How to fix MySql Replication Error 1236

We have some websites that use two database servers in master-master replication. Recently one of the servers died and had to be resurrected. (They are cloud based and SoftLayer doesn't seem to have their cloud-based offering thing nailed down, yet. Every week one of them will go down and be unresponsive until tech support does some magic.)

After one of the servers was brought back up the other server would not connect. It "Slave_IO_Running: No" and "Seconds_Behind_Master: null" which means it was not playing nicely.

First, I went to the MySql log files, which for this server is found at /var/log/mysql.log and looked at the last few messages here by running "tail /var/log/mysql.log" from the command prompt. This revealed the error number (server_errno=1236). It also had the following info:

Got fatal error 1236: 'Client requested master to start replication from impossible position' from master when reading data from binary log

Just before this entry in the mysql.log it indicates the file and position that it is trying to read. So, with this data in hand, I headed over to the master server and took a look at the bin logs. They are located in /var/lib/mysql/. Here I took a look at the file in question using the mysqlbinlog utility. I used the following command to check out the bin log. Obviously, you'll have to replace the position and file name with the position and file indicated in your mysql.log.

mysqlbinlog --offset=128694 mysql-bin.000013

And, this is what I saw here, among other things:

Warning: this binlog was not closed properly. Most probably mysqld crashed writing it.

Well, that explains things! When the server crashed the bin log was not closed properly. This is easy to fix. Going back to the slave server, I stopped the slave, reset the bin log position and started the server.

STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS = 0;
CHANGE MASTER TO MASTER_LOG_FILE = 'mysql-bin.000014';
START SLAVE;

I simply pointed the slave to the start of the next bin log. It started right up with no problem.

Update 5/21/13 - An anonymous commentor made a good point about potentially losing data using the above technique. If you are in a situation like I am (master-master replication with 100's of gigs of non-critical data) this is really the only way of getting back up without significant down time. But if you are in a master-slave configuration with a manageable data set or its critical you slave doesn't miss any data, you should probably dump the master database and re-create the slave database to make sure you didn't miss anything.