This is the first article in the ongoing series we decided to call “DBAs’ Impossible Tales”. As you can imagine, our MariaDB and MySQL support engineers come up against all kinds of mind boggling, seemingly impossible problems on a daily basis. In these articles we will talk about some of the more interesting and unusual cases we have encountered. So while you are no doubt getting ready to tell us “That word you keep using – I do not think it means what you think it means.“, let’s begin.
The Mystery of the Freezing MariaDB Database
The Client had a peculiar problem that has been plaguing them for a long time. Approximately every 8 hours, their MariaDB database server would simply freeze and stop responding. The mysqld process was running, but it was doing nothing. It wasn’t accessible via the unix domain socket. It was not responding to network connections. It was not using CPU, it was not using any network I/O or disk I/O. It just stopped – as if it was suspended, only it wasn’t in a suspended state.
As a first action point, we deployed the Shattered SIlicon Monitoring system for MySQL and MariaDB to capture telemetry leading up to the next freeze. We reviewed the configuration and although there were a few telltale signs that the systems were sub-optimally configured, there was nothing obvious in there that could obviously outright result in the entire database locking up solid and becoming so completely unresponsive.
The next lockup came, and the verdict from captured telemetry was that… there was nothing obviously anomalous preceding the freeze. No spike in activity, network I/O, disk I/O or memory usage. The Mystery deepened. The version of MariaDB they were running was years beyond EOL and not fully patched even within that major release, but we have worked with that version extensively over the years, and it was a very solid release branch with very few serious bugs, especially of this severity. There was nothing obvious in dmesg. There was even less in syslog.
In fact, there was so little in syslog that it was conspicuous. It was as if it had not captured anything in days, since not long the last server reboot carried out by The Client’s internal systems support staff in their attempts to get the database server working again. Could this be related?
In our configuration review of the configuration, we noticed that The Client has audit logging enabled on their database, and the output was set to syslog. The plot thickened. It looked like syslog was experiencing a problem that made it stop logging, and once the buffer got full and database couldn’t send audit log data, database stalled until it could push the audit log entry before allowing anything else to happen.
We changed the audit log output to file instead of syslog, and the database never froze again after that. The syslog still stopped logging after a while, but it was no longer affecting the database. It also turned out that rsyslog they were using came from a custom repository and wasn’t the standard distribution shipped version. After The Client’s own engineers confirmed that the reasons for non-standard rsyslog use were long lost to time, we downgraded it to the distribution supplied one, and that concluded the solution.
So, in the end – database locking up was not actually a database problem. One of The Client’s engineers described the fix as “finding a needle in a field of haystacks”. This is why it is important to engage database experts that are broadly experienced and whose skills also cover everything adjacent to databases.
After that we carried out the incremental upgrade process across several major releases to a newer MariaDB version that still hasn’t reached EOL. It is worth mentioning that MySQL and MariaDB upgrades have to be handled with care. There were several on-disk format incompatibilities that were introduced in minor point releases over the years, so unless you have some experience dealing with this, you should probably consult a MySQL / MariaDB consultant for advice before you dive into the process.