Commit 72b4ba32 authored by Leigh B Stoller's avatar Leigh B Stoller

Quick fix for watchdog/backup interaction; use a script lock.

From Slack:

What I notice is that mysqldump is read locking all of the tables for a
long time. This time gets longer and longer of course as the DB gets
bigger. Last night enough stuff backed up (trying to get various write
locks) that we hit the 500 thread limit. I only know this cause mysql
prints "killing 501" threads at 2:03am. Which makes me wonder if our
thread limit is too small (but seems like it would have to be much
bigger) or if our backup strategy is inappropriate for how big the DB is
and how busy the system is. But to be clear, I am not even sure if
mysqld throws in the towel when it hits 500 threads, I am in the midst
of reading obtuse mysql documentation. (edited) There a bunch of other
error messages that I do not understand yet.

I can reproduce this in my elabinelab with a 10 line perl script. Two
problems; one is that we do not use the permission system, so we cannot
use dynamic permissions, which means that the single thread that is left
for just this case, can be used by anyone, and so the server is fully
out of threads. And 2) then the Emulab mysql watchdog cannot perform its
query, and so it thinks mysqld has gone catatonic and kills it, right in
the middle of the backup. Yuck * 2. (edited)

And if anyone is curious about a more typical approach: "If you want to
do this for MyISAM or mixed tables without any downtime from locking the
tables, you can set up a slave database, and take your snapshots from
there. Setting up the slave database, unfortunately, causes some
downtime to export the live database, but once it's running, you should
be able to lock it's tables, and export using the methods others have
described. When this is happening, it will lag behind the master, but
won't stop the master from updating it's tables, and will catch up as
soon as the backup is complete"
parent ebadc7d1
#!/usr/bin/perl -w
# Copyright (c) 2000-2017 University of Utah and the Flux Group.
# Copyright (c) 2000-2018 University of Utah and the Flux Group.
......@@ -72,6 +72,7 @@ my $BACKUPDAYS = "30";
my $extension;
my @updatefiles = ();
my $dohotcopy = 0;
my $locked = 0;
my $dbname = "mysql";
my $dbuser = "root";
my $dbpass;
......@@ -150,6 +151,13 @@ if ($clean) {
if (! $opsmode) {
if (TBScriptLock("backup", 0, 60) != TBSCRIPTLOCK_OKAY()) {
fatal("Could not get the lock after a long time!\n");
$locked = 1;
# Create a temporary name for a log file and untaint it.
......@@ -339,6 +347,8 @@ if ($dohotcopy && -e "$BACKUPDIR/tbdb") {
system("$SETSITEVAR web/message -")
if (!$opsmode);
if ($locked);
exit 0;
sub fatal($) {
......@@ -349,5 +359,7 @@ sub fatal($) {
SENDMAIL($TBOPS, "DB Backup Failed", $msg, undef, undef, ($logname));
system("$SETSITEVAR web/message -");
if ($locked);
......@@ -65,6 +65,7 @@ use libtestbed;
# Protos
sub TryQuery();
sub RestartMysqld();
sub notify($);
# Daemonize;
if (!$debug) {
......@@ -97,6 +98,17 @@ setsid();
# Loop forever ...
while (1) {
# We do not want to run this while the database is backing up, since
# we might run out of connections and think the mysqld is dead. This
# not a great approach, but the proper approach to database backup
# needs more work then I have time for today.
if (TBScriptLock("backup", 0, 30 * 60) != TBSCRIPTLOCK_OKAY()) {
notify("Could not get the backup lock after a long time!\n");
if (TryQuery() < 0) {
if (!$paused);
......@@ -104,7 +116,7 @@ while (1) {
else {
$paused = 0;
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment