• Leigh Stoller's avatar
    New version of the portal monitor that is specific to the Mothership. · 2a5cbb2a
    Leigh Stoller authored
    This version is intended to replace the old autostatus monitor on bas,
    except for monitoring the Mothership itself. We also notify the Slack
    channel like the autostatus version. Driven from the apt_aggregates
    table in the DB, we do the following.
    
    1. fping all the boss nodes.
    
    2. fping all the ops nodes and dboxen. Aside; there are two special
       cases for now, that will eventually come from the database. 1)
       powder wireless aggregates do not have a public ops node, and 2) the
       dboxen are hardwired into a table at the top of the file.
    
    3. Check all the DNS servers. Different from autostatus (which just
       checks that port 53 is listening), we do an actual lookup at the
       server. This is done with dig @ the boss node with recursion turned
       off. At the moment this is serialized test of all the DNS servers,
       might need to change that latter. I've lowered the timeout, and if
       things are operational 99% of the time (which I expect), then this
       will be okay until we get a couple of dozen aggregates to test.
    
       Note that this test is skipped if the boss is not pingable in the
       first step, so in general this test will not be a bottleneck.
    
    4. Check all the CMs with a GetVersion() call. As with the DNS check, we
       skip this if the boss does not ping. This test *is* done in parallel
       using ParRun() since its slower and the most likely to time out when
       the CM is busy. The time out is 20 seconds. This seems to be the best
       balance between too much email and not hanging for too long on any
       one aggregate.
    
    5. Send email and slack notifications. The current loop is every 60
       seconds, and each test has to fail twice in a row before marking a
       test as a failure and sending notification. Also send a 24 hour
       update for anything that is still down.
    
    At the moment, the full set of tests takes 15 seconds on our seven
    aggregates when they are all up. Will need more tuning later, as the
    number of aggregates goes up.
    2a5cbb2a
Name
Last commit
Last update
account Loading commit data...
apache Loading commit data...
apt Loading commit data...
assign Loading commit data...
autoconf Loading commit data...
autofs Loading commit data...
backend Loading commit data...
bugdb Loading commit data...
cdrom Loading commit data...
clientside Loading commit data...
collab Loading commit data...
daikon Loading commit data...
db Loading commit data...
delay Loading commit data...
dhcpd Loading commit data...
discvr Loading commit data...
doc Loading commit data...
event Loading commit data...
firewall Loading commit data...
flash Loading commit data...
fwrules Loading commit data...
hw_config Loading commit data...
hyperviewer Loading commit data...
image-test Loading commit data...
install Loading commit data...
ipod Loading commit data...
mobile Loading commit data...
mote Loading commit data...
named Loading commit data...
node_usage Loading commit data...
ntpd Loading commit data...
os Loading commit data...
patches Loading commit data...
pelab Loading commit data...
protogeni Loading commit data...
pxe Loading commit data...
rc.d Loading commit data...
robots Loading commit data...
rpms Loading commit data...
security Loading commit data...
sensors Loading commit data...
sql Loading commit data...
ssl Loading commit data...
sysadmin Loading commit data...
tbsetup Loading commit data...
testsuite Loading commit data...
tip Loading commit data...
tmcd Loading commit data...
tools Loading commit data...
utils Loading commit data...
vis Loading commit data...
wiki Loading commit data...
www Loading commit data...
xmlrpc Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.gitmodules Loading commit data...
.loc-ignore Loading commit data...
AGPL-COPYING Loading commit data...
GNUmakefile.in Loading commit data...
GNUmakerules Loading commit data...
GPL-COPYING Loading commit data...
LGPL-COPYING Loading commit data...
MOVED-TO-WIKI Loading commit data...
Makeconf.in Loading commit data...
README Loading commit data...
TODO Loading commit data...
TODO.plab Loading commit data...
VERSION Loading commit data...
WEBtemplate.in Loading commit data...
config.h.in Loading commit data...
configure Loading commit data...
configure.ac Loading commit data...
defs-apt Loading commit data...
defs-cloudlab-clemson Loading commit data...
defs-cloudlab-utah Loading commit data...
defs-cloudlab-wisc Loading commit data...
defs-default Loading commit data...
defs-duerig-emulab Loading commit data...
defs-elabinelab Loading commit data...
defs-example Loading commit data...
defs-gtw-apt Loading commit data...
defs-gtw-emulab Loading commit data...
defs-johnsond-emulab Loading commit data...
defs-kwebb-apt Loading commit data...
defs-kwebb-cloudlab Loading commit data...
defs-kwebb-emulab Loading commit data...
defs-mike-emulab Loading commit data...
defs-onelab Loading commit data...
defs-ricci-emulab Loading commit data...
defs-stoller-apt Loading commit data...
defs-stoller-emulab Loading commit data...
defs-stoller-home Loading commit data...
defs-stoller-lbsdb Loading commit data...
defs-uky Loading commit data...
defs-utahclient Loading commit data...
defs-wbsun-emulab Loading commit data...
defs-wide Loading commit data...
pnet-favicon.ico Loading commit data...