Commit 2a5cbb2a authored by Leigh Stoller's avatar Leigh Stoller

New version of the portal monitor that is specific to the Mothership.

This version is intended to replace the old autostatus monitor on bas,
except for monitoring the Mothership itself. We also notify the Slack
channel like the autostatus version. Driven from the apt_aggregates
table in the DB, we do the following.

1. fping all the boss nodes.

2. fping all the ops nodes and dboxen. Aside; there are two special
   cases for now, that will eventually come from the database. 1)
   powder wireless aggregates do not have a public ops node, and 2) the
   dboxen are hardwired into a table at the top of the file.

3. Check all the DNS servers. Different from autostatus (which just
   checks that port 53 is listening), we do an actual lookup at the
   server. This is done with dig @ the boss node with recursion turned
   off. At the moment this is serialized test of all the DNS servers,
   might need to change that latter. I've lowered the timeout, and if
   things are operational 99% of the time (which I expect), then this
   will be okay until we get a couple of dozen aggregates to test.

   Note that this test is skipped if the boss is not pingable in the
   first step, so in general this test will not be a bottleneck.

4. Check all the CMs with a GetVersion() call. As with the DNS check, we
   skip this if the boss does not ping. This test *is* done in parallel
   using ParRun() since its slower and the most likely to time out when
   the CM is busy. The time out is 20 seconds. This seems to be the best
   balance between too much email and not hanging for too long on any
   one aggregate.

5. Send email and slack notifications. The current loop is every 60
   seconds, and each test has to fail twice in a row before marking a
   test as a failure and sending notification. Also send a 24 hour
   update for anything that is still down.

At the moment, the full set of tests takes 15 seconds on our seven
aggregates when they are all up. Will need more tuning later, as the
number of aggregates goes up.
parent 3dcc45bc
This diff is collapsed.
...@@ -8,6 +8,8 @@ ...@@ -8,6 +8,8 @@
# BEFORE: apache22 # BEFORE: apache22
# KEYWORD: shutdown # KEYWORD: shutdown
MAINSITE="@TBMAINSITE@"
case "$1" in case "$1" in
start|faststart|quietstart|onestart|forcestart) start|faststart|quietstart|onestart|forcestart)
# #
...@@ -157,7 +159,7 @@ case "$1" in ...@@ -157,7 +159,7 @@ case "$1" in
@prefix@/sbin/cnetwatch @prefix@/sbin/cnetwatch
fi fi
if [ -x @prefix@/sbin/portal_monitor ]; then if [ $MAINSITE == "1" -a -x @prefix@/sbin/portal_monitor ]; then
echo -n " portal_monitor" echo -n " portal_monitor"
@prefix@/sbin/portal_monitor @prefix@/sbin/portal_monitor
fi fi
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment