- Oct 30, 2018
-
-
David Johnson authored
(No existing code ever checked the return value from TBSetNodeEventState until libosload_virtnode started to do so, to retry failed event sends under high load in large-scale vnode experiments. libevent, Node, and libdb alternate return conventions; this sets them right.)
-
David Johnson authored
-
David Johnson authored
This is necessary for clusters that run an arp lockdown on boss. This eluded me for a long time. None of the documented ways to set the mac address of an endpoint on container create work (they only work on post-create network attach). You have to use some special, weird, undocumented magic.
-
David Johnson authored
(Most of these got lost in some other commit storm, I believe. The firewall fixes are new, for newer Dockers that drop traffic by default.)
-
Leigh B Stoller authored
-
- Oct 29, 2018
-
-
Leigh B Stoller authored
with a modern implementation.
-
Leigh B Stoller authored
days, extended check locks up the system for too long and it never ever finds any problems. So do a "medium" check instead, runs 5 times faster.
-
- Oct 26, 2018
-
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
Mike Hibler authored
Turns out we have not been installing (via slicefix) the local site certs on nodes after they have been imaged. We haven't noticed because we don't usually use SSL-enabled tmcd. Leigh noticed because we do use it in the script that locks down ARP entries.
-
Leigh B Stoller authored
that and point them to the Geni portal to fix it.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
* Respect default branch at the origin; gitlab/guthub allows you to set the default branch on repo, which we ignoring, always using master. Now, we ask the remote for the default branch when we clone/update the repo and set that locally. Like gitlab/guthub, mark the default branch in the branchlist with a "default" badge so the user knows. * Changes to the timer that is asking if the repohash has changed (via a push hook), this has a race in it, and I have solved part of it. It is not a serious problem, just a UI annoyance I am working on fixing. Added a cheesy mechanism to make sure the timer is not running at the same time the user clicks on Update().
-
Leigh B Stoller authored
it is a jail, and it's mac is the same as boss.
-
- Oct 25, 2018
-
-
Aleksander Maricq authored
-
Leigh B Stoller authored
-
David Johnson authored
(Also, add support for user to change container entrypoint at runtime. Note also that the server side now stores the entrypoint/cmd/env attributes as base64url-encoded virt_node_attributes, so that we can just use the existing table_regex for those values.) We add a new runit service (/etc/service/dockerentrypoint) to clientside/tmcc/linux/docker/dockerfiles/common to handle the entrypoint/cmd/env/workingdir/user emulation. From the comments: Docker's semantics for ENTRYPOINT/CMD vary depending on if those values are specified as arrays of string, or simple as single strings (which must be interpreted by /bin/sh -c). Handling all the quoting possibilities in the shell is a major pain. So, this script handles the basic stuff (in particular, sourcing env vars, because we want the shell to interpret them!) -- then execs our perl companion script (run.pl) to deal with the entrypoint/command files that libvnode_docker::emulabizeImage and libvnode_docker::vnodeCreate populated. libvnode_docker creates these single-line files in /etc/emulab/docker as either string:hexstr(<entrypoint-or-cmd-string>), or array:hexstr(a[0]),hexstr(a[1])... . This allows us to preserve the original type of the image's entrypoint/cmd as well as the runtime entrypoint/cmd, and to preserve the exact bytes for the eventual final call to exec. The static files builtin to an emulabized image are /etc/emulab/docker/{entrypoint.image,cmd.image}, and those created dynamically at runtime if user changes the entrypoint or cmd are bind-mounted to /etc/emulab/docker{entrypoint.runtime,cmd.runtime}. Given the presence (or absence!) of those files, this script implements the emulation, based upon the content in those files.
-
David Johnson authored
-
David Johnson authored
-
Mike Hibler authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Mike Hibler authored
-
Mike Hibler authored
The full port is fixed at version 0.29.1. The latest version that was wraped, version 0.30.1 has problems with unicode to "string" conversions. This explicitly caused an exception from the m2crypto SWIG stubs for libssl. Even after fixing that, we still could not verify a certificate due to apparent missing chars in strings.
-
- Oct 24, 2018
-
-
Leigh B Stoller authored
* When deleting a lan can there is only one interface left, need to go back and delete the interface from the last node. Else its a malformed rpsec (which we have been ignoring), but it was passing through to the manifest, which made it a malformed manifest. * But a later bug was causing that now removed interface to sneak back in via the old copy of the manifest in the database. * Also fix a bug that was causing multiple versions of the site_info element to get inserted during an update. * Remove code that updates the manifest in the DB, use the existing Aggregate->UpdateManifest() method instead.
-
Mike Hibler authored
Avoid gratuituous serial line signal changes when opening up the USB device for the Arduino. Otherwise, the Arduino will reset its state.
-
Leigh B Stoller authored
experiment running that uses that profile. A small bug here prevented the Terminate button from getting enabled. In general though, I wonder if we should not allow a profile to be deleted while its instantiated. :-)
-
- Oct 23, 2018
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
rolling.
-
Leigh B Stoller authored
when getting a termination signal. Also add an option to not redefine the HUP handler, which is needed for the portal_monitor, which uses the HUP signal to reopen the logfile (from syslogd).
-
Leigh B Stoller authored
-
Leigh B Stoller authored
This version is intended to replace the old autostatus monitor on bas, except for monitoring the Mothership itself. We also notify the Slack channel like the autostatus version. Driven from the apt_aggregates table in the DB, we do the following. 1. fping all the boss nodes. 2. fping all the ops nodes and dboxen. Aside; there are two special cases for now, that will eventually come from the database. 1) powder wireless aggregates do not have a public ops node, and 2) the dboxen are hardwired into a table at the top of the file. 3. Check all the DNS servers. Different from autostatus (which just checks that port 53 is listening), we do an actual lookup at the server. This is done with dig @ the boss node with recursion turned off. At the moment this is serialized test of all the DNS servers, might need to change that latter. I've lowered the timeout, and if things are operational 99% of the time (which I expect), then this will be okay until we get a couple of dozen aggregates to test. Note that this test is skipped if the boss is not pingable in the first step, so in general this test will not be a bottleneck. 4. Check all the CMs with a GetVersion() call. As with the DNS check, we skip this if the boss does not ping. This test *is* done in parallel using ParRun() since its slower and the most likely to time out when the CM is busy. The time out is 20 seconds. This seems to be the best balance between too much email and not hanging for too long on any one aggregate. 5. Send email and slack notifications. The current loop is every 60 seconds, and each test has to fail twice in a row before marking a test as a failure and sending notification. Also send a 24 hour update for anything that is still down. At the moment, the full set of tests takes 15 seconds on our seven aggregates when they are all up. Will need more tuning later, as the number of aggregates goes up.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
current experiment if there is one. This is convenient.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
when CRLS are enabled. This used to be the default but is now an option we need to turn on.
-