- 07 Nov, 2018 2 commits
-
-
Leigh B Stoller authored
From Slack: What I notice is that mysqldump is read locking all of the tables for a long time. This time gets longer and longer of course as the DB gets bigger. Last night enough stuff backed up (trying to get various write locks) that we hit the 500 thread limit. I only know this cause mysql prints "killing 501" threads at 2:03am. Which makes me wonder if our thread limit is too small (but seems like it would have to be much bigger) or if our backup strategy is inappropriate for how big the DB is and how busy the system is. But to be clear, I am not even sure if mysqld throws in the towel when it hits 500 threads, I am in the midst of reading obtuse mysql documentation. (edited) There a bunch of other error messages that I do not understand yet. I can reproduce this in my elabinelab with a 10 line perl script. Two problems; one is that we do not use the permission system, so we cannot use dynamic permissions, which means that the single thread that is left for just this case, can be used by anyone, and so the server is fully out of threads. And 2) then the Emulab mysql watchdog cannot perform its query, and so it thinks mysqld has gone catatonic and kills it, right in the middle of the backup. Yuck * 2. (edited) And if anyone is curious about a more typical approach: "If you want to do this for MyISAM or mixed tables without any downtime from locking the tables, you can set up a slave database, and take your snapshots from there. Setting up the slave database, unfortunately, causes some downtime to export the live database, but once it's running, you should be able to lock it's tables, and export using the methods others have described. When this is happening, it will lag behind the master, but won't stop the master from updating it's tables, and will catch up as soon as the backup is complete"
-
Leigh B Stoller authored
like a mellanox switch.
-
- 06 Nov, 2018 4 commits
-
-
David Johnson authored
(We don't want systemd sending them SIGTERM before bootvnodes can get them!)
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
(For the vhost cases we have today, it makes more sense to let the cluster pick a reasonable default OS for the vhost, when the user declares a vhost but doesn't set an image for it. The default OS is almost certain to be unable to host the hosted nodes anyway.)
-
- 05 Nov, 2018 6 commits
-
-
Leigh B Stoller authored
* The primary problem with the mellanox is that the install image does a kexec out of ONIE into Linux, spends 30+ minutes doing stuff, and then reboots. This throws the reload state machine out of whack cause we do not get a chance to send the RELOADDONE state. So ... some change to rc.testbed and rc.reload on the USB dongle: the ONIE MFS sends RELOADING and writes a flag file to the ONIE partition on the "disk" (not the usb). Then the kexec into MLNX, the install happens, and reboots. The next boot into ONIE sees the flag file, erases it and sends REDLOADDONE. Waits for a bit, and then continues on the normal path. This abuses stated in that there a whiny messages in the stated log file, but I am immune to stated whining. * Another item of note is that the switch DHCPs, but only to get the IP info, there is no ability to give it an initial config file like we can with the Dell switches. The main problem here is that the switch comes up with its default login/password which is obviously well known cause its in the manual. That means there is a window where the switch is vulnerable, but since we block the switches from the public side, this is not a serious problem. As soon as we can get in (sshd is running) we login and update the config with passwords, keys, etc. * Other changes to the machine dependent osload library module, I had done some of this before switching to the Dells way back when, but it needed to be updated/completed.
-
Leigh B Stoller authored
empty testbed test. Prior to this commit, we were not invoking the empty testbed case consitently. Now we do, but that exposed another problem; reporting that to the error to the Portal in a meaningful way. Basically, we can report a different error code for an impossible to map error, but then we lose the info we store now about what the actual failure was (which we show to the user with additional helpful info). Since we cannot (easily) change the Geni API for CreateSliver(), I have elected to continue the practice of returning the specific error codes (which also go into the database for long term historical info), and add more helpful text that for the Portal user that explains clearly that the mapping is impossible on the target cluster. This extra text also go into the database in the attached message field, so we ccan come back later and post process if we decide to do something different.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
- 30 Oct, 2018 5 commits
-
-
David Johnson authored
(No existing code ever checked the return value from TBSetNodeEventState until libosload_virtnode started to do so, to retry failed event sends under high load in large-scale vnode experiments. libevent, Node, and libdb alternate return conventions; this sets them right.)
-
David Johnson authored
-
David Johnson authored
This is necessary for clusters that run an arp lockdown on boss. This eluded me for a long time. None of the documented ways to set the mac address of an endpoint on container create work (they only work on post-create network attach). You have to use some special, weird, undocumented magic.
-
David Johnson authored
(Most of these got lost in some other commit storm, I believe. The firewall fixes are new, for newer Dockers that drop traffic by default.)
-
Leigh B Stoller authored
-
- 29 Oct, 2018 2 commits
-
-
Leigh B Stoller authored
with a modern implementation.
-
Leigh B Stoller authored
days, extended check locks up the system for too long and it never ever finds any problems. So do a "medium" check instead, runs 5 times faster.
-
- 26 Oct, 2018 10 commits
-
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
Mike Hibler authored
Turns out we have not been installing (via slicefix) the local site certs on nodes after they have been imaged. We haven't noticed because we don't usually use SSL-enabled tmcd. Leigh noticed because we do use it in the script that locks down ARP entries.
-
Leigh B Stoller authored
that and point them to the Geni portal to fix it.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
* Respect default branch at the origin; gitlab/guthub allows you to set the default branch on repo, which we ignoring, always using master. Now, we ask the remote for the default branch when we clone/update the repo and set that locally. Like gitlab/guthub, mark the default branch in the branchlist with a "default" badge so the user knows. * Changes to the timer that is asking if the repohash has changed (via a push hook), this has a race in it, and I have solved part of it. It is not a serious problem, just a UI annoyance I am working on fixing. Added a cheesy mechanism to make sure the timer is not running at the same time the user clicks on Update().
-
Leigh B Stoller authored
it is a jail, and it's mac is the same as boss.
-
- 25 Oct, 2018 10 commits
-
-
Aleksander Maricq authored
-
Leigh B Stoller authored
-
David Johnson authored
(Also, add support for user to change container entrypoint at runtime. Note also that the server side now stores the entrypoint/cmd/env attributes as base64url-encoded virt_node_attributes, so that we can just use the existing table_regex for those values.) We add a new runit service (/etc/service/dockerentrypoint) to clientside/tmcc/linux/docker/dockerfiles/common to handle the entrypoint/cmd/env/workingdir/user emulation. From the comments: Docker's semantics for ENTRYPOINT/CMD vary depending on if those values are specified as arrays of string, or simple as single strings (which must be interpreted by /bin/sh -c). Handling all the quoting possibilities in the shell is a major pain. So, this script handles the basic stuff (in particular, sourcing env vars, because we want the shell to interpret them!) -- then execs our perl companion script (run.pl) to deal with the entrypoint/command files that libvnode_docker::emulabizeImage and libvnode_docker::vnodeCreate populated. libvnode_docker creates these single-line files in /etc/emulab/docker as either string:hexstr(<entrypoint-or-cmd-string>), or array:hexstr(a[0]),hexstr(a[1])... . This allows us to preserve the original type of the image's entrypoint/cmd as well as the runtime entrypoint/cmd, and to preserve the exact bytes for the eventual final call to exec. The static files builtin to an emulabized image are /etc/emulab/docker/{entrypoint.image,cmd.image}, and those created dynamically at runtime if user changes the entrypoint or cmd are bind-mounted to /etc/emulab/docker{entrypoint.runtime,cmd.runtime}. Given the presence (or absence!) of those files, this script implements the emulation, based upon the content in those files.
-
David Johnson authored
-
David Johnson authored
-
Mike Hibler authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Mike Hibler authored
-
Mike Hibler authored
The full port is fixed at version 0.29.1. The latest version that was wraped, version 0.30.1 has problems with unicode to "string" conversions. This explicitly caused an exception from the m2crypto SWIG stubs for libssl. Even after fixing that, we still could not verify a certificate due to apparent missing chars in strings.
-
- 24 Oct, 2018 1 commit
-
-
Leigh B Stoller authored
* When deleting a lan can there is only one interface left, need to go back and delete the interface from the last node. Else its a malformed rpsec (which we have been ignoring), but it was passing through to the manifest, which made it a malformed manifest. * But a later bug was causing that now removed interface to sneak back in via the old copy of the manifest in the database. * Also fix a bug that was causing multiple versions of the site_info element to get inserted during an update. * Remove code that updates the manifest in the DB, use the existing Aggregate->UpdateManifest() method instead.
-