- 08 Feb, 2019 1 commit
-
-
Leigh Stoller authored
* Use an emulabfeature to control what projects use the new ppwizard and geni-lib code. The feature is applied to the profile's project, since who is instantiating and what project it is instantiated into, does not really matter, the incompatible changes are also associated with the profile. * Run both versions of the ppwizard side by side, and flip between them when the user is using the profile picker. * The new version of geni-lib is /usr/testbed/opsdir/lib/geni-lib.new, we tell the genilab jail to use that directory when on the new path. * All of this is temporary.
-
- 07 Feb, 2019 1 commit
-
-
Leigh Stoller authored
-
- 22 Jan, 2019 1 commit
-
-
Leigh Stoller authored
-
- 11 Jan, 2019 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
David is going to add this to prepare so we do not create images with stale repos on them.
-
- 14 Dec, 2018 1 commit
-
-
Leigh Stoller authored
-
- 13 Dec, 2018 1 commit
-
-
Leigh Stoller authored
into a whacked out state cause of earlier errors. Be careful.
-
- 30 Nov, 2018 2 commits
-
-
Leigh Stoller authored
project leader) to give write (editing) permission to everyone else in the project.
-
Leigh Stoller authored
-
- 28 Nov, 2018 1 commit
-
-
Leigh Stoller authored
licensing later.
-
- 08 Nov, 2018 1 commit
-
-
Leigh Stoller authored
a user error.
-
- 26 Oct, 2018 1 commit
-
-
Leigh Stoller authored
* Respect default branch at the origin; gitlab/guthub allows you to set the default branch on repo, which we ignoring, always using master. Now, we ask the remote for the default branch when we clone/update the repo and set that locally. Like gitlab/guthub, mark the default branch in the branchlist with a "default" badge so the user knows. * Changes to the timer that is asking if the repohash has changed (via a push hook), this has a race in it, and I have solved part of it. It is not a serious problem, just a UI annoyance I am working on fixing. Added a cheesy mechanism to make sure the timer is not running at the same time the user clicks on Update().
-
- 25 Oct, 2018 1 commit
-
-
Leigh Stoller authored
-
- 23 Oct, 2018 5 commits
-
-
Leigh Stoller authored
rolling.
-
Leigh Stoller authored
This version is intended to replace the old autostatus monitor on bas, except for monitoring the Mothership itself. We also notify the Slack channel like the autostatus version. Driven from the apt_aggregates table in the DB, we do the following. 1. fping all the boss nodes. 2. fping all the ops nodes and dboxen. Aside; there are two special cases for now, that will eventually come from the database. 1) powder wireless aggregates do not have a public ops node, and 2) the dboxen are hardwired into a table at the top of the file. 3. Check all the DNS servers. Different from autostatus (which just checks that port 53 is listening), we do an actual lookup at the server. This is done with dig @ the boss node with recursion turned off. At the moment this is serialized test of all the DNS servers, might need to change that latter. I've lowered the timeout, and if things are operational 99% of the time (which I expect), then this will be okay until we get a couple of dozen aggregates to test. Note that this test is skipped if the boss is not pingable in the first step, so in general this test will not be a bottleneck. 4. Check all the CMs with a GetVersion() call. As with the DNS check, we skip this if the boss does not ping. This test *is* done in parallel using ParRun() since its slower and the most likely to time out when the CM is busy. The time out is 20 seconds. This seems to be the best balance between too much email and not hanging for too long on any one aggregate. 5. Send email and slack notifications. The current loop is every 60 seconds, and each test has to fail twice in a row before marking a test as a failure and sending notification. Also send a 24 hour update for anything that is still down. At the moment, the full set of tests takes 15 seconds on our seven aggregates when they are all up. Will need more tuning later, as the number of aggregates goes up.
-
Leigh Stoller authored
-
Leigh Stoller authored
current experiment if there is one. This is convenient.
-
Leigh Stoller authored
-
- 08 Oct, 2018 1 commit
-
-
Leigh Stoller authored
-
- 01 Oct, 2018 2 commits
-
-
Leigh Stoller authored
1. Split the resource stuff (where we ask for an advertisement and process it) into a separate script, since that takes a long time to cycle through cause of the size of the ads from the big clusters. 2. On the monitor, distinguish offline (nologins) from actually being down. 3. Add a table to store changes in status so we can see over time how much time the aggregates are usable.
-
Leigh Stoller authored
terminate instead.
-
- 28 Sep, 2018 6 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
-
Leigh Stoller authored
-
Leigh Stoller authored
-
Leigh Stoller authored
Terminate and/or Freeze. So now we can send email to users about an experiment, that comes from the system and not from us personally?
-
Leigh Stoller authored
like a long time, but lets try to avoid flapping especially on the POWDER fixed nodes. Might revisit with a per aggregate period setting. Send mail only once per day (and when daemon starts), send email when aggregate is alive again. This closes issue #425.
-
- 26 Sep, 2018 1 commit
-
-
Leigh Stoller authored
-
- 21 Sep, 2018 2 commits
-
-
Leigh Stoller authored
with cause and optionally freeze the user. "Cause" means you can paste in a block of text that is emailed to the user.
-
Leigh Stoller authored
This is actually implemented in a backend perl script, so you can do the seach from the command line, but that would be silly, right?
-
- 17 Sep, 2018 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
-
- 04 Sep, 2018 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
via stdout.
-
- 13 Aug, 2018 1 commit
-
-
Leigh Stoller authored
-
- 09 Aug, 2018 1 commit
-
-
Leigh Stoller authored
-
- 08 Aug, 2018 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
* I started out to add just deferred aggregates; those that are offline when starting an experiment (and marked in the apt_aggregates table as being deferable). When an aggregate is offline, we add an entry to the new apt_deferred_aggregates table, and periodically retry to start the missing slivers. In order to accomplish this, I split create_instance into two scripts, first part to create the instance in the DB, and the second (create_slivers) to create slivers for the instance. The daemon calls create_slivers for any instances in the deferred table, until all deferred aggregates are resolved. On the UI side, there are various changes to deal with allowing experiments to be partially create. For example used to wait till we have all the manifests until showing the topology. Now we show the topo on the first manifest, and then add them as they come in. Various parts of the UI had to change to deal with missing aggregates, I am sure I did not get them all. * And then once I had that, I realized that "scheduled" experiments was an "easy" addition, its just a degenerate case of deferred. For this I added some new slots to the tables to hold the scheduled start time, and added a started stamp so we can distinguish between the time it was created and the time it was actually started. Lots of data. On the UI side, there is a new fourth step on the instantiate page to give the user a choice of immediate or scheduled start. I moved the experiment duration to this step. I was originally going to add a calendar choice for termination, but I did not want to change the existing 16 hour max duration policy, yet.
-
- 07 Aug, 2018 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
-