- 14 Dec, 2018 1 commit
-
-
Leigh B Stoller authored
-
- 13 Dec, 2018 1 commit
-
-
Leigh B Stoller authored
into a whacked out state cause of earlier errors. Be careful.
-
- 30 Nov, 2018 2 commits
-
-
Leigh B Stoller authored
project leader) to give write (editing) permission to everyone else in the project.
-
Leigh B Stoller authored
-
- 28 Nov, 2018 1 commit
-
-
Leigh B Stoller authored
licensing later.
-
- 08 Nov, 2018 1 commit
-
-
Leigh B Stoller authored
a user error.
-
- 26 Oct, 2018 1 commit
-
-
Leigh B Stoller authored
* Respect default branch at the origin; gitlab/guthub allows you to set the default branch on repo, which we ignoring, always using master. Now, we ask the remote for the default branch when we clone/update the repo and set that locally. Like gitlab/guthub, mark the default branch in the branchlist with a "default" badge so the user knows. * Changes to the timer that is asking if the repohash has changed (via a push hook), this has a race in it, and I have solved part of it. It is not a serious problem, just a UI annoyance I am working on fixing. Added a cheesy mechanism to make sure the timer is not running at the same time the user clicks on Update().
-
- 25 Oct, 2018 1 commit
-
-
Leigh B Stoller authored
-
- 23 Oct, 2018 5 commits
-
-
Leigh B Stoller authored
rolling.
-
Leigh B Stoller authored
This version is intended to replace the old autostatus monitor on bas, except for monitoring the Mothership itself. We also notify the Slack channel like the autostatus version. Driven from the apt_aggregates table in the DB, we do the following. 1. fping all the boss nodes. 2. fping all the ops nodes and dboxen. Aside; there are two special cases for now, that will eventually come from the database. 1) powder wireless aggregates do not have a public ops node, and 2) the dboxen are hardwired into a table at the top of the file. 3. Check all the DNS servers. Different from autostatus (which just checks that port 53 is listening), we do an actual lookup at the server. This is done with dig @ the boss node with recursion turned off. At the moment this is serialized test of all the DNS servers, might need to change that latter. I've lowered the timeout, and if things are operational 99% of the time (which I expect), then this will be okay until we get a couple of dozen aggregates to test. Note that this test is skipped if the boss is not pingable in the first step, so in general this test will not be a bottleneck. 4. Check all the CMs with a GetVersion() call. As with the DNS check, we skip this if the boss does not ping. This test *is* done in parallel using ParRun() since its slower and the most likely to time out when the CM is busy. The time out is 20 seconds. This seems to be the best balance between too much email and not hanging for too long on any one aggregate. 5. Send email and slack notifications. The current loop is every 60 seconds, and each test has to fail twice in a row before marking a test as a failure and sending notification. Also send a 24 hour update for anything that is still down. At the moment, the full set of tests takes 15 seconds on our seven aggregates when they are all up. Will need more tuning later, as the number of aggregates goes up.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
current experiment if there is one. This is convenient.
-
Leigh B Stoller authored
-
- 08 Oct, 2018 1 commit
-
-
Leigh B Stoller authored
-
- 01 Oct, 2018 2 commits
-
-
Leigh B Stoller authored
1. Split the resource stuff (where we ask for an advertisement and process it) into a separate script, since that takes a long time to cycle through cause of the size of the ads from the big clusters. 2. On the monitor, distinguish offline (nologins) from actually being down. 3. Add a table to store changes in status so we can see over time how much time the aggregates are usable.
-
Leigh B Stoller authored
terminate instead.
-
- 28 Sep, 2018 6 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
Terminate and/or Freeze. So now we can send email to users about an experiment, that comes from the system and not from us personally?
-
Leigh B Stoller authored
like a long time, but lets try to avoid flapping especially on the POWDER fixed nodes. Might revisit with a per aggregate period setting. Send mail only once per day (and when daemon starts), send email when aggregate is alive again. This closes issue #425.
-
- 26 Sep, 2018 1 commit
-
-
Leigh B Stoller authored
-
- 21 Sep, 2018 2 commits
-
-
Leigh B Stoller authored
with cause and optionally freeze the user. "Cause" means you can paste in a block of text that is emailed to the user.
-
Leigh B Stoller authored
This is actually implemented in a backend perl script, so you can do the seach from the command line, but that would be silly, right?
-
- 17 Sep, 2018 2 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
- 04 Sep, 2018 2 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
via stdout.
-
- 13 Aug, 2018 1 commit
-
-
Leigh B Stoller authored
-
- 09 Aug, 2018 1 commit
-
-
Leigh B Stoller authored
-
- 08 Aug, 2018 2 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
* I started out to add just deferred aggregates; those that are offline when starting an experiment (and marked in the apt_aggregates table as being deferable). When an aggregate is offline, we add an entry to the new apt_deferred_aggregates table, and periodically retry to start the missing slivers. In order to accomplish this, I split create_instance into two scripts, first part to create the instance in the DB, and the second (create_slivers) to create slivers for the instance. The daemon calls create_slivers for any instances in the deferred table, until all deferred aggregates are resolved. On the UI side, there are various changes to deal with allowing experiments to be partially create. For example used to wait till we have all the manifests until showing the topology. Now we show the topo on the first manifest, and then add them as they come in. Various parts of the UI had to change to deal with missing aggregates, I am sure I did not get them all. * And then once I had that, I realized that "scheduled" experiments was an "easy" addition, its just a degenerate case of deferred. For this I added some new slots to the tables to hold the scheduled start time, and added a started stamp so we can distinguish between the time it was created and the time it was actually started. Lots of data. On the UI side, there is a new fourth step on the instantiate page to give the user a choice of immediate or scheduled start. I moved the experiment duration to this step. I was originally going to add a calendar choice for termination, but I did not want to change the existing 16 hour max duration policy, yet.
-
- 07 Aug, 2018 2 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
- 30 Jul, 2018 2 commits
-
-
Leigh B Stoller authored
case we have to ask the clearinghouse.
-
Leigh B Stoller authored
so that the SCS will accept it. Oh, and I learned that the SCS is sensitive to the order of elements in the XML file! The link *must* be after the nodes, or the SCS refuses to do anything with it. Sheesh.
-
- 16 Jul, 2018 2 commits
-
-
Leigh B Stoller authored
1. The primary change is to the Create Image modal; we now allow users to optionally specify a description for the image. This needed to be plumbed through all the way to the GeniCM CreateImage() API. Since the modal is getting kinda overloaded, I rearranged things a bit and changed the argument checking and error handling. I think this is the limit of what we want to do on this modal, need a better UI in the future. 2. Of course, if we let users set descriptions, lets show them on the image listing page. While I was there, I made the list look more like the classic image list; show the image name and project, and put the URN in a tooltip, since in general the URN is noisy to look at. 3. And while I was messing with the image listing, I noticed that we were not deleting profiles like we said we would. The problem is that when we form the image list, we know the profile versions that can be deleted, but when the user actually clicks to delete, I was trying to regen that decision, but without asking the cluster for the info again. So instead, just pass through the version list from the web UI.
-
Leigh B Stoller authored
-
- 09 Jul, 2018 1 commit
-
-
Leigh B Stoller authored
* Add portal url to the existing emulab extension that tells the CM the CreateSliver() is coming from the Portal. Always send this info, not just for the Emulab Portal. * Stash that info in the geni slice data structure so we can add links back to the portal status page for current slices. * Add routines to generate a portal URL for the history entries, since we will not have those links for historical slices. Add links back to the portal on the showslice and slice history pages.
-