- 07 Feb, 2019 1 commit
-
-
Leigh Stoller authored
-
- 28 Sep, 2018 1 commit
-
-
Leigh Stoller authored
-
- 08 Aug, 2018 1 commit
-
-
Leigh Stoller authored
* I started out to add just deferred aggregates; those that are offline when starting an experiment (and marked in the apt_aggregates table as being deferable). When an aggregate is offline, we add an entry to the new apt_deferred_aggregates table, and periodically retry to start the missing slivers. In order to accomplish this, I split create_instance into two scripts, first part to create the instance in the DB, and the second (create_slivers) to create slivers for the instance. The daemon calls create_slivers for any instances in the deferred table, until all deferred aggregates are resolved. On the UI side, there are various changes to deal with allowing experiments to be partially create. For example used to wait till we have all the manifests until showing the topology. Now we show the topo on the first manifest, and then add them as they come in. Various parts of the UI had to change to deal with missing aggregates, I am sure I did not get them all. * And then once I had that, I realized that "scheduled" experiments was an "easy" addition, its just a degenerate case of deferred. For this I added some new slots to the tables to hold the scheduled start time, and added a started stamp so we can distinguish between the time it was created and the time it was actually started. Lots of data. On the UI side, there is a new fourth step on the instantiate page to give the user a choice of immediate or scheduled start. I moved the experiment duration to this step. I was originally going to add a calendar choice for termination, but I did not want to change the existing 16 hour max duration policy, yet.
-
- 09 Jul, 2018 1 commit
-
-
Leigh Stoller authored
* Add portal url to the existing emulab extension that tells the CM the CreateSliver() is coming from the Portal. Always send this info, not just for the Emulab Portal. * Stash that info in the geni slice data structure so we can add links back to the portal status page for current slices. * Add routines to generate a portal URL for the history entries, since we will not have those links for historical slices. Add links back to the portal on the showslice and slice history pages.
-
- 12 Jun, 2018 1 commit
-
-
Leigh Stoller authored
-
- 14 May, 2018 1 commit
-
-
Leigh Stoller authored
with no rspec, nor try to start an experiment with no rspec.
-
- 12 May, 2018 1 commit
-
-
Leigh Stoller authored
-
- 30 Apr, 2018 1 commit
-
-
Leigh Stoller authored
show the user, there was a path in which this was not happening.
-
- 26 Apr, 2018 1 commit
-
-
Leigh Stoller authored
-
- 14 Mar, 2018 1 commit
-
-
Leigh Stoller authored
something different in the Portal. Ditto when we fail on an empty testbed, although it appears we never get that anymore.
-
- 06 Mar, 2018 1 commit
-
-
Leigh Stoller authored
-
- 20 Feb, 2018 1 commit
-
-
Leigh Stoller authored
so that we can still kill the child and the parent cleans things up. 2) Bump the failsafe timeout to 7200.
-
- 16 Feb, 2018 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
I spent a fair amount of improving error handling along the RPC path, as well making the code more consistent across the various files. Also be more consistent in how the web interface invokes the backend and gets errors back, specifically for errors that are generated when taking to a remote cluster. Add checks before every RPC to make sure the cluster is not disabled in the database. Also check that we can actually reach the cluster, and that the cluster is not offline (NoLogins()) before we try to do anything. I might have to relax this a bit, but in general it takes a couple of seconds to check, which is a small fraction of what most RPCs take. Return precise errors for clusters that are not available, to the web interface and show them to user. Use webtasks more consistently between the web interface and backend scripts. Watch specifically for scripts that exit abnormally (exit before setting the exitcode in the webtask) which always means an internal failure, do not show those to users. Show just those RPC errors that would make sense users, stop spewing script output to the user, send it just to tbops via the email that is already generated when a backend script fails fatally. But do not spew email for clusters that are not reachable or are offline. Ditto for several other cases that were generating mail to tbops instead of just showing the user a meaningful error message. Stop using ParRun for single site experiments; 99% of experiments. For create_instance, a new "async" mode that tells CreateSliver() to return before the first mapper run, which is typically very quickly. Then watch for errors or for the manifest with Resolve or for the slice to disappear. I expect this to be bounded and so we do not need to worry so much about timing this wait out (which is a problem on very big topologies). When we see the manifest, the RedeemTicket() part of the CreateSliver is done and now we are into the StartSliver() phase. For the StartSliver phase, watch for errors and show them to users, previously we mostly lost those errors and just sent the experiment into the failed state. I am still working on this.
-
- 22 Jan, 2018 2 commits
-
-
Leigh Stoller authored
CreateSliver() RPC is going to take a really long time as is going to time out. This would cause the experiment to fail at the Portal. Now we continue on to WaitForSlivers(), and wait for the manifest to appear, which signals the CreateSliver() has finally finished, and then we go into normal waitmode.
-
Leigh Stoller authored
-
- 04 Dec, 2017 1 commit
-
-
Leigh Stoller authored
* New tables to store policies for users and projects/groups. At the moment, there is only one policy (with associated reason); disabled. This allows us to mark projects/groups/users with enable/disable flags. Note that policies are applied consecutively, so you can disable extensions for a project, but enable them for a user in that project. * Apply extensions when experiments are created, send mail to the audit log when policies cause extensions to be disabled. * New driver script (manage_extensions) to change the policy tables.
-
- 27 Nov, 2017 1 commit
-
-
Leigh Stoller authored
could let a guest user gain entry is now gone. I will clean up straggling guest code over time. This closes issue #352.
-
- 23 Oct, 2017 1 commit
-
-
Leigh Stoller authored
-
- 14 Aug, 2017 1 commit
-
-
Leigh Stoller authored
-
- 10 Jul, 2017 1 commit
-
-
Leigh Stoller authored
-
- 30 May, 2017 1 commit
-
-
Leigh Stoller authored
In the beginning, the number and size of experiments was small, and so storing the entire slice/sliver status blob as json in the web task was fine, even though we had to lock tables to prevent races between the event updates and the local polling. But lately the size of those json blobs is getting huge and the lock is bogging things down, including not being able to keep up with the number of events coming from all the clusters, we get really far behind. So I have moved the status blobs out of the per-instance web task and into new tables, once per slice and one per node (sliver). This keeps the blobs very small and thus the lock time very small. So now we can keep up with the event stream. If we grow big enough that this problem comes big enough, we can switch to innodb for the per-sliver table and do row locking instead of table locking, but I do not think that will happen
-
- 10 May, 2017 1 commit
-
-
Leigh Stoller authored
-
- 02 May, 2017 1 commit
-
-
Leigh Stoller authored
1. Okay, 10-15 seconds for me, which is the same as forever. 2. Do not sort in PHP, sort in javascript, let the client burn those cycles instead of poor overworked boss. 3. Store global lastused/usecount in the apt_profiles table so that we do not have to compute it every time for profile. 4. Compute the user's lastused/usecount for each profile in a single query and create local array. Cuts out 100s of queries.
-
- 24 Mar, 2017 1 commit
-
-
Leigh Stoller authored
-
- 10 Mar, 2017 1 commit
-
-
Leigh Stoller authored
Right now I am doing this as an extra operation, it could be rolled into the initial createsliver.
-
- 06 Mar, 2017 1 commit
-
-
Leigh Stoller authored
to May 1st.
-
- 15 Feb, 2017 1 commit
-
-
Leigh Stoller authored
behind that is locked and cannot be terminated.
-
- 09 Feb, 2017 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
coded to 16 hours, need to fix that.
-
- 25 Jan, 2017 2 commits
-
-
Leigh Stoller authored
before instantiating.
-
Leigh Stoller authored
when they have leaked.
-
- 20 Jan, 2017 2 commits
-
-
Leigh Stoller authored
-
Leigh Stoller authored
-
- 29 Dec, 2016 1 commit
-
-
Leigh Stoller authored
-
- 28 Dec, 2016 1 commit
-
-
Leigh Stoller authored
remove are run on ops via the proxy script. Other smaller operations like getting the source, logs, commit info are currently run on boss via NFS to /usr/testbed/opsdir/repositories, which are one per profile. Lots of work still to do on optimizing repositories.
-
- 29 Nov, 2016 1 commit
-
-
Leigh Stoller authored
1. Kill canceled instances; we allow users to "terminate" an instance while it is booting up, but we have to pend that till the lock is released. We do this with a canceled flag, similar to the Classic interface. But I never committed the apt_daemon changes that look for canceled instances and kills them! 2. Look for stale st/lt datasets and delete them. A stale dataset is one that no longer exists at the remote cluster (cause its expiration was reached and it was reaped). We do not get notification at the Portal, and so those dangling datasets descriptors sit around confusing people (okay, confusing me and others of a similar vintage).
-
- 06 Oct, 2016 1 commit
-
-
Leigh Stoller authored
OSinfo and Image into a single object for the benefit of the perl code. The database tables have not changed though.
-
- 28 Sep, 2016 1 commit
-
-
Leigh Stoller authored
-
- 31 Aug, 2016 1 commit
-
-
Leigh Stoller authored
-