- 16 Feb, 2018 13 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
I spent a fair amount of improving error handling along the RPC path, as well making the code more consistent across the various files. Also be more consistent in how the web interface invokes the backend and gets errors back, specifically for errors that are generated when taking to a remote cluster. Add checks before every RPC to make sure the cluster is not disabled in the database. Also check that we can actually reach the cluster, and that the cluster is not offline (NoLogins()) before we try to do anything. I might have to relax this a bit, but in general it takes a couple of seconds to check, which is a small fraction of what most RPCs take. Return precise errors for clusters that are not available, to the web interface and show them to user. Use webtasks more consistently between the web interface and backend scripts. Watch specifically for scripts that exit abnormally (exit before setting the exitcode in the webtask) which always means an internal failure, do not show those to users. Show just those RPC errors that would make sense users, stop spewing script output to the user, send it just to tbops via the email that is already generated when a backend script fails fatally. But do not spew email for clusters that are not reachable or are offline. Ditto for several other cases that were generating mail to tbops instead of just showing the user a meaningful error message. Stop using ParRun for single site experiments; 99% of experiments. For create_instance, a new "async" mode that tells CreateSliver() to return before the first mapper run, which is typically very quickly. Then watch for errors or for the manifest with Resolve or for the slice to disappear. I expect this to be bounded and so we do not need to worry so much about timing this wait out (which is a problem on very big topologies). When we see the manifest, the RedeemTicket() part of the CreateSliver is done and now we are into the StartSliver() phase. For the StartSliver phase, watch for errors and show them to users, previously we mostly lost those errors and just sent the experiment into the failed state. I am still working on this.
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
can we connect to it (using GetVersion()).
-
Leigh B Stoller authored
-
Leigh B Stoller authored
readonly to make it more clear that the input is ignored (the box was already cleared).
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
the listing without going to the edit page. Also hide the approve/deny icons on approved reservations.
-
- 15 Feb, 2018 1 commit
-
-
Mike Hibler authored
Because Interface.pm caches DB state, we might not pick up on this if someone just changes the DB directly (who would do such a thing?)
-
- 14 Feb, 2018 2 commits
-
-
Mike Hibler authored
Discovered when reloading/reinitializing the Apt shared node hosts.
-
Leigh B Stoller authored
-
- 12 Feb, 2018 2 commits
-
-
Mike Hibler authored
So we don't generate unnecessary entries in the dhcpd.conf file.
-
Leigh B Stoller authored
-
- 09 Feb, 2018 7 commits
-
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
going to leave that state, and it causes the Portal to paint then in yellow since the are not "ready". A better approach would be to move nodes out of updating_users after some amount of time has passed. Maybe on the next trip through this part of the forest.
-
Leigh B Stoller authored
images and do not send isalive. This stuff is pretty old and crufty! Remove a bunch of obsolete code while I was here.
-
- 08 Feb, 2018 4 commits
-
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
-
David Johnson authored
(Some containers are deprivileged and cannot even run setpriority. Traditional nice is helpful to linktest, because even if its setpriority call fails, it still runs the command. busybox nice just fails (which is arguably the right behavior, but it's not consistent). So now we do a quick trial of the platform nice to see what it can do, and only use it if it succeeds on /bin/true.)
-
- 07 Feb, 2018 3 commits
-
-
David Johnson authored
-
David Johnson authored
-
Leigh B Stoller authored
the graphs), do not throw a fatal error. Geni users can get into this situation easily, but they can still look at the cluster status page.
-
- 06 Feb, 2018 4 commits
-
-
David Johnson authored
(clientside makefiles assume its presence)
-
David Johnson authored
-
David Johnson authored
-
Leigh B Stoller authored
in more detail in a later commit.
-
- 05 Feb, 2018 3 commits
-
-
Mike Hibler authored
-
Leigh B Stoller authored
We never use this stuff, we should kill it.
-
Leigh B Stoller authored
-
- 03 Feb, 2018 1 commit
-
-
David Johnson authored
-