• Leigh B Stoller's avatar
    A lot of work on the RPC code, among other things. · 56f6d601
    Leigh B Stoller authored
    I spent a fair amount of improving error handling along the RPC path,
    as well making the code more consistent across the various files. Also
    be more consistent in how the web interface invokes the backend and gets
    errors back, specifically for errors that are generated when taking to a
    remote cluster.
    
    Add checks before every RPC to make sure the cluster is not disabled in
    the database. Also check that we can actually reach the cluster, and
    that the cluster is not offline (NoLogins()) before we try to do
    anything. I might have to relax this a bit, but in general it takes a
    couple of seconds to check, which is a small fraction of what most RPCs
    take. Return precise errors for clusters that are not available, to the
    web interface and show them to user.
    
    Use webtasks more consistently between the web interface and backend
    scripts. Watch specifically for scripts that exit abnormally (exit
    before setting the exitcode in the webtask) which always means an
    internal failure, do not show those to users.
    
    Show just those RPC errors that would make sense users, stop spewing
    script output to the user, send it just to tbops via the email that is
    already generated when a backend script fails fatally.
    
    But do not spew email for clusters that are not reachable or are
    offline. Ditto for several other cases that were generating mail to
    tbops instead of just showing the user a meaningful error message.
    
    Stop using ParRun for single site experiments; 99% of experiments.
    
    For create_instance, a new "async" mode that tells CreateSliver() to
    return before the first mapper run, which is typically very quickly.
    Then watch for errors or for the manifest with Resolve or for the slice
    to disappear. I expect this to be bounded and so we do not need to worry
    so much about timing this wait out (which is a problem on very big
    topologies). When we see the manifest, the RedeemTicket() part of the
    CreateSliver is done and now we are into the StartSliver() phase.
    
    For the StartSliver phase, watch for errors and show them to users,
    previously we mostly lost those errors and just sent the experiment into
    the failed state. I am still working on this.
    56f6d601
Name
Last commit
Last update
account Loading commit data...
apache Loading commit data...
apt Loading commit data...
assign Loading commit data...
autoconf Loading commit data...
autofs Loading commit data...
backend Loading commit data...
bugdb Loading commit data...
cdrom Loading commit data...
clientside Loading commit data...
collab Loading commit data...
daikon Loading commit data...
db Loading commit data...
delay Loading commit data...
dhcpd Loading commit data...
discvr Loading commit data...
doc Loading commit data...
event Loading commit data...
firewall Loading commit data...
flash Loading commit data...
fwrules Loading commit data...
hw_config Loading commit data...
hyperviewer Loading commit data...
image-test Loading commit data...
install Loading commit data...
ipod Loading commit data...
mobile Loading commit data...
mote Loading commit data...
named Loading commit data...
node_usage Loading commit data...
ntpd Loading commit data...
os Loading commit data...
patches Loading commit data...
pelab Loading commit data...
protogeni Loading commit data...
pxe Loading commit data...
rc.d Loading commit data...
robots Loading commit data...
rpms Loading commit data...
security Loading commit data...
sensors Loading commit data...
sql Loading commit data...
ssl Loading commit data...
sysadmin Loading commit data...
tbsetup Loading commit data...
testsuite Loading commit data...
tip Loading commit data...
tmcd Loading commit data...
tools Loading commit data...
utils Loading commit data...
vis Loading commit data...
wiki Loading commit data...
www Loading commit data...
xmlrpc Loading commit data...
.gitattributes Loading commit data...
.gitignore Loading commit data...
.gitmodules Loading commit data...
.loc-ignore Loading commit data...
AGPL-COPYING Loading commit data...
GNUmakefile.in Loading commit data...
GNUmakerules Loading commit data...
GPL-COPYING Loading commit data...
LGPL-COPYING Loading commit data...
MOVED-TO-WIKI Loading commit data...
Makeconf.in Loading commit data...
README Loading commit data...
TODO Loading commit data...
TODO.plab Loading commit data...
VERSION Loading commit data...
WEBtemplate.in Loading commit data...
config.h.in Loading commit data...
configure Loading commit data...
configure.ac Loading commit data...
defs-apt Loading commit data...
defs-cloudlab-clemson Loading commit data...
defs-cloudlab-utah Loading commit data...
defs-cloudlab-wisc Loading commit data...
defs-default Loading commit data...
defs-duerig-emulab Loading commit data...
defs-elabinelab Loading commit data...
defs-example Loading commit data...
defs-gtw-apt Loading commit data...
defs-gtw-emulab Loading commit data...
defs-johnsond-emulab Loading commit data...
defs-kwebb-apt Loading commit data...
defs-kwebb-cloudlab Loading commit data...
defs-kwebb-emulab Loading commit data...
defs-mike-emulab Loading commit data...
defs-onelab Loading commit data...
defs-ricci-emulab Loading commit data...
defs-stoller-apt Loading commit data...
defs-stoller-emulab Loading commit data...
defs-stoller-home Loading commit data...
defs-stoller-lbsdb Loading commit data...
defs-uky Loading commit data...
defs-utahclient Loading commit data...
defs-wbsun-emulab Loading commit data...
defs-wide Loading commit data...
pnet-favicon.ico Loading commit data...