Commit 56f6d601 authored by Leigh Stoller's avatar Leigh Stoller

A lot of work on the RPC code, among other things.

I spent a fair amount of improving error handling along the RPC path,
as well making the code more consistent across the various files. Also
be more consistent in how the web interface invokes the backend and gets
errors back, specifically for errors that are generated when taking to a
remote cluster.

Add checks before every RPC to make sure the cluster is not disabled in
the database. Also check that we can actually reach the cluster, and
that the cluster is not offline (NoLogins()) before we try to do
anything. I might have to relax this a bit, but in general it takes a
couple of seconds to check, which is a small fraction of what most RPCs
take. Return precise errors for clusters that are not available, to the
web interface and show them to user.

Use webtasks more consistently between the web interface and backend
scripts. Watch specifically for scripts that exit abnormally (exit
before setting the exitcode in the webtask) which always means an
internal failure, do not show those to users.

Show just those RPC errors that would make sense users, stop spewing
script output to the user, send it just to tbops via the email that is
already generated when a backend script fails fatally.

But do not spew email for clusters that are not reachable or are
offline. Ditto for several other cases that were generating mail to
tbops instead of just showing the user a meaningful error message.

Stop using ParRun for single site experiments; 99% of experiments.

For create_instance, a new "async" mode that tells CreateSliver() to
return before the first mapper run, which is typically very quickly.
Then watch for errors or for the manifest with Resolve or for the slice
to disappear. I expect this to be bounded and so we do not need to worry
so much about timing this wait out (which is a problem on very big
topologies). When we see the manifest, the RedeemTicket() part of the
CreateSliver is done and now we are into the StartSliver() phase.

For the StartSliver phase, watch for errors and show them to users,
previously we mostly lost those errors and just sent the experiment into
the failed state. I am still working on this.
parent 6b6fdafa
#!/usr/bin/perl -wT
#
# Copyright (c) 2007-2017 University of Utah and the Flux Group.
# Copyright (c) 2007-2018 University of Utah and the Flux Group.
#
# {{{EMULAB-LICENSE
#
......@@ -574,6 +574,18 @@ sub GetCertificate($)
return $cert;
}
# Helper functions for below.
sub ContextError()
{
return GeniResponse->Create(GENIRESPONSE_ERROR(), undef,
"Could not generate context for RPC");
}
sub CredentialError()
{
return GeniResponse->Create(GENIRESPONSE_ERROR(), undef,
"Could not generate credentials for RPC");
}
#
# Create a dataset on the remote aggregate.
#
......@@ -584,13 +596,14 @@ sub CreateDataset($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential, $speaksfor_credential) =
APT_Geni::GenCredentials($cert, $geniuser, ["blockstores"]);
return undef
return CredentialError
if (! (defined($speaksfor_credential) &&
defined($credential)));
......@@ -624,13 +637,13 @@ sub DeleteDataset($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential, $speaksfor_credential) =
APT_Geni::GenCredentials($cert, $geniuser, ["blockstores"], 1);
return undef
return CredentialError()
if (!defined($credential));
my $credentials = [$credential->asString()];
......@@ -657,13 +670,13 @@ sub ModifyDataset($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential, $speaksfor_credential) =
APT_Geni::GenCredentials($cert, $geniuser, ["blockstores"], 1);
return undef
return CredentialError()
if (!defined($credential));
my $credentials = [$credential->asString()];
......@@ -692,13 +705,13 @@ sub ExtendDataset($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential, $speaksfor_credential) =
APT_Geni::GenCredentials($cert, $geniuser, ["blockstores"], 1);
return undef
return CredentialError()
if (!defined($credential));
my $credentials = [$credential->asString()];
......@@ -726,13 +739,13 @@ sub DescribeDataset($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential, $speaksfor_credential) =
APT_Geni::GenCredentials($cert, $geniuser, ["blockstores"], 1);
return undef
return CredentialError()
if (!defined($credential));
my $credentials = [$credential->asString()];
......@@ -759,13 +772,13 @@ sub GetCredential($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential) =
APT_Geni::GenAuthCredential($cert, ["blockstores"]);
return undef
return CredentialError()
if (!defined($credential));
my $args = {
......@@ -789,13 +802,13 @@ sub ApproveDataset($)
my $geniuser = $self->GetGeniUser();
my $context = APT_Geni::GeniContext();
my $cert = $self->GetCertificate();
return undef
return ContextError()
if (! (defined($geniuser) && defined($authority) &&
defined($context) && defined($cert)));
my ($credential) =
APT_Geni::GenAuthCredential($cert, ["admin"]);
return undef
return CredentialError()
if (!defined($credential));
my $args = {
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -81,12 +81,14 @@ use GeniXML;
use GeniUser;
use APT_Geni;
use APT_Profile;
use APT_Aggregate;
# Protos
sub fatal($);
sub UserError($);
sub DoListImages();
sub DoDeleteImage();
sub ExitWithError($);
#
# Parse command arguments. Once we return from getopts, all that should be
......@@ -162,9 +164,17 @@ sub DoListImages()
# Shorten default timeout.
Genixmlrpc->SetTimeout(90);
# Lets do a cluster check to make sure its reachable.
my $aggregate = APT_Aggregate->Lookup($aggregate_urn);
if (!defined($aggregate)) {
fatal("No such aggregate");
}
if ($aggregate->CheckStatus(\$errmsg)) {
UserError($errmsg);
}
my $authority = GeniAuthority->Lookup($aggregate_urn);
if (!defined($authority)) {
fatal("No such aggregate");
fatal("No authority for aggregate");
}
my $cmurl = $authority->url();
if ($usemydevtree) {
......@@ -184,23 +194,8 @@ sub DoListImages()
my $response = Genixmlrpc::CallMethod($cmurl, undef, "ListImages", $args);
if ($response->code() != GENIRESPONSE_SUCCESS) {
if ($response->output()) {
print STDERR $response->output() . "\n";
if (defined($webtask)) {
$webtask->output($response->output());
}
}
else {
print STDERR "Operation failed, returned " .
$response->code() . "\n";
if (defined($webtask)) {
$webtask->output("Operation failed");
}
}
if (defined($webtask)) {
$webtask->Exited($response->code());
}
exit($response->code());
print STDERR $response->error() . "\n";
ExitWithError($response);
}
#
# We get back a flat list, which can include mulitple versions of
......@@ -526,9 +521,17 @@ sub DoDeleteImage()
# Shorten default timeout.
Genixmlrpc->SetTimeout(90);
# Lets do a cluster check to make sure its reachable.
my $aggregate = APT_Aggregate->Lookup($aggregate_urn);
if (!defined($aggregate)) {
fatal("No such aggregate");
}
if ($aggregate->CheckStatus(\$errmsg)) {
UserError($errmsg);
}
my $authority = GeniAuthority->Lookup($aggregate_urn);
if (!defined($authority)) {
fatal("No such aggregate");
fatal("No authority for aggregate");
}
my $cmurl = $authority->url();
if ($usemydevtree) {
......@@ -610,3 +613,33 @@ sub escapeshellarg($)
return $str;
}
#
# These are errors which the user might need to see. Some errors are
# exceptions though, and those we want to treat as internal errors.
#
sub ExitWithError($)
{
my ($response) = @_;
my $mesg = $response->error();
my $code = $response->code();
#
# In general, these errors are to be expected by the caller.
#
if ($code == GENIRESPONSE_REFUSED ||
$code == GENIRESPONSE_SEARCHFAILED ||
$code == GENIRESPONSE_SERVER_UNAVAILABLE ||
$code == GENIRESPONSE_NETWORK_ERROR ||
$code == GENIRESPONSE_BUSY) {
if (defined($webtask)) {
$webtask->output($mesg);
$webtask->Exited($code);
}
print STDERR "*** $0:\n".
" $mesg\n";
exit(1);
}
fatal($mesg);
}
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/perl -w
#
# Copyright (c) 2008-2017 University of Utah and the Flux Group.
# Copyright (c) 2008-2018 University of Utah and the Flux Group.
#
# {{{GENIPUBLIC-LICENSE
#
......@@ -251,7 +251,8 @@ sub Start($$)
# Check for NoLogins; return XMLRPC
#
if (NoLogins()) {
return XMLError(503, "CM temporarily offline; please try again later");
return XMLError(HTTP_SERVICE_UNAVAILABLE(),
"CM temporarily offline; please try again later");
}
if (0) {
# For timing.
......@@ -367,10 +368,10 @@ sub Start($$)
# Check Emulab users table.
#
my $user = User->Lookup($hrn->id());
return XMLError(XMLRPC_APPLICATION_ERROR(),
return XMLError(GENIRESPONSE_REFUSED(),
"Not a valid local user. Who are you really?")
if (!defined($user));
return XMLError(XMLRPC_APPLICATION_ERROR(),
return XMLError(GENIRESPONSE_FORBIDDEN(),
"Your account is no longer active!")
if ($user->status() ne USERSTATUS_ACTIVE());
}
......@@ -379,7 +380,7 @@ sub Start($$)
# So we know who/what we are acting as, in case we have to make a
# callout RPC. The "context" is a global variable.
#
return XMLError(XMLRPC_SERVER_ERROR(),
return XMLError(XMLRPC_SYSTEM_ERROR(),
"There is no certificate for this server")
if (!exists($MODULEDEFS->{"CERTIFICATE"}));
......@@ -421,7 +422,7 @@ sub Start($$)
return XMLError(XMLRPC_PARSE_ERROR(), "error decoding RPC:\n" . "$@");
}
if ($call->{'type'} ne 'call') {
return XMLError(XMLRPC_APPLICATION_ERROR(),
return XMLError(XMLRPC_PARSE_ERROR(),
"expected RPC methodCall, got $call->{'type'}");
}
my $method = $call->{'method_name'};
......@@ -548,7 +549,7 @@ sub Start($$)
#
$rpcerror = $iserror = 1;
print STDERR "Error executing RPC method $method:\n" . $@ . "\n\n";
$response = $decoder->encode_fault(XMLRPC_SERVER_ERROR(),
$response = $decoder->encode_fault(XMLRPC_SYSTEM_ERROR(),
"Internal Error executing $method");
print STDERR "-------------- Request -----------------\n";
......@@ -716,6 +717,11 @@ sub CreateLogFile($)
{
my ($fname) = @_;
# So tbops people can read the files ...
if (!chmod(0664, $fname)) {
print STDERR "Could not chmod $fname\n";
return undef;
}
my $group = Group->Lookup($GENIGROUP, $GENIGROUP);
if (!defined($group)) {
print STDERR "Could not lookup group $GENIGROUP";
......
#!/usr/bin/perl -w
#
# Copyright (c) 2008-2016, 2018 University of Utah and the Flux Group.
# Copyright (c) 2008-2018 University of Utah and the Flux Group.
#
# {{{GENIPUBLIC-LICENSE
#
......@@ -152,9 +152,10 @@ sub GetContext($)
sub SetTimeout($$)
{
my ($class, $to) = @_;
my $old = $timeout;
$timeout = $to;
return 0;
return $old;
}
BEGIN {
......@@ -215,9 +216,14 @@ sub CallMethodInternal($$$$@)
# But must have a context;
if (!defined($context)) {
print STDERR "Must provide an rpc context\n";
return GeniResponse->new(GENIRESPONSE_RPCERROR, -1,
return GeniResponse->new(GENIRESPONSE_RPCERROR, XMLRPC_SYSTEM_ERROR,
"Must provide an rpc context");
}
# Testing mode.
if (0) {
return GeniResponse->new(GENIRESPONSE_RPCERROR, XMLRPC_SYSTEM_ERROR,
"Testing mode!");
}
# Callback to write the data, when streaming to a file.
my $callback = sub {
......@@ -242,7 +248,7 @@ sub CallMethodInternal($$$$@)
else {
print STDERR
"Could not determine what version of FreeBSD you are running!\n";
return GeniResponse->new(GENIRESPONSE_RPCERROR, -1,
return GeniResponse->new(GENIRESPONSE_RPCERROR, XMLRPC_SYSTEM_ERROR,
"Could not determine what version of FreeBSD you are running!");
}
......@@ -338,14 +344,68 @@ sub CallMethodInternal($$$$@)
delete($ENV{'HTTPS_PKCS12_FILE'});
delete($ENV{'HTTPS_PKCS12_PASSWORD'});
if ($debug > 1 || ($debug && !$hresp->is_success())) {
if ($debug > 1) {
print STDERR "xml response: " . $hresp->as_string();
print STDERR "\n";
print STDERR "------------------\n";
}
if (!$hresp->is_success()) {
return GeniResponse->new(GENIRESPONSE_RPCERROR,
$hresp->code(), $hresp->message());
my $code = $hresp->code();
my $message = $hresp->message();
if ($debug > 1) {
print STDERR "RPC Failure $code, $message\n";
print STDERR "------------------\n";
}
if ($code == HTTP_INTERNAL_SERVER_ERROR()) {
#
# We get here for what seems to be for one of three reasons:
#
# 1. Unable to reach the server. We do not know why, we just
# cannot connect.
# 2. The connection times out. We do not know where it timed
# out but typically it is because the server is taking too
# long to answer. Note that the connection has probably been
# successful and the server is working away. But we do not
# know that for sure.
# 3. A total server error, either in apache or in the backend
# scripts that are invoked.
#
# Sadly, we have to look at the string to know, which makes all
# this pretty damn fragile.
#
# The first two errors are not something we can do much to
# fix, but in general the user does not care, he just needs to
# know the request cannot be completed cause of a network
# error. So turn that into an error that the caller knows to
# pass through without generating (tons of) email.
#
if ($message =~ /read timeout/i) {
return GeniResponse->new(GENIRESPONSE_NETWORK_ERROR,
GENIRESPONSE_NETWORK_ERROR_TIMEDOUT,
"Timed out talking to server");
}
if ($message =~ /Can\'t connect to/i ||
# In case this changes to proper english
$message =~ /Cannot connect to/i) {
return GeniResponse->new(GENIRESPONSE_NETWORK_ERROR,
GENIRESPONSE_NETWORK_ERROR_NOCONNECT,
"Cannot connect to server");
}
#
# The third one is bad, we want to make sure we whine about
# it, but do not send a bunch of gibberish to the user.
#
if ($message =~ /Internal Server Error/i) {
return GeniResponse->new(GENIRESPONSE_SERVERERROR,
$code, $message);
}
}
#
# Otherwise bad news, we want to whine.
#
return GeniResponse->new(GENIRESPONSE_RPCERROR, $code, $message);
}
# Streamed the data okay, we are done.
if (defined($fp) && !defined($xmlgoo)) {
......@@ -376,18 +436,26 @@ sub CallMethodInternal($$$$@)
if (!ref($goo)) {
print STDERR "Error in XMLRPC parse: $xmlgoo\n";
$code = GENIRESPONSE_RPCERROR();
$value = undef;
$value = XMLRPC_SYSTEM_ERROR;
$output = "Could not parse XMLRPC return value: $xmlgoo";
}
elsif ($goo->value()->is_fault()
|| (ref($goo->value()) && UNIVERSAL::isa($goo->value(),"HASH")
&& exists($goo->value()->{'faultCode'}))) {
$code = $goo->value()->{"faultCode"}->value;
$code = GENIRESPONSE_RPCERROR();
$value = $goo->value()->{"faultCode"}->value;
$output = $goo->value()->{"faultString"}->value;
# EXO returns a bad fault structure.
if (!$code) {
$code = $value = GENIRESPONSE_ERROR();
$value = GENIRESPONSE_ERROR();
}
#
# Negative values are XMLRPC errors, these are bad and we want
# to whine. Positive are different, look to see if they are one
# of the ones we expect our servers to generate and convert.
#
if ($value == HTTP_SERVICE_UNAVAILABLE()) {
$code = GENIRESPONSE_SERVER_UNAVAILABLE();
}
}
elsif (! (ref($goo->value()) && UNIVERSAL::isa($goo->value(),"HASH")
......@@ -413,18 +481,24 @@ sub CallMethodInternal($$$$@)
}
else {
$value = $goo->value()->{'value'}->value;
$logurl = $goo->value()->{'protogeni_error_url'}->value
if (exists($goo->value()->{'protogeni_error_url'}));
}
$output = $goo->value()->{'output'}->value
if (exists($goo->value()->{'output'}));
$logurl = $goo->value()->{'protogeni_error_url'}->value
if (exists($goo->value()->{'protogeni_error_url'}));
}
#
# For consistency, make sure there is a subcode for RPCERROR.
#
if ($code == GENIRESPONSE_RPCERROR && !defined($value)) {
$value = XMLRPC_SYSTEM_ERROR;
}
if ($debug > 1 && $code) {
print STDERR "CallMethod: $method failed: $code";
print STDERR ", $output\n" if (defined($output) && $output ne "");
}
if ($code == GENIRESPONSE_SERVER_UNAVAILABLE) {
$code = GENIRESPONSE_RPCERROR;
if ($debug > 1 && $code == GENIRESPONSE_RPCERROR) {
print STDERR "RPC Failure $value, $output\n";
}
return GeniResponse->new($code, $value, $output, $logurl);
......
......@@ -81,7 +81,7 @@ BEGIN {
if ($1 > @PROTOGENI_MAXSERVERLOAD@) {
sleep(5);
my $decoder = Frontier::RPC2->new();
my $string = $decoder->encode_fault(503,
my $string = $decoder->encode_fault(HTTP_SERVICE_UNAVAILABLE(),
"Server is WAY too busy; please try again later");
print "Content-Type: text/xml \n\n";
print $string;
......@@ -191,14 +191,16 @@ sub XMLError($$)
# Check for NoLogins; return XMLRPC
#
if (NoLogins()) {
XMLError(503, "CM temporarily offline; please try again later");
XMLError(HTTP_SERVICE_UNAVAILABLE(),
"Server temporarily offline; please try again later");
}
#
# Sanity check.
#
if ($EUID != 0) {
XMLError(503, "Server configuration error; please try again later");
XMLError(XMLRPC_SYSTEM_ERROR,
"Server configuration error; please try again later");
}
#
......@@ -294,13 +296,13 @@ XMLError(XMLRPC_APPLICATION_ERROR(),
my ($authority, $type, $id) = GeniHRN::Parse($GENIURN);
if ($type eq "user" && GeniHRN::Authoritative($GENIURN, "@OURDOMAIN@")) {
#
# Check Emulab users table.
# Check Emulab users table. These are permission errors, not XML errors.
#
my $user = User->Lookup($id);
XMLError(XMLRPC_APPLICATION_ERROR(),
XMLError(GENIRESPONSE_REFUSED(),
"Not a valid local user. Who are you really?")
if (!defined($user));
XMLError(XMLRPC_APPLICATION_ERROR(),
XMLError(GENIRESPONSE_FORBIDDEN(),
"Your account is no longer active!")
if ($user->status() ne "active");
}
......@@ -324,12 +326,12 @@ my $return = do $file;
if (!defined($return)) {
SENDMAIL($TBOPS, "Error loading module",
($@ ? $@ : ($! ? $! : Dumper(%ENV))));
XMLError(XMLRPC_APPLICATION_ERROR(), "Internal error loading module");
XMLError(XMLRPC_SYSTEM_ERROR(), "Internal error loading module");
}
if (!(defined($GENI_METHODS) && defined($EMULAB_PEMFILE))) {
SENDMAIL($TBOPS, "Error loading module $MODULE",
"No definition for GENI_METHODS or EMULAB_PEMFILE");
XMLError(XMLRPC_APPLICATION_ERROR(),
XMLError(XMLRPC_SYSTEM_ERROR(),
"Internal error loading module; missing definitions");
}
......@@ -531,7 +533,8 @@ $SIG{__WARN__} = sub {
$SIG{__DIE__} = sub {
my $message = shift;
if ($warned) {
die($message);
print STDERR $message;
exit(-1);
}
else {
confess($message);
......@@ -570,7 +573,7 @@ if ($@) {
XMLError(XMLRPC_PARSE_ERROR(), "error decoding RPC:\n" . "$@");
}
if ($call->{'type'} ne 'call') {
XMLError(XMLRPC_APPLICATION_ERROR(),
XMLError(XMLRPC_PARSE_ERROR(),
"expected RPC methodCall, got $call->{'type'}");
}
my $method = $call->{'method_name'};
......@@ -710,9 +713,10 @@ if ($@) {
#
$rpcerror = $iserror = 1;
print STDERR "Error executing RPC method $method:\n" . $@ . "\n\n";
$response = $decoder->encode_fault(XMLRPC_SERVER_ERROR(),
"Internal Error executing $method");
$response = $decoder->encode_fault(XMLRPC_SYSTEM_ERROR(),
"Internal Error executing $method:\n" . $@ . "\n");
push(@metadata, ["LogURL", $logurl]);
$logfile->SetMetadata(\@metadata, 0);
foreach my $foo (@metadata) {
my ($key,$val) = @{$foo};
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment