- 16 Jan, 2018 4 commits
-
-
Leigh B Stoller authored
socket buffer sizes, lets use that to reduce the send buffer size and bump up the receive size (since we get a LOT of events).
-
Leigh B Stoller authored
* Big change to how events are forwarded to the Portal; Originally we subscribed to events from the local pubsubd, would transform them to Geni events, then send them back to the local pubsubd, pubsub_forward would pick them up, and then foward to the Portal SSL pubsubd. Now, send them directly to the Portal SSL pubsubd, which reduces the load on the main pubsubd which was throwing errors because of too much load (to be specific, the subscribers were not keeping up, which causes pubsubd to throw errors back to the sender). We can do this cause pubsub and the event system now include the SSL entrypoints. Aside, pubsub_forward is multi-threaded while igevent_daemon is not, we might have to play some tricks similar to stated. * Clean up configure definitions as described in commit 621253f2. * Various debugging changes to make possible to run an alternate igevent daemon out of my devel tree for debuging. Basically, the main igevent daemon ignores all events for a specific slice, while my igevenyt daemon ignores all the other events and process just the ones for my specific slice.
-
Leigh B Stoller authored
pubsubd. The main point is that instead of being able to run the SSL pubsubd at the Mothership only, any site can be a Portal and needs to run it. So for example my elabinelab is a real Portal, which is very handy for testing.
-
Leigh B Stoller authored
-
- 12 Jan, 2018 5 commits
-
-
Mike Hibler authored
-
Gary Wong authored
-
Gary Wong authored
-
Gary Wong authored
-
- 11 Jan, 2018 3 commits
-
-
Leigh B Stoller authored
-
David Johnson authored
(I had a disk image containing unmodifiable binary software that would overwrite dhcpcd's sane copy of /etc/resolv.conf, at a nondeterministic point in time, with something completely bogus. That screwed up startcmdstatus reports; this helps out with that case (in combination with other custom scripting that returns /etc/resolv.conf to sanity). Note though that we only retry infinitely once runstartup has successfully gone to the background; up til then, we're limited to about a minute's worth of retries. Likewise, we don't retry forever if runstartup itself experiences an error. We only retry forever if we actually have a status to send.
-
Leigh B Stoller authored
was really inefficient. At about 1200 nodes each successive node was taking over a second to process. Much better now. Also note that this commit disables all Jacks when the number of nodes is greater then 200. Temporary.
-
- 10 Jan, 2018 1 commit
-
-
Mike Hibler authored
-
- 09 Jan, 2018 7 commits
-
-
Mike Hibler authored
If we support provenance but not deltas, then we do not use the newer create-versioned-image when creating images from Xen vnodes. However, we had a bug in that path where we would then not pass the imageid argument to the old script, resulting in us spewing the image out to stdout which got put in the logfile.
-
David Johnson authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
events that need to be transformed into portal events. This cuts the number of events received by 50 percent, which keeps pubsub from backing up on sending the events back. Helpful for really big experiments (1000+ nodes).
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
- 08 Jan, 2018 1 commit
-
-
David Johnson authored
If the TBScriptLock caller provides a debug message, it will be stored in a file, and other blocked TBScriptLock callers will get (possibly slightly racy) info about who holds the lock. Then, use this in libvnode_xen to get some info about long calls to xl (create|halt|reboot|etc). Also enable lockdebug in libvnode_xen for now.
-
- 04 Jan, 2018 6 commits
-
-
Mike Hibler authored
-
Mike Hibler authored
This is really only needed on the mothership.
-
Mike Hibler authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
cluster is not reachable or its BUSY. Just mark it canceled so the apt_daemon will pick it up.
-
Leigh B Stoller authored
-
- 03 Jan, 2018 1 commit
-
-
Mike Hibler authored
-
- 02 Jan, 2018 7 commits
-
-
Mike Hibler authored
-
Leigh B Stoller authored
the extend modal, tell the user they cannot extend their experiment until the cluster is back online, try again later. This is a stopgap, we probably need a better way to handle transient failure in contacting clusters when doing extensions.
-
Leigh B Stoller authored
an experiment from the cluster (say, cause its offline).
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
- 01 Jan, 2018 4 commits
-
-
Leigh B Stoller authored
-
Leigh B Stoller authored
-
Leigh B Stoller authored
1. Reservation system now groks experiment lockdown and swappable. When swapping in, lockdown and swappable mean the expected end of the experiment is never. 2. Reservation library now handles changes to lockdowm, swappable, and autoswap (timeout). editexp now hands these changes off to a new script called manage_expsettings, which can be called by hand since we might need to force a change (I am not changing the classic UI, if a change is not allowed by the res system, we have to do it by hand). 3. Minor fixes to reservation library.
-
Leigh B Stoller authored
-
- 30 Dec, 2017 1 commit
-
-
Mike Hibler authored
-