emulab-devel issueshttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues2020-06-03T09:17:26-06:00https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/554The FreeBSD `metis` port has undergone a major, possibly incompatible revision2020-06-03T09:17:26-06:00Mike HiblerThe FreeBSD `metis` port has undergone a major, possibly incompatible revisionFrom the Slack thread:
> FYI, the FreeBSD `metis4` port is gone is the latest quarterly port set. There is now a `metis` port, which is Metis version 5. Anyone care to speculate whether that is going to cause problems? I thought it was ...From the Slack thread:
> FYI, the FreeBSD `metis4` port is gone is the latest quarterly port set. There is now a `metis` port, which is Metis version 5. Anyone care to speculate whether that is going to cause problems? I thought it was used with `assign`, but apparently it is only `assign_prepass` and `ipassign`.
@ricci thought them no longer used, though I verified that we have classic experiments (none in the last 10 years), that will use them do to magical DB settings in the `experiments` table. @stoller says `ipassign` is not used through the portal path, though `assign_prepass` (via `mapper`) might be.
The new port does cause problems:
> It looks like at least `assign_prepass` is still used. It uses the `kmetis` command line tool and that tool is now gone. According to the manual (http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/manual.pdf, search for "kmetis"), `gpmetis` is the direct replacement command, but I don't have any idea if it behaves the same without any additional options.
> I should add that `ipassign` is also technically still used, but it has to be specified explicitly in an ns file and there are only a handful of experiments that do that--none swapped in in the last 10 years. But we don't have a better solution for large, complex topos, we just punt on them now. `ipassign` links with `metis` libraries and we have already discovered that they moved include files around which breaks the build. Who knows what has changed in the API.
> Looks like mapper has to be invoked with "-x", or the experiment flagged with "useprepass", in order for assign_prepass to be called. There are no instances of the former, and though there are three experiments with the "useprepass" column set, I see no path in our code through which that field can ever be set!https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/530create_image -s on boss fails2020-03-26T14:20:26-06:00chuck cranorcreate_image -s on boss failsThis is Mike's favorite bug! It predates flux gitlab, so I couldn't file an issue on it when i first encountered it. Now I can...
running "create_image -s" (uses ssh) on boss causes the new disk image binary to be emailed to your acco...This is Mike's favorite bug! It predates flux gitlab, so I couldn't file an issue on it when i first encountered it. Now I can...
running "create_image -s" (uses ssh) on boss causes the new disk image binary to be emailed to your account instead of saved on disk. its too large, so the mail system rejects it with:
<pre>
The original message was received at Wed, 29 Jan 2020 14:54:46 -0500 (EST)
from localhost [127.0.0.1]
----- The following addresses had permanent fatal errors -----
<chuck@ece.cmu.edu>
(reason: 552 5.2.3 Message size exceeds fixed maximum message size
+(52428800))
----- Transcript of session follows -----
... while talking to dept-mx-03.andrew.cmu.edu.:
>>> MAIL From:<chuck@boss.narwhal.pdl.cmu.edu> SIZE=928020875
<<< 552 5.2.3 Message size exceeds fixed maximum message size (52428800)
554 5.0.0 Service unavailable
</pre>
one possible solution is just to remove "-s" from create_image, as the non-ssh options work.
another solution is this:
<pre>
diff -r -u baseline/utils/create_image.in orca/utils/create_image.in
--- baseline/utils/create_image.in 2019-05-22 17:03:13.000000000 -0400
+++ orca/utils/create_image.in 2019-05-22 17:02:25.000000000 -0400
@@ -1224,7 +1224,7 @@
#
my $SAVEUID = $UID;
$EUID = $UID = 0;
- $result = run_with_ssh($command, undef);
+ $result = run_with_ssh($command, $filename);
$EUID = $UID = $SAVEUID;
if ($result eq "setupfailed") {
goto done;
</pre>
but Mike was worried that it might break(?) something else if applied.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/519SonicWall firewall issues with vnode to pnode communication2024-01-02T11:11:09-07:00Mike HiblerSonicWall firewall issues with vnode to pnode communicationSince the firmware upgrade (I think), we are once again having problems with routing between the control net (155.98.36.x) and the vnode control net (172.16.x.x). We are once again dropping these packets:
```
DROPPED, Drop Code: 710(Pack...Since the firmware upgrade (I think), we are once again having problems with routing between the control net (155.98.36.x) and the vnode control net (172.16.x.x). We are once again dropping these packets:
```
DROPPED, Drop Code: 710(Packet dropped - drop bounce same link pkt)
```
Packets necessarily need to bounce off of the firewall to route between the two. The vnode control net is a "secondary subnet" in the control net zone on the firewall. We had a problem with this initially, but after putting in the needed ARP and routing hacks, it started working. As far as I can tell, those hacks are still in place.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/516Delay-and-loss tolerant interactive access to endpoint nodes2020-09-15T15:56:36-06:00Robert Ricciricci@cs.utah.eduDelay-and-loss tolerant interactive access to endpoint nodesUsers will inevitably want interactive access to endpoints: both fixed and mobile. Trying to run protocols like VNC already performs poorly to the fixed endpoints (over wifi), and this will be much worse for the mobile ones.
Some possib...Users will inevitably want interactive access to endpoints: both fixed and mobile. Trying to run protocols like VNC already performs poorly to the fixed endpoints (over wifi), and this will be much worse for the mobile ones.
Some possible avenues to pursue:
* Mosh for shell access (downside: requires user to install client-side software)
* Low-bandwidth alternatives to VNC or X11Robert Ricciricci@cs.utah.eduRobert Ricciricci@cs.utah.eduhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/509Frisbee-ing from subbosses to virtual nodes with non-public IPs2020-01-27T08:08:19-07:00Mike HiblerFrisbee-ing from subbosses to virtual nodes with non-public IPsSince subbosses are on the node control net segment and may need to served vnodes with 172.16.0.0 adresses, they need to have an alias on their control net for that network. Can we get away with simply adding a route for 172.16.0.0/12 to...Since subbosses are on the node control net segment and may need to served vnodes with 172.16.0.0 adresses, they need to have an alias on their control net for that network. Can we get away with simply adding a route for 172.16.0.0/12 to the control net interface? Will IGMP snooping on the switch work in that case?Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/507Wonky port counters on Utah Cloudlab hp090 and hp0912022-06-27T08:31:26-06:00Mike HiblerWonky port counters on Utah Cloudlab hp090 and hp091We get frequent "excessive traffic" reports for hp090 and hp091 w.r.t. transmitted packets, e.g.:
```
Node:port Expt Pkts/sec Mb/sec When
hp091:eth2 cops-PG0/tapir 6732659...We get frequent "excessive traffic" reports for hp090 and hp091 w.r.t. transmitted packets, e.g.:
```
Node:port Expt Pkts/sec Mb/sec When
hp091:eth2 cops-PG0/tapir 6732659 6 2019-07-02 08:59:48 for 638 sec
```
almost 7M pps for 6M bps for 10 minutes. This would indicate that each packet was, on average, less than one bit. There appears to be some discrepancy between packet counters on the Dell S4048-ON control switch both by the CLI or SNMP. For example, hp091's switch port shows:
```
Input Statistics:
1536874051 packets, 162913777286 bytes
1242179 64-byte pkts, 1512599412 over 64-byte pkts, 4230559 over 127-byte pkts
587130 over 255-byte pkts, 2005787 over 511-byte pkts, 16208984 over 1023-byte pkts
198618 Multicasts, 10872 Broadcasts, 289299473137 Unicasts
0 runts, 0 giants, 256 throttles
0 CRC, 0 overrun, 0 discarded
```
Note that the first line shows 1536874051 total packets and yet the unicast pack count is almost 200x larger. It is those individual counters that we collect with `portstats` and the control net monitor program.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/505Moonshot chassis firmware upgrades2019-08-29T11:01:43-06:00Mike HiblerMoonshot chassis firmware upgradesThere have been a number of upgrades to HP Moonshot firmware since we got those chassis. I have made half-hearted attempts over the years to upgrade the 13 chassis, but they are in a patchwork state right now. The various "component pack...There have been a number of upgrades to HP Moonshot firmware since we got those chassis. I have made half-hearted attempts over the years to upgrade the 13 chassis, but they are in a patchwork state right now. The various "component packs" of firmware can be found in `/usr/testbed/www/downloads/Moonshot/`. See the oddly named `commands` file in that directory for
the current state of the chassis.
Mostly we only care about the Chassis iLO and any changes that might make SOL to the cartridges more robust. We still need to "reset" the chassis manager (CM) periodically to get around `ipmitool` failures due to resource shortages.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/504`slothd` run too soon?2019-05-24T16:38:41-06:00Mike Hibler`slothd` run too soon?I happen to notice when I was setting up some vlan devices on a client, that `slothd` complains:
```
...
Starting slothd usage detector
slothd: Failed to get experiment net iface name for MAC
slothd: Failed to get experiment net iface na...I happen to notice when I was setting up some vlan devices on a client, that `slothd` complains:
```
...
Starting slothd usage detector
slothd: Failed to get experiment net iface name for MAC
slothd: Failed to get experiment net iface name for MAC
slothd: Failed to get experiment net iface name for MAC
...
```
because the interfaces in question are vlan devices that have not been created yet (on FreeBSD) because `rc.config` (which runs `rc.ifconfig`) has not been run yet.
At this point I am not sure if `slothd` will pick these up later or whether we just don't report stats for the vlan interfaces or whether this is just a problem on FreeBSD.
We run `slothd` as early as we do so that it gets run on nodes that are free. I think this dates back to the days when free nodes sat in an OS instead of in pxeboot.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/497Possible future `grub2pxe` bug to fix.2019-04-29T13:07:56-06:00Mike HiblerPossible future `grub2pxe` bug to fix.An email from at Stanford pointed out the boot-time error message on the m510 nodes:
```
error: can't find command `pxe'
```
which I got side-tracked looking into. Somewhere between Ryan's original Grub 1.97 and the official Grub 2.x tha...An email from at Stanford pointed out the boot-time error message on the m510 nodes:
```
error: can't find command `pxe'
```
which I got side-tracked looking into. Somewhere between Ryan's original Grub 1.97 and the official Grub 2.x that we use now, they got rid of the `pxe` command. We were/are using this in our PXE-loaded grub.cfg to enable a DB specification of `os_info_versions.path` of the form _server_:path. The "pxe -s" command would set the TFTP server used by the PXE command.
I think what this means is that we can no longer load an MFS (the only instance where the path column is not NULL) from anywhere but the PXE (aka DHCP) server. I don't even recall where/why we would have needed this feature (we don't do it anywhere today), so probably not a big deal.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/490Build Randomization into CloudLab Benchmarks2020-07-29T14:32:19-06:00Aleksander MaricqBuild Randomization into CloudLab Benchmarks* [x] Build support for "random_order" flag and order generation into orchestration.
* [x] Add support for "random_order" and "ordering" fields into database.
* [x] Split up operations in the second set of memory benchmarks (leave STR...* [x] Build support for "random_order" flag and order generation into orchestration.
* [x] Add support for "random_order" and "ordering" fields into database.
* [x] Split up operations in the second set of memory benchmarks (leave STREAM as a single invocation?)
* [x] Construct a network test regulator to be run on the destination hosts (clients will take a lock when they run a network test, waiting clients will queue.)
* [x] Write up an ordered list of functions comprising every possible configuration in the order we currently run them (that will be randomized when random_order=TRUE.)
* [x] Change over "random_order" from a static "FALSE" to a probability (10% chance to start.)Aleksander MaricqAleksander Maricqhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/488Subsume Apt into Cloudlab Utah2019-08-29T14:23:55-06:00Mike HiblerSubsume Apt into Cloudlab UtahThere has been talk (by me) of getting rid of Apt as a distinct cluster and just bringing the nodes into CloudLab Utah. The plusses:
* one less cluster to manage
* infrastructure hardware is getting old: dbox node has failed and control ...There has been talk (by me) of getting rid of Apt as a distinct cluster and just bringing the nodes into CloudLab Utah. The plusses:
* one less cluster to manage
* infrastructure hardware is getting old: dbox node has failed and control node has failed and been resurrected once
Minuses:
* brings CloudLab cluster up to around 1000 nodes, potential scaling problems
* assimilation may be painful, e.g., joining the experiment fabrics at sufficient BW
* the CHPC conundrum, how to retain access to just the old Apt nodes in the CloudLab contextMike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/481Performance tuning for the new boss and ops nodes2019-08-30T06:47:47-06:00Mike HiblerPerformance tuning for the new boss and ops nodesNow that the upgrade is done (#407), there is a great deal of tuning of the network, ZFS and NFS that can be done to take advantage of the large RAM, 40Gb network, and fast disks. See for example:
https://calomel.org/freebsd_network_tun...Now that the upgrade is done (#407), there is a great deal of tuning of the network, ZFS and NFS that can be done to take advantage of the large RAM, 40Gb network, and fast disks. See for example:
https://calomel.org/freebsd_network_tuning.htmlMike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/469User-visible inventory control2020-04-01T15:15:48-06:00Robert Ricciricci@cs.utah.eduUser-visible inventory controlhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/456Harden Hadoop profiles against automated cryptocurrency compromises2019-08-29T14:25:07-06:00Gary WongHarden Hadoop profiles against automated cryptocurrency compromisesThis needs to be done in such a way that:
1. Profiles we maintain are adequately secure out of the box
1. Other users can easily apply the same techniques to their own profiles, whether derived from ours or independent. (So that when ...This needs to be done in such a way that:
1. Profiles we maintain are adequately secure out of the box
1. Other users can easily apply the same techniques to their own profiles, whether derived from ours or independent. (So that when we say "fix your experiment or we'll kick you out", it's reasonable for them to comply.)
* [x] Add `iptables` glue to block most network traffic by default and allow whitelists. This needs to be image independent (see requirements above); the approach will be to use geni-lib, profile parameters, and install/execute services so as to be readily portable. This is applicable beyond just Hadoop.
* [ ] Add `nginx` reverse proxying with HTTP basic authentication to permit restricted access to the web services blocked above.
* [ ] Think about ways to make the `nginx` basic auth generic so it too can be reused in non-Hadoop profiles.
* [ ] Document all of this on a Wiki page with an example users can copy-and-paste into their profiles.Gary WongGary Wonghttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/454Fix CA vs User cert expiration problem2018-12-12T06:42:10-07:00Leigh StollerFix CA vs User cert expiration problemOur certs expire in five years now. But the problem is that our
CA cert is more recent, so it expires before the user certs.
Which is bad, need to do something about that before too long.
The APT cert expires in 13 months. I need to mak...Our certs expire in five years now. But the problem is that our
CA cert is more recent, so it expires before the user certs.
Which is bad, need to do something about that before too long.
The APT cert expires in 13 months. I need to make our CA certs
last longer, and set the expiration on user certs earlier then
the CA cert. And then regen all the certs for users.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/443Scripting to add/remove BYOD devices2020-03-31T16:24:25-06:00Robert Ricciricci@cs.utah.eduScripting to add/remove BYOD devicesBYOD devices are typically gray/black-box hardware (and sometimes software). User shows up with a device, that at minimum has a control network interface and some ability to be rebooted. We have a variety of support for them: osid opmo...BYOD devices are typically gray/black-box hardware (and sometimes software). User shows up with a device, that at minimum has a control network interface and some ability to be rebooted. We have a variety of support for them: osid opmode/features, image boot/load timeout controls, node/type attributes. We have support for a few different types of management/console interfaces (perhaps we could package/automate this a bit, but not sure it's worth it). Then of course we support a variety of power control methods, including IPMI/ilo/drac, etc. These devices may not be imageable and may not have our clientside installed; that's fine and supported. If they support experiment ethernet network interfaces that plug into our fabric, we can isolate those into links.
The goal here is not to add new gray/black-box config/setup modes during experiment runtime -- but rather to better script the addition/removal of (temporary) user BYODs.
Here's what we already have: `addstack`, `addswitch`, `addwire`, `addmanagementiface`, `addinterface`.
There are also `addspecialdevice` `addspecialiface` but at least the former should be extended to handle a few more necessary things (osid, os features) (right now it assumes the presence of the GENERIC osid). Finally, there is `newscript`, which can take XML descriptions of new nodes (and their ifaces/ifacetypes/wires).
My sense is that by extending `addspecialdevice` and `addspecialiface` as mentioned above, and by ensuring that BYOD types can optionally only be used by their owners in their project(s), we can basically call this done for the initial case. (I'm sure when we actually start adding these things more regularly, we'll want to add further options to the `addspecialdevice` script to set various node/type attributes, etc.)David Johnsonjohnsond@flux.utah.eduDavid Johnsonjohnsond@flux.utah.eduhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/442Disk image caching on AMs2020-07-14T16:16:08-06:00Robert Ricciricci@cs.utah.eduDisk image caching on AMsThe control network connectivity and bandwidth of different types of end points and base stations requires rethinking of how we distribute and collect the potentially large images for experiment nodes. If and how images are cached on the...The control network connectivity and bandwidth of different types of end points and base stations requires rethinking of how we distribute and collect the potentially large images for experiment nodes. If and how images are cached on the end point control nodes is an issue.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/441"Lite" version of boss2018-09-25T15:49:47-06:00Robert Ricciricci@cs.utah.edu"Lite" version of bossPrune down the list of services and daemons that run on boss (ops is already bare bones) and more generally shrink the boss footprint.
Prune down the list of services and daemons that run on boss (ops is already bare bones) and more generally shrink the boss footprint.
Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/438VPN concentrator for control network2019-03-26T11:24:39-06:00Robert Ricciricci@cs.utah.eduVPN concentrator for control networkThis is about setting up a physical machine, probably in MEB or the DDC, to act as an openvpn server to the various POWDER aggregates; configuring its openvpn software; and setting up appropriate routing/firewalling to the mothership (et...This is about setting up a physical machine, probably in MEB or the DDC, to act as an openvpn server to the various POWDER aggregates; configuring its openvpn software; and setting up appropriate routing/firewalling to the mothership (et al). (The aggregate (client) side of this issue is being discussed in https://gitlab.flux.utah.edu/emulab/emulab-devel/issues/439).
Subtasks:
- [x] @hibler is going to obtain a new /22 from campus and have them route it to the MEB firewall
- [x] @hibler or @kwebb configure the firewall with the routes for the concentrated /29s to point to a gateway address on the VPN outside all those /29s
- [x] @johnsond will setup a physical VPN concentrator box, probably running Ubuntu 18.04.
- [x] @mike or @kwebb will setup a path from the firewall to the concentrator, and from the concentrator to the mothership control net.
- [x] @johnsond is going to write a profile that is a mockup of (most of) the software, including the failover stuff (wired to start, then wireless using a nuc), to validate the design (this is happening in https://gitlab.flux.utah.edu/johnsond/powder-vpn)
- [x] @johnsond needs to turn the scripts from https://gitlab.flux.utah.edu/johnsond/powder-vpn into a single script on the concentrator; this is trivial.
~~- [ ] @johnsond needs to tweak the concentrator's configuration to move to the "scalable", one openvpn server process per client (aggregate) -- and adapt his profile's scripts to add configuration for each new aggregate.~~ (Given that UConnect bandwidth is what it is, we decided that there is currently no need to move to the scalable design.)Dan ReadingDan Readinghttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/437Out of band control for NUCs2020-03-31T16:25:27-06:00Robert Ricciricci@cs.utah.eduOut of band control for NUCsSince we will not have the ready physical access we have always had with Emulab/CloudLab resources, we would really, really like to be able to remotely access the control node for base stations and other end points. (I'm not going out on...Since we will not have the ready physical access we have always had with Emulab/CloudLab resources, we would really, really like to be able to remotely access the control node for base stations and other end points. (I'm not going out on the roof of the Medical Tower in a blizzard to power cycle a NUC!) Specifically we are concerned with maintaining control of NUC control nodes: at the very least the ability to reboot the node and ideally the ability to diagnose via the console.
On the hardware-side, NUCs have watchdog timers and a LAN-based management interface, but it is not yet clear how accessible the watchdog timers are and in some cases we know the LAN-based MI will not help us.
On the software-side, we may be able to mitigate problems by running the base CloudLab control servers inside of a VM on the NUC. This is assuming that the most common cause of failure would be a lock-up/freak-out of the FreeBSD boss/ops and not a hardware failure of the physical box.Mike HiblerMike Hibler