emulab issueshttps://gitlab.flux.utah.edu/groups/emulab/-/issues2021-09-03T13:52:31-06:00https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/632Get our content out of the Plone wiki2021-09-03T13:52:31-06:00Mike HiblerGet our content out of the Plone wikiWe have been keeping Plone on life support through a couple of boss/ops upgrades. After the latest, it is time to pull the plug since nobody wants to convert it to python3. So we need to get our content out of there and loaded somewhere ...We have been keeping Plone on life support through a couple of boss/ops upgrades. After the latest, it is time to pull the plug since nobody wants to convert it to python3. So we need to get our content out of there and loaded somewhere else. (gitlab wiki?) If only I had remembered this *before* we converted ops...
So now we have to move the current installation over to a machine with python2, either an elabinelab or else just move it somewhere like ops.utah.cloudlab.us. Then figure out a way to extract the useful content. Then figure out a way to get that content into something else in a reasonable form.
I expect this falls on @hibler or @stoller.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/631Reservation time search can return reservations that don't work2021-07-15T15:45:11-06:00Robert Ricciricci@cs.utah.eduReservation time search can return reservations that don't workI heard from a user that they used the feature of the reservation request page where they put in a machine type and number of days to search for a start time, but when they clicked to request the reservation, they were told it didn't fit...I heard from a user that they used the feature of the reservation request page where they put in a machine type and number of days to search for a start time, but when they clicked to request the reservation, they were told it didn't fit.
I don't know the reason; eg. it's possible an experiment swapped in or got extended between when the search ran and when they requested it, but it seems like it might be worth taking a look to make sure we don't have any obvious potential bugs.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/630Get the 32-port 100Gb Barefoot switch up and running2021-09-03T13:57:08-06:00Mike HiblerGet the 32-port 100Gb Barefoot switch up and runningBrent wants to use this, so we need to get it integrated in the testbed and wired up to some nodes.
The short-term plan is to connect up 8 of the new c6525-100g nodes directly as they have a second, unused 100Gb port.
* [x] Rack the swi...Brent wants to use this, so we need to get it integrated in the testbed and wired up to some nodes.
The short-term plan is to connect up 8 of the new c6525-100g nodes directly as they have a second, unused 100Gb port.
* [x] Rack the switch in V05
* [x] Wire up the nodes
* [ ] Apply a feature to the nodes so they can be specified in a profile (and possibly to dis-favor them for normal use)
* [ ] Add the switch to the DB and make the management interface accessible in a safe way
Down the road:
* [ ] Ability to reload the switch OShttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/629Replace the storage server at Clemson.2023-06-09T08:43:20-06:00Mike HiblerReplace the storage server at Clemson.Spinning this one off from #567 as well. That issues says:
```
Clemson: 50TB on one server, but in two zpools of 43 and 7TB. 5.8TB and 1.3TB free. Again a couple of
possibilities. The first would be to commandeer another one of the first...Spinning this one off from #567 as well. That issues says:
```
Clemson: 50TB on one server, but in two zpools of 43 and 7TB. 5.8TB and 1.3TB free. Again a couple of
possibilities. The first would be to commandeer another one of the first-gen storage node, giving us
another 50TB. A more intriguing possibility would be to take over one of the dss7500 nodes with 45 HDDs
and 270TB. That would solve the space issue for some time to come but would take one of only two of
those machines. I only suggest it because those machines are almost never used right now, which is
a waste.
```
This has come to the forefront because the smaller zpool of 1TB disks has a slowly failing disk that ZFS can deal with, but it takes long enough for it to retry an operation that the iSCSI client times out, leading to "disk errors" and a corrupted filesystem. `smartctl` confirms lots of corrected errors and that it is in the "pre-fail" state. Unfortunately, so is every other disk in that zpool, so I am not sure there is much point in replacing the one disk. We need to evacuate that pool.
Short term I am going to clear out dead datasets and move everything left on the small zpool to the larger one. That one has 4TB drives that are just as old, but seem to be holding up better.
But it is time to seriously consider taking over one of the dss7500 nodes.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/628Upgrade mothership boss and ops to FreeBSD 12.22021-09-03T13:36:03-06:00Mike HiblerUpgrade mothership boss and ops to FreeBSD 12.2What I will do on my summer vacation...
I opt to skip upgrading to 11.3, because that is already obsolete. So let's jump ahead for once. We have a couple of installs running 12.2 already, so much of the boilerplate has been taken care o...What I will do on my summer vacation...
I opt to skip upgrading to 11.3, because that is already obsolete. So let's jump ahead for once. We have a couple of installs running 12.2 already, so much of the boilerplate has been taken care of (see #611):
- [x] Upgrade packages to 2021Q2 at least. Big issue here is the need to move to python3
- [x] Upgrade Emulab sources to work with python3 (see #599)
- [x] Work out instructions for upgrading from 11.3. See `install/README-upgrade-11.3-12.2.txt` for the basic steps.
- [x] Make sure SOL consoles are working.
- [x] Figure out how to do a recoverable backup.
There will be lots of agony dealing with the mothership specific aspects. Some that come to mind:
- [x] (ops) `iocage` or some mechanism for managing the genilib jails
- [x] (both) `bareos` and `znapzend` (ops only) for backups
- [x] (ops) `bulkmailer`, `procmail` and other mail software`
- [x] (boss) `gnuplot` for assorted graph-y things
- [x] (boss) VNC stuff for @stoller's remote x-server :-)
- [x] (boss) Powder specific stuff (TBD)
- [x] (boss) Master portal related stuff (TBD)
- [ ] (ops) `plone`, the undead wiki
Other stuff:
- [x] Send out a message about 0.5-1 day downtime.
- [x] Stand up a web server to serve up an "Under construction" web page for our front pages (Emulab/Cloudlab/Powder).Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/627More storage for Utah Cloudlab cluster2021-10-24T19:32:21-06:00Mike HiblerMore storage for Utah Cloudlab clusterSpinning this off from the more general #567. So sayeth that issue:
```
Utah: 43.5TB in one server, 6.7TB free. We have a couple of options for more space here.
One is the DriveScale chassis which have over 100TB between them but needs m...Spinning this off from the more general #567. So sayeth that issue:
```
Utah: 43.5TB in one server, 6.7TB free. We have a couple of options for more space here.
One is the DriveScale chassis which have over 100TB between them but needs more work to
integrate with the model. The other is to wire up the Apt half of the Dell storage box.
This would add a second server with 36TB. Requires running a 40Gb link from the Apt side
of the pod over to bighp1.
```
These are still the primary hardware solutions, with a couple of updates:
* DriveScale got bought up, so we don't have to worry about using their SW anymore! The JBOD is pretty basic I think, as is the controller node it connects with (just a PC running Linux), so we could probably just get it running with FreeNAS as a regular storage server.
* The "steal Apt's storage server" solution is more practical because we never put the Apt server back in operation after rebuilding the storage box. Right now it is just an oversized FreeNAS testing box.
* A cheaper and faster(?) to implement alternative to running a 40Gb fiber around the pod would be to just export the current Apt RAID volumes out the Cloudlab SATA ports. This leaves the second storage server with just its 1TB of fast local storage, which sould be enough for testing. It does limit the potential throughput we could get with two server with twice as much SATA and network BW.
* We could also consider populating more of the slots in the MD3260, we are using only 25 of the 60 bays. Dell certified drives from harddrivesdirect.com range from $235 (qty 1) to $181 (qty 10) each for 4TB models. 6TB models range from $218 to $202 though we would need to double check that they are compatible.
On the software side, I have discovered that the ZFS zvol volume size (-V when creating) is a space limit from the perspective of the user, and not a limit on how much space the zvol will consume. While knowing this at some level, I failed to appreciate just how much overhead ZFS can introduce on top of that, including metadata such as RAID parity, checksums, and fragmentation due to misalignment of various blocksizes. Apparently the recommendation for iSCSI on zvols is to not allocate more than 50% of the capacity. In our most extreme example, we have a 15TB dataset, which is using 21TB of disk space even though the overlayed Linux filesystem is only using 7TB. Potential improvements here:
* Use "thin" zvols. This really doesn't solve anything though, it just allows us to overbook storage and probably everyone will run out of space in a truly ugly way down the road.
* Turn on compression for the zvols. This is one way of implementing that "don't allocate the full disk capacity", but of course depends on the nature of the data and doesn't address the metadata overhead.
* Switch to using iSCSI volumes on top of ZFS filesystem files. This is what a number of people say you should do, as apparently it has fewer "misalignment" problems and performs better overall.
* Turn on the "discard" option when we create ext4 filesystems, so that they TRIM. This will help when the overlayed filesystem has a lot of free space.
* Use something other than ZFS? This would be a lot of work to implement at this point.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/626Management interface flakiness at Wisconsin2021-06-19T10:54:58-06:00Mike HiblerManagement interface flakiness at WisconsinThere are lots of things going on here, possibly related.
* Management interfaces (CIMCs) become unresponsive entirely, cannot ssh in or even ping. Sometimes they recover themselves after a while. Perhaps they reboot out of the blue?
* I...There are lots of things going on here, possibly related.
* Management interfaces (CIMCs) become unresponsive entirely, cannot ssh in or even ping. Sometimes they recover themselves after a while. Perhaps they reboot out of the blue?
* IPMI on management interface becomes unresponsive. In this state you can ssh in to the CIMC, but IPMI operations fail with a variety of errors ranging from reporting "out of resources" to not responding to operations at all. Again, time will often heal all wounds or, failing that, giving it a swift kick in the ass (reboot the CIMC).
* Garbage output to the IPMI SOL console. Not actually garbage, it is VT100 cursor positioning sequences (see #624).
I suspect that the underlying issue here is a combination of an older possibly buggy IPMI implementation, the fact that IPMI is UDP based and apparently stateful, and our possibly over-aggresive use of the protocol.
Suggested actions:
* Investigate the `power` command IPMI module to see if there is something wrong or if there is anything to improve.
* Investigate the `capture` command's use of `ipmitool` to see if there is something wrong or if there is anything to improve.
* Update the CIMC firmware?
* Turn off "console redirection always" in the BIOS (see #624).
* Periodic monitoring with optional ass-kicking of unresponsive interfaces.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/624Turn off BIOS redirection after boot on Wisconsin c220g2 nodes2021-06-19T10:34:52-06:00Mike HiblerTurn off BIOS redirection after boot on Wisconsin c220g2 nodesI happened to notice that the console logs for the c220g2 nodes seem to be growing constantly, though not alarmingly. Looking at the logs, there are tons of 8 character VT100 cursor positioning sequences in there. These appears to be com...I happened to notice that the console logs for the c220g2 nodes seem to be growing constantly, though not alarmingly. Looking at the logs, there are tons of 8 character VT100 cursor positioning sequences in there. These appears to be coming from the node at a rate of about 2 per second. These happen even when the node is sitting at PXEWAIT.
Since they appear to be coming from the BIOS, even when not in the BIOS, I tried changing the "BIOS console redirection after POST" from "always" to "bootloader". I have never seen the latter before but it seems to do what we need--turn off BIOS console output once control is handed to the boot loader. We continue to get boot loader and kernel redirection because we have both configured to use the serial port which is passed via SOL.
This needs to be changed on all c220g2 nodes.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/623'Stops' in the extension slider2021-06-10T15:27:12-06:00Robert Ricciricci@cs.utah.edu'Stops' in the extension sliderAnecdotally, it seems that some users may end up asking for longer extensions than they intend to because of the way the extension slider works. eg. if they are intending to request an amount of time that will be auto-approved, it's hard...Anecdotally, it seems that some users may end up asking for longer extensions than they intend to because of the way the extension slider works. eg. if they are intending to request an amount of time that will be auto-approved, it's hard to stop the slider at exactly that point, meaning they may ask for a few hours or days that they didn't intend to. We should consider whether we can or should make it easier to ask for the exact amounts of time that hit various thresholds to save users from having to try to fine-tune the slider - this may also lower the burden on us with respect to having fewer small extension requests to process.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/622Larger node console2021-06-24T12:45:06-06:00Robert Ricciricci@cs.utah.eduLarger node consoleOn the experiment status page, the node console tab is fixed-size. For people who make extensive use of the serial console, it would be helpful to be able to either expand it, or pop it out into its own (browser) tab/window so they can r...On the experiment status page, the node console tab is fixed-size. For people who make extensive use of the serial console, it would be helpful to be able to either expand it, or pop it out into its own (browser) tab/window so they can resize as needed.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/621re-IP the `utah.cloudlab.us` space2021-06-29T08:42:44-06:00Mike Hiblerre-IP the `utah.cloudlab.us` spaceWith the addition of Phase III, we are almost out of IPv4 addresses in 128.110.152.0/22.
@ricci requested and received a new /21, 128.110.216.0/21 with the promise that we will return the old IP space within a month. Well, that was more...With the addition of Phase III, we are almost out of IPv4 addresses in 128.110.152.0/22.
@ricci requested and received a new /21, 128.110.216.0/21 with the promise that we will return the old IP space within a month. Well, that was more than a month ago, so we (@stoller and @hibler in particular) need to get crackin' with the move.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/620Do something with DriveScale hardware2021-04-13T13:51:49-06:00Mike HiblerDo something with DriveScale hardwareSince DriveScale got bought and we can no longer use their software, we have a whole bunch of disk space that we should make use of somehow:
* **2 x 72 drive JBOD boxes**. These are just disk boxes with external SAS interfaces that we s...Since DriveScale got bought and we can no longer use their software, we have a whole bunch of disk space that we should make use of somehow:
* **2 x 72 drive JBOD boxes**. These are just disk boxes with external SAS interfaces that we should be able to hook up to a server and use in the traditional Emulab storage server way. The one box we have mounted is fully populated with 1-3TB drives and has around 120TB, the other is fully populated and a spot check showed 1-2 TB drives, so probably upwards of 100TB again.
* **24 drive NVMe chassis**. This is actually a server with 24 NVMe slots, and 4 x 100Gb network interfaces. But we have not been able to figure out exactly what it is or how to boot it into anything but its internal OS drive. But we have not tried very hard. It is currently populated with 16 x 6.4TB NVMe drives that were purchased with CARES funding. So we need to at least get the drives into use.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/619Slow SOL consoles2021-06-09T18:44:11-06:00Mike HiblerSlow SOL consolesIPMI SOL on some Dell nodes (Clemson R7525, Utah R6525) can periodically get really, really, really slow. We are talking 1970s-era acoustic coupler slow here, like 110 baud-ish.
This originally happened on the Clemson AMD nodes, but we ...IPMI SOL on some Dell nodes (Clemson R7525, Utah R6525) can periodically get really, really, really slow. We are talking 1970s-era acoustic coupler slow here, like 110 baud-ish.
This originally happened on the Clemson AMD nodes, but we have also been experiencing it on our more recent AMD nodes. It is not just with our `capture` console proxy, but also just with `ipmitool` or `ipmiconsole` (`freeipmi`) SOL access. Sometimes a power cycle will fix it for awhile. I have noticed a couple of times that slowness is preceeded by a disconnect of the node from IPMI.
Is it an AMD thing? Legacy BIOS related?https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/618Design for Raspberry Pi carrier boards2021-11-24T11:04:03-07:00Robert Ricciricci@cs.utah.eduDesign for Raspberry Pi carrier boardsThis ticket collects requirements and notes for the design of the Raspberry PI CM4 carrier boards for CloudLab.
In contrast to other carrier boards, this will be designed for density, dual onboard networks, and manageability. The goal i...This ticket collects requirements and notes for the design of the Raspberry PI CM4 carrier boards for CloudLab.
In contrast to other carrier boards, this will be designed for density, dual onboard networks, and manageability. The goal is to get as many CM4s per board as is practical given physical constraints and availability of things like Ethernet ports on the on-board chips. Our current attempts at design have tried aiming for _N_ 5 or 8 experiment nodes per board.
In addition to the experiment nodes, there will be one onboard control node that will handle booting the experiment nodes, power control over them, configuring the onboard switches, etc.
The goal is for each board to have the minimum number of external connectors (power, Ethernet, USB, etc.) while meeting the requirements below.
These boards are designed to be used with CM4 with onboard eMMC (size TBD, but likely 16GB), no WiFi. RAM on CM4 TBD, but possibly 4GB) eg. part number `CM4104016`.
### Requirements for each experiment node (CM4 socket - _N_ of them):
- [ ] Power control (via access to the `GLOBAL_EN` pin, connected to a `GPIO` on the control node - note: this is at 5V, so we likely want something between it and the `GPIO` to prevent leakage current)
- [ ] Control over boot from onboard eMMC or rpiboot (USB) (via the `nRPI_BOOT` pin, connected to a `GPIO` on the control node)
- [ ] Control over `EEPROM` write enable/disable (via the `EEPROM_nWP` pin, connected to a `GPIO` on the control node)
- [ ] Serial console (through built-in `UART` pins (`GPIO14/TXD0` and `GPIO15/RXD0`) connected to a `UART` on the control node`)
- [ ] USB client device for rpiboot/flashing (eg `USB_N`/`USB_P` pins) connected to a USB hub/port on the control node
- [ ] Experiment network: builtin Ethernet connected to an onboard switch
- [ ] Control network: Ethernet via PCIe Ethernet controller (eg.`RTL811H`) connected to an onboard switch
- [ ] LEDs for debugging (exact set not yet determined, these are some possibilities)
- [ ] Power via `CM4_3.3V`, which is off when `GLOBAL_EN` is pulled low
- [ ] Status of `nRPI_BOOT` and/or `EEPROM_nWP` pins
- [ ] `Pi_nLED_Activity`
### Requirements for the control node:
- [ ] Sufficient `GPIO`s for functions listed above (may require an `I2C` `GPIO` expander depending on _N_)
- [ ] Sufficient USB 2.0 ports to act as a host for all _N_ experiment nodes via an onboard hub
- [ ] Sufficient `UART`s (via USB or `I2C` UART adapter) for all _N_ experiment nodes
- [ ] Its own RJ45 connector on the board connected to its builtin Ethernet for out of band control
- [ ] `SPI` or `I2C` connections to onboard switches to configure them
- [ ] Connection to the onboard control switch (unless this would cost us a node, in which case, we could handle by giving it its own jack on the board)
- [ ] Power control (pins to control from another machine, plus an onboard reset button)
- [ ] Serial console (via either USB or directly exposed RS232)
- [ ] Some way to reflash, though this can be rare (eg. by swapping an SD card)
- [ ] Sufficient storage to cache disk images locally
- [ ] Same debugging LEDs as experiment nodes
### Requirements for each switch (2 of them, one experiment and one control)
- [ ] At least _N+1_ gigabit Ethernet ports (one for each node, plus one to connect off-board - needs _N+2_ if we want an onboard connection to the control node)
- [ ] `SPI` or `I2C` control from control node, ability to manipulate VLANs
- [ ] Reset or power control via `GPIO` pin on the control node
- [ ] RJ45 connector for off-board connectivity
### Requirements for the board in general:
- [ ] Ideally a single power connector, either barrel or ATX - will need 5V for the CM4s and 3.3V for various other devices
### Questions remaining:
- [ ] Can we connect `GLOBAL_EN`, `nRPI_BOOT`, `EEPROM_nWP` directly to `GPIO`s on the controller, or do we need a transistor or voltage divider for better safety and/or reliability?
- [ ] See questions regarding network and management interfaces for control node
## Current design at Gepetto/Upverter
I don't see a way to share designs without the other person having an account, so make an account at https://geppetto.gumstix.com/ and let me (@ricci) let me know what email address you signed up with.
The current design hosts _N_=5 nodes plus one control node. It's at the maximum board size for the tool: 22.8 cm x 15.2 cm. Everything is mounted flat on the top of the board, so it should fit easily in 1 U (1.5 inches) with plenty of room for heatsinks.
### Things that are not ideal:
- [ ] Power distribution: as designed, it is powered at 36V and needs two barrel jacks to reach the rated capacity of about 88W. It requires a 5V/5A regulator for each CM4, plus a bunch of 3V/1.5A regulators for other components (which have to be fed from the 5V regulators because they don't take 36V input). A much simpler (and cheaper) design should be possible: power the whole thing at 5V if we can handle the current, use a smaller number of larger regulators, etc. Possibly see if we can use an ATX power supply since it provides both 5V and 3.3V
### Big things missing:
- [ ] Geppetto does not have a PCI Ethernet chip. I've put USB ones in as placeholders, but we'd need to get them to add one. I've tested the `RTL811H` and it seems to work well, though has the downside that the driver is out of tree.
- [ ] I don't see a way to put a single transistor or `74xx06` between the `GPIO` and the `GLOBAL_EN` line as noted elsewhere in this ticket.
- [ ] I haven't added 'out of band' flashing, power cycling, etc for the control node, see question elsewhere in the ticket.
### Small things missing:
- [ ] Geppetto does not currently provide access to the `GLOBAL_EN`, `nRPI_BOOT` and `EEPROM_nWP` pins on the CM4 connector. I'm guessing this is easy for them to add
- [ ] I have not put status LEDs or mounting holes on yet (including mounting holes for the CM4s)
### Factors limiting number of CM4s:
- [ ] Board space: board space is currently not completely full, and it should be possible to squeeze a few more on there physically, especially if the power distribution messiness can get fixed, the regulators currently take up quite a bit of space. There are also options to mount the CM4 on an SODIMM adapter board (like the CM3), but this would cost more and run into potential problems fitting into 1U. I think their adapter boards also don't expose all of the lines we need.
- [ ] Switch ports: the board currently uses switches that have 5 ports, plus 2 that are to be used for uplinks. So, to add more nodes, we need to either:
* Get Geppetto to add a larger version of the same switch with more ports (which will involve engineering costs)
* Add 2 more switches, and connect them together on the board in a way that will introduce bottlenecks and complexity
* Add 2 more switches, and give each its own uplink port. This doubles the number of Ethernet cables and switch ports we need
- [ ] USB hubs: the USB hub I'm currently using only has 7 device/client ports so we'd need to daisy chain a few together to get enough ports
- [ ] `GPIO` pins: with 3 pins needed on the control node per experiment node, we would likely use up all available pins and would need to add more through an `I2C` GPIO expander or similar.
- [ ] Power and heat density: I have not even conclusively verified that we are OK at this density, and of course more density is even harderRobert Ricciricci@cs.utah.eduRobert Ricciricci@cs.utah.eduhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/616Associating user accounts with publications2021-10-21T07:08:32-06:00Robert Ricciricci@cs.utah.eduAssociating user accounts with publicationsFor the purposes of reporting to NSF and producing a public bibliography of the various facilities that have used our testbeds, it would be helpful to be able to cross-reference our userbase with publication databases. Options for public...For the purposes of reporting to NSF and producing a public bibliography of the various facilities that have used our testbeds, it would be helpful to be able to cross-reference our userbase with publication databases. Options for publication databases include:
* [DLBP](https://dblp.uni-trier.de/)
* [Google Scholar](https://scholar.google.com/)
* [Microsoft Academic Search](https://academic.microsoft.com/home)
At this point, I am leaning towards DLBP: it's a big (very big) database we can download and process ourselves as we see fit, as opposed to the other two that are web services that we'd have to interact with via searches. My understanding is that MS Academic search has an API, which Google Scholar does not, but Google scholar will search through the contents of papers, which MS Academic Search will not, meaning that Google Scholar is much better for searching for papers that mention our name or URLs. DBLP also has the advantage of being focused on CS, and seems to contain more structured data about papers and people that we can use to generate bibliographies from.
What I'm imagining the workflow looks like (to be repeated periodically):
1. Download the latest DLBP database
2. Match users from our own DB to people in the DLBP database. This is necessarily going to be a bit fuzzy, likely based on email address, name, institution. Likely, record this matching in our own DB so that if we ever have to manually do any matching that gets saved. Also, we should probably make this a one-to-many mapping, as these kinds of databases do tend to inadvertently split people's records (eg. maybe they have separate records for Kobus and Jacobus)
3. Find all papers in the DB by those people: in the initial run, looking for papers since the testbed was "open", in subsequent runs, we just look at once since we last checked
4. Optional: automatically download the paper, if available, run text extraction, and look for keywords like the name of the testbed, names of hardware types, etc.
5. Check each paper to see if it used the facility. We can do this ourselves (possibly assisted by automatic search for keywords) and/or ask users themselves. In my experience, doing this manually goes pretty fast, assuming you have a link to the paper: you open it up, Ctrf-f for the name of the facility, and it's usually immediately obvious whether they are describing it in the context of evaluation. To ask users, we'd provide them with a list of the papers we know about, and ask them yes/no did you use our testbed to evaluate this, and maybe, if they are in more than one project, which project is this related to. We'd need to do something to avoid bothering too many people in duplicate about the same papers.
6. Record all of this, and create lists views for the whole testbed, per user, per project, maybe per publication venue, institutions, etc. If the database we use somehow marks research areas, that would be an interesting criteria too.
One possible place to look for code to parse DLBP data could be [CSrankings](https://github.com/emeryberger/CSRankings), which also has some additional metadata that could be useful, like mappings of faculty to institutions. @johnsond might also have some relevant code.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/615Manual route for the control network2021-10-21T07:16:58-06:00Leigh StollerManual route for the control networkOn the Mothership we have the problem of the supermassive not being able to route the endpoint subnets. We could solve this if we could send routes to the nodes with tmcd, but that would require "manual". And that turns off auto route ca...On the Mothership we have the problem of the supermassive not being able to route the endpoint subnets. We could solve this if we could send routes to the nodes with tmcd, but that would require "manual". And that turns off auto route calculation with
dijkstra. We need a combo mode.
@hibler and @johnsond also request that I try to remove the Boost dependency.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/614Image server type list2021-02-08T09:10:39-07:00Leigh StollerImage server type listWe have the situation that Ubuntu 16 does not work on D6515 nodes, but at the moment the list of X86 types we assign to nodes in the Image server is fixed. We need to start using the types_not_working.We have the situation that Ubuntu 16 does not work on D6515 nodes, but at the moment the list of X86 types we assign to nodes in the Image server is fixed. We need to start using the types_not_working.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/613Allow leases (datasets) to be extended while they are still valid2021-01-13T09:01:06-07:00Mike HiblerAllow leases (datasets) to be extended while they are still validCurrently we only let you extend one if it has reached the grace state.Currently we only let you extend one if it has reached the grace state.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/612Limited updates to frisbee/imagezip to increase image size and improve distri...2020-12-09T09:48:31-07:00Mike HiblerLimited updates to frisbee/imagezip to increase image size and improve distribution speedThe goal here would be to make it practical to use frisbee to distribute 10+GB images for Cloudlab3 and possibly Powder.
In some situations right now, you could probably tar, scp, untar a big filesystem of data faster than using frisbee ...The goal here would be to make it practical to use frisbee to distribute 10+GB images for Cloudlab3 and possibly Powder.
In some situations right now, you could probably tar, scp, untar a big filesystem of data faster than using frisbee to lay down an image. Some of the tasks:
* Support 64-bit block numbers in imagezip. Right now, you cannot image a filesystem larger than 2TB, no matter how little valid data it has in it. This is mostly an issue of maintaining backward compatibility and I have largely completed this step.
* Tune frisbee for something more modern than a 100Mb link. With a 10Gb server link, it would be reasonable to try to achieve 1Gbs per image. That would allow us to distribute a 10GB image in something like 80-90 seconds. Ideally, we could crank it higher than that, say up to 10Gbs. There are any number of problems here that might bleed into the imagezip image format. Look at increasing the chunksize (from 1MB) or blocksize (1KB) to get more data on the wire faster. We have a number of constants related to allowing multiple outstanding requests, they need to be adjusted or eliminated. Flow control has been a bugaboo for a long time. Revisit (again!) maybe considering taking advantage of a lossless Ethernet control network and ECN.
* Improve frisbee server. We should multithread so that it can pre-read image data and it should cache that data. On our current servers, it would be reasonable to read and cache an entire 1GB-ish image.
* Improve unicast support. Right now, an image server cannot support more than a single unicast client at a time due to the naive way we support unicast. There are many cases where unicast distribution of images is reasonable. We still have situations today where a multicast client may take 20-30 seconds to hook up with the server, we could unicast the image 10x over in that time period!
Emphasis on **limited** here, since I have oodles of TODO files describing things that could/should be done to imagezip and frisbee.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/611Server-side OS update2021-09-03T13:54:54-06:00Mike HiblerServer-side OS updateIt is past time to move on from FreeBSD 11.3. The candidates are either 11.4 or the newly released 12.2. I would prefer the latter since the former is probably the end of the road for 11.x. I will have to try an upgrade to the latter to ...It is past time to move on from FreeBSD 11.3. The candidates are either 11.4 or the newly released 12.2. I would prefer the latter since the former is probably the end of the road for 11.x. I will have to try an upgrade to the latter to see how much effort is involved.
However, the OS is less of an issue though than moving on to a more current port set. In theory, python 2.7 is gone at the end of this year (see issue #599). Just trying to build server package sets with just python 3 was an adventure, nevermind getting the Emulab software to build and work.
Another side-effect of a new port set is that they have moved on to Swig4 and there isn't even a swig3 port anymore. The only thing that broke immediately was the `abac` package which doesn't seem to be supported anymore and was written to work with swig2.
I would dearly love to have this resolved soon before we start rolling out Powder MEs in quantity. Updating those later would be a *massive* pain in the ass.
The tasks I know of:
* [x] Create a python3-based package set.
* [x] Get Emulab server-side scripts working with python3.
* [x] Work out a conversion process from 11.x to (probably) 12.2.
* [x] Get `abac` (and others?) working with swig4.
* [ ] Fix clientside scripts to work with python3.Mike HiblerMike Hibler