emulab issueshttps://gitlab.flux.utah.edu/groups/emulab/-/issues2022-03-02T08:58:19-07:00https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/659Fix up Apt node BIOSes2022-03-02T08:58:19-07:00Mike HiblerFix up Apt node BIOSesWhile attempting to collect hwinfo from all Apt nodes, I noticed two anomalies with the pt (in particular `r320`) nodes.
* No setup password set
* The lifecycle controller is disabled, did we intentionally do that?
Because of the first,...While attempting to collect hwinfo from all Apt nodes, I noticed two anomalies with the pt (in particular `r320`) nodes.
* No setup password set
* The lifecycle controller is disabled, did we intentionally do that?
Because of the first, we should make a pass over the BIOSes and make sure they are configured as we expect.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/658Time synchronization at Cloudlab clusters2022-02-24T12:14:21-07:00Mike HiblerTime synchronization at Cloudlab clustersA recent question on the users list asked about time synchronization between the clusters which got me thinking about this again.
All of the nodes at a cluster use a local (`ntp1`) NTP time server which by convention is `ops`. We also ...A recent question on the users list asked about time synchronization between the clusters which got me thinking about this again.
All of the nodes at a cluster use a local (`ntp1`) NTP time server which by convention is `ops`. We also stash away the "drift" value from each node (via the watchdog) and use the latest saved value to initialize the drift file when a node is imaged. The various cluster NTP servers use a range of upstream servers and NTP pools, but are not directly connected ("peers"). We seem to keep reasonable time between the cluster NTP servers at least, generally around 1-5ms.
Some questions:
* Is saving/restoring the drift value still a good thing to do?
* Should we be using PTP?
* Any chance of getting a GPS receiver at the main clusters?
* Should we use `chrony` which is ```aimed at ordinary computers, which are unstable, go into sleep mode or have intermittent connection to the Internet. chrony is also designed for virtual machines, a much more unstable environment.```? I think current Ubuntu images already use it.
At the very least, we should probably move the `ntp1` alias off of `ops`, which is a VM at all but Emulab, and onto the control node instead where there would be a more stable clock.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/657Emulab storage servers acting flaky2022-02-08T10:21:17-07:00Mike HiblerEmulab storage servers acting flakyDuring the recently completed storage upgrade of the storage box (#656) both of the SAS-attached storage servers exhibited flaky behavior. At one time or another, both rebooted suddenly and during reboots (expected or not), both had a te...During the recently completed storage upgrade of the storage box (#656) both of the SAS-attached storage servers exhibited flaky behavior. At one time or another, both rebooted suddenly and during reboots (expected or not), both had a tendency to hang as the OS was coming up. Additionally, `dbox2` was showing some
```
Processor #0x2d Asserted IERR.
```
errors. I could find no documentation about this, but there were statements that this is likely an error detected by the processor and not an error with the processor itself.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/655Configuration of special devices through adjacent nodes in an experiment2022-01-13T16:13:51-07:00Kirk WebbConfiguration of special devices through adjacent nodes in an experimentIt would be useful to be able to configure (load images, etc.) special devices as part of experiment instantiation. Here, "special devices" means anything that doesn't come up under PXE boot control for setup. Examples include USRP X31...It would be useful to be able to configure (load images, etc.) special devices as part of experiment instantiation. Here, "special devices" means anything that doesn't come up under PXE boot control for setup. Examples include USRP X310 software defined radios, and previously we also had COTS "nano" eNodeB devices from ip.access that required configuration specific to each experiment.
One way to handle these devices is to have network-adjacent nodes in the experiment do the configuration work. As an example, X310 radios are almost always paired with a single compute node. This node could be identified during instantiation, and appropriate commands could be scheduled to run on it to load firmware images, set configuration parameters (if required), etc. Such proxying of device setup would allow us to specify disk images to load and other configuration steps from the profile script in a first class manner.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/654Speed up the Emulab database2021-12-06T15:08:09-07:00Mike HiblerSpeed up the Emulab databaseIt is becoming increasingly clear, if it wasn't already, that the database is one of our primary bottlenecks for allowing instantiation of large numbers of experiments at once. The options are either to speed up our mysql setup/schema or...It is becoming increasingly clear, if it wasn't already, that the database is one of our primary bottlenecks for allowing instantiation of large numbers of experiments at once. The options are either to speed up our mysql setup/schema or to switch to a different DB.
For the former, we can further attempt to optimize our MyISAM tables:
* https://dev.mysql.com/doc/refman/5.7/en/optimizing-myisam.html
which is straight forward but will provide minimal payoff. We can switch to InnoDB that supports better parallelism but at non-trivial cost to convert:
* https://dev.mysql.com/doc/refman/5.7/en/converting-tables-to-innodb.html
or we could try clustering or replication:
* https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster.html
* https://dev.mysql.com/doc/refman/5.7/en/replication.html
but I am not sure that those make sense in our environment which doesn't need to scale _that_ far and has a very small footprint infrastructure-wise (one or two servers).
Switching databases would be a lot more work and with no guarantee of better performance. MariaDB:
* https://mariadb.org/
is a fork of mysql and claims to be faster/better/stronger. It would probably be the easiest to transition to. PostgreSQL:
* https://www.postgresql.org/
is more featureful and better for very large DBs, but seems like overkill for us. The transition is likely to be extremely painful as well.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/651Reservations and VTypes problem2021-10-28T09:36:58-06:00Leigh StollerReservations and VTypes problemI noticed today that using "powder-compute" (a global vtype) does not play well with the reservation pre checks in the mapper.I noticed today that using "powder-compute" (a global vtype) does not play well with the reservation pre checks in the mapper.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/650Update storage servers FreeNAS to the latest release (TrueNAS Core)2022-02-08T09:37:23-07:00Mike HiblerUpdate storage servers FreeNAS to the latest release (TrueNAS Core)Our current storage servers are running the dead-end FreeNAS 11 (FreeBSD 11 based). We need to update them to TrueNAS Core version 12 (FreeBSD 12 based). The biggest hurdle is that they have done away with REST API v1.0 in favor of 2.0 w...Our current storage servers are running the dead-end FreeNAS 11 (FreeBSD 11 based). We need to update them to TrueNAS Core version 12 (FreeBSD 12 based). The biggest hurdle is that they have done away with REST API v1.0 in favor of 2.0 which is considerably different.
The storage servers that need upgrading:
* [x] Emulab dbox1 and dbox2
* [ ] Cloudlab Utah dbox2
* [ ] Cloudlab Clemson dbox
* [ ] Cloudlab Wisconsin dboxMike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/649Improve storage server disk usage2022-02-08T14:24:23-07:00Mike HiblerImprove storage server disk usageThis comes from another ticket (#627) but is not specific to that storage server.
The general problem is that we run out of space well before what we think is the capacity of the zpool. I discovered that the ZFS zvol volume size (-V whe...This comes from another ticket (#627) but is not specific to that storage server.
The general problem is that we run out of space well before what we think is the capacity of the zpool. I discovered that the ZFS zvol volume size (-V when creating) is a space limit from the perspective of the user, and not a limit on how much space the zvol will consume. While knowing this at some level, I failed to appreciate just how much overhead ZFS can introduce on top of that, including metadata such as RAID parity, checksums, and fragmentation due to misalignment of various blocksizes. Apparently the recommendation for iSCSI on zvols is to not allocate more than 50% of the capacity. In our most extreme example, we have a 15TB dataset, which is using 21TB of disk space even though the overlayed Linux filesystem is only using 7TB. Potential improvements here:
* Use "thin" zvols. This really doesn't solve anything though, it just allows us to overbook storage and probably everyone will run out of space in a truly ugly way down the road.
* Turn on compression for the zvols. This is one way of implementing that "don't allocate the full disk capacity", but of course depends on the nature of the data and doesn't address the metadata overhead.
* Switch to using iSCSI volumes on top of ZFS filesystem files. This is what a number of people say you should do, as apparently it has fewer "misalignment" problems and performs better overall.
* Turn on the "discard" option when we create ext4 filesystems, so that they TRIM. This will help when the overlayed filesystem has a lot of free space.
* Use something other than ZFS? This would be a lot of work to implement at this point.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/648Stop using the uuid for sharing profiles.2021-10-22T14:44:49-06:00Leigh StollerStop using the uuid for sharing profiles.This came up while I was in Utah. We should stop using the uuid of a profile for sharing it (when not public) since that makes it impossible to revoke.This came up while I was in Utah. We should stop using the uuid of a profile for sharing it (when not public) since that makes it impossible to revoke.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/645Making parameter sets more prominent2021-09-29T12:06:43-06:00Robert Ricciricci@cs.utah.eduMaking parameter sets more prominentSome ideas to make parameter sets more prominent. Not sure we should do all, but a list for brainstorming, in rough order of the experiment instantiation process
- On the first step of the instantiate page and the profile picker, indicat...Some ideas to make parameter sets more prominent. Not sure we should do all, but a list for brainstorming, in rough order of the experiment instantiation process
- On the first step of the instantiate page and the profile picker, indicate which profiles the user has a paramset for
- In the parameterize step, give them an option to load parameter sets for the profile (and maybe those made by others that they have used recently?)
- On the finalize page, give them a button to save a paramset (probably presented as some kind of save/share option) (of course, only for a parameterized profile)
- On the experiment status page, something similar. This should happen even for failed experiments, to enable a use case where, if there weren't enough nodes or whatever, you have an easy way back later to swap in quickly without re-filling everything.
- A 'recents' menu somewhere (in the main menu in the header?) that takes you right to the middle of the instantiate wizard, with the profile selected and the params filled outLeigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/643Make profile sharing more prominent2021-09-24T14:49:05-06:00Robert Ricciricci@cs.utah.eduMake profile sharing more prominentThough users can share profiles publicly in a few different ways, they seem less aware of this than we would like. I have a few ideas on how we might improve this.
* ![Screenshot_2021-09-24_14-15-34](/uploads/62ccb4f71c836c79f387a99b710...Though users can share profiles publicly in a few different ways, they seem less aware of this than we would like. I have a few ideas on how we might improve this.
* ![Screenshot_2021-09-24_14-15-34](/uploads/62ccb4f71c836c79f387a99b7109c560/Screenshot_2021-09-24_14-15-34.png) I don't think the Share button is as prominent as it could be - it's currently at the bottom, and the same color as other buttons. I think it could stand to be a more distinct color (the bootstrap success color (green)?). Also, when a profile has a long description, it gets hidden off the bottom of the screen, so I think putting it in the box on the left would be better.
* If you don't own a profile, there is no clear indication that it *can* be made public. In the screenshot above, I'm looking at another project-member's profile, and notice that this is not suggested to me at all. If, for example a student is making profiles, and a faculty member is taking care of releasing software, they may not even realize that it's possible to make it public. I'm not sure we want to allow project members to make each others' profiles public (though maybe project leads?), but maybe we could at least have a greyed-out 'make public' button with a tooltip explaining that the owner has to do it. Hmmm - actually now that I look at it, I don't get an indication when looking at my own profiles that I can make the public, without clicking "Edit" - which is different from how most Share UIs work these days.
* Speaking of how share UIs work, here is a comparison between ours and Google Docs. I would not say that I really like the way Google Docs does it, but it does have the advantage that you can change the settings right from the popover. It also does a concise job of explaining what the sharing options mean.
![Screenshot_2021-09-24_14-14-48](/uploads/7a52ca7633ef0b764a139c82d0ba0a51/Screenshot_2021-09-24_14-14-48.png)
![Screenshot_2021-09-24_14-17-50](/uploads/f35171ef753f340935d3adce34e22510/Screenshot_2021-09-24_14-17-50.png)
![Screenshot_2021-09-24_14-18-45](/uploads/388837bdc5c64311745a1bc8b492e650/Screenshot_2021-09-24_14-18-45.png)
* For some reason, people seem to think that they can't make profiles public if they use custom disk images. When they toggle public on something, maybe we can include a message to the effect that this makes their disk image public too.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/642Unify path that nodes take to get into `hwdown`2021-09-22T11:02:32-06:00Mike HiblerUnify path that nodes take to get into `hwdown`Right now, depending on how nodes find their way into `hwdown`, they can be in different states.
For one, years ago I had added a sitevar `reload/hwdownaction` that could define what we do with node when reload failed and we moved the n...Right now, depending on how nodes find their way into `hwdown`, they can be in different states.
For one, years ago I had added a sitevar `reload/hwdownaction` that could define what we do with node when reload failed and we moved the nodes into `hwdown`, one of do nothing, reboot the node into the admin MFS, or power it off. But, as the name implies, this is only done if it is the reload daemon that puts the node in `hwdown`. If it gets there via the checknodes daemon, or via an explicit `sched_reserve` or `nalloc`, then nothing special is done.
For another, whether NFS filesystems should be available and mounted likewise depends on how the node gets into `hwdown`, or more accurately, whether `exports_setup` gets run on that path.
So, we should put some code in the `Node.pm` module or maybe just write an explicit script that will put a node into `hwdown`, taking care of all the magic necessary to ensure it is cleanly removed from wherever it is and put into a consistent state.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/641Show/warn users about pending (unapproved) reservation requests2021-10-21T18:05:31-06:00Kirk WebbShow/warn users about pending (unapproved) reservation requestsFor scarce/single resources that are in demand, users can easily step on each other with reservation requests. This is because they have no visibility into pending requests. I propose that we show pending reservation requests on the "a...For scarce/single resources that are in demand, users can easily step on each other with reservation requests. This is because they have no visibility into pending requests. I propose that we show pending reservation requests on the "available resources" views, probably using a different color/marking to distinguish them. Additionally, we should email a warning to users that submit requests that overlap with existing unapproved requests. Finally, the "search" button on the reservation request page should take into account pending reservations when looking for an available window.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/640Fix clientside scripts to work with python3.2021-09-03T13:54:54-06:00Mike HiblerFix clientside scripts to work with python3.We have fixed up the server side (#611) along with a few of the client-side scripts that are used on `ops`, but we should finish the job.We have fixed up the server side (#611) along with a few of the client-side scripts that are used on `ops`, but we should finish the job.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/638Reload Topology button broken2021-09-13T05:02:53-06:00Leigh StollerReload Topology button brokenThe reload topology button loses the nodes in the manifest somehow. Okay after page reload, so must be something in the javascript.The reload topology button loses the nodes in the manifest somehow. Okay after page reload, so must be something in the javascript.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/636Mysql performance on boss is terrible!2021-10-21T07:07:54-06:00Leigh StollerMysql performance on boss is terrible!Since the 12.2 upgrade i have noticed a lot of very slow loading web pages. Looking at the mysql slow queries log there are an enormous number of ones like this, that should have been close to instant.
Its bothering me enough that I am ...Since the 12.2 upgrade i have noticed a lot of very slow loading web pages. Looking at the mysql slow queries log there are an enormous number of ones like this, that should have been close to instant.
Its bothering me enough that I am going to have to dig into it.
```
# Time: 2021-08-20T19:35:18.826539Z
# User@Host: skip-grants user[instantiate.php] @ localhost [] Id: 31469255
# Query_time: 5.324131 Lock_time: 0.002438 Rows_sent: 1152 Rows_examined: 29629
SET timestamp=1629488118;
select p.uuid,p.name,p.pid,v.creator,p.profileid, p.usecount,f.marked from apt_profiles as p left join apt_profile_versions as v on v.profileid=p.profileid and v.version=p.version left join group_membership as g on g.uid_idx='926619' and g.pid_idx=v.pid_idx and g.pid_idx=g.gid_idx left join apt_profile_favorites as f on f.profileid=p.profileid and f.uid_idx='926619' where locked is null and p.disabled=0 and v.disabled=0 and (p.public=1 or p.shared=1 or v.creator_idx='926619' or g.uid_idx is not null );
```Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/634FreeBSD: "older" fsck version is incompatible with "newer" versions of UFS2021-08-11T10:34:07-06:00Mike HiblerFreeBSD: "older" fsck version is incompatible with "newer" versions of UFSWhile we were setting up an elabinelab for testing the new firewall, we used prebuilt full-disk images of 12.2-based boss and ops nodes and put those down on a couple of d430s. Upon booting, we got all kinds of cylinder group checksum er...While we were setting up an elabinelab for testing the new firewall, we used prebuilt full-disk images of 12.2-based boss and ops nodes and put those down on a couple of d430s. Upon booting, we got all kinds of cylinder group checksum errors from both kernels. After many (many) bad theories we discovered that the FreeBSD 10 version of fsck in the MFS will fix up bad summary information, which is actually metadata in a FreeBSD 12 filesystem.
I only know for sure that this is a problem between FreeBSD 10 and 12, I haven't tracked this down to see when the incompatibility really happened. Hence the vague "older" and "newer" in the title.
This needs to be fixed in anything that deals with FreeBSD filesystems. Possibly this is as simple as switching to a FreeBSD 12 version of fsck--assuming that version does not screw up filesystems for older versions of FreeBSD. This will affect the FreeBSD and Linux based MFSes as well as the blockstore code in FreeBSD.
We were here once before, with FreeBSD 4 and 5 I think. Didn't really have a satisfactory fix in that case. I think. That was many generations of Mike ago though.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/633Fix medusa segfaults2021-09-03T13:42:56-06:00Mike HiblerFix medusa segfaultsWe run `medusa` on boss against experiment nodes to look for authentication issues in VNC. The version we use, 2.2 from 2016, is the latest released version but has core-dump issues. @stoller has tried tracking these down and so have I. ...We run `medusa` on boss against experiment nodes to look for authentication issues in VNC. The version we use, 2.2 from 2016, is the latest released version but has core-dump issues. @stoller has tried tracking these down and so have I. The latest upgrade on boss and ops (to FreeBSD 12.2) seems to have exacerbated the problem and put it back on my radar.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/632Get our content out of the Plone wiki2021-09-03T13:52:31-06:00Mike HiblerGet our content out of the Plone wikiWe have been keeping Plone on life support through a couple of boss/ops upgrades. After the latest, it is time to pull the plug since nobody wants to convert it to python3. So we need to get our content out of there and loaded somewhere ...We have been keeping Plone on life support through a couple of boss/ops upgrades. After the latest, it is time to pull the plug since nobody wants to convert it to python3. So we need to get our content out of there and loaded somewhere else. (gitlab wiki?) If only I had remembered this *before* we converted ops...
So now we have to move the current installation over to a machine with python2, either an elabinelab or else just move it somewhere like ops.utah.cloudlab.us. Then figure out a way to extract the useful content. Then figure out a way to get that content into something else in a reasonable form.
I expect this falls on @hibler or @stoller.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/630Get the 32-port 100Gb Barefoot switch up and running2021-09-03T13:57:08-06:00Mike HiblerGet the 32-port 100Gb Barefoot switch up and runningBrent wants to use this, so we need to get it integrated in the testbed and wired up to some nodes.
The short-term plan is to connect up 8 of the new c6525-100g nodes directly as they have a second, unused 100Gb port.
* [x] Rack the swi...Brent wants to use this, so we need to get it integrated in the testbed and wired up to some nodes.
The short-term plan is to connect up 8 of the new c6525-100g nodes directly as they have a second, unused 100Gb port.
* [x] Rack the switch in V05
* [x] Wire up the nodes
* [ ] Apply a feature to the nodes so they can be specified in a profile (and possibly to dis-favor them for normal use)
* [ ] Add the switch to the DB and make the management interface accessible in a safe way
Down the road:
* [ ] Ability to reload the switch OS