emulab-devel issues

emulab-devel issues https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues 2021-10-22T14:44:49-06:00 https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/648 Stop using the uuid for sharing profiles. 2021-10-22T14:44:49-06:00 Leigh Stoller

Stop using the uuid for sharing profiles.

This came up while I was in Utah. We should stop using the uuid of a profile for sharing it (when not public) since that makes it impossible to revoke. This came up while I was in Utah. We should stop using the uuid of a profile for sharing it (when not public) since that makes it impossible to revoke. status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/647 Include reservation justification text in admin email 2021-10-21T07:01:51-06:00 Kirk Webb

Include reservation justification text in admin email

Feature request: I'd like to see the user's justification text in the email notification messages for reservation requests. For example, this would have helped this past weekend when a user asked if the start time could be moved to som... Feature request: I'd like to see the user's justification text in the email notification messages for reservation requests. For example, this would have helped this past weekend when a user asked if the start time could be moved to sometime Sunday. I saw that there was a reservation request during the weekend, but didn't look at it until this (Monday) morning. Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/646 Deal with CA root certificate expiration fallout 2022-01-06T15:23:44-07:00 Mike Hibler

Deal with CA root certificate expiration fallout

On 09/30/2021 the root "DST Root CA X3" certificate expired. A new certificate ("ISRG Root X1") was in place well in advance, but OpenSSL 1.0.2 (and others) still try to chain through the old certificate. See [this blog post](https://www... On 09/30/2021 the root "DST Root CA X3" certificate expired. A new certificate ("ISRG Root X1") was in place well in advance, but OpenSSL 1.0.2 (and others) still try to chain through the old certificate. See [this blog post](https://www.openssl.org/blog/blog/2021/09/13/LetsEncryptRootCertExpire/). This affects not only our servers, but all standard and custom images as well. Things we gotta do: * [x] Fix all boss/ops/dbox/whatever nodes that need HTTPS service from anyone. * [x] Make sure out client images going forward **do not** include the DST certificate and **do** include the replacement. * [x] Add `slicefix` magic to fix up custom images based on our supported images (Ubuntu 16+, CentOS 7+, FreeBSD 11+). * [ ] Have a plan for older images (instructions for how users can fix them?). The fix is pretty straight forward for at least Ubuntu and FreeBSD, just remove the invalid certificate from the right places. I will note that Ubuntu 14 does not include the replacement certificate, so a fix is harder...if we chose to try and do something about older images. cloudlab powder status:active Mike Hibler Mike Hibler https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/645 Making parameter sets more prominent 2021-09-29T12:06:43-06:00 Robert Ricci ricci@cs.utah.edu

Making parameter sets more prominent

Some ideas to make parameter sets more prominent. Not sure we should do all, but a list for brainstorming, in rough order of the experiment instantiation process - On the first step of the instantiate page and the profile picker, indicat... Some ideas to make parameter sets more prominent. Not sure we should do all, but a list for brainstorming, in rough order of the experiment instantiation process - On the first step of the instantiate page and the profile picker, indicate which profiles the user has a paramset for - In the parameterize step, give them an option to load parameter sets for the profile (and maybe those made by others that they have used recently?) - On the finalize page, give them a button to save a paramset (probably presented as some kind of save/share option) (of course, only for a parameterized profile) - On the experiment status page, something similar. This should happen even for failed experiments, to enable a use case where, if there weren't enough nodes or whatever, you have an easy way back later to swap in quickly without re-filling everything. - A 'recents' menu somewhere (in the main menu in the header?) that takes you right to the middle of the instantiate wizard, with the profile selected and the params filled out cloudlab status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/644 Public page listing public profiles 2021-10-21T07:03:53-06:00 Robert Ricci ricci@cs.utah.edu

Public page listing public profiles

If we remember correctly, we made it possible to view the profile page for public profiles even if you are not logged in. We should similarly make a page that's visible to non-logged-in people that lists all of the public profiles. Maybe... If we remember correctly, we made it possible to view the profile page for public profiles even if you are not logged in. We should similarly make a page that's visible to non-logged-in people that lists all of the public profiles. Maybe with thumbnails of the topos (if we still generate those) and abbreviated version of the short description. Sorted by all-time instantiations or recent instantiations. cloudlab status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/643 Make profile sharing more prominent 2021-09-24T14:49:05-06:00 Robert Ricci ricci@cs.utah.edu

Make profile sharing more prominent

Though users can share profiles publicly in a few different ways, they seem less aware of this than we would like. I have a few ideas on how we might improve this. * ![Screenshot_2021-09-24_14-15-34](/uploads/62ccb4f71c836c79f387a99b710... Though users can share profiles publicly in a few different ways, they seem less aware of this than we would like. I have a few ideas on how we might improve this. * ![Screenshot_2021-09-24_14-15-34](/uploads/62ccb4f71c836c79f387a99b7109c560/Screenshot_2021-09-24_14-15-34.png) I don't think the Share button is as prominent as it could be - it's currently at the bottom, and the same color as other buttons. I think it could stand to be a more distinct color (the bootstrap success color (green)?). Also, when a profile has a long description, it gets hidden off the bottom of the screen, so I think putting it in the box on the left would be better. * If you don't own a profile, there is no clear indication that it *can* be made public. In the screenshot above, I'm looking at another project-member's profile, and notice that this is not suggested to me at all. If, for example a student is making profiles, and a faculty member is taking care of releasing software, they may not even realize that it's possible to make it public. I'm not sure we want to allow project members to make each others' profiles public (though maybe project leads?), but maybe we could at least have a greyed-out 'make public' button with a tooltip explaining that the owner has to do it. Hmmm - actually now that I look at it, I don't get an indication when looking at my own profiles that I can make the public, without clicking "Edit" - which is different from how most Share UIs work these days. * Speaking of how share UIs work, here is a comparison between ours and Google Docs. I would not say that I really like the way Google Docs does it, but it does have the advantage that you can change the settings right from the popover. It also does a concise job of explaining what the sharing options mean. ![Screenshot_2021-09-24_14-14-48](/uploads/7a52ca7633ef0b764a139c82d0ba0a51/Screenshot_2021-09-24_14-14-48.png) ![Screenshot_2021-09-24_14-17-50](/uploads/f35171ef753f340935d3adce34e22510/Screenshot_2021-09-24_14-17-50.png) ![Screenshot_2021-09-24_14-18-45](/uploads/388837bdc5c64311745a1bc8b492e650/Screenshot_2021-09-24_14-18-45.png) * For some reason, people seem to think that they can't make profiles public if they use custom disk images. When they toggle public on something, maybe we can include a message to the effect that this makes their disk image public too. cloudlab status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/642 Unify path that nodes take to get into `hwdown` 2021-09-22T11:02:32-06:00 Mike Hibler

Unify path that nodes take to get into `hwdown`

Right now, depending on how nodes find their way into `hwdown`, they can be in different states. For one, years ago I had added a sitevar `reload/hwdownaction` that could define what we do with node when reload failed and we moved the n... Right now, depending on how nodes find their way into `hwdown`, they can be in different states. For one, years ago I had added a sitevar `reload/hwdownaction` that could define what we do with node when reload failed and we moved the nodes into `hwdown`, one of do nothing, reboot the node into the admin MFS, or power it off. But, as the name implies, this is only done if it is the reload daemon that puts the node in `hwdown`. If it gets there via the checknodes daemon, or via an explicit `sched_reserve` or `nalloc`, then nothing special is done. For another, whether NFS filesystems should be available and mounted likewise depends on how the node gets into `hwdown`, or more accurately, whether `exports_setup` gets run on that path. So, we should put some code in the `Node.pm` module or maybe just write an explicit script that will put a node into `hwdown`, taking care of all the magic necessary to ensure it is cleanly removed from wherever it is and put into a consistent state. cloudlab status:active Mike Hibler Mike Hibler https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/641 Show/warn users about pending (unapproved) reservation requests 2021-10-21T18:05:31-06:00 Kirk Webb

Show/warn users about pending (unapproved) reservation requests

For scarce/single resources that are in demand, users can easily step on each other with reservation requests. This is because they have no visibility into pending requests. I propose that we show pending reservation requests on the "a... For scarce/single resources that are in demand, users can easily step on each other with reservation requests. This is because they have no visibility into pending requests. I propose that we show pending reservation requests on the "available resources" views, probably using a different color/marking to distinguish them. Additionally, we should email a warning to users that submit requests that overlap with existing unapproved requests. Finally, the "search" button on the reservation request page should take into account pending reservations when looking for an available window. Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/640 Fix clientside scripts to work with python3. 2021-09-03T13:54:54-06:00 Mike Hibler

Fix clientside scripts to work with python3.

We have fixed up the server side (#611) along with a few of the client-side scripts that are used on `ops`, but we should finish the job. We have fixed up the server side (#611) along with a few of the client-side scripts that are used on `ops`, but we should finish the job. cloudlab status:inactive https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/639 Slow disk IO on Wisconsin boss and ops VMs 2022-12-23T08:43:56-07:00 Mike Hibler

Slow disk IO on Wisconsin boss and ops VMs

I don't even remember the details here. Spun this off from (#482) so that I can close that issue. I don't even remember the details here. Spun this off from (#482) so that I can close that issue. cloudlab status:active Mike Hibler Mike Hibler https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/638 Reload Topology button broken 2021-09-13T05:02:53-06:00 Leigh Stoller

Reload Topology button broken

The reload topology button loses the nodes in the manifest somehow. Okay after page reload, so must be something in the javascript. The reload topology button loses the nodes in the manifest somehow. Okay after page reload, so must be something in the javascript. cloudlab status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/637 Repo Update button works once but then does nothing, need to page reload 2021-10-21T07:05:14-06:00 Leigh Stoller

Repo Update button works once but then does nothing, need to page reload

No big deal, but I am not home to put a postit on my monitor. No big deal, but I am not home to put a postit on my monitor. cloudlab status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/636 Mysql performance on boss is terrible! 2021-10-21T07:07:54-06:00 Leigh Stoller

Mysql performance on boss is terrible!

Since the 12.2 upgrade i have noticed a lot of very slow loading web pages. Looking at the mysql slow queries log there are an enormous number of ones like this, that should have been close to instant. Its bothering me enough that I am ... Since the 12.2 upgrade i have noticed a lot of very slow loading web pages. Looking at the mysql slow queries log there are an enormous number of ones like this, that should have been close to instant. Its bothering me enough that I am going to have to dig into it. ``` # Time: 2021-08-20T19:35:18.826539Z # User@Host: skip-grants user[instantiate.php] @ localhost [] Id: 31469255 # Query_time: 5.324131 Lock_time: 0.002438 Rows_sent: 1152 Rows_examined: 29629 SET timestamp=1629488118; select p.uuid,p.name,p.pid,v.creator,p.profileid, p.usecount,f.marked from apt_profiles as p left join apt_profile_versions as v on v.profileid=p.profileid and v.version=p.version left join group_membership as g on g.uid_idx='926619' and g.pid_idx=v.pid_idx and g.pid_idx=g.gid_idx left join apt_profile_favorites as f on f.profileid=p.profileid and f.uid_idx='926619' where locked is null and p.disabled=0 and v.disabled=0 and (p.public=1 or p.shared=1 or v.creator_idx='926619' or g.uid_idx is not null ); ``` cloudlab status:active Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/635 `frisbeed` eating up CPU 2021-08-29T20:12:14-06:00 Mike Hibler

`frisbeed` eating up CPU

New for FreeBSD 12.2! When `frisbeed` finishes serving clients, it eats up 100% of a CPU til it dies. Doing some mutex op repeatedly: ``` ... 72175 frisbeed CALL _umtx_op(0x800289f10,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0x7fffffffe488) 7... New for FreeBSD 12.2! When `frisbeed` finishes serving clients, it eats up 100% of a CPU til it dies. Doing some mutex op repeatedly: ``` ... 72175 frisbeed CALL _umtx_op(0x800289f10,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0x7fffffffe488) 72175 frisbeed RET _umtx_op -1 errno 60 Operation timed out 72175 frisbeed CALL _umtx_op(0x800289f10,UMTX_OP_WAIT_UINT_PRIVATE,0,0x18,0x7fffffffe488) 72175 frisbeed RET _umtx_op -1 errno 60 Operation timed out ... ``` cloudlab status:active Mike Hibler Mike Hibler https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/634 FreeBSD: "older" fsck version is incompatible with "newer" versions of UFS 2021-08-11T10:34:07-06:00 Mike Hibler

FreeBSD: "older" fsck version is incompatible with "newer" versions of UFS

While we were setting up an elabinelab for testing the new firewall, we used prebuilt full-disk images of 12.2-based boss and ops nodes and put those down on a couple of d430s. Upon booting, we got all kinds of cylinder group checksum er... While we were setting up an elabinelab for testing the new firewall, we used prebuilt full-disk images of 12.2-based boss and ops nodes and put those down on a couple of d430s. Upon booting, we got all kinds of cylinder group checksum errors from both kernels. After many (many) bad theories we discovered that the FreeBSD 10 version of fsck in the MFS will fix up bad summary information, which is actually metadata in a FreeBSD 12 filesystem. I only know for sure that this is a problem between FreeBSD 10 and 12, I haven't tracked this down to see when the incompatibility really happened. Hence the vague "older" and "newer" in the title. This needs to be fixed in anything that deals with FreeBSD filesystems. Possibly this is as simple as switching to a FreeBSD 12 version of fsck--assuming that version does not screw up filesystems for older versions of FreeBSD. This will affect the FreeBSD and Linux based MFSes as well as the blockstore code in FreeBSD. We were here once before, with FreeBSD 4 and 5 I think. Didn't really have a satisfactory fix in that case. I think. That was many generations of Mike ago though. cloudlab status:active Mike Hibler Mike Hibler https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/633 Fix medusa segfaults 2021-09-03T13:42:56-06:00 Mike Hibler

Fix medusa segfaults

We run `medusa` on boss against experiment nodes to look for authentication issues in VNC. The version we use, 2.2 from 2016, is the latest released version but has core-dump issues. @stoller has tried tracking these down and so have I. ... We run `medusa` on boss against experiment nodes to look for authentication issues in VNC. The version we use, 2.2 from 2016, is the latest released version but has core-dump issues. @stoller has tried tracking these down and so have I. The latest upgrade on boss and ops (to FreeBSD 12.2) seems to have exacerbated the problem and put it back on my radar. cloudlab status:active Mike Hibler Mike Hibler https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/632 Get our content out of the Plone wiki 2021-09-03T13:52:31-06:00 Mike Hibler

Get our content out of the Plone wiki

We have been keeping Plone on life support through a couple of boss/ops upgrades. After the latest, it is time to pull the plug since nobody wants to convert it to python3. So we need to get our content out of there and loaded somewhere ... We have been keeping Plone on life support through a couple of boss/ops upgrades. After the latest, it is time to pull the plug since nobody wants to convert it to python3. So we need to get our content out of there and loaded somewhere else. (gitlab wiki?) If only I had remembered this *before* we converted ops... So now we have to move the current installation over to a machine with python2, either an elabinelab or else just move it somewhere like ops.utah.cloudlab.us. Then figure out a way to extract the useful content. Then figure out a way to get that content into something else in a reasonable form. I expect this falls on @hibler or @stoller. cloudlab status:inactive https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/631 Reservation time search can return reservations that don't work 2021-07-15T15:45:11-06:00 Robert Ricci ricci@cs.utah.edu

Reservation time search can return reservations that don't work

I heard from a user that they used the feature of the reservation request page where they put in a machine type and number of days to search for a start time, but when they clicked to request the reservation, they were told it didn't fit... I heard from a user that they used the feature of the reservation request page where they put in a machine type and number of days to search for a start time, but when they clicked to request the reservation, they were told it didn't fit. I don't know the reason; eg. it's possible an experiment swapped in or got extended between when the search ran and when they requested it, but it seems like it might be worth taking a look to make sure we don't have any obvious potential bugs. cloudlab Leigh Stoller Leigh Stoller https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/630 Get the 32-port 100Gb Barefoot switch up and running 2021-09-03T13:57:08-06:00 Mike Hibler

Get the 32-port 100Gb Barefoot switch up and running

Brent wants to use this, so we need to get it integrated in the testbed and wired up to some nodes. The short-term plan is to connect up 8 of the new c6525-100g nodes directly as they have a second, unused 100Gb port. * [x] Rack the swi... Brent wants to use this, so we need to get it integrated in the testbed and wired up to some nodes. The short-term plan is to connect up 8 of the new c6525-100g nodes directly as they have a second, unused 100Gb port. * [x] Rack the switch in V05 * [x] Wire up the nodes * [ ] Apply a feature to the nodes so they can be specified in a profile (and possibly to dis-favor them for normal use) * [ ] Add the switch to the DB and make the management interface accessible in a safe way Down the road: * [ ] Ability to reload the switch OS cloudlab status:active https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/629 Replace the storage server at Clemson. 2023-06-09T08:43:20-06:00 Mike Hibler

Replace the storage server at Clemson.

Spinning this one off from #567 as well. That issues says: ``` Clemson: 50TB on one server, but in two zpools of 43 and 7TB. 5.8TB and 1.3TB free. Again a couple of possibilities. The first would be to commandeer another one of the first... Spinning this one off from #567 as well. That issues says: ``` Clemson: 50TB on one server, but in two zpools of 43 and 7TB. 5.8TB and 1.3TB free. Again a couple of possibilities. The first would be to commandeer another one of the first-gen storage node, giving us another 50TB. A more intriguing possibility would be to take over one of the dss7500 nodes with 45 HDDs and 270TB. That would solve the space issue for some time to come but would take one of only two of those machines. I only suggest it because those machines are almost never used right now, which is a waste. ``` This has come to the forefront because the smaller zpool of 1TB disks has a slowly failing disk that ZFS can deal with, but it takes long enough for it to retry an operation that the iSCSI client times out, leading to "disk errors" and a corrupted filesystem. `smartctl` confirms lots of corrected errors and that it is in the "pre-fail" state. Unfortunately, so is every other disk in that zpool, so I am not sure there is much point in replacing the one disk. We need to evacuate that pool. Short term I am going to clear out dead datasets and move everything left on the small zpool to the larger one. That one has 4TB drives that are just as old, but seem to be holding up better. But it is time to seriously consider taking over one of the dss7500 nodes. cloudlab status:active Mike Hibler Mike Hibler