emulab-devel issues

Simple interfaces to blockstores.

2018-05-25T07:42:11-06:00

We want to make common storage use cases easy to configure, particularly via Jacks. First we need to identify those common use cases, e.g., "scratch" space or "a shared filesystem". Then we can propose interfaces and ensure that the und...

Color code based on adjusted free values

2017-11-06T09:24:57-07:00

Change color coding in the cluster picker to follow the adjusted free values that don't count reserved nodes as fully 'free'. Consult with @gtw to see where this adjusted information is coming in via the XMLRPC call.

Fix Image import at Moonshot cluster

2018-01-11T14:26:50-07:00

If the imported image does not include an architecture definition in the metadata, we do not know which node types or architecture to assign locally. Normally not a problem, we just assign all local node types, which works everywhere except Moonshot. We had a bunch of imported images marked to run on m400 and m510. It was easy to fix all the images up and update the image server, since in our universe, all imported images are by definition x86 images. But need to do two things; 1) Make sure we set the arch for all node types and images on our clusters. 2) Do something for when we just do not get an architecture.

Start a ZFS snapshot schedule on other cluster `fs` nodes

2020-01-27T07:46:38-07:00

In lieu of actual backups, we should use `znapzend` to create and maintain a set of ZFS snapshots for /proj and /users. On the mothership we are keeping: daily snapshots for a week, weekly snapshots for a month.

Jacks slowness (?)

2018-01-25T13:04:34-07:00

I am encountering more Jacks slowness when I attempt to instantiate https://www.cloudlab.us/show-profile.php?uuid=6a1598a1-cef5-11e7-b179-90e2ba22fee4 when I change the number of `VBG VMs per host` to 20; `Physical Hosts` to 10; and set `Number VM hosts plugged into each VBG VM` to 0 instead of 1. Seems like it takes 4-5 minutes on my lightly-loaded, 32GB RAM desktop, to get to a fully-rendered Finalize frame in the wizard. I didn't try to separate out the cost of the constraint checker vs the renderer. Then there is a long delay on the status page when transitioning from the 'provisioning' state to the 'booting' state, even once CreateSliver is obviously a long way down the road. @stoller suspects that part of that delay is rendering the Topology View. I need some way around this in the next couple or three weeks, even if we can't look at the root cause prior to that. My ideas are things like a profile parameter/metadata bit, UI option, instantiate URL param, to disable jacks rendering. Of course, if we allow disabling of Jacks rendering, then we need to ensure that the same Actions can be performed from the Node List tab as can be done from the Topology View tab. I wonder, would it also be easy to revert back to the legacy Emulab renderer for large experiments or if Jacks render has been disabled? Presumably we generate the classic experiment picture at the CM, and would just have to get it back to the portal and dump into the Topology View tab. Anyway, maybe something like that could be a stopgap?

Come up with a current, comprehensive document for using storage in Emulab/Cloudlab

2020-07-03T11:02:35-06:00

We have assorted wiki/FAQ pages about how to acquire and use storage in Emulab and CloudLab, but it is pretty scattered and mostly out of date. * https://wiki.emulab.net/wiki/kb28 Provides pointers to several others, but also talks about RON and /scratch (remember those?) * https://wiki.emulab.net/wiki/kb81 Talks about the NFS structure (for Emulab) and some dos and don'ts * https://wiki.emulab.net/wiki/kb55 Low-level instructions for creating an extra FS (vintage RHL9 and FBSD410). * https://wiki.emulab.net/Emulab/wiki/Tutorial#CustomOS Original tutorial text on creating custom images. * https://wiki.emulab.net/wiki/EmulabStorage @kwebb's original document about the blockstore system. * https://wiki.emulab.net/wiki/kb28b @hibler's most recent guidance on what storage to use when. Probably the closest to what I have in mind, but too high level. We need to answer questions about specific scenarios and provide pointers to profiles that do it. Examples: * I need more local disk space, how do I get it? * I need to log or output data at a high rate, where do I put it? * I need a lot of persistent storage, how do I do it? * I need a lot of persistent storage shared between my experiment nodes, how do I do it? * Can I set up my own NFS server? HDFS? Ceph setup? * I have a bunch of VM images that I need in my experiment, where do I put them? * etc.

Hardware infrastructure upgrades for CloudLab

2018-11-01T13:45:17-06:00

A few semi-related things that need to be done: * Subboss for node imaging (frisbee). There are occasional protracted delays when staring up frisbee sessions. Not a huge problem, but if we can get a dedicated 10Gb link directly on the node control net, it will help. I have a dual-port X520 card, and we can probably use one of the old 1U HP boxes or a 2U Dell box for this purpose. Just need to make sure it has a PCIe 2.0 (?) slot. The ulterior motive for a subbos box is... * Provide a machine to hook the USB based serial port hub (for user allocatable switch consoles) to. The hub works fine under FreeBSD (11.1) which is what the subboss will be. See #382, last few comments. * Get dedicated 10Gb links into the boss and ops VMs. Right now they (along with the control node) share a single 10Gb link The server has a dual-port 10Gb card, and we can use a spare 1Gb interface for the control interface for the control node itself. This is the way the mothership control node is setup. * Get the filesave box hooked up and running. This box is intended primarily as an offsite backup for Emulab/Flux, but we have designed things such that it will also backup the critical CloudLab/Apt servers. The box is setup, we just need to get it down there and running. This will hook in via 1-2 10Gb interfaces. * Scare up enough 10Gb ports on `bighp1` to do all this. We currently have two broken out 10Gb ports. The layout: + 2/0/19:1: currently control node, use for passthrough to boss (requires re-config of boss VM, need to have control node 1Gb and ops 10Gb broken out first) + 2/0/19:2: currently unused, use for passthrough to ops (can be setup anytime, requires reconfig of ops VM) + 2/0/19:3: currently unused, use for ms-subboss (can be setup anytime) + 2/0/19:4: currently unused, use for ms-procurve (maybe issues here since this link is currently fiber, have to see if Dell DAC cable will work on Procurve switch, I have reason to believe it will not--test this first) + 2/0/20:1: currently the fiber uplink to ms-procurve, move to 2/0/19:4 and use this port for filesave box. + 2/0/20:2: currently unused, second link to filesave box? + 2/0/20:3: currently unused, future expansion + 2/0/20:3: currently unused, future expansion We only have one 3m breakout cable right now, we will have to order a second. I am hoping that bighp1 will be okay with the Dell DAC cable. We do have 1m HP (Mellanox?) breakouts. Those are too short to use here, but maybe we could swap one of the 3m netscout<->node breakouts for a one meter for nearby nodes.

Integrate CONFIRM into portal

2019-08-29T11:21:40-06:00

The first step is to identify places in the portal (and maybe documentation) where we should link to CONFIRM. Some possibilities: * Link on the experiment status page to CONFIRM with the experiment's nodes pre-selected * Link from some node type or status pages * Link in the hardware chapter of the manual

CloudLab hardware page is out of date

2020-03-26T14:41:21-06:00

The online description of CloudLab's hardware resources is out of date: https://cloudlab.us/hardware.php It describes the Phase II hardware as "the remainder will be built in about a year." It refers to CloudLab in the future tense: "CloudLab will be distributed infrastructure...", "CloudLab will be federated with a wealth of existing research infrastructure..."

Wonky port counters on Utah Cloudlab hp090 and hp091

2022-06-27T08:31:26-06:00

We get frequent "excessive traffic" reports for hp090 and hp091 w.r.t. transmitted packets, e.g.: ``` Node:port Expt Pkts/sec Mb/sec When hp091:eth2 cops-PG0/tapir 6732659 6 2019-07-02 08:59:48 for 638 sec ``` almost 7M pps for 6M bps for 10 minutes. This would indicate that each packet was, on average, less than one bit. There appears to be some discrepancy between packet counters on the Dell S4048-ON control switch both by the CLI or SNMP. For example, hp091's switch port shows: ``` Input Statistics: 1536874051 packets, 162913777286 bytes 1242179 64-byte pkts, 1512599412 over 64-byte pkts, 4230559 over 127-byte pkts 587130 over 255-byte pkts, 2005787 over 511-byte pkts, 16208984 over 1023-byte pkts 198618 Multicasts, 10872 Broadcasts, 289299473137 Unicasts 0 runts, 0 giants, 256 throttles 0 CRC, 0 overrun, 0 discarded ``` Note that the first line shows 1536874051 total packets and yet the unicast pack count is almost 200x larger. It is those individual counters that we collect with `portstats` and the control net monitor program.

More comprehensive disk cleaning

2023-09-22T07:33:01-06:00

As we (mostly Powder) pick up more industrial (non-acedemic) users, zeroing of disks between experiments will probably become more important. We have a path for doing this, and @kwebb uses this in his "tainting" of experiment nodes (originally for PhantomNet), but it is an expensive operation right now because it requires frisbee write zeros to all free blocks. There are other thiings we can do: * Use the "block erase" support that many (most? all?) SSDs and NVMe devices support. We use this currently in conjunction with our TRIM support and, at least for devices we have, requires less than a minute to erase up to 500GB devices. Presumably this is because it only marks all the blocks for erasure and does the work in the background. * Make use of SEDs (self-encrypting disks). We have talked about this since NCR days, but if you just change the encryption key for disks between experiments, you have effectively erased the old content. I think we have some of these disks, and there is FreeBSD/Linux support for manipulating these. Another consideration is whether we erase *all* disks between experiments. That could be really, really painful on those Clemson nodes with 40+ HDs...

Portal based policies

2020-04-24T13:01:02-06:00

We need a way to restrict node/types on a portal basis. We have kicked around ideas like extending the group_policy tables or adding a portal_policies table. @hibler is super interesting in this ticket.

Trying to use emulab-xen sliver type fails at Cloudlab Utah cause of image aliases

2020-05-04T11:35:50-06:00

Using /tmp/stitcher.hWZoPG for stitcher Stitcher command: /usr/testbed/gcf/src/stitcher.py --fileDir /tmp/stitcher.hWZoPG --cred /tmp/stitcher.hWZoPG/speaksforcred.xml --slicecredfile /tmp/stitcher.hWZoPG/slicecred.xml --usercredfile /tmp/stitcher.hWZoPG/slicecred.xml --al2scredfile /tmp/stitcher.hWZoPG/al2scred.xml --debug --GetVersionCacheName=/tmp/stitcher.hWZoPG/get_version_cache.json --AggNickCacheName=/tmp/stitcher.hWZoPG/agg_nick_cache --scsURL http://scs.scs.scs.emulab.net:8081/geni/xmlrpc --speaksfor urn:publicid:IDN+emulab.net+user+thedeu2e -V3 allocate urn:publicid:IDN+emulab.net:sdnnfvlab+slice+attempt5 /tmp/stitcher.hWZoPG/rspec.xml Allocation of slivers in slice urn:publicid:IDN+emulab.net:sdnnfvlab+slice+attempt5 at utah-clab3 failed: Error from Aggregate: code 2. protogeni AM code: 28: *** WARNING: mapper: *** nodejailosid: Could not map [ImageAlias emulab-ops,UBUNTU16-64-STD *** 123456] on [vnode:Relay-node] *** ERROR: mapper: *** Can't call method "osid" on an undefined value at *** /usr/testbed/lib/libvtop_test.pm line 2510. (PG log url - look here for details on any failures: https://www.utah.cloudlab.us/spewlogfile.php3?logfile=f754c12071ae77e24945263997708068)..

Nodecheck bugs

2020-05-22T17:12:09-06:00

Nodecheck bugs

Make a tipserv machine for IPMI-based consoles

2023-12-17T10:22:02-07:00

Since we started using SOL for node consoles, we have been running captures on the boss node since it has access to the management network. In the case of the mothership boss, and even more so the Utah cloudlab boss, that is on the order of 500-1000 capture instances. There is no reason that capture has to run on boss for these (well, unless we cut some corners in the IPMI capture and assumed we could access the DB directly for info). We could have a separate node (VM?) handle this, it just has to be on the private segment of the control network. The question is whether the trade-off of load vs. convenience of access (e.g., to logfiles) is worth it.

Ensure datasets are not busy when taking a snapshot

2020-07-03T09:46:00-06:00

This is a very common failure mode for image-backed datasets: ``` About to: '/usr/testbed/bin/sshtb -n -host c220g5-111012 /usr/local/bin/create-versioned-image METHOD=frisbee SERVER=128.104.222.9 IMAGENAME=praxis-PG0/bench-setup:0 BSNAME=bs IZOPTS=N' as uid 0 c220g5-111012: started image capture for '/.amd_mnt/ops.wisc.cloudlab.us/proj/praxis-PG0/images/bench-setup/bench-setup.ndz.tmp', waiting up to 90 minutes total or 8 minutes idle. umount: /benchdata: target is busy. Could not unmount /dev/mapper/emulab-bs! Could not parse all arguments FAILED: Returned error code 2 generating image ... ``` We want to unmount the filesystem to get a consistent snapshot of the filesystem, but the user has a process active on the dataset at that time. Things to look at: * attempting to locate all such processes and killing them * doing a forcible unmount * shutting down the machine to single-user * identifying the situation in advance and refusing to snapshot

More storage for CloudLab storage servers

2022-10-06T12:05:57-06:00

After spending quite a lot of time scrambling to come up with 10TB on one of our blockstore servers, I came to the conclusion that we are "under-storaged" if we are serious about the blockstore mechanism and allocation of 10+TB datasets. The situation: * Utah: 43.5TB in one server, **6.7TB free**. We have a couple of options for more space here. One is the DriveScale chassis which have over 100TB between them but needs more work to integrate with the model. The other is to wire up the Apt half of the Dell storage box. This would add a second server with 36TB. Requires running a 40Gb link from the Apt side of the pod over to `bighp1`. * Clemson: 50TB on one server, but in two zpools of 43 and 7TB. **5.8TB and 1.3TB free**. Again a couple of possibilities. The first would be to commandeer another one of the first-gen storage node, giving us another 50TB. A more intriguing possibility would be to take over one of the `dss7500` nodes with 45 HDDs and 270TB. That would solve the space issue for some time to come but would take one of only two of those machines. I only suggest it because those machines are almost never used right now, which is a waste. * Wisc: 32TB on one server. **10TB free**. The only option for more space here would be to take over another one or two `c240g1` nodes, the only ones with significant storage. That would get another 32TB per server. * UMass: no storage servers, no current plan. * OneLab: no storage servers, no current plan. * Emulab: 92TB on two storage servers. Each server has 40TB for persistent and 6TB for ephemeral. There is about **9.7TB free on each**. * Apt: 36TB in one server, all available, have been using for testing. The intent was to move this storage server + disk to CloudLab Utah.

Per experiment, cross aggregate root ssh

2020-08-17T15:10:23-06:00

Regarding experiment wide ssh, we currently generate a per-aggregate root ssh key pair and optionally push that out to all nodes in an experiment, but it is a different key pair for each aggregate in the experiment. Instead, we should derive an ssh key from the ssl private key that we generate for every portal experiment and send over to the aggregates. Then all the nodes across all aggregates in the experiment, would be able to root ssh to each other.

mkextrafs failure on second 1TB disk

2020-10-06T11:31:25-06:00

The command: ``` sudo /usr/testbed/bin/mkextrafs -s 1 -r /dev/sdb /mnt1 ``` on our CentOS 7 image creates a filesystem that is only 27GB instead of 900+GB. The culprit seems to be the line: ``` echo '2048,1953525168' | sfdisk --force /dev/sdb ``` which should correctly create the 900+GB partition one, but instead says: ``` [root@node1 ~]# echo '2048,1953525168' | sfdisk --force /dev/sdb Checking that no-one is using this disk right now ... OK Disk /dev/sdb: 121601 cylinders, 255 heads, 63 sectors/track Old situation: Units: cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdb1 2048 5520- 3473- 27896024 83 Linux /dev/sdb2 0 - 0 0 0 Empty /dev/sdb3 0 - 0 0 0 Empty /dev/sdb4 0 - 0 0 83 Linux Warning: given size (1953525168) exceeds max allowable size (119553) New situation: Units: cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sdb1 2048 1953527215 1953525168 15691690911960 83 Linux /dev/sdb2 0 - 0 0 0 Empty /dev/sdb3 0 - 0 0 0 Empty /dev/sdb4 0 - 0 0 0 Empty Warning: partition 1 extends past end of disk Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).) ``` and then produced the tiny partition instead, which appears to be some sort of default. Note that this is only a 1TB drive and should not trigger the DOS partition 2TB problem. But I would not doubt that it is related. Note also that we parse out the disk size from fdisk (1953525168) and then want to use that size, offset by 2048, to create the new partition. Seems like we should be subtracting 2048 from the size before we do that. I tried that manually, but it did not seem to affect the outcome. But maybe there is some rounding going on that caused it to still be too big.

IG Event daemon stops delivering events

2021-10-21T08:13:14-06:00

After a long time running, the IG event daemon will stop posting events back to the Mothership. Typically happens after a period of instability in the network, and it will be able to reconnect most of the time. But sometimes it just whigs out. I've tried to track it down, but no success. I think a simple workaround is to wrap with daemon_wrapper, and when it gets itself into this state, just exit and let the wrapper restart it. Well, perhaps a bit more complicated, but something along these lines.