emulab-devel issueshttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues2020-07-23T17:31:36-06:00https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/574Allow ADB connections across aggregates in the same experiment2020-07-23T17:31:36-06:00Kirk WebbAllow ADB connections across aggregates in the same experimentWe restrict access to ADB on UE phones based on a "target" host/IP set for such nodes in profiles (using `iptables` on the phone host). For phones running "development" Android builds, this is important since no authentication is requir...We restrict access to ADB on UE phones based on a "target" host/IP set for such nodes in profiles (using `iptables` on the phone host). For phones running "development" Android builds, this is important since no authentication is required to connect. Note: The current mechanism only allows users to specify a single host/IP. The ADB target can be a node in the same experiment as the UE so long as both are in the same aggregate. When this is the case, the node's user-assigned (logical) name can be specified in the profile as the ADB target. During experiment setup, such a binding will be dynamically resolved to the control IP address of the physical machine allocated as the ADB target node. This dynamic binding is not possible across aggregates within the same experiment because aggregate namespaces and mappings are not shared. Since users are interested in having the ADB target in the same experiment as phones in a separate aggregate, we are looking for ways to allow for this.Kirk WebbKirk Webbhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/570Develop code to monitor sensor on biccmpb CPU board2020-08-26T13:21:53-06:00Alex OrangeDevelop code to monitor sensor on biccmpb CPU boardDevelop the code to interface with the sensors on the Bus Integrated Control Clocking Monitoring Power Board (biccmpb). This board will be made 10 more times to go into the fixed endpoints with roughly if not exactly the same code as the...Develop the code to interface with the sensors on the Bus Integrated Control Clocking Monitoring Power Board (biccmpb). This board will be made 10 more times to go into the fixed endpoints with roughly if not exactly the same code as the buses for monitoring.
Sensors:
* Voltage Sensors (power supply)
* Current Sensors
* Isolated Voltage/Current Sensor (new power board)
* Fans: tachometer and speed control
* GPSDO NMEA String
* GPSDO 10 MHz (new board) and PPS to MCU
* RS232 UARTs (2)
* SPI MAC FLASH
* SI7006-A20 (Humidity sensor)
* Moisture sensor
* Cooler fault
* MCP9808
* Charge/Discharge on battery unit
* BMX160 (IMU)
* TCA9539-Q1 (GPIO expander)
* NUC resethttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/569Allow project leader to delete any dataset in their project2020-07-22T06:54:10-06:00Leigh StollerAllow project leader to delete any dataset in their projectCurrently only the owner or admins can delete a dataset.Currently only the owner or admins can delete a dataset.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/567More storage for CloudLab storage servers2022-10-06T12:05:57-06:00Mike HiblerMore storage for CloudLab storage serversAfter spending quite a lot of time scrambling to come up with 10TB on one of our blockstore servers, I came to the conclusion that we are "under-storaged" if we are serious about the blockstore mechanism and allocation of 10+TB datasets....After spending quite a lot of time scrambling to come up with 10TB on one of our blockstore servers, I came to the conclusion that we are "under-storaged" if we are serious about the blockstore mechanism and allocation of 10+TB datasets. The situation:
* Utah: 43.5TB in one server, **6.7TB free**. We have a couple of options for more space here. One is the DriveScale chassis which have over 100TB between them but needs more work to integrate with the model. The other is to wire up the Apt half of the Dell storage box. This would add a second server with 36TB. Requires running a 40Gb link from the Apt side of the pod over to `bighp1`.
* Clemson: 50TB on one server, but in two zpools of 43 and 7TB. **5.8TB and 1.3TB free**. Again a couple of possibilities. The first would be to commandeer another one of the first-gen storage node, giving us another 50TB. A more intriguing possibility would be to take over one of the `dss7500` nodes with 45 HDDs and 270TB. That would solve the space issue for some time to come but would take one of only two of those machines. I only suggest it because those machines are almost never used right now, which is a waste.
* Wisc: 32TB on one server. **10TB free**. The only option for more space here would be to take over another one or two `c240g1` nodes, the only ones with significant storage. That would get another 32TB per server.
* UMass: no storage servers, no current plan.
* OneLab: no storage servers, no current plan.
* Emulab: 92TB on two storage servers. Each server has 40TB for persistent and 6TB for ephemeral. There is about **9.7TB free on each**.
* Apt: 36TB in one server, all available, have been using for testing. The intent was to move this storage server + disk to CloudLab Utah.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/565Remove project datasets when we remove a project2020-06-27T09:43:29-06:00Mike HiblerRemove project datasets when we remove a projectI got this:
```
2 boss.emulab.net> showlease -a
Could not find group JPF-Doop/JPF-Doop!
Could not find group Maline/Maline!
Could not find group JPF-Doop/JPF-Doop!
Could not find group Maline/Maline!
Could not find group Clover/Clover!
C...I got this:
```
2 boss.emulab.net> showlease -a
Could not find group JPF-Doop/JPF-Doop!
Could not find group Maline/Maline!
Could not find group JPF-Doop/JPF-Doop!
Could not find group Maline/Maline!
Could not find group Clover/Clover!
Could not find group Clover/Clover!
...
```
and those, I believe are old projects that Zvonimir got rid of.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/564Missing local blockstore desires2020-06-29T15:30:07-06:00Leigh StollerMissing local blockstore desiresNoticed this morning, when a user tried to create a local blockstore that was too big for the node (m400, disk size from the previous century). Turns out that we need to add a ?+disk_xxx desire for local blockstores, which is currently d...Noticed this morning, when a user tried to create a local blockstore that was too big for the node (m400, disk size from the previous century). Turns out that we need to add a ?+disk_xxx desire for local blockstores, which is currently done in the NS parser on the CLassic path. So we need to generate that in geni-lib or in the CM.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/561Fix snmpit error after DeleteNodes()2020-06-26T10:14:43-06:00Leigh StollerFix snmpit error after DeleteNodes()Seen this a couple of times, deleting all nodes but a single node, which means deleting the vlan, results in this error:
```
stack::findVlans calling ms-chassis9-switchb
*** snmpit:
No vlanid 67208 in the DB!
Failed to setup vlans: F...Seen this a couple of times, deleting all nodes but a single node, which means deleting the vlan, results in this error:
```
stack::findVlans calling ms-chassis9-switchb
*** snmpit:
No vlanid 67208 in the DB!
Failed to setup vlans: Failed to synchronize vlans
```
See https://www.utah.cloudlab.us/spewlogfile.php3?logfile=1b1d96adb75be1ec12725cda2145a751Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/560Ensure datasets are not busy when taking a snapshot2020-07-03T09:46:00-06:00Mike HiblerEnsure datasets are not busy when taking a snapshotThis is a very common failure mode for image-backed datasets:
```
About to: '/usr/testbed/bin/sshtb -n -host c220g5-111012 /usr/local/bin/create-versioned-image METHOD=frisbee SERVER=128.104.222.9 IMAGENAME=praxis-PG0/bench-setup:0 BSNAM...This is a very common failure mode for image-backed datasets:
```
About to: '/usr/testbed/bin/sshtb -n -host c220g5-111012 /usr/local/bin/create-versioned-image METHOD=frisbee SERVER=128.104.222.9 IMAGENAME=praxis-PG0/bench-setup:0 BSNAME=bs IZOPTS=N' as uid 0
c220g5-111012: started image capture for '/.amd_mnt/ops.wisc.cloudlab.us/proj/praxis-PG0/images/bench-setup/bench-setup.ndz.tmp', waiting up to 90 minutes total or 8 minutes idle.
umount: /benchdata: target is busy.
Could not unmount /dev/mapper/emulab-bs!
Could not parse all arguments
FAILED: Returned error code 2 generating image ...
```
We want to unmount the filesystem to get a consistent snapshot of the filesystem, but the user has a process active on the dataset at that time.
Things to look at:
* attempting to locate all such processes and killing them
* doing a forcible unmount
* shutting down the machine to single-user
* identifying the situation in advance and refusing to snapshotMike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/556Make a tipserv machine for IPMI-based consoles2023-12-17T10:22:02-07:00Mike HiblerMake a tipserv machine for IPMI-based consolesSince we started using SOL for node consoles, we have been running captures on the boss node since it has access to the management network. In the case of the mothership boss, and even more so the Utah cloudlab boss, that is on the order...Since we started using SOL for node consoles, we have been running captures on the boss node since it has access to the management network. In the case of the mothership boss, and even more so the Utah cloudlab boss, that is on the order of 500-1000 capture instances.
There is no reason that capture has to run on boss for these (well, unless we cut some corners in the IPMI capture and assumed we could access the DB directly for info). We could have a separate node (VM?) handle this, it just has to be on the private segment of the control network. The question is whether the trade-off of load vs. convenience of access (e.g., to logfiles) is worth it.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/555Eliminate tipserver mounts on ops2020-06-08T09:12:42-06:00Mike HiblerEliminate tipserver mounts on opsOur remaining `tipserv*` nodes are getting old and flaky and if one of them goes down for an extended period of time, it (eventually) takes `ops` with it because we NFS mount the tipserv nodes on ops. Even though the mounts are interrupt...Our remaining `tipserv*` nodes are getting old and flaky and if one of them goes down for an extended period of time, it (eventually) takes `ops` with it because we NFS mount the tipserv nodes on ops. Even though the mounts are interruptible (intr) and set to timeout (soft), neither seems to work. A 30 minute reboot of ops because a tipserv node fails is not acceptable.
Why are tipserv nodes mounted on ops? Originally, it was so that users could get to the log/run files for captures. But now that users can no longer login to ops directly, maybe that is not needed. @stoller confirms that the portal interface accesses logs by `ssh`ing directly to the tipserv nodes. The remaining concern is that `ops` does directly access the `.acl` file for authentication. A reasonably simple solution for this is to reverse the mounts so that tipserv nodes mount `ops` instead of the other way around. We would want to isolate the `.acl` files so that they are the only thing being exported to `ops`; we do not want to have every tipserv node writing the console logs themselves across NFS to `ops`. `ops` does not need more NFS load.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/554The FreeBSD `metis` port has undergone a major, possibly incompatible revision2020-06-03T09:17:26-06:00Mike HiblerThe FreeBSD `metis` port has undergone a major, possibly incompatible revisionFrom the Slack thread:
> FYI, the FreeBSD `metis4` port is gone is the latest quarterly port set. There is now a `metis` port, which is Metis version 5. Anyone care to speculate whether that is going to cause problems? I thought it was ...From the Slack thread:
> FYI, the FreeBSD `metis4` port is gone is the latest quarterly port set. There is now a `metis` port, which is Metis version 5. Anyone care to speculate whether that is going to cause problems? I thought it was used with `assign`, but apparently it is only `assign_prepass` and `ipassign`.
@ricci thought them no longer used, though I verified that we have classic experiments (none in the last 10 years), that will use them do to magical DB settings in the `experiments` table. @stoller says `ipassign` is not used through the portal path, though `assign_prepass` (via `mapper`) might be.
The new port does cause problems:
> It looks like at least `assign_prepass` is still used. It uses the `kmetis` command line tool and that tool is now gone. According to the manual (http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/manual.pdf, search for "kmetis"), `gpmetis` is the direct replacement command, but I don't have any idea if it behaves the same without any additional options.
> I should add that `ipassign` is also technically still used, but it has to be specified explicitly in an ns file and there are only a handful of experiments that do that--none swapped in in the last 10 years. But we don't have a better solution for large, complex topos, we just punt on them now. `ipassign` links with `metis` libraries and we have already discovered that they moved include files around which breaks the build. Who knows what has changed in the API.
> Looks like mapper has to be invoked with "-x", or the experiment flagged with "useprepass", in order for assign_prepass to be called. There are no instances of the former, and though there are three experiments with the "useprepass" column set, I see no path in our code through which that field can ever be set!https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/553Nodecheck bugs2020-05-22T17:12:09-06:00Dan ReadingNodecheck bugsDan ReadingDan Readinghttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/547Trying to use emulab-xen sliver type fails at Cloudlab Utah cause of image al...2020-05-04T11:35:50-06:00Leigh StollerTrying to use emulab-xen sliver type fails at Cloudlab Utah cause of image aliasesUsing /tmp/stitcher.hWZoPG for stitcher
Stitcher command: /usr/testbed/gcf/src/stitcher.py --fileDir /tmp/stitcher.hWZoPG --cred /tmp/stitcher.hWZoPG/speaksforcred.xml --slicecredfile /tmp/stitcher.hWZoPG/slicecred.xml --usercredfile /tm...Using /tmp/stitcher.hWZoPG for stitcher
Stitcher command: /usr/testbed/gcf/src/stitcher.py --fileDir /tmp/stitcher.hWZoPG --cred /tmp/stitcher.hWZoPG/speaksforcred.xml --slicecredfile /tmp/stitcher.hWZoPG/slicecred.xml --usercredfile /tmp/stitcher.hWZoPG/slicecred.xml --al2scredfile /tmp/stitcher.hWZoPG/al2scred.xml --debug --GetVersionCacheName=/tmp/stitcher.hWZoPG/get_version_cache.json --AggNickCacheName=/tmp/stitcher.hWZoPG/agg_nick_cache --scsURL http://scs.scs.scs.emulab.net:8081/geni/xmlrpc --speaksfor urn:publicid:IDN+emulab.net+user+thedeu2e -V3 allocate urn:publicid:IDN+emulab.net:sdnnfvlab+slice+attempt5 /tmp/stitcher.hWZoPG/rspec.xml
Allocation of slivers in slice urn:publicid:IDN+emulab.net:sdnnfvlab+slice+attempt5 at utah-clab3 failed: Error from Aggregate: code 2. protogeni AM code: 28: *** WARNING: mapper:
*** nodejailosid: Could not map [ImageAlias emulab-ops,UBUNTU16-64-STD
*** 123456] on [vnode:Relay-node]
*** ERROR: mapper:
*** Can't call method "osid" on an undefined value at
*** /usr/testbed/lib/libvtop_test.pm line 2510.
(PG log url - look here for details on any failures: https://www.utah.cloudlab.us/spewlogfile.php3?logfile=f754c12071ae77e24945263997708068)..Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/542Portal based policies2020-04-24T13:01:02-06:00Leigh StollerPortal based policiesWe need a way to restrict node/types on a portal basis. We have kicked around ideas
like extending the group_policy tables or adding a portal_policies table. @hibler is
super interesting in this ticket.We need a way to restrict node/types on a portal basis. We have kicked around ideas
like extending the group_policy tables or adding a portal_policies table. @hibler is
super interesting in this ticket.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/540More comprehensive disk cleaning2023-09-22T07:33:01-06:00Mike HiblerMore comprehensive disk cleaningAs we (mostly Powder) pick up more industrial (non-acedemic) users, zeroing of disks between experiments will probably become more important. We have a path for doing this, and @kwebb uses this in his "tainting" of experiment nodes (orig...As we (mostly Powder) pick up more industrial (non-acedemic) users, zeroing of disks between experiments will probably become more important. We have a path for doing this, and @kwebb uses this in his "tainting" of experiment nodes (originally for PhantomNet), but it is an expensive operation right now because it requires frisbee write zeros to all free blocks.
There are other thiings we can do:
* Use the "block erase" support that many (most? all?) SSDs and NVMe devices support. We use this currently in conjunction with our TRIM support and, at least for devices we have, requires less than a minute to erase up to 500GB devices. Presumably this is because it only marks all the blocks for erasure and does the work in the background.
* Make use of SEDs (self-encrypting disks). We have talked about this since NCR days, but if you just change the encryption key for disks between experiments, you have effectively erased the old content. I think we have some of these disks, and there is FreeBSD/Linux support for manipulating these.
Another consideration is whether we erase *all* disks between experiments. That could be really, really painful on those Clemson nodes with 40+ HDs...Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/538Watchdog for "shutting off" RF if we lose contact for an extended period of time2020-07-14T16:07:28-06:00Robert Ricciricci@cs.utah.eduWatchdog for "shutting off" RF if we lose contact for an extended period of timeThe idea is to shut off RF transmissions if we lose contact with central control for a long period of time, since we can't tell if the panic button has been pressed.
@alexo suggested that we control the RF switch to move the inputs to a...The idea is to shut off RF transmissions if we lose contact with central control for a long period of time, since we can't tell if the panic button has been pressed.
@alexo suggested that we control the RF switch to move the inputs to an unamplified output, as a simple way of preventing significant radiation from occurring.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/534Lock a project down to a particular Cluster2020-03-26T14:15:08-06:00Robert Ricciricci@cs.utah.eduLock a project down to a particular ClusterThe Mass site is going to want to allow in commercial users, who should only be able to use their cluster. After talking this over with Mike Zink, the simplest way to do this will probably be to make a project that they manage membership...The Mass site is going to want to allow in commercial users, who should only be able to use their cluster. After talking this over with Mike Zink, the simplest way to do this will probably be to make a project that they manage membership for, and limit the project to using only one cluster.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/530create_image -s on boss fails2020-03-26T14:20:26-06:00chuck cranorcreate_image -s on boss failsThis is Mike's favorite bug! It predates flux gitlab, so I couldn't file an issue on it when i first encountered it. Now I can...
running "create_image -s" (uses ssh) on boss causes the new disk image binary to be emailed to your acco...This is Mike's favorite bug! It predates flux gitlab, so I couldn't file an issue on it when i first encountered it. Now I can...
running "create_image -s" (uses ssh) on boss causes the new disk image binary to be emailed to your account instead of saved on disk. its too large, so the mail system rejects it with:
<pre>
The original message was received at Wed, 29 Jan 2020 14:54:46 -0500 (EST)
from localhost [127.0.0.1]
----- The following addresses had permanent fatal errors -----
<chuck@ece.cmu.edu>
(reason: 552 5.2.3 Message size exceeds fixed maximum message size
+(52428800))
----- Transcript of session follows -----
... while talking to dept-mx-03.andrew.cmu.edu.:
>>> MAIL From:<chuck@boss.narwhal.pdl.cmu.edu> SIZE=928020875
<<< 552 5.2.3 Message size exceeds fixed maximum message size (52428800)
554 5.0.0 Service unavailable
</pre>
one possible solution is just to remove "-s" from create_image, as the non-ssh options work.
another solution is this:
<pre>
diff -r -u baseline/utils/create_image.in orca/utils/create_image.in
--- baseline/utils/create_image.in 2019-05-22 17:03:13.000000000 -0400
+++ orca/utils/create_image.in 2019-05-22 17:02:25.000000000 -0400
@@ -1224,7 +1224,7 @@
#
my $SAVEUID = $UID;
$EUID = $UID = 0;
- $result = run_with_ssh($command, undef);
+ $result = run_with_ssh($command, $filename);
$EUID = $UID = $SAVEUID;
if ($result eq "setupfailed") {
goto done;
</pre>
but Mike was worried that it might break(?) something else if applied.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/524Prevent deletion of image files in /proj/pid/images2020-01-02T14:54:09-07:00Leigh StollerPrevent deletion of image files in /proj/pid/imagesThis comes up all the time; users deleting image files and directories in the images directory, leaving a dangling descriptor. We can use the sunlink flag to prevent it, but we cannot set/unset flags via NFS, so it is more than a trivial...This comes up all the time; users deleting image files and directories in the images directory, leaving a dangling descriptor. We can use the sunlink flag to prevent it, but we cannot set/unset flags via NFS, so it is more than a trivial change.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/517Representing wireless spectrum licenses2020-07-14T16:05:54-06:00Robert Ricciricci@cs.utah.eduRepresenting wireless spectrum licensesWe need to have a way to represent what spectrum users are allowed to use. We already have the other side of this (what they are requesting) represented. What we need now are at least the following:
* Spectrum that everyone is allowed t...We need to have a way to represent what spectrum users are allowed to use. We already have the other side of this (what they are requesting) represented. What we need now are at least the following:
* Spectrum that everyone is allowed to use all the time (eg. ISM)
* Spectrum that people are allowed to use as a result of licenses we own (eg. innovation zone)
* Spectrum that can be used only if we make some external check (eg. checking to see if they posted the details fo the experiment to the FCC website for the program license)
* Spectrum that other parties have access to - eg. due to licenses their companies hold
This is partially blocked on seeing what the rules for the Innovation Zone will be.Gary WongGary Wong