emulab-devel issueshttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues2024-02-02T15:31:57-07:00https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/688Allow explicit user snapshots of persistent blockstores (remote datasets)2024-02-02T15:31:57-07:00Mike HiblerAllow explicit user snapshots of persistent blockstores (remote datasets)Right now, snapshots of a blockstore are only made when an experiment containing a node with a RW mapping of that blockstore is terminated. This might be a bit cumbersome for workflows in which the "master" of a dataset is being updated ...Right now, snapshots of a blockstore are only made when an experiment containing a node with a RW mapping of that blockstore is terminated. This might be a bit cumbersome for workflows in which the "master" of a dataset is being updated on a regular basis (instantiate, update, terminate).
We could allow for users to make an explicit snapshot of a blockstore while a RW mapping exists. The workflow would be: user clicks a button on the GUI somewhere, invokes a script on boss which `ssh`s over to the node with the RW mapping and unmounts any filesystem associated with the dataset, then makes the DB call on boss to create a snapshot of the zvol, and finally `ssh`s again to the node to remount any filesystem.
I would not allow more than a single snapshot which would basically be "the most recent snapshot" that new RO and clone mappings use. If they then create another snapshot, it will replace the previous one if possible (i.e., the snapshot is not in active use). I don't want to start accounting for multiple user snapshots and providing ways to map particular ones. I view this just as a shortcut of the terminate/reinstantiate model.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/687More comprehensive node cleaning2023-09-22T07:52:39-06:00Mike HiblerMore comprehensive node cleaningWe are running in to more and more instances of node-hosted devices that have their own persistent state that should really be reset between experiments. We have long talked about doing this for BIOS settings (#234, #652), but new "threa...We are running in to more and more instances of node-hosted devices that have their own persistent state that should really be reset between experiments. We have long talked about doing this for BIOS settings (#234, #652), but new "threats" are emerging:
* BlueField NICs or other processing units that have an SoC. We have run into problems on the Clemson nodes with BF2 NICs where users can load images onto the card that prevent it from presenting properly to the BIOS or naive OS. This leads to long waits, ending with timeouts, at initialization time and ultimately causes timeout in `stated` which can result in nodes winding up in `hwdown`.
* NVMe devices. The standard allows for partitioning of the physical devices into logical units and for changing the format of the drives from 512 byte logical blocks to 4K blocks. The former is mostly just confusing to the user, the latter can actually prevent nodes from being imaged by `frisbee` as the `imagezip` format allows for regions smaller than 4k that need to be written.
The latter could be handled from the MFS pretty easily, there are small utilities for both Linux (`nvme`) and FreeBSD (`nvmecontrol`) that can remove logical devices and put the drives in the correct format all in short order. The former involves a much more complex set of NVIDIA tools and a complete Linux image for the card, and the reset process can take 20+ minutes.
Since these are (at the moment) pretty rare events, our thinking is that we will have a "recovery" path, similar to the "hardware checkup" path that will run nodes through a custom Linux image with all the necessary tools to fix these and other problems. Nodes could be run through this path if they fail to load a known standard image, or we can run them through on demand. @stoller has such a custom image that he constructed specifically for the BF2 case.
Note that this is related to #540, which covers proper cleaning of the contents of storage devices.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/686Handle blockstores "correctly" during experiment modify2023-10-11T11:45:50-06:00Mike HiblerHandle blockstores "correctly" during experiment modifySince @stoller got experiment modify working via the portal (have I mentioned how awesome this is?) I need to figure out how to handle blockstores in a sane manner.
First, it should be pretty easy to make this work for remote blockstore...Since @stoller got experiment modify working via the portal (have I mentioned how awesome this is?) I need to figure out how to handle blockstores in a sane manner.
First, it should be pretty easy to make this work for remote blockstores, we just have to detach from them before modify and reattach to whatever is in the experiment after the modify.
Local blockstores are the issue. The cleanest approach would just be to destroy any existing blockstores before modify and then let it recreate blockstores after the modify. However, this would likely not sit well with users unless they were explicitly making a change to the blockstore configuration as part of their modify operation. If they were just say, adding a node, then wiping out their local blockstores on all other nodes is probably not a reasonable thing to do. Unfortunately, this may be what I do for the current "reconfig" target, I had better go fix that!
There is code in place today to save the current config at boot time and on reboot read in that config, adding or removing any blockstores that appear of disappear. But that is not going to work unless the reconfig involves a reboot. It also won't work if the boot disk is reloaded as part of the modify. We either need to store the configuration info in /proj, of maybe take advantage of the fact that volume managers like LVM and ZFS can reconstruct their config from on-disk info.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/685Ensure Linux MFSes provide necessary `/dev/disk/by-X` symlinks2023-06-22T17:33:03-06:00David Johnsonjohnsond@flux.utah.eduEnsure Linux MFSes provide necessary `/dev/disk/by-X` symlinksAs discussed in https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/233#note_37815, we need to extend our Linux MFS `mdev`-based environment to support the classic Recovery MFS use case: reinstall a borked bootloader via chroot to ...As discussed in https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/233#note_37815, we need to extend our Linux MFS `mdev`-based environment to support the classic Recovery MFS use case: reinstall a borked bootloader via chroot to host. This used to work fine, but modern grubs need additional dynamically-generated symlinks in /dev. To quote from the linked comment: "our recovery MFS chroot-and-reinstall-bootloader no longer works on Ubuntu 22. The package postinst and/or bootloader install scripts now require the /dev/disk/by-\* symlinks, and we save space in the MFS by using busybox mdev."David Johnsonjohnsond@flux.utah.eduDavid Johnsonjohnsond@flux.utah.eduhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/683Boss/ops hardware upgrades at Cloudlab Clemson2023-06-09T08:45:59-06:00Mike HiblerBoss/ops hardware upgrades at Cloudlab ClemsonThis is nearly identical to #669 and #670. A related issue is #629.
We need to do everything here:
* [ ] both: figure out hardware, either buying new machines or repurposing new-ish nodes.
* [ ] ops: make preliminary copy of current op...This is nearly identical to #669 and #670. A related issue is #629.
We need to do everything here:
* [ ] both: figure out hardware, either buying new machines or repurposing new-ish nodes.
* [ ] ops: make preliminary copy of current ops ZFS data over to new ops
* [ ] both: make preliminary copy of /usr/testbed over to new machines, making sure services are disabled on the new machines
* [ ] figure out what if any DB state needs to be updated to reflect the change
* [ ] schedule downtime
* [ ] make sure @stoller and Scott/Dennis are around to share the pain :-)
* [ ] make the final transitionMike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/682long waits, or timeouts, Clemson r650s2023-05-17T17:16:05-06:00Dan Readinglong waits, or timeouts, Clemson r650sI've been reloading r650s at clemson a-lot recently and I see _long_ waits at "Waiting for server to reboot us ..."
_experiment_ `https://www.cloudlab.us/status.php?uuid=25b9ab9b-f4c3-11ed-9f39-e4434b2381fc`
In this experiment I aske...I've been reloading r650s at clemson a-lot recently and I see _long_ waits at "Waiting for server to reboot us ..."
_experiment_ `https://www.cloudlab.us/status.php?uuid=25b9ab9b-f4c3-11ed-9f39-e4434b2381fc`
In this experiment I asked two r650 to load UINTAH22-64-STD both nodes ended with "Exited (2)"
Looking at the console log for clnode263 frisbee loaded emulab-ops/UBUNTU22-64-STD:1
```
Wed May 17 08:58:19 MDT 2023: slicefix run(s) done
Waiting for server to reboot us ...
[ 123.144978] IPOD: got type=6, code=6, iplen=666, host=130.127.132.51^M
[ 123.151442] IPOD: reboot forced by 130.127.132.51...
Press `ESC' to enter the GRUB menu...
Checking DHCP data for bootinfo server address...not found, using DHCP
server 130.127.132.51
Requesting bootinfo data from 130.127.132.51....
error 28 while waiting for response
error: failed to receive bootinfo response.
Booting `Force Recovery MFS (Linux) Boot'
error: can't find command `pxe'.
Trying to load GRUB config file /tftpboot/recovery_linux/grub.cfg...
```
I tried `os_load -i UBUNTU22-64-STD clnode265` on the other node in the experiment it loaded the image ok.
Just seems to be a overly long response waiting for **Waiting for server to reboot us ...**Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/680A better grub2pxe2023-04-17T11:34:46-06:00Mike HiblerA better grub2pxe`grub2pxe` is our "Linux MFS" replacement for `pxeboot` from FreeBSD-world. Since it is the future (and has been for at least 10 years now) there are a couple of things that could/should be improved:
* We should get our changes synced up...`grub2pxe` is our "Linux MFS" replacement for `pxeboot` from FreeBSD-world. Since it is the future (and has been for at least 10 years now) there are a couple of things that could/should be improved:
* We should get our changes synced up with some suitably recent version of mainstream Grub. In the past I have merged our changes (bootinfo, FreeBSD kernel loading fixes, TFTP improvements). My most recent attempt (last year sometime) resulted in a grub2pxe that didn't work right. So this needs to get sorted.
* Come up with the minimum set of grub2pxe-loaded `grub.cfg` files that will handle loading FreeBSD or Linux MFSes, FreeBSD or Linux on disk images) and UEFI (GPT) or legacy BIOS (MBR) format images on all our nodes types with their various quirks. I started down this path as part of #233, meticulously testing all possible combos for each node type, but after about a month of off and on work, I gave up and just started tweaking the installed versions til they worked without going back and integrating the changes and retesting working combos. All in the name of "let's just get this done already!" The result is a trail of highly similar versions that could be reduced and consolidated.
* Grub has support for network transports other than TFTP, in particular HTTP. Once grub2pxe has been loaded, we could download the kernel and MFS image much more efficiently with HTTP. For one, it is TCP and not UDP so we get some congestion control. For another, there is likely a lot more web server development going on than TFTP server development, in particular for handling lots of simultaneous clients. Finally, putting a random-access disk-based boot loader on top of TFTP results in particularly horrible behavior when it has to do a seek. Seeking backward means starting over at the beginning, and doing sequential block-by-block transfers of data (that you throw away) til you reach the correct new location. Trying to change the behavior of the boot loader by avoiding backward seeks is one of the set of highly invasive, custom changes to Grub we have been carrying around.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/677Clouldlab display problem in narrow window2023-03-16T12:46:16-06:00Dan ReadingClouldlab display problem in narrow windowMonitor is 1920x1200 in vertical orientation
Don't see the drop downs Experiments, Storage, Docs and <user> if the window is 1121W x 1283H. If window width increased to 1541W then the drops downs appear.
If browser window is decreased to...Monitor is 1920x1200 in vertical orientation
Don't see the drop downs Experiments, Storage, Docs and <user> if the window is 1121W x 1283H. If window width increased to 1541W then the drops downs appear.
If browser window is decreased to 75% then the drop down appears.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/676Reservations, extensions, and on-demand experiments coexistence2023-03-16T14:08:05-06:00Kirk WebbReservations, extensions, and on-demand experiments coexistenceExperiment extensions push resource allocations forward, sometimes clashing with unapproved reservations, which seek to carve out specific future time slots for resource use. Extensions are currently granted, with automatic approval, for...Experiment extensions push resource allocations forward, sometimes clashing with unapproved reservations, which seek to carve out specific future time slots for resource use. Extensions are currently granted, with automatic approval, for existing experiments for up to 1 week from the initial start of the experiment (up to an additional week can be requested later, for a total of 2 weeks of auto-approved extension). Some reservations are automatically granted and some are auto-approved, depending on the resources and time period requested. Reservations for resources that require admin approval may be submitted with a start time as soon as 9 AM the next business day. If a reservation requiring admin approval is posted in the period between reviews (longer on weekends), a subsequent automatically-approved extension involving those same resources can cause this reservation request to no longer be viable. The resulting resource contention usually results in the reservation request being denied, with a note explaining the situation and requesting that the user make a new request. This is a fairly frequent problem, especially for reservations starting on Monday mornings with a weekend in between admin reviews.
On POWDER, where admin-approved reservations are required for some resources, this extension-reservation clash is often confusing and frustrating for users. To try and ease this tension, we came to a consensus in a recent team meeting to make the following policy (and associated enforcement) changes:
* Increase the minimum start time for new reservation requests requiring admin approval to 2 business days.
- Starting no earlier than 9 AM on the second business day.
* Decrease the maximum auto-approved extension request duration to 24 hours from the current time.
- Additional auto-approved extensions can be granted for up to 24 hours from the current time over 2 weeks.
- Longer extensions can still be requested, but will require admin approval.
Taken together these changes will force auto-approved extensions to stay clear of pending reservation requests. I will write up an announcement to send to users to explain the changes in reservation and extensions policy. I'm only suggesting this change for POWDER, but of course it could be applied elsewhere.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/674Unclean shutdown of iSCSI blockstores2023-02-13T22:00:01-07:00Mike HiblerUnclean shutdown of iSCSI blockstoresWhen an experiment terminates, iSCSI blockstores with mounted filesystems to not get unmounted cleanly. Since the nodes in an experiment do not get rebooted (I don't think) until the node winds up in `reloading`, the blockstore shutdown ...When an experiment terminates, iSCSI blockstores with mounted filesystems to not get unmounted cleanly. Since the nodes in an experiment do not get rebooted (I don't think) until the node winds up in `reloading`, the blockstore shutdown script's unmounting of remote blockstores at that time will not work since the experiment VLANs have been torn down. The result is that not all data may be sync'ed to the storage server.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/671Enable SEV on Cloudlab AMD nodes2022-12-22T12:21:14-07:00Mike HiblerEnable SEV on Cloudlab AMD nodesThis only applies to Clemson `r6525` nodes right now, but I am creating this ticket to make note of what to do for future nodes.
To enable AMD SEV, enable the following (on Dell machines):
* IOMMU
* Kernel DMA Protection
* Secure Memory...This only applies to Clemson `r6525` nodes right now, but I am creating this ticket to make note of what to do for future nodes.
To enable AMD SEV, enable the following (on Dell machines):
* IOMMU
* Kernel DMA Protection
* Secure Memory Encryption
* Secure Nested Paging
* SNP Memory Coverage
and set "Minimum SEV non-ES ASID" to a value greater than one. It appears to actually be a maximum based on the description in [the vSphere docs](https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.security.doc/GUID-757E2B37-C9D0-416A-AA38-088009C75C56.html). They say set it to N+1 if you want N VMs, so we could set it to something like 17 or 33 maybe.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/666Periodic retries for nodes in `hwdown`2022-07-11T11:12:17-06:00Mike HiblerPeriodic retries for nodes in `hwdown`We spend a considerable amount of time on an ongoing basis dealing with nodes in the `hwdown` experiment. By the time we diagnose such a node, quite often the problem has disappeared or we discover the problem is easily fixable. Worse, i...We spend a considerable amount of time on an ongoing basis dealing with nodes in the `hwdown` experiment. By the time we diagnose such a node, quite often the problem has disappeared or we discover the problem is easily fixable. Worse, if we don't notice a node in there quickly enough it can get pushed down the list by more recent failures and can fall off our radar, winding up in `hwdown` for months.
So... @eeide asks, "Can we do better?" Maybe periodically releasing nodes from `hwdown` to give them a second chance (which is often times what we do manually when we don't have time to diagnose).
Note that this is mostly an issue for nodes of the less popular types (e.g. `pc3000`s). `hwdown`ing of nodes of frequently used types (e.g., those with GPUs) will cause overbook problems in the reservation system or users will complain and we will act quickly.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/665Expiring projects and users2022-06-23T13:59:23-06:00Leigh StollerExpiring projects and usersJust a place to record some notes about we want for this. @ricci mentioned that class projects should expired after some period of time. Some questions that come to mind:
- How long?
- Is the project deleted? We do not have project arch...Just a place to record some notes about we want for this. @ricci mentioned that class projects should expired after some period of time. Some questions that come to mind:
- How long?
- Is the project deleted? We do not have project archiving at this time, and for history purposes we would need to add it.
- Are the users deleted or made inactive? Probably not the leader.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/663Node type permissions checks and enhancements2022-05-09T11:37:12-06:00Kirk WebbNode type permissions checks and enhancementsAs discussed in the POWDER Platform meeting today, I would like two enhancements to the node type checks that are performed:
1) Per-portal node type access
For this, I'd like to be able to, e.g., grant use of all d740 nodes and x310 ra...As discussed in the POWDER Platform meeting today, I would like two enhancements to the node type checks that are performed:
1) Per-portal node type access
For this, I'd like to be able to, e.g., grant use of all d740 nodes and x310 radios to all POWDER Portal users. Leigh points out that this can be done at the mothership. I take this to mean it can't be done (easily) across aggregates. That should work
fine for what is needed here.
2) Node type checks when submitting reservations
When a reservation is submitted, the system should check that the user has rights to use the types in the reservation. We do this check at swap-in, but not for reservations. This would prevent users from getting stopped cold at the last second from swapping in when they have an approved reservation for a type they haven't been granted access to.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/660A better mechanism for moving data between concurrent experiments2022-03-25T11:59:34-06:00Mike HiblerA better mechanism for moving data between concurrent experimentsLooking at some of our recent control net traffic abuse, there seems to be a trend among them where people move a large amount of data between nodes in different experiments. In one case, they have a semi-persistent single node experimen...Looking at some of our recent control net traffic abuse, there seems to be a trend among them where people move a large amount of data between nodes in different experiments. In one case, they have a semi-persistent single node experiment, and then as they get GPU nodes one at a time, they copy their data over to those. Then (I think), they copy the data back before the GPU node expires. In another last night, they were moving data (I think) from the NVMe drives in an older experiment onto the NVMe drives in a newer experiment, I assumed because the older experiment was going to expire soon. The former is probably about getting some sort of continuity when they can only get GPU nodes for short periods of time. The latter maybe because the two experiments overlap and they cannot use a persistent dataset RW in two experiments at once.
Anyway, there does seem to be a desire to move data between concurrent experiments. Right now, the easiest way to do that is over the control net, either directly with `scp` or indirectly with NFS (/proj). I am pondering whether there is a better way. Possibilities:
* Use a shared vlan between the experiments where they could just do their scp over an experiment net.
* Expose a shared filesystem abstraction via the "blockstore" mechanism. Again that would use the experiment fabric, but would put load on shared infrastructure.
* Eliminate the need for multiple experiments by making it easy to add and remove nodes from an experiment. Then they could have something like an `m400` "NFS server" node (with blockstore) for their data and add/subtract "good" nodes in a LAN to do actual work.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/659Fix up Apt node BIOSes2022-03-02T08:58:19-07:00Mike HiblerFix up Apt node BIOSesWhile attempting to collect hwinfo from all Apt nodes, I noticed two anomalies with the pt (in particular `r320`) nodes.
* No setup password set
* The lifecycle controller is disabled, did we intentionally do that?
Because of the first,...While attempting to collect hwinfo from all Apt nodes, I noticed two anomalies with the pt (in particular `r320`) nodes.
* No setup password set
* The lifecycle controller is disabled, did we intentionally do that?
Because of the first, we should make a pass over the BIOSes and make sure they are configured as we expect.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/658Time synchronization at Cloudlab clusters2022-02-24T12:14:21-07:00Mike HiblerTime synchronization at Cloudlab clustersA recent question on the users list asked about time synchronization between the clusters which got me thinking about this again.
All of the nodes at a cluster use a local (`ntp1`) NTP time server which by convention is `ops`. We also ...A recent question on the users list asked about time synchronization between the clusters which got me thinking about this again.
All of the nodes at a cluster use a local (`ntp1`) NTP time server which by convention is `ops`. We also stash away the "drift" value from each node (via the watchdog) and use the latest saved value to initialize the drift file when a node is imaged. The various cluster NTP servers use a range of upstream servers and NTP pools, but are not directly connected ("peers"). We seem to keep reasonable time between the cluster NTP servers at least, generally around 1-5ms.
Some questions:
* Is saving/restoring the drift value still a good thing to do?
* Should we be using PTP?
* Any chance of getting a GPS receiver at the main clusters?
* Should we use `chrony` which is ```aimed at ordinary computers, which are unstable, go into sleep mode or have intermittent connection to the Internet. chrony is also designed for virtual machines, a much more unstable environment.```? I think current Ubuntu images already use it.
At the very least, we should probably move the `ntp1` alias off of `ops`, which is a VM at all but Emulab, and onto the control node instead where there would be a more stable clock.https://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/657Emulab storage servers acting flaky2022-02-08T10:21:17-07:00Mike HiblerEmulab storage servers acting flakyDuring the recently completed storage upgrade of the storage box (#656) both of the SAS-attached storage servers exhibited flaky behavior. At one time or another, both rebooted suddenly and during reboots (expected or not), both had a te...During the recently completed storage upgrade of the storage box (#656) both of the SAS-attached storage servers exhibited flaky behavior. At one time or another, both rebooted suddenly and during reboots (expected or not), both had a tendency to hang as the OS was coming up. Additionally, `dbox2` was showing some
```
Processor #0x2d Asserted IERR.
```
errors. I could find no documentation about this, but there were statements that this is likely an error detected by the processor and not an error with the processor itself.Mike HiblerMike Hiblerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/655Configuration of special devices through adjacent nodes in an experiment2022-01-13T16:13:51-07:00Kirk WebbConfiguration of special devices through adjacent nodes in an experimentIt would be useful to be able to configure (load images, etc.) special devices as part of experiment instantiation. Here, "special devices" means anything that doesn't come up under PXE boot control for setup. Examples include USRP X31...It would be useful to be able to configure (load images, etc.) special devices as part of experiment instantiation. Here, "special devices" means anything that doesn't come up under PXE boot control for setup. Examples include USRP X310 software defined radios, and previously we also had COTS "nano" eNodeB devices from ip.access that required configuration specific to each experiment.
One way to handle these devices is to have network-adjacent nodes in the experiment do the configuration work. As an example, X310 radios are almost always paired with a single compute node. This node could be identified during instantiation, and appropriate commands could be scheduled to run on it to load firmware images, set configuration parameters (if required), etc. Such proxying of device setup would allow us to specify disk images to load and other configuration steps from the profile script in a first class manner.Leigh StollerLeigh Stollerhttps://gitlab.flux.utah.edu/emulab/emulab-devel/-/issues/654Speed up the Emulab database2021-12-06T15:08:09-07:00Mike HiblerSpeed up the Emulab databaseIt is becoming increasingly clear, if it wasn't already, that the database is one of our primary bottlenecks for allowing instantiation of large numbers of experiments at once. The options are either to speed up our mysql setup/schema or...It is becoming increasingly clear, if it wasn't already, that the database is one of our primary bottlenecks for allowing instantiation of large numbers of experiments at once. The options are either to speed up our mysql setup/schema or to switch to a different DB.
For the former, we can further attempt to optimize our MyISAM tables:
* https://dev.mysql.com/doc/refman/5.7/en/optimizing-myisam.html
which is straight forward but will provide minimal payoff. We can switch to InnoDB that supports better parallelism but at non-trivial cost to convert:
* https://dev.mysql.com/doc/refman/5.7/en/converting-tables-to-innodb.html
or we could try clustering or replication:
* https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster.html
* https://dev.mysql.com/doc/refman/5.7/en/replication.html
but I am not sure that those make sense in our environment which doesn't need to scale _that_ far and has a very small footprint infrastructure-wise (one or two servers).
Switching databases would be a lot more work and with no guarantee of better performance. MariaDB:
* https://mariadb.org/
is a fork of mysql and claims to be faster/better/stronger. It would probably be the easiest to transition to. PostgreSQL:
* https://www.postgresql.org/
is more featureful and better for very large DBs, but seems like overkill for us. The transition is likely to be extremely painful as well.