On the destructive tendencies of `prepare`
This is prompted most recently by issue #303 (closed), but we (well, me anyway) have long had concerns about whether a node we take an image of, should come back up afterward as though nothing had happened. This behavior is at odds with the desire to have all experiment and even cluster specific state removed from the image being taken.
A great deal of what we remove/undo in the prepare
script prior to taking an image gets replaced by our startup scripts when the node comes back up, but not everything. Basically, any one-time action done by slicefix
at swapin time (as opposed to done every reboot by the startup scripts) is a potential problem.
The current bugaboo is the cleaning of SSH state from an image. If we remove the host keys (/etc/ssh/ssh_host_*
) then at the next reboot, the node will either generate new host keys (FreeBSD) or wind up with no host keys (Linux, at least Ubuntu). The latter is particularly bad because then even boss root cannot ssh into the node.
On numerous occasions we have declared this destructive behavior to be okay, based on our typical usage of swapping in single node experiments for the express purpose of creating a new image and then swapping them out afterward.
But this is not a desirable behavior when it is a user snapshoting one or more nodes for the purposes of checkpointing or creating a backup and then continuing on. However, I have only circumstantial evidence that people actually do this (e.g., the presence of a user image with 100+ versions, created frequently).
One compromise position here would be to have a "prepare harder" option that we only use when we are creating images for export to other sites. That is, only use this option when updating system images. The normal prepare mode will avoid these destructive behaviors. Really though, we would have to do it for any image that is "global" to avoid "trojan images". (See the whole Alice/Bob thing in #303 (closed)).
Thoughts?