A better mechanism for moving data between concurrent experiments
Looking at some of our recent control net traffic abuse, there seems to be a trend among them where people move a large amount of data between nodes in different experiments. In one case, they have a semi-persistent single node experiment, and then as they get GPU nodes one at a time, they copy their data over to those. Then (I think), they copy the data back before the GPU node expires. In another last night, they were moving data (I think) from the NVMe drives in an older experiment onto the NVMe drives in a newer experiment, I assumed because the older experiment was going to expire soon. The former is probably about getting some sort of continuity when they can only get GPU nodes for short periods of time. The latter maybe because the two experiments overlap and they cannot use a persistent dataset RW in two experiments at once.
Anyway, there does seem to be a desire to move data between concurrent experiments. Right now, the easiest way to do that is over the control net, either directly with scp
or indirectly with NFS (/proj). I am pondering whether there is a better way. Possibilities:
- Use a shared vlan between the experiments where they could just do their scp over an experiment net.
- Expose a shared filesystem abstraction via the "blockstore" mechanism. Again that would use the experiment fabric, but would put load on shared infrastructure.
- Eliminate the need for multiple experiments by making it easy to add and remove nodes from an experiment. Then they could have something like an
m400
"NFS server" node (with blockstore) for their data and add/subtract "good" nodes in a LAN to do actual work.