Such a brutal ElabinElab hack ... When trying to swapin an actual
experiment from the web interface, I ran into another control network problem, this time in bootinfo. When a node is sitting free, it waits in pxeboot for a bootinfo packet from boss to tell it what to do (this is different then when the node is allocated, and bootinfo tells it what to do in a reply to the initial request). In the PXEWAIT case, we *send* it a packet, addressed to its *control network* address, which in the inner DB, is on the inner control network, but of course PXE is really using the outer control network, so packets addressed to inner control network are never seen by pxeboot. This is the only (known) case of this happening, and rather then try for some general, over engineered solution, I did something unusual, and put in a hack, ifdefed for ELABINELAB (meaning, its an inner elab). I know, you're thinking, how could he have done such a thing, its so unlike him! Well, it was damn easy! Anyway, this little hack checks the DB for an interface tagged as role='outer_ctrl' and uses that IP instead of the inner control network. When I create the inner DB from the outer DB, I was already leaving the outer control network in place so that bootinfo could find the proper node (again, cause the bootinfo request packets are coming from the outer control network, and so its IP would not match any nodes in the DB). I'd like to say that this is the last problem with swapin, but I see in my other window that the event scheduler failed to start on inner ops with some silly error ssh permission denied error. Whats that all about?
Please register or sign in to comment