-
Leigh B. Stoller authored
* Add boot_errno to the nodes table so that nodes can report in a subcode to indicate what went wrong. At present, we do not report any real error codes; that is going to take some time to work out since it will reqiure a bunch of changes to the boot scripts. * Add new table node_bootlogs to store logs provided by the nodes. Not a full console log, but a log of the tmcd client side part. We can make it a full log if we want though; just means mucking about with the boot phase a bit. * Add new state transition to NORMALv2 and PCVM state machines. "TBFAILED" is a new state that is sent (after TBSETUP) if a node fails somewhere in the tmcd client side. * Change TBNodeStateWait() to take a list of states (instead of single state) and an optional pass by reference parameter to return the actual state that the node landed in. Change all calls to TBNodeStateWait() of course. * Change os_setup (and libreboot in wait mode) to look for both TBFAILED and ISUP. If a TBFAILED event is seen, we can terminate the wait early and not retry os_setup on physical nodes (although still retry virtual nodes). The nice thing about this is that the wait should terminate much earlier (rather then waiting for timeout), especially for virtual nodes which can take a really long time when there are a couple of hundred. * Add new routines dobooterrno() and dobootlog() to tmcd. Bump version number and increase the buffer size to allow for the larger packets that a console log wikk generate (added MAXTMCDPACKET variable, set to 0x4000). * Add new -f option to tmcc to specify a datafile to send along as the last argument to tmcd. This is more pleasing then trying to send a console log in on the command line. For example: "tmcc -f /tmp/log BOOTLOG" will send a BOOTLOG command along with the contents of /tmp/log. Also close the write side of the pipe so that server sees EOF on read. See aside comment below. * Changes to rc.bootsetup: 1. Use perl tricks to capture all output, duping to the console and to a log file in /var/emulab/logs. 2. On any error, send a status code (boot_errno) and the bootlog to tmcd. 3. Generate a TBFAILED state transition. * Changes to rc.injail: 1. Same as rc.bootsetup, but do not send log files; that would pummel boss. Leave them on the physical node. * Change vnodesetup (which calls mkjail) to watch for any error and send a TBFAILED state transition. This should catch almost all errors, and dramatically reduce waiting when something fails. * Changes to rc.cdboot are essentially the same as rc.bootsetup, although a bootlog is sent all the time (success or failure), and I do not generate a boot_errno yet. Also, instead of TBFAILED, generate a PXEFAILED state since the CDROM is actually operating within the PXEFBSD opmode. I have yet to work this into the rest of the system though; waiting to get a new CD built and actually experiment with it. * Add new menu option and web page to display the node bootlog. We store only the lastest bootlog, but maybe someday store more then one. Display boot_errno on node page. Aside: I made a big mistake in the tmcd protocol; I did not envision passing more then a small amount of data (one fragment) and so I do not include a record terminator (ie: close of the write side on the client sends EOF) or a size field at the beginning. No big deal since small requests are sent in one fragment and the server sees the entire thing. Well, with a large console log, that will end up as multiple fragments, and the server will often not get the entire thing on the first read, and there are no subsequent reads (with no EOF or known size, it would block forever). Well, fixing this in a backwards compatable manner (for old images) was way too much pain. Instead, tmcc now closes the write side, and the server does subsequent reads *only* in the new dobbootlog() routine. Note that it *is* possible to fix this in a backwards compatable manner, but I did not want to go down that path just yet.
Leigh B. Stoller authored* Add boot_errno to the nodes table so that nodes can report in a subcode to indicate what went wrong. At present, we do not report any real error codes; that is going to take some time to work out since it will reqiure a bunch of changes to the boot scripts. * Add new table node_bootlogs to store logs provided by the nodes. Not a full console log, but a log of the tmcd client side part. We can make it a full log if we want though; just means mucking about with the boot phase a bit. * Add new state transition to NORMALv2 and PCVM state machines. "TBFAILED" is a new state that is sent (after TBSETUP) if a node fails somewhere in the tmcd client side. * Change TBNodeStateWait() to take a list of states (instead of single state) and an optional pass by reference parameter to return the actual state that the node landed in. Change all calls to TBNodeStateWait() of course. * Change os_setup (and libreboot in wait mode) to look for both TBFAILED and ISUP. If a TBFAILED event is seen, we can terminate the wait early and not retry os_setup on physical nodes (although still retry virtual nodes). The nice thing about this is that the wait should terminate much earlier (rather then waiting for timeout), especially for virtual nodes which can take a really long time when there are a couple of hundred. * Add new routines dobooterrno() and dobootlog() to tmcd. Bump version number and increase the buffer size to allow for the larger packets that a console log wikk generate (added MAXTMCDPACKET variable, set to 0x4000). * Add new -f option to tmcc to specify a datafile to send along as the last argument to tmcd. This is more pleasing then trying to send a console log in on the command line. For example: "tmcc -f /tmp/log BOOTLOG" will send a BOOTLOG command along with the contents of /tmp/log. Also close the write side of the pipe so that server sees EOF on read. See aside comment below. * Changes to rc.bootsetup: 1. Use perl tricks to capture all output, duping to the console and to a log file in /var/emulab/logs. 2. On any error, send a status code (boot_errno) and the bootlog to tmcd. 3. Generate a TBFAILED state transition. * Changes to rc.injail: 1. Same as rc.bootsetup, but do not send log files; that would pummel boss. Leave them on the physical node. * Change vnodesetup (which calls mkjail) to watch for any error and send a TBFAILED state transition. This should catch almost all errors, and dramatically reduce waiting when something fails. * Changes to rc.cdboot are essentially the same as rc.bootsetup, although a bootlog is sent all the time (success or failure), and I do not generate a boot_errno yet. Also, instead of TBFAILED, generate a PXEFAILED state since the CDROM is actually operating within the PXEFBSD opmode. I have yet to work this into the rest of the system though; waiting to get a new CD built and actually experiment with it. * Add new menu option and web page to display the node bootlog. We store only the lastest bootlog, but maybe someday store more then one. Display boot_errno on node page. Aside: I made a big mistake in the tmcd protocol; I did not envision passing more then a small amount of data (one fragment) and so I do not include a record terminator (ie: close of the write side on the client sends EOF) or a size field at the beginning. No big deal since small requests are sent in one fragment and the server sees the entire thing. Well, with a large console log, that will end up as multiple fragments, and the server will often not get the entire thing on the first read, and there are no subsequent reads (with no EOF or known size, it would block forever). Well, fixing this in a backwards compatable manner (for old images) was way too much pain. Instead, tmcc now closes the write side, and the server does subsequent reads *only* in the new dobbootlog() routine. Note that it *is* possible to fix this in a backwards compatable manner, but I did not want to go down that path just yet.