Skip to content
  • David S. Miller's avatar
    sparc64: Kill spurious NMI watchdog triggers by increasing limit to 30 seconds. · e6617c6e
    David S. Miller authored
    
    
    This is a compromise and a temporary workaround for bootup NMI
    watchdog triggers some people see with qla2xxx devices present.
    
    This happens when, for example:
    
    CPU 0 is in the driver init and looping submitting mailbox commands to
    load the firmware, then waiting for completion.
    
    CPU 1 is receiving the device interrupts.  CPU 1 is where the NMI
    watchdog triggers.
    
    CPU 0 is submitting mailbox commands fast enough that by the time CPU
    1 returns from the device interrupt handler, a new one is pending.
    This sequence runs for more than 5 seconds.
    
    The problematic case is CPU 1's timer interrupt running when the
    barrage of device interrupts begin.  Then we have:
    
    	timer interrupt
    	return for softirq checking
    	pending, thus enable interrupts
    
    		 qla2xxx interrupt
    		 return
    		 qla2xxx interrupt
    		 return
    		 ... 5+ seconds pass
    		 final qla2xxx interrupt for fw load
    		 return
    
    	run timer softirq
    	return
    
    At some point in the multi-second qla2xxx interrupt storm we trigger
    the NMI watchdog on CPU 1 from the NMI interrupt handler.
    
    The timer softirq, once we get back to running it, is smart enough to
    run the timer work enough times to make up for the missed timer
    interrupts.
    
    However, the NMI watchdogs (both x86 and sparc) use the timer
    interrupt count to notice the cpu is wedged.  But in the above
    scenerio we'll receive only one such timer interrupt even if we last
    all the way back to running the timer softirq.
    
    The default watchdog trigger point is only 5 seconds, which is pretty
    low (the softwatchdog triggers at 60 seconds).  So increase it to 30
    seconds for now.
    
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    e6617c6e