ExtremeSwitching (Other)

Expand all | Collapse all

CPU Health Check has failed

  • 1.  CPU Health Check has failed

    Posted 06-19-2017 09:32
    Hello, a couple days ago we lost communication to an 8 slot 460-48p switch stack, were alerted to it by a netsight alert:

    Cpu HealthCheck has failed. Slot ExtremeXOS (Stack) version 15.3.1.4 v1531b4-patch1-44 by release-manager on Fri Sep 5 16:29:36 EDT 2014 Error Type 7 Action hardwareFail(4) Retries autoRecovery(5)

    I was able to log into the stack of (8) 460-48p's. However only slot 1 had a role (Master). Switches 2 thru 8 had a role of (None). A reboot cleared the issue up. I booted into the other partition which has a 16 code (had that planned already). In looking back at records, we had the same message roughly a year ago and things went down there too. Is there something that I can tweek so that if the slot has a problem that it would recover by itself?

    Or something that was a known issue maybe with the 15.3.1.4 patch 1-44 code? Or maybe it's a hardware issue?

    Thank you

    Sarah

    configure slot 1 module X460-48p
    configure sys-recovery-level slot 1 reset
    configure slot 2 module X460-48p
    configure sys-recovery-level slot 2 reset
    configure slot 3 module X460-48p
    configure sys-recovery-level slot 3 reset
    configure slot 4 module X460-48p
    configure sys-recovery-level slot 4 reset
    configure slot 5 module X460-48p
    configure sys-recovery-level slot 5 reset
    configure slot 6 module X460-48p
    configure sys-recovery-level slot 6 reset
    configure slot 7 module X460-48p
    configure sys-recovery-level slot 7 reset
    configure slot 8 module X460-48p
    configure sys-recovery-level slot 8 reset





  • 2.  RE: CPU Health Check has failed

    Posted 06-19-2017 13:05
    Hi Sarah,

    Did you happen to try logging into one of the non-master nodes during the failure? I'm curious what they saw their role as during this?

    Also, did you check 'show slot'? I'd like to know what the status of the non-master nodes was.

    'Show log' from both the master and one of the failed nodes may be helpful as well, but since it was rebooted and the issue was a few days ago, there's a possibility we may have lost the logs during the failure.


  • 3.  RE: CPU Health Check has failed

    Posted 06-19-2017 13:21
    Hi Brandon,

    Thanks for the reply. I only did the show stacking command (slot 1 was active and master the rest had number assignments but no role) not the show slot. I didn't think to try and telnet into the other slots to see.

    There are some messages still in NVRAM for example from slot 2, they all indicate no master (all slots):

    06/18/2017 08:08:19.03