topic CPU Health Check has failed in ExtremeSwitching (Other)

CPU Health Check has failed

Sarah_Seidl — Mon, 19 Jun 2017 16:32:00 GMT

Hello, a couple days ago we lost communication to an 8 slot 460-48p switch stack, were alerted to it by a netsight alert:

Cpu HealthCheck has failed. Slot ExtremeXOS (Stack) version 15.3.1.4 v1531b4-patch1-44 by release-manager on Fri Sep 5 16:29:36 EDT 2014 Error Type 7 Action hardwareFail(4) Retries autoRecovery(5)

I was able to log into the stack of (8) 460-48p's. However only slot 1 had a role (Master). Switches 2 thru 8 had a role of (None). A reboot cleared the issue up. I booted into the other partition which has a 16 code (had that planned already). In looking back at records, we had the same message roughly a year ago and things went down there too. Is there something that I can tweek so that if the slot has a problem that it would recover by itself?

Or something that was a known issue maybe with the 15.3.1.4 patch 1-44 code? Or maybe it's a hardware issue?

Thank you

Sarah

configure slot 1 module X460-48p
configure sys-recovery-level slot 1 reset
configure slot 2 module X460-48p
configure sys-recovery-level slot 2 reset
configure slot 3 module X460-48p
configure sys-recovery-level slot 3 reset
configure slot 4 module X460-48p
configure sys-recovery-level slot 4 reset
configure slot 5 module X460-48p
configure sys-recovery-level slot 5 reset
configure slot 6 module X460-48p
configure sys-recovery-level slot 6 reset
configure slot 7 module X460-48p
configure sys-recovery-level slot 7 reset
configure slot 8 module X460-48p
configure sys-recovery-level slot 8 reset

RE: CPU Health Check has failed

BrandonC — Mon, 19 Jun 2017 20:05:00 GMT

Hi Sarah,

Did you happen to try logging into one of the non-master nodes during the failure? I'm curious what they saw their role as during this?

Also, did you check 'show slot'? I'd like to know what the status of the non-master nodes was.

'Show log' from both the master and one of the failed nodes may be helpful as well, but since it was rebooted and the issue was a few days ago, there's a possibility we may have lost the logs during the failure.

RE: CPU Health Check has failed

Sarah_Seidl — Mon, 19 Jun 2017 20:21:00 GMT

Hi Brandon,

Thanks for the reply. I only did the show stacking command (slot 1 was active and master the rest had number assignments but no role) not the show slot. I didn't think to try and telnet into the other slots to see.

There are some messages still in NVRAM for example from slot 2, they all indicate no master (all slots):

06/18/2017 08:08:19.03 Slot-2: Slot-3 FAILED (1) No Master
06/18/2017 08:08:19.03 Slot-2: Slot-5 FAILED (1) No Master
06/18/2017 08:08:19.03 Slot-2: Slot-7 FAILED (1) No Master
06/18/2017 08:08:19.02 Slot-2: Slot-4 FAILED (1) No Master
06/18/2017 08:08:19.02 Slot-2: Slot-2 FAILED (1) No Master
06/18/2017 08:08:18.49 Slot-2: Slot-1 FAILED (1)
06/18/2017 08:08:18.48 Slot-2: Slot-6 FAILED (1) No Master
06/18/2017 08:08:18.46 Slot-2: Node State[3] = FAIL (No Master)
06/18/2017 08:08:18.46 Slot-2: PRIMARY NODE (Slot-1) DOWN