CPU Health Check has failed
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
‎06-19-2017 09:32 AM
Hello, a couple days ago we lost communication to an 8 slot 460-48p switch stack, were alerted to it by a netsight alert:
Cpu HealthCheck has failed. Slot ExtremeXOS (Stack) version 15.3.1.4 v1531b4-patch1-44 by release-manager on Fri Sep 5 16:29:36 EDT 2014 Error Type 7 Action hardwareFail(4) Retries autoRecovery(5)
I was able to log into the stack of (8) 460-48p's. However only slot 1 had a role (Master). Switches 2 thru 8 had a role of (None). A reboot cleared the issue up. I booted into the other partition which has a 16 code (had that planned already). In looking back at records, we had the same message roughly a year ago and things went down there too. Is there something that I can tweek so that if the slot has a problem that it would recover by itself?
Or something that was a known issue maybe with the 15.3.1.4 patch 1-44 code? Or maybe it's a hardware issue?
Thank you
Sarah
configure slot 1 module X460-48p
configure sys-recovery-level slot 1 reset
configure slot 2 module X460-48p
configure sys-recovery-level slot 2 reset
configure slot 3 module X460-48p
configure sys-recovery-level slot 3 reset
configure slot 4 module X460-48p
configure sys-recovery-level slot 4 reset
configure slot 5 module X460-48p
configure sys-recovery-level slot 5 reset
configure slot 6 module X460-48p
configure sys-recovery-level slot 6 reset
configure slot 7 module X460-48p
configure sys-recovery-level slot 7 reset
configure slot 8 module X460-48p
configure sys-recovery-level slot 8 reset
Cpu HealthCheck has failed. Slot ExtremeXOS (Stack) version 15.3.1.4 v1531b4-patch1-44 by release-manager on Fri Sep 5 16:29:36 EDT 2014 Error Type 7 Action hardwareFail(4) Retries autoRecovery(5)
I was able to log into the stack of (8) 460-48p's. However only slot 1 had a role (Master). Switches 2 thru 8 had a role of (None). A reboot cleared the issue up. I booted into the other partition which has a 16 code (had that planned already). In looking back at records, we had the same message roughly a year ago and things went down there too. Is there something that I can tweek so that if the slot has a problem that it would recover by itself?
Or something that was a known issue maybe with the 15.3.1.4 patch 1-44 code? Or maybe it's a hardware issue?
Thank you
Sarah
configure slot 1 module X460-48p
configure sys-recovery-level slot 1 reset
configure slot 2 module X460-48p
configure sys-recovery-level slot 2 reset
configure slot 3 module X460-48p
configure sys-recovery-level slot 3 reset
configure slot 4 module X460-48p
configure sys-recovery-level slot 4 reset
configure slot 5 module X460-48p
configure sys-recovery-level slot 5 reset
configure slot 6 module X460-48p
configure sys-recovery-level slot 6 reset
configure slot 7 module X460-48p
configure sys-recovery-level slot 7 reset
configure slot 8 module X460-48p
configure sys-recovery-level slot 8 reset
2 REPLIES 2
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
‎06-19-2017 01:21 PM
Hi Brandon,
Thanks for the reply. I only did the show stacking command (slot 1 was active and master the rest had number assignments but no role) not the show slot. I didn't think to try and telnet into the other slots to see.
There are some messages still in NVRAM for example from slot 2, they all indicate no master (all slots):
06/18/2017 08:08:19.03 Slot-2: Slot-3 FAILED (1) No Master
06/18/2017 08:08:19.03 Slot-2: Slot-5 FAILED (1) No Master
06/18/2017 08:08:19.03 Slot-2: Slot-7 FAILED (1) No Master
06/18/2017 08:08:19.02 Slot-2: Slot-4 FAILED (1) No Master
06/18/2017 08:08:19.02 Slot-2: Slot-2 FAILED (1) No Master
06/18/2017 08:08:18.49 Slot-2: Slot-1 FAILED (1)
06/18/2017 08:08:18.48 Slot-2: Slot-6 FAILED (1) No Master
06/18/2017 08:08:18.46 Slot-2: Node State[3] = FAIL (No Master)
06/18/2017 08:08:18.46 Slot-2: PRIMARY NODE (Slot-1) DOWN
Thanks for the reply. I only did the show stacking command (slot 1 was active and master the rest had number assignments but no role) not the show slot. I didn't think to try and telnet into the other slots to see.
There are some messages still in NVRAM for example from slot 2, they all indicate no master (all slots):
06/18/2017 08:08:19.03
06/18/2017 08:08:19.03
06/18/2017 08:08:19.03
06/18/2017 08:08:19.02
06/18/2017 08:08:19.02
06/18/2017 08:08:18.49
06/18/2017 08:08:18.48
06/18/2017 08:08:18.46
06/18/2017 08:08:18.46
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
‎06-19-2017 01:05 PM
Hi Sarah,
Did you happen to try logging into one of the non-master nodes during the failure? I'm curious what they saw their role as during this?
Also, did you check 'show slot'? I'd like to know what the status of the non-master nodes was.
'Show log' from both the master and one of the failed nodes may be helpful as well, but since it was rebooted and the issue was a few days ago, there's a possibility we may have lost the logs during the failure.
Did you happen to try logging into one of the non-master nodes during the failure? I'm curious what they saw their role as during this?
Also, did you check 'show slot'? I'd like to know what the status of the non-master nodes was.
'Show log' from both the master and one of the failed nodes may be helpful as well, but since it was rebooted and the issue was a few days ago, there's a possibility we may have lost the logs during the failure.
