X670 Stack failure upon stack member loss

  • 0
  • 1
  • Problem
  • Updated 3 years ago
  • Solved
Hi Everyone

We recently lost 1 switch in our stack of 2 X670 switches running XOS version 15.7.1.4, to a power failure. When the power came back and the switch booted, the whole stack immediately crashed and rebooted. I have pasted in the events from the log around this time. I would be grateful if anyone could shed some light on the problem:


08/19/2015 18:20:46.56 <Warn:EPM.reboot> Slot-1: Rebooting with reason User requested switch reboot
08/19/2015 18:09:02.70 <Warn:STP.DomainEnable> Slot-2: STP domain s2 enabled
08/19/2015 18:09:02.64 <Warn:STP.DomainEnable> Slot-2: STP domain s1 enabled
08/19/2015 18:09:02.55 <Warn:STP.DomainEnable> Slot-2: STP domain s0 enabled
08/19/2015 18:08:40.95 <Warn:STP.DomainEnable> Slot-1: STP domain s2 enabled
08/19/2015 18:08:40.31 <Warn:STP.DomainEnable> Slot-1: STP domain s1 enabled
08/19/2015 18:08:40.30 <Warn:STP.DomainEnable> Slot-1: STP domain s0 enabled
08/19/2015 18:07:09.08 <Warn:EPM.UnexpctRebootDtect> Slot-2: Booting after System Failure.
08/19/2015 18:06:58.33 <Warn:EPM.UnexpctRebootDtect> Slot-1: Booting after System Failure.
08/19/2015 18:04:31.51 <Warn:DM.Warning> Slot-1: Slot-2 unexpectedly VLAN_SYNCED from EMPTY
08/19/2015 18:04:31.49 <Warn:HAL.Card.Warning> Slot-1: Unexpected state transition for Slot-2 oldState EMPTY newState VLAN_SYNC_DONE
08/19/2015 18:04:31.32 <Warn:DM.Warning> Slot-1: BACKUP NODE (Slot-2) DOWN
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to create load sharing group 1:31 (tid 5) on slot 2 unit 9: Invalid unit.
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to create load sharing group 1:30 (tid 4) on slot 2 unit 9: Invalid unit.
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to create load sharing group 1:28 (tid 3) on slot 2 unit 9: Invalid unit.
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to create load sharing group 1:26 (tid 2) on slot 2 unit 9: Invalid unit.
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to create load sharing group 2:48 (tid 1) on slot 2 unit 9: Invalid unit.
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to create load sharing group 1:48 (tid 0) on slot 2 unit 9: Invalid unit.
08/19/2015 18:04:31.30 <Erro:HAL.LAG.CfgNoUcastHashFail> Slot-1: Failed to configure hash algorithm for non-unicast packets for link aggregation groups on slot 2 unit 0: Invalid unit
08/19/2015 18:04:31.30 <Erro:HAL.LAG.Error> Slot-1: Failed to obtain trunk HW info, Invalid unit
08/19/2015 18:04:31.29 <Erro:HAL.Port.Error> Slot-1: Port preferred-medium fiber failed for port 48 rv = -3
08/19/2015 18:04:31.29 <Erro:HAL.Port.Error> Slot-1: Port preferred-medium fiber failed for port 47 rv = -3
08/19/2015 18:04:31.29 <Erro:HAL.Port.Error> Slot-1: Port preferred-medium fiber failed for port 46 rv = -3
08/19/2015 18:04:31.29 <Erro:HAL.Port.Error> Slot-1: Port preferred-medium fiber failed for port 45 rv = -3
08/19/2015 18:04:31.29 <Erro:HAL.WRED.CfgFail> Slot-1: Failed to configure WRED parameters on the underlying Hardware on port 2:53 qosprofile 1 with Error: Conduit failure
08/19/2015 18:04:29.66 <Warn:STP.DomainEnable> Slot-1: STP domain s2 enabled
08/19/2015 18:04:29.35 <Warn:STP.DomainEnable> Slot-1: STP domain s1 enabled
08/19/2015 18:04:29.21 <Warn:STP.DomainEnable> Slot-1: STP domain s0 enabled
08/19/2015 18:04:06.60 <Erro:cm.loadErr> Slot-1: Failed to load configuration: timed out (after 150 seconds) while waiting for all applications to get ready to load configuration on MASTER ( eaps is still not ready yet).
08/19/2015 18:00:26.96 <Warn:EPM.UnexpctRebootDtect> Slot-2: Booting after System Failure.
08/19/2015 17:57:41.15 <Warn:EPM.all_shutdown> Slot-1: Shutting down all processes
08/19/2015 17:57:41.15 <Warn:DM.Warning> Slot-1: Slot-1 unexpectedly ACL_SYNCED from FAILED
08/19/2015 17:57:41.15 <Warn:DM.Warning> Slot-1: Slot-1 FAILED (1) Not In Sync
08/19/2015 17:57:41.04 <Erro:cm.sys.msgDrop> Slot-1: Dropped CM_MSG_CHKP_LOAD_ACK: Length -1 Peer 1 (backup)
08/19/2015 17:57:41.04 <Warn:DM.Warning> Slot-1: cfgmgr cannot write msg_id 1 to MASTER connection 0
08/19/2015 17:57:40.75 <Erro:DM.Error> Slot-1: Node State[4] = FAIL (Not In Sync)
08/19/2015 17:57:40.75 <Warn:DM.Warning> Slot-1: NM: Old Primary's state is UNKNOWN
08/19/2015 17:57:40.75 <Warn:DM.Warning> Slot-1: PRIMARY NODE (Slot-2) DOWN
08/19/2015 17:57:38.55 <Warn:STP.DomainEnable> Slot-1: STP domain s2 enabled
08/19/2015 17:57:38.41 <Warn:STP.DomainEnable> Slot-1: STP domain s1 enabled
08/19/2015 17:57:38.19 <Warn:STP.DomainEnable> Slot-1: STP domain s0 enabled
08/19/2015 17:56:39.07 <Warn:EPM.UnexpctRebootDtect> Slot-1: Booting after System Failure.

Many Thanks

Simon
Photo of Simon Vosper

Simon Vosper

  • 658 Points 500 badge 2x thumb

Posted 3 years ago

  • 0
  • 1
Photo of Stephen Williams

Stephen Williams, Employee

  • 8,838 Points 5k badge 2x thumb
Simon,

With the logs provided it looks like slot 1 had problems syncing with slot 2 when it was booting.  There might be a process crash in there somewhere but i don't see it.    I would open up a case with GTAC and provide the below data.

From both slots:

show log
show log messages nvram
show
ls /usr/local/tmp #  if you see a file like "core.ProcessName.ProcessId.gz" please tftp the file off the switch for GTAC.
show debug system-dump
show stacking configuration
show stacking
show slot
(Edited)
Photo of Simon Vosper

Simon Vosper

  • 658 Points 500 badge 2x thumb
Thanks Stephen

I have logged a call with GTAC