topic Stack switch fails and brings the whole LAN down in ExtremeSwitching (EXOS/Switch Engine)

Stack switch fails and brings the whole LAN down

Steven_Marriott — Fri, 15 Sep 2017 13:14:00 GMT

i have x440-48T Switch stacks in my Local LAN, over the last 6 months the LAN has been brought to a standstill because of a failure on a stack, both times it was a different stack. No one can connect from external the whole Network is down. We have LACP enabled on all stacks.

The error what i see is this, anfter a reboot of the stack ist is good again but i dont understand why one failed Switch will bring the whole Network to a standstill.

09/11/2017 09:07:07.74 Slot-2: perfTimer Execution time of Timer Thread select (38): Min: 0.0 sec Avg: 0.868 sec Max: 1.10 sec
09/11/2017 09:07:07.74 Slot-2: perfTimer Execution time of Timer Thread select (38): Last Execution: 1.10 sec
09/11/2017 09:06:45.35 Slot-2: snmpMaster initialization complete
09/11/2017 09:06:44.01 Slot-2: **** telnetd started *****
09/11/2017 09:06:41.60 Slot-2: DOS protect application started successfully
09/11/2017 09:06:41.49 Slot-2: **** tftpd started *****
09/11/2017 09:06:37.55 Slot-2: snmpSubagent initialization complete
09/11/2017 09:06:37.44 Slot-2: Network Login framework has been initialized
09/11/2017 09:06:34.83 Slot-2: Slot-2 being Powered ON
09/11/2017 09:06:34.73 Slot-2: Node State[1] = INIT
09/11/2017 09:06:34.15 Slot-2: Hal initialization done.
09/11/2017 09:06:33.67 Slot-2: Module in Slot-2 is inserted
09/11/2017 09:06:32.57 Slot-2: Starting hal initialization ....
09/11/2017 09:06:28.39 Slot-2: telnetd listening on port 23

09/11/2017 09:06:19.52 Slot-2: The Node Manager (NM) has started processing.
09/11/2017 09:06:19.14 Slot-2: DM started
09/11/2017 09:06:18.44 Slot-2: EPM Started
09/11/2017 09:06:18.43 Slot-2: Booting after System Failure.
09/11/2017 09:06:17.06 Slot-2: Changing to watchdog warm reset mode
09/11/2017 06:35:15.24 Slot-2: Failed to send SNTP request to server 10.0.100.21
09/11/2017 06:35:15.19 Slot-2: Failed to send SNTP request to server 10.0.100.20
09/11/2017 06:17:14.95 Slot-2: Shutting down all processes
09/11/2017 06:17:14.92 Slot-2: Slot-2 FAILED (1) Backup lost
09/11/2017 06:17:14.51 Slot-1: BACKUP is NOT in SYNC
09/11/2017 06:17:14.50 Slot-1: BACKUP NODE (Slot-2) DOWN
09/11/2017 06:17:14.48 Slot-2: Node State[4] = FAIL (Backup lost)
09/11/2017 06:17:14.48 Slot-2: MASTER decided that I am not BACKUP anymore
09/11/2017 06:17:14.48 Slot-2: BACKUP NODE (Slot-2) DOWN
05/15/2017 17:29:10.14 Slot-1: Disabling port 1:48. Auto re-enable port after 30 seconds
05/15/2017 17:29:10.14 Slot-1: Disabling port 1:48. Auto re-enable port after 30 seconds

RE: Stack switch fails and brings the whole LAN down

Ariyakudi_Srini — Fri, 15 Sep 2017 13:38:00 GMT

Hi Steven,

Few things to consider there.

The stack seems to have detected a loop and ELRP seems to have disabled the port#1:48 in the stack.

05/15/2017 17:29:10.14 Slot-1: Disabling port 1:48. Auto re-enable port after 30 seconds

This event is followed by the slot-2 reboot and then records the below log message,

09/11/2017 09:07:07.74 Slot-2: perfTimer Execution time of Timer Thread select (38): Min: 0.0 sec Avg: 0.868 sec Max: 1.10 sec
09/11/2017 09:07:07.74 Slot-2: perfTimer Execution time of Timer Thread select (38): Last Execution: 1.10 sec

Please check this GTAC article for the above log message,

https://gtacknowledge.extremenetworks.com/articles/Q_A/What-are-perfTimer-Execution-messages

And mentioning about the loop, it is a very high possibility that it could stop the switch from processing any traffic by flooding the switch CPU with huge volume of broadcasts and bring the switch to almost standstill.

It is understood from the log message that the ELRP has taken down the loop by disabling the port.

Has the ELRP PDU timer been changed by any chance ?

Also, please check the output of "top" in the switch at the time of freeze using the console/serial cable to see if the CLI is functional and the process "bcmRx" or any other process seems to spike.

Thank You,

RE: Stack switch fails and brings the whole LAN down

Steven_Marriott — Fri, 15 Sep 2017 13:53:00 GMT

sorry i copied earlier log Messages that have nothing to do with this issue, the logs for this issue start here.

Like i said in my previous post why would this stack take all the Network down?

09/11/2017 09:06:19.52 Slot-2: The Node Manager (NM) has started processing.
09/11/2017 09:06:19.14 Slot-2: DM started
09/11/2017 09:06:18.44 Slot-2: EPM Started
09/11/2017 09:06:18.43 Slot-2: Booting after System Failure.
09/11/2017 09:06:17.06 Slot-2: Changing to watchdog warm reset mode
09/11/2017 06:35:15.24 Slot-2: Failed to send SNTP request to server 10.0.100.21
09/11/2017 06:35:15.19 Slot-2: Failed to send SNTP request to server 10.0.100.20
09/11/2017 06:17:14.95 Slot-2: Shutting down all processes
09/11/2017 06:17:14.92 Slot-2: Slot-2 FAILED (1) Backup lost
09/11/2017 06:17:14.51 Slot-1: BACKUP is NOT in SYNC
09/11/2017 06:17:14.50 Slot-1: BACKUP NODE (Slot-2) DOWN
09/11/2017 06:17:14.48 Slot-2: Node State[4] = FAIL (Backup lost)
09/11/2017 06:17:14.48 Slot-2: MASTER decided that I am not BACKUP anymore
09/11/2017 06:17:14.48 Slot-2: BACKUP NODE (Slot-2) DOWN

Thankyou for your quick response

RE: Stack switch fails and brings the whole LAN down

Ariyakudi_Srini — Fri, 15 Sep 2017 16:10:00 GMT

Hi Steven,

From the above available logs, there is nothing substantial to arrive at a conclusion for the stack freeze state.

Did you have the opportunity to check the Master switch CLI over console? Does that also not respond?
What is the EXOS version in the stack?

And you have mentioned that the issue seems to resolve after a reboot, are you referring to a power cycle of the stack units or you are rebooting the stack over CLI?

Thank You,