X460G2 (Stack) - stack node crash

  • 0
  • 1
  • Problem
  • Updated 3 years ago
  • Solved
Today we had an crash of 1 node in a 2 node X460G2-48p-10G4 stacking configuration. things began to become unresponsive. After checking the chalet gui and checking true serial port i saw node 1 unresponsive.

The error was with extremexos 15.7.1.4 and now i have already have installed 15.7.2.9

Event logs:

2015-09-07 19:24:04.46 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 111 sec, kepc 0xffffffff805fa5f4(__cond_resched+0x20/0x44) uepc 0x2acdb150.2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb164 00000000 nop
2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb160 8f8393ac lw v1,-27732(gp)
2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb158 7c03e83b Unknown at 0x2acdb158, 0x7c03e83b, op 31
2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb154 00408021 addu s0,v0,zero
2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb150 <10e00008>beq a3,zero,0x2acdb174
2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb14c 0000000c syscall 0
2015-09-07 19:24:03.48 <Crit:Kern.Card.Alert> Slot-1: 2acdb15c 00601021 addu v0,v1,zero
2015-09-07 19:24:03.42 <Crit:Kern.Card.Alert> Slot-1: 2acdb148 24020fa7
2015-09-07 19:24:03.42 <Crit:Kern.Card.Alert> Slot-1: 2acdb144 02003021 addu a2,s0,zero
2015-09-07 19:24:03.42 <Crit:Kern.Card.Alert> Slot-1: Code:
2015-09-07 19:24:03.42 <Crit:Kern.Card.Alert> Slot-1:
2015-09-07 19:24:03.42 <Crit:Kern.Card.Alert> Slot-1: Process epm pid 1141 died with signal 6
2015-09-07 19:24:03.42 <Crit:Kern.Card.Emergency> Slot-1: Application watchdog killing process 1141(epm) in state 1.
2015-09-07 19:24:03.41 <Crit:Kern.Card.Critical> Slot-1: App timer for index 0 app: (epm) expired, delta 12031 timeout: 120000
2015-09-07 19:23:53.70 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 111 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:23:43.62 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 101 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:23:33.51 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 90 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:23:23.36 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 80 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:23:13.56 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 70 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:23:03.34 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 60 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:22:53.04 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 50 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:22:42.90 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 40 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:22:32.83 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 30 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:22:22.68 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 20 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:22:02.36 <Warn:EPM.cpu> Slot-1: CPU utilization monitor: process epm consumes 99 % CPU
2015-09-07 19:21:57.60 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 60 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:21:47.48 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 50 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:21:37.35 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 40 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:21:27.22 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 30 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:21:17.09 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 20 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:20:53.76 <Warn:EPM.Msg.hello_rate> Slot-1: Process mrp sends hello too often, expected once in 5 secs
2015-09-07 19:20:53.76 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mrp 2 more often then expected 3
2015-09-07 19:20:53.76 <Warn:EPM.Msg.hello_rate> Slot-1: Process elsm sends hello too often, expected once in 10 secs
2015-09-07 19:20:53.76 <Warn:EPM.hello_rate> Slot-1: Received hellos from process elsm 2 more often then expected 3
2015-09-07 19:20:53.76 <Warn:EPM.Msg.hello_rate> Slot-1: Process mcmgr sends hello too often, expected once in 10 secs
2015-09-07 19:20:53.76 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mcmgr 2 more often then expected 3
2015-09-07 19:20:47.92 <Warn:EPM.Msg.hello_rate> Slot-1: Process mrp sends hello too often, expected once in 5 secs
2015-09-07 19:20:47.92 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mrp 2 more often then expected 3
2015-09-07 19:20:47.50 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 30 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:20:37.34 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 20 sec, kepc 0xffffffff805fee1c(schedule_timeout+0x64/0xe0) uepc 0x2aaec2e8.
2015-09-07 19:19:58.79 <Warn:EPM.Msg.hello_rate> Slot-1: Process mrp sends hello too often, expected once in 5 secs
2015-09-07 19:19:58.79 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mrp 2 more often then expected 3
2015-09-07 19:19:24.23 <Warn:EPM.Msg.hello_rate> Slot-1: Process mcmgr sends hello too often, expected once in 10 secs
2015-09-07 19:19:24.23 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mcmgr 2 more often then expected 3
2015-09-07 19:19:23.80 <Warn:EPM.hello_rate> Slot-1: Received hellos from process elsm 2 more often then expected 3
2015-09-07 19:19:23.80 <Warn:EPM.Msg.hello_rate> Slot-1: Process elsm sends hello too often, expected once in 10 secs
2015-09-07 19:19:08.82 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mrp 2 more often then expected 3
2015-09-07 19:19:08.82 <Warn:EPM.Msg.hello_rate> Slot-1: Process mrp sends hello too often, expected once in 5 secs
2015-09-07 19:19:05.23 <Crit:Kern.Card.Emergency> Slot-1: Epm application wdg timer warning - 20 sec, kepc 0xffffffff802dcca8(do_wait+0x2d0/0x478) uepc 0x2acdb150.
2015-09-07 19:18:31.06 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mrp 2 more often then expected 3
2015-09-07 19:18:31.06 <Warn:EPM.Msg.hello_rate> Slot-1: Process mrp sends hello too often, expected once in 5 secs
2015-09-07 19:18:31.06 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mrp 2 more often then expected 3
2015-09-07 19:18:31.06 <Warn:EPM.Msg.hello_rate> Slot-1: Process elsm sends hello too often, expected once in 10 secs
2015-09-07 19:18:31.06 <Warn:EPM.hello_rate> Slot-1: Received hellos from process elsm 2 more often then expected 3
2015-09-07 19:18:31.06 <Warn:EPM.Msg.hello_rate> Slot-1: Process mcmgr sends hello too often, expected once in 10 secs
2015-09-07 19:18:31.06 <Warn:EPM.Msg.hello_rate> Slot-1: Process mrp sends hello too often, expected once in 5 secs
2015-09-07 19:18:31.06 <Warn:EPM.hello_rate> Slot-1: Received hellos from process mcmgr 2 more often then expected 3
Photo of Admin ZML

Admin ZML

  • 202 Points 100 badge 2x thumb

Posted 3 years ago

  • 0
  • 1
Photo of Patrick Voss

Patrick Voss, Alum

  • 11,594 Points 10k badge 2x thumb
From the looks of it you ran into a process crash. Can you paste the output for "ls" and "ls internal memory". We may be able to assist you here but ultimately a GTAC case may have to be opened to see what can be done.
Photo of Admin ZML

Admin ZML

  • 202 Points 100 badge 2x thumb
I have opened an GTAC case and will post the outcome if it is solved.
Photo of Admin ZML

Admin ZML

  • 202 Points 100 badge 2x thumb
The GTAC support has answered the following:



<email address removed>
Hello,

My name is Christopher and this case has just been escalated to me.

From the show tech information I can see that there was a process crash of process epm on slot 1 on the 7th of September at 19:24:02
What I also can see are additional memory depletion messages due to process climaster following this process crash at 19:27:16, 19:27:22, and 19:27:29.

I can see that you are having webhttp enabled, can you tell me, are you using the web-interface of this switch?

Taken your comment that at this point you were running EXOS 15.7.1. there is a known issue (xos0062016) in this version of code that cause reboots due to memory depletion of process CliMaster, so (with having the web-interface enabled) I'm quite certain that this is the cause of your reboot. Process EPM is responsible for handling all the running processes, and I'm quite certain that it crashed due to not having sufficient memory left due to the known issue. This would explain the memory depletions showing up right after the process crash.

xos0062016 has been fixed in EXOS 15.7.2, so coincidentally the version that you have already upgraded to.

kind regards,

Christopher Henrich
EMEA TAC Sr. Escalation Support Engineer / Extreme Networks
(Edited)
Photo of Drew C.

Drew C., Community Manager

  • 39,442 Points 20k badge 2x thumb
Thanks for coming back to update the thread.  I've marked this post as "solved."