Bd 8810 Restarted and Lost 30% of Mac Addresses and now Not Learning New Mac Addresses

  • 0
  • 2
  • Problem
  • Updated 8 months ago
  • Solved
The BD 8810 rebooted on its own and lost about 30% of already learnt MAC addresses and right now its not learning any new mac addresses .


# sh fdb stats
Total: 253 Static: 0 Perm: 0 Dyn: 253 Dropped: 0
FDB Aging time: 300
FDB VPLS Aging time: 300



# sh log
02/23/2018 16:47:04.07 <Warn:VRRP.UnkVR> No VR found on VLAN x with VR Id 1
02/23/2018 16:47:04.06 <Warn:VRRP.UnkVR> No VR found on VLAN x with VR Id 1
02/23/2018 16:47:03.64 <Warn:VRRP.UnkVR> No VR found on VLAN y with VR Id 1
02/23/2018 16:47:03.63 <Warn:VRRP.UnkVR> No VR found on VLAN y with VR Id 1



# sh switch detail 
System Type:      BD-8810
SysHealth check:  Enabled (Normal)
Recovery Mode:    All
System Watchdog:  Enabled
Boot Count:       6
Next Reboot:      None scheduled
System UpTime:    2 days 16 hours 40 minutes 49 seconds
Slot:             MSM-A *                      MSM-B                  
                  ------------------------     ------------------------
Current State:    MASTER                       BACKUP (In Sync)       
Image Selected:   primary                      primary                
Image Booted:     primary                      primary                
Primary ver:      12.3.3.6                     12.3.3.6               
Secondary ver:    12.3.3.6                     12.3.3.6               
Config Selected:  primary.cfg                  primary.cfg            
Config Booted:    Factory Default              Factory Default        
primary.cfg       Created by ExtremeXOS version 12.3.3.6
                  946439 bytes saved on Thu Feb 22 09:27:21 2018




# sh version detail
Chassis     : 800392-00-03 1113G-02926 Rev 3.0
Slot-1      : 800224-00-05 1125G-00022 Rev 5.0 BootROM: 1.0.3.9    IMG: 12.3.3.6 
Slot-2      : 800224-00-05 1125G-00023 Rev 5.0 BootROM: 1.0.3.9    IMG: 12.3.3.6 
Slot-3      : 800224-00-05 1125G-00024 Rev 5.0 BootROM: 1.0.3.9    IMG: 12.3.3.6 
Slot-4      :
Slot-5      : 800232-00-06 1119G-00581 Rev 6.0 BootROM: 1.0.3.9    IMG: 12.3.3.6 
Slot-6      : 800232-00-06 1113G-00074 Rev 6.0 BootROM: 1.0.3.9    IMG: 12.3.3.6 
Slot-7      :
Slot-8      :
Slot-9      :
Slot-10     :
MSM-A       : 800314-00-01 1115G-00282 Rev 1.0 BootROM: 1.0.4.2    IMG: 12.3.3.6 
MSM-B       : 800314-00-01 1111G-02015 Rev 1.0 BootROM: 1.0.4.2    IMG: 12.3.3.6 
PSUCTRL-1   : 450306-00-03 1113G-02776 Rev 3.0 BootROM: 2.18     
PSUCTRL-2   : 450306-00-03 1113G-02826 Rev 3.0 BootROM: 2.18     
PSU-1       : PS 2350 4300-00146 1124J-01187 Rev 5.0
PSU-2       : PS 2350 4300-00146 1124J-01208 Rev 5.0
PSU-3       : PS 2350 4300-00146 1124J-01199 Rev 5.0
PSU-4       :
PSU-5       :
PSU-6       :
Image   : ExtremeXOS version 12.3.3.6 v1233b6 by release-manager
BootROM : 1.0.4.2
Diagnostics : 1.5
Photo of BGP

BGP

  • 110 Points 100 badge 2x thumb

Posted 9 months ago

  • 0
  • 2
Photo of Ron Huygens

Ron Huygens, Employee

  • 3,360 Points 3k badge 2x thumb
This box is running a very old version of EXOS. I suggest that you start to upgrade to the last recommend version.
https://extremeportal.force.com/ExtrArticleDetail?n=000002378&q=recommended%20version

This will most likely solve your issue.
Photo of BGP

BGP

  • 110 Points 100 badge 2x thumb
Thank you very much for swift response . I would just like to know what you want me to upgrade, The Extreme OS or BootROM or Both ?

I would also like to know to explain this behavior of the switch ? Does it mean that when the OS gets too old it stops learning Mac Addresses ?

Thank you again , you have already been of great help.
Photo of EtherMAN

EtherMAN, Embassador

  • 7,370 Points 5k badge 2x thumb
I have a question  ... Did you load a config from cli on the reboot... you are running on factory default config but have saved to your primary... you may not have all the config loaded and or some things may not have been saved from previous changes ... mac table size is only what the switch sees connections for on vlans or vmans... If you have clients connected to ports with no vlan provisioned the table will be smaller... you can also do system-dump to see what may have caused the reboot

Config Selected:  primary.cfg                  primary.cfg             
Config Booted:    Factory Default              Factory Default         
primary.cfg       Created by ExtremeXOS version 12.3.3.6
                  946439 bytes saved on Thu Feb 22 09:27:21 2018



He did not say the reboot was due to old software but it is a good idea to keep things current.  If this was a very static configuration and you were running for years without issues then unless there is a specific bug that caused the re-boot and is fixed by new code you still need to find out what caused the reboot..

the command to see systemdump

show debug system-dump
Photo of BGP

BGP

  • 110 Points 100 badge 2x thumb
Hi EtherMan , I really appreciate your prompt feedback as this is a very urgent issue . I have copied the system dump and  pasted here . I can see the reason for failure is a process crash and the process is nodemgr . Would you know any reason why the process manager would cause a reboot. I am guessing the switch went to factory default judging from the output of "Show Switch". 

And also, no I didn't load any config on the cli after the reboot. I still do not know what caused the reboot because the device has been running for years without issues like you said.


# show debug system-dump
===============================================
            MSM-A system dump information
===============================================
core_dump_info storage: 8/3072 used [EMPTY]
failure: process crash
time: Wed Feb 21 00:20:01 2018
process nodemgr
pid 619
signal 6
$0 : z0=00000000 at=fefefeff v0=00000000 v1=00000001
$4 : a0=0000026b a1=00000006 a2=00000001 a3=00000000
$8 : t0=00000002 t1=2b500028 t2=2b500450 t3=2b500028
$12: t4=00000001 t5=000001d9 t6=000007e2 t7=2abc8414
$16: s0=0000026b s1=00000402 s2=00000006 s3=2abdb344
$20: s4=100007b4 s5=00412cf0 s6=100238e4 s7=100238e0
$24: t8=00000113 t9=2b22af80
$28: gp=2b3a9b40 sp=7f7ff890 s8=00000193 ra=2ab4b1f4
Hi : 00000285
Lo : 00033829
epc  : 2b22af94    Not tainted
Status: 00001f13
Cause : 00808020
 7f7ff890: 00000000 2aba07d0 2b27e704 2b500010 2aba07d0 2b39d1d0 00000006 2abdb180
 7f7ff8b0: 00000163 2aba07d0 2ab4bae4 2ab4bac8 00000000 00000000 2abdb2c4 2abdb180
 7f7ff8d0: 2aba07d0 2abdb344 2abdb2c4 2aba07d0 2b22cec8 2b22cf3c 2aba07d0 2ab4f10c
 7f7ff8f0: 7f7ffa20 7f7ffa48 2b3a9b40 00412cf0 2aba07d0 000000a5 00000000 000000a5
 7f7ff910: 2b500638 000000a5 000000a5 2aba07d0 2b27c48c 00000000 ffffffff 000000a5
 7f7ff930: 2b3a9b40 2b27dae8 2b3a9b40 2b500508 2b39f928 000000a5 2b500638 000000a5
 7f7ff950: 2b3a9b40 2b27b4b8 2b27bb64 00000000 2aba07d0 00000000 2aba07d0 2b39f850
 7f7ff970: 7f7ffc00 2b39f850 00000001 2abdb344 2aba07d0 2aba07d0 2aba07d0 2ab48390
 7f7ff990: 00000020 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 7f7ff9b0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 7f7ff9d0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 7f7ff9f0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 7f7ffa10: 2b3a9b40 2b223040 2ab21140 2b34d6c0 7fff7aef 2b34d6e8 2abdb180 00000163
 7f7ffa30: 2abdb2c4 2b34d6e8 2abdb344 2aace46c 2b3a9b40 2abdb2c4 2b500638 2abdb2c4
 7f7ffa50: 2abdb2c4 10023f70 2abdb180 2abdb344 2b3a9b40 2abc8414 00000000 00000000
 7f7ffa70: 00000000 00000000 2abdb344 2abdb180 00000163 2abdb2c4 2abdb344 0034f2b0
log: ...  transaction through muxes was interrupted, clean up
log: <7>Opening app watchdog timer, instance: 1
log: <7>Application watchdog timer is not cleanup, instance: 1
log: <7>Opening app watchdog timer, instance: 1
log: <7>Application watchdog timer is not cleanup, instance: 1

Text segment map
  0x00400000-0x00414000 /exos/bin/nodemgr
  0x2aac0000-0x2aada000 /lib/ld-2.2.5.so
  0x2ab40000-0x2ab52000 /lib/libpthread-0.9.so
  0x2abc0000-0x2abdd000 /exos/lib/libcommon.so
  0x2ac40000-0x2ac71000 /exos/lib/libipml.so
  0x2acc0000-0x2acd2000 /exos/lib/libepm.so
  0x2ad40000-0x2ad4f000 /exos/lib/libds.so
  0x2adc0000-0x2ae1c000 /exos/lib/libdm.so
  0x2ae80000-0x2ae82000 /exos/lib/libnm.so
  0x2af00000-0x2af06000 /exos/lib/libcli.so
  0x2af80000-0x2afa4000 /exos/lib/libexpat.so
  0x2b000000-0x2b018000 /exos/lib/libcmbackend.so
  0x2b080000-0x2b08c000 /exos/lib/libems.so
  0x2b100000-0x2b117000 /exos/lib/libdispatch.so
  0x2b180000-0x2b18c000 /exos/lib/libwkninfo.so
  0x2b200000-0x2b35d000 /lib/libc-2.2.5.so
  0x2b3c0000-0x2b3c3000 /lib/libdl-2.2.5.so
  0x2b440000-0x2b441000 /exos/lib/libusertrace.so
failure: process crash
time: Wed Feb 21 00:20:02 2018
process nodemgr
pid 403
signal 6
$0 : z0=00000000 at=10001f00 v0=00000004 v1=00000001
$4 : a0=00001091 a1=00000009 a2=7fff7650 a3=00000001
$8 : t0=00001f00 t1=00000000 t2=00000000 t3=8032a060
$12: t4=886c8480 t5=886c8500 t6=886c8400 t7=00000058
$16: s0=7fff77c0 s1=1000ec50 s2=7fff7880 s3=7fff7758
$20: s4=00000009 s5=7fff7880 s6=10011660 s7=00000000
$24: t8=00000000 t9=2b2f4130
$28: gp=2b3a9b40 sp=7fff7600 s8=2b156e40 ra=2ac6b7d0
Hi : 0000007f
Lo : ef9db29d
epc  : 2b2f4144    Not tainted
Status: 00001f13
Cause : 00808020
 7fff7600: 2add0c6c 2add0be4 2b3a9b40 2abc6b78 0000012c 10014ce8 2acb8800 00000400
 7fff7620: 2acb8800 2ac650d0 10032002 2b114b1c 2b115144 2b114cc0 100115c0 2b114b1c
 7fff7640: 2aba07d0 2b115144 2acb8800 2ac24830 2aba07d0 2b103504 2ad194c0 2ae647d0
 7fff7660: 2aba07d0 10024e0c 00000000 2aba07d0 7fff77c0 1000ec50 7fff7880 7fff7748
 7fff7680: 10011660 00000000 2b156e80 10011660 2acb8800 2ac695cc 000000c1 000001a0
 7fff76a0: 2b15eac0 00000000 7fff77c0 7fff7758 1000c270 2b156ec0 2acb8800 00000002
 7fff76c0: 2aba07d0 10011660 1000bb90 100115c0 2b114b1c 2b114ca4 2b114cc0 10032002
 7fff76e0: 2b156e50 00000000 2aba07d0 000001bd 2ab48464 2ac24830 2b104dfc 2b104820
 7fff7700: 2aba07d0 0000002c 2aba07d0 2b156e80 2b156e50 2aba07d0 2b10470c 2b1045b4
 7fff7720: 2b156e80 10011660 2b15eac0 2b156e40 00000000 00000af0 00000000 2b156e40
 7fff7740: 0bf40218 0039ada0 0bf40218 0039ada0 00000000 00000000 00000038 2b156e40
 7fff7760: 00000001 2b156e80 00000038 2b156e40 000f423f 2b156e40 2acb8800 2b10576c
 7fff7780: 2b2826ac 00000009 2aba07d0 2add0be4 7fff77c0 2b103504 2ae5dae0 10011450
 7fff77a0: 7fff7878 10011454 2b15eac0 2b15eac0 0bf40218 0039ada0 0bf40218 000ef420
 7fff77c0: 00000000 000493e0 65706d48 656c6c6f 54696d65 72000000 00000000 00000000
 7fff77e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
log: ... 
log: <6>  0x2b200000-0x2b35d000  /lib/libc-2.2.5.so
log: <6>  0x2b3c0000-0x2b3c3000  /lib/libdl-2.2.5.so
log: <6>  0x2b440000-0x2b441000  /exos/lib/libusertrace.so.0.0
log: <6>*****
log: <6>Start core dump: pid 619 (nodemgr) signal 6
log: <6>End core dump: pid 619 (nodemgr) signal 6

Text segment map
  0x00400000-0x00414000 /exos/bin/nodemgr
  0x2aac0000-0x2aada000 /lib/ld-2.2.5.so
  0x2ab40000-0x2ab52000 /lib/libpthread-0.9.so
  0x2abc0000-0x2abdd000 /exos/lib/libcommon.so
  0x2ac40000-0x2ac71000 /exos/lib/libipml.so
  0x2acc0000-0x2acd2000 /exos/lib/libepm.so
  0x2ad40000-0x2ad4f000 /exos/lib/libds.so
  0x2adc0000-0x2ae1c000 /exos/lib/libdm.so
  0x2ae80000-0x2ae82000 /exos/lib/libnm.so
  0x2af00000-0x2af06000 /exos/lib/libcli.so
  0x2af80000-0x2afa4000 /exos/lib/libexpat.so
  0x2b000000-0x2b018000 /exos/lib/libcmbackend.so
  0x2b080000-0x2b08c000 /exos/lib/libems.so
  0x2b100000-0x2b117000 /exos/lib/libdispatch.so
  0x2b180000-0x2b18c000 /exos/lib/libwkninfo.so
  0x2b200000-0x2b35d000 /lib/libc-2.2.5.so
  0x2b3c0000-0x2b3c3000 /lib/libdl-2.2.5.so
  0x2b440000-0x2b441000 /exos/lib/libusertrace.so
Photo of Ron Huygens

Ron Huygens, Employee

  • 3,360 Points 3k badge 2x thumb
The crash dump is pointing to a memory depletion.
If you say that the switch was running for a very long time, you may hit CR# xos0042592. The effect of this is that the Node Manager process consumes excessive CPU usage when the system uptime reaches 994 days. Ultimately it will crash due to memory depletion. This is fixed in EXOS 12.5.3 and up.

As Etherman correctly described the switch running with a default config and that may be the reason why you see decreased performance. Actually, I am surprised that something is working at all.

Again I suggest to upgrade to a current version to avoid the possible memory depletion and make sure you use the correct config: "use config primary"
With the current information we cannot determine the reason why the switch choose to use the factory default config. If you want to have that investigated you better open a case through the support portal.
Photo of BGP

BGP

  • 110 Points 100 badge 2x thumb
Hi Ron , Thank you very much . Your reply went a long way in helping me understand what was going on . 

I really appreciate