MSM-B: CPU 0: Kernel thread was stuck#####and issue with ospf

  • 0
  • 1
  • Question
  • Updated 3 years ago
  • Answered
We recently upgraded bd8806 from 12.5 to 15.7, unfortunately first attempt to upgrade interrupted and second attempt was successful.  After reboot the following error was noticed on the logs:

04/26/2016 09:57:29.77 <Crit:Kern.Alert> MSM-B: CPU 0: Kernel thread was stuck for 2.74 seconds, jiffies: 301270004/26/2016 09:57:29.77 <Crit:Kern.Alert> MSM-B: CPU 1: Kernel thread was stuck for 2.32 seconds, jiffies: 3012674
04/26/2016 09:57:29.77 <Crit:Kern.Alert> MSM-B: CPU 1: soft watchdog expiration warning EPC 8016633c(__rcu_pending+0x0/0x94) at 2 seconds.
04/26/2016 09:57:29.77 <Crit:Kern.Alert> MSM-B: CPU 0: soft watchdog expiration warning EPC 80105df0(cpu_idle+0x3c/0x80) at 2 seconds.

since then all my OSPF neighbor has been unstable even after removing MSM-B. After rollback to 12.5 OSPF became stable. Is this bug and I have to fresh-install EXOS on MSM-B or both MSM.

Note: OSPF neigbor is not with Switch. (NodeA[ospf]-----BD8806(L2vlan)------Router[ospf]).

debug on MSM-B
===============================================            MSM-B system dump information
===============================================
core_dump_info storage: 8/3072 used [EMPTY]
failure: process crash
time: Tue Apr 26 01:23:10 2016
process hal
pid 1331
signal 10
$0 : z0=00000000 at=10001f00 v0=00408e3c v1=004043b0
$4 : a0=0041bc7c a1=004003a4 a2=00000399 a3=0040924c
$8 : t0=7fff727c t1=2aac8504 t2=00000080 t3=f0000000
$12: t4=000014c0 t5=ffffffff t6=00000000 t7=7fff7198
$16: s0=00000000 s1=2aad1b04 s2=00000022 s3=a1cfb68c
$20: s4=00000000 s5=00000000 s6=2aad9ef8 s7=050e7db4
$24: t8=2aada2a8 t9=2aab2d9c
$28: gp=2aae2000 sp=7fff7178 s8=2aab8000 ra=2aab3704
Hi : 00000399
Lo : 0000b704
epc   : 2aab36d4    Tainted: P          
Status: 00001f13
Cause : 00808008
 7fff7178: 00000001 2aacd800 7fff7160 2aab7bd4 7fff7308 2aab7e24 2aae2000 2aacd478
 7fff7198: 2aad15b8 00000000 00000000 00000001 000012a3 00000000 2aaca000 2c512f4d
 7fff71b8: 0041ccbc 2c512ba8 00000000 2aaccd80 7fff71a8 2aab7bd4 2aab2d9c 0040924c
 7fff71d8: 7fff727c 2aac8504 7fff732c 2c512f4d 00000000 00000000 00000001 2aad15b8
 7fff71f8: 2aad1844 a1cfb68c 7fff7208 2aab3a24 00000001 2aacc308 7fff71f0 2aab7bd4
 7fff7218: 7fff7270 2aaca230 00000000 00000000 00000001 00000000 00000000 2aad15b8
 7fff7238: 2aae2000 2aacbc00 7fff7220 2aab7bd4 7fff7308 2aab7e24 00000001 2aacb878
 7fff7258: ffffffff 2aad15b8 2aae2000 0000000d 2aad15b8 00000000 00000000 00000000
 7fff7278: 2c512f18 0b7268a5 00000000 2aacb180 7fff7268 2aab7bd4 2aad1068 00000001
 7fff7298: 00000000 2aacae00 2aae2000 2aab7bd4 00000037 2c52a110 2c513536 2aad15b8
 7fff72b8: 2c5134c0 00000001 00000002 2c512ba8 7fff72d0 2aab5670 2c51223c 2aab7bd4
 7fff72d8: 2aad15b8 2aad1844 00000000 00000000 00000001 00000000 2aae2000 2aadabc0
 7fff72f8: 2aaa8645 2aabd1e0 ffffffff 00000000 2aae2000 00000000 2aae2000 00000001
 7fff7318: 00000b50 2aada2a8 2aad2cd0 2aabd6f4 7fff73d4 2c512ba8 00000000 00000022
 7fff7338: 2aae2000 2f746f6f 2aada2a8 00000001 2c512f18 00000001 00000000 2aaa895c
 7fff7358: 0000001c 00000000 00000000 2aaca000 00000001 2aabe0d0 2aab543c 2aad15b8
log: ... 2 notice: (1008) check_node_data: wrong data CRC in data node at 0x00337620: read 0xd10c9294, calculated 0xbcb7794b.
log: <4>Data CRC 040e30d4 != calculated CRC 1b75f785 for node at 0033ec40
log: <4>Data CRC 040e30d4 != calculated CRC 1b75f785 for node at 0033ec40

Text segment map
  0x00400000-0x005ba000  /exos/bin/hal
  0x2b030000-0x2b049000  /exos/lib/libpibutil.so.0.0
  0x2aaa8000-0x2aaca000  /lib/ld-2.13.so
  0x2aadc000-0x2aadf000  /lib/libdl-2.13.so
  0x2aaf0000-0x2aafa000  /exos/lib/libhal.so
  0x2ab0c000-0x2ab1f000  /exos/lib/libcommon.so
  0x2ab30000-0x2ab35000  /exos/lib/libcli.so
  0x2ab46000-0x2ab9a000  /exos/lib/libvlan.so
  0x2abb2000-0x2abcb000  /exos/lib/libcmbackend.so
  0x2abdc000-0x2abdf000  /exos/lib/libipv6.so
  0x2abf0000-0x2ac0a000  /exos/lib/librtmgrc.so
  0x2ac1c000-0x2ac20000  /exos/lib/libsnmpclient.so
  0x2ac30000-0x2ac43000  /exos/lib/libacl.so
  0x2ac54000-0x2ac62000  /exos/lib/libfdb.so
  0x2ac74000-0x2aceb000  /exos/lib/libaspen.so
  0x2ad54000-0x2af77000  /exos/lib/libpib.so
  0x2b05a000-0x2b091000  /exos/lib/libaspenshared.so
  0x2b0aa000-0x2b119000  /exos/lib/libaspensm.so
  0x2b27e000-0x2b28d000  /exos/lib/libaspensvc.so
  0x2b2b0000-0x2b2b5000  /exos/lib/libaspenutil.so
  0x2b2c6000-0x2b2cb000  /exos/lib/libsummitbcmicm.so
  0x2b84c000-0x2bce1000  /exos/lib/libbcmplat.so
  0x2c0de000-0x2c148000  /exos/lib/libcorediags.so
  0x2c16c000-0x2c171000  /exos/lib/libaspendiags.so
  0x2c182000-0x2c1e3000  /exos/lib/libstratadiag.so
  0x2c206000-0x2c214000  /exos/lib/libaspenpoe.so
  0x2c224000-0x2c22c000  /exos/lib/libpibdiag.so
  0x2c23c000-0x2c254000  /lib/libpthread-2.13.so
  0x2c2fe000-0x2c35b000  /exos/lib/libdispatch.so
  0x2c374000-0x2c381000  /exos/lib/libwkninfo.so
  0x2c394000-0x2c4fa000  /lib/libc-2.13.so
Build directory: /data3/release-manager/v15_7_1_4/aspen_msm
failure: process crash
time: Tue Apr 26 01:23:15 2016
process xmld
pid 1468
signal 11
$0 : z0=00000000 at=10001f00 v0=00000003 v1=ffffffff
$4 : a0=0000432b a1=00000025 a2=00000025 a3=00000000
$8 : t0=0000007a t1=2aac8504 t2=00000050 t3=f0000000
  
Photo of Njanyana Buthelezi

Njanyana Buthelezi

  • 200 Points 100 badge 2x thumb

Posted 3 years ago

  • 0
  • 1
Photo of Henrique

Henrique, Employee

  • 10,342 Points 10k badge 2x thumb
Hi Njanyana,

To review the hal process crash and try to identify the root cause I would suggest you to open a GTAC case.

You can check the article below to learn about how to upload debug/coredump information to a TFTP server:

How to upload debug information to a TFTP server

I believe that "Kernel thread was stuck" is related to hal process crash even if the timestamp does not match (based on the outputs provided) between log outputs and crash time.

Please take a look at both articles below regarding the upgrade instructions and some additional information:

https://gtacknowledge.extremenetworks.com/articles/Q_A/Upgrade-from-12-X-to-15-X
https://gtacknowledge.extremenetworks.com/articles/How_To/Best-practices-during-the-upgrade-of-an-EX...

Upgrading from 12.5.x or 12.6.x to 15.X should not require any intermediate SW upgrade as long as all the hardware supports the newer version of EXOS.

Regarding the OSPF instability, Is that related to neighbor relationship? Which 15.7 version did you try?
Photo of Njanyana Buthelezi

Njanyana Buthelezi

  • 200 Points 100 badge 2x thumb

Yes OSPF intability is in relation to neighbor relationship. This is what happen on my end device

Apr 26 13:01:48 X  rpd[1691]: RPD_OSPF_NBRDOWN: OSPF neighbor x.x.x.x (realm ospf-v2 ge-0/2/0.1953 area 0.0.0.2) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Apr 26 13:01:50 X  rpd[1691]: RPD_OSPF_NBRDOWN: OSPF neighbor x.x.x.x (realm ospf-v2 ge-0/2/0.667 area 0.0.0.20) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)
Apr 26 13:01:54 X  rpd[1691]: RPD_OSPF_NBRDOWN: OSPF neighbor x.x.x.x(realm ospf-v2 ge-0/2/0.630 area 0.0.0.10) state changed from Full to Init due to 1WayRcvd (event reason: neighbor is in one-way mode)


When on 12.5 all the neighbor are established (Full). The two EXOS version on the switches are:

Slot:             MSM-A *                      MSM-B                  
                  ------------------------     ------------------------
Current State:    MASTER                       BACKUP (In Sync)       

Image Selected:   primary                      primary                
Image Booted:     primary                      primary                
Primary ver:      12.5.4.5                     12.5.4.5               
Secondary ver:    15.7.1.4                     15.7.1.4   

 
Upgrade procedure were followed except the unfotunate part of installation interuption which I think caused the crash/corupption on MSMB since it was backup during installation.

I think I'll have to install EXOS from bootrom for particular MSM.


           

Photo of Henrique

Henrique, Employee

  • 10,342 Points 10k badge 2x thumb
Hi Njanyana,

I have performed a quick review in EXOS 15.7 release notes and found some OSPF bugs fixed in 15.7.3.1-patch1-3. However, I'm not sure if those are related to the issue you have faced. One of them is related to OSPF neighbor, but that depends on your OSPF configuration.

An option would be upgrade the BD8k with only the MSM-A connected and if everything works as expected, you can plug the MSM-B and run a "synchronize" command.

You can also review the EXOS firmware recommendations in the link below:

http://documentation.extremenetworks.com/hw_sw_compatibility/HardwareSoftwareCompatibility/r_exos-re...