BD-12804: Slot (GM-20XTR) turns off with strange Erro:HAL.Card.Error

  • 0
  • 1
  • Question
  • Updated 12 months ago
  • Answered
Today after 2 years of uptime one of our Slots on BD-12804 had been turned off for no apparent reason. We had got some strange logs before the accident:
06/29/2017 20:06:29.22 <Warn:DM.Warning> MSM-A: Slot-1 FAILED (2) cartmanPollMBReady-594: cartman4 on slot 1 (1 errors):Mailbox Polling Timeour
06/29/2017 20:06:29.22 <Warn:DM.Warning> MSM-A: Slot-1, Error 12: cartmanPollMBReady-594: cartman4 on slot 1 (1 errors):Mailbox Polling Timeou)
06/29/2017 20:06:29.21 <Erro:HAL.Card.Error> MSM-A: cartmanPollMBReady-594: cartman4 on slot 1 (1 errors):Mailbox Polling Timeout(reg 705=87)
And after that that all the ports on a slot starts to turn off:
06/29/2017 20:06:29.64 <Info:vlan.msgs.portLinkStateDown> MSM-A: Port 1:5 link down
06/29/2017 20:06:29.64 <Info:vlan.msgs.portLinkStateDown> MSM-A: Port 1:4 link down
06/29/2017 20:06:29.23 <Info:LACP.RemPortFromAggr> MSM-A: Remove port 1:3 from aggregator
06/29/2017 20:06:29.23 <Info:LACP.RemPortFromAggr> MSM-A: Remove port 1:2 from aggregator
06/29/2017 20:06:29.23 <Info:LACP.RemPortFromAggr> MSM-A: Remove port 1:1 from aggregator
06/29/2017 20:06:29.22 <Info:vlan.dbg.info> MSM-A: Port 1:3 is Down, remove from aggregator 1:1
06/29/2017 20:06:29.22 <Info:vlan.msgs.portLinkStateDown> MSM-A: Port 1:3 link down
06/29/2017 20:06:29.22 <Info:vlan.dbg.info> MSM-A: Port 1:2 is Down, remove from aggregator 1:1
06/29/2017 20:06:29.22 <Info:vlan.msgs.portLinkStateDown> MSM-A: Port 1:2 link down
06/29/2017 20:06:29.22 <Info:vlan.dbg.info> MSM-A: Port 1:1 is Down, remove from aggregator 1:1
06/29/2017 20:06:29.22 <Info:vlan.msgs.portLinkStateDown> MSM-A: Port 1:1 link down
I have not found any references in Internet to the problem, and logs look really strange for me. I have not found any PollMBReady or Mailbox Poling Timeouts in documentation. We even have no any mailboxes in configuration of BD-12804.

Our equipment:
Chassis     : 804023-00-09 06135-01409 Rev 9.0
Slot-1      : 804032-00-06 06284-00059 Rev 6.0
Slot-5      : 804032-00-06 0721F-00331 Rev 6.0
Slot-6      : 804032-00-06 0720F-00670 Rev 6.0
MSM-A       : 804047-00-07 0711F-00084 Rev 7.0 BootROM: 1.0.0.3    IMG: 12.6.2.10 
PSUCTRL-1   : 700087-00-07 06105-00862 Rev 7.0 BootROM: 2.13      
PSUCTRL-2   : 700087-00-07 06105-00911 Rev 7.0 BootROM: 2.13      
PSU-1       : PS 2336 4300-00145 0722K-30342 Rev 10.0
PSU-2       : PS 2336 4300-00137 0502J-03684 Rev 7.0
PSU-3       : PS 2336 4300-00137 0519J-05462 Rev 7.0
Image   : ExtremeXOS version 12.6.2.10 v1262b10 by release-manager
          on Thu Sep 29 17:48:22 EDT 2011
BootROM : 1.0.0.3

Any idea? After restart the chassis works perfect as ever, but I fear of repeating of the problem and don't understand, what was the problem with our Slot-1 (GM-20XTR)?
Photo of ilyinilyas

ilyinilyas

  • 74 Points
  • anxious

Posted 12 months ago

  • 0
  • 1
Photo of Nick Yakimenko

Nick Yakimenko

  • 2,404 Points 2k badge 2x thumb
Looks like it can be a hardware issue, e.g. broken capacitors due to overheat
Photo of Drew C.

Drew C., Community Manager

  • 37,320 Points 20k badge 2x thumb
I did some searching and found a few instances of this that were resolved with software updates, but that was in 12.0 and 12.1 versions, so 12.6 should be okay. I see a later instance where an RMA was requested for the blade and no trouble was found at the repair facility. It's hard for me to say with certainty what caused this, but you'll want to monitor for sure. Keep in mind that you're dealing with 11+ year old equipment :)
Photo of EtherMAN

EtherMAN, Embassador

  • 6,456 Points 5k badge 2x thumb
If this happens again and since you are going to reboot it to clear it up you may want to run an extended diagnostics on slot 1 and the MSM to see if there are any issues that show up.  Be warned though if there are indeed bad memory or other hardware that it finds it may take the bad card offline due to the hardware problems so I would only do this if indeed you have a spare.  Also be sure and have a back up of the config if you do the MSM... We only had one 12k in our network and if I recall the diagnostics is about 5 or 6 minutes per card and you have to do them one at a time.