C5/C3/B5/B3-Series Firmware Resolution for DMA Error reset

  • 0
  • 1
  • Article
  • Updated 3 years ago
  • (Edited)
Article ID: 16165 

C5-Series; firmware through, through,
C3-Series; firmware through
B5-Series; firmware through, through,
B3-Series; firmware through

L2 Table parity error misreported as DMA error.
DMA-type errors display in the current.log, followed by a unit reboot event.

The current.log (5487) displays DMA-type errors (14007); for example:

<160>Apr 8 06:35:10 SIM[89329680]: hwutils.c(4455) 37651 %% Unit 1 DMA regs:PCIMEM_START(0x055cb8a0) SBUS_START(0x07a01000) ENTRY_CNT(0x00001000) CFG(0x0004011c) SBUS_ADDR(0x07a01000) CMIC_SCHAN_CTRL(0x00000000) CMIC_DMA_STAT(0x00082012) CMIC_IRQ_STAT(0x60000102) rv(0xfffffff5) LINE(2986)
<160>Apr 8 06:35:11 SIM[89329680]: hwutils.c(4476) 37653 %% PCI Status for CPU=0x20a0
<160>Apr 8 06:35:11 SIM[89329680]: hwutils.c(4470) 37655 %% PCI Status for Device 0x14e4:0xb620=0x02a0
<160>Apr 8 06:35:11 SIM[89329680]: hwutils.c(4483) 37659 %% MPC85xx DMA/PCI register dump
<160>Apr 8 06:35:11 SIM[89329680]: hwutils.c(4502) 37661 %% DGSR(0x00000000) ERR_DR(0x80000040) ERR_ATTRIB(0x001fa001) ERR_ADDR(0x00000000) ERR_EXT_ADDR(0x00000000) ERR_DL(0x00000000) ERR_DH(0x00000000)
<160>Apr 8 06:35:11 SIM[89329680]: hwutils.c(4518) 37663 %% 1:PEX_ERR_DR(0x00000000) PEX_ERR_CAP_STAT(0x00000000) PEX_ERR_CAP_R0(0x00000000) PEX_ERR_CAP_R1(0x00000000) PEX_ERR_CAP_R2(0x00000000) PEX_ERR_CAP_R3(0x00000000)
<160>Apr 8 06:35:11 SIM[89329680]: broad_hpc_drv.c(2686) 37669 %% _soc_xgs3_mem_dma: L2_ENTRY.ipipe0 failed(NAK), unit 1
<160>Apr 8 06:35:11 SIM[89329680]: broad_hpc_drv.c(2686) 37670 %% soc_l2x_thread: Too many errors
<160>Apr 8 06:35:11 DRIVER[89329680]: hwutils.c(4237) 37671 %% soc_l2x_thread unit = 1: DMA failed too many times
<160>Apr 8 06:35:11 SIM[89329680]: hwutils.c(4238) 37672 %% soc_l2x_thread unit = 1: DMA failed too many times
<160>Apr 8 06:35:21 SIM[172323408]: hwutils.c(3223) 37673 %% Error(0x6c327800)
<160>Apr 8 06:35:28 SIM[65830416]: hwutils.c(4715) 37675 %% ERROR:Code exception:Watchdog no longer being serviced.

The current.log (5487) goes on to display a task suspension line which identifies the event as one of three known varieties:

  • From C5-Series 14739: "Task C5IntProc(0x<address>) is suspended with error 2, creating file sysDmpxMmmddyy.z"
  • From B5-Series 14793: "Task IntProc(0x<address>) is suspended with error 2, creating file sysDmpxMmmddyy.z"
  • From C3/B3-Series 14755: "Task CPLD_Status(0x<address>) is suspended with error 2, creating file sysDmpxMmmddyy.z"

The passage of high-energy particles can trigger a table memory bit transition, which is detected as a memory parity error, which causes the table DMA to fail. The rate at which these errors have occurred is within the norms predicted to be observed in this class of silicon.

The stated sequence of events will in all likelihood never occur on any given unit, but within a broad deployment of many such units, may well be experienced somewhere in the network.

For the C5, C3, B5, or B3; upgrade to 6.61 firmware or higher.
For the C5 or B5; upgrade to 6.71 firmware or higher.
For the C5 or B5; upgrade to 6.81 firmware or higher.
Release notes state, in the 'Firmware Changes and Enhancements' section:
16086    Attempt to recover from a L2 table DMA error that previously resulted in a reset with a log entry of: "soc_l2x_thread DMA failed too many times". On an L2 Table DMA failure we will now walk the table to find the corrupted entry and remove it. The expected warning message is: "warning soc_l2x_thread: Bad L2 table entry found. Recovering".

Upon detection of a parity error, the affected table entry is removed and a set of new messages is logged; for example:

<160>May 15 16:04:31 SIM[99694928]: broad_hpc_drv.c(2686) 710 % warning soc_l2x_thread: DMA failed. Attempting recovery
<160>May 15 16:04:31 SIM[99694928]: broad_hpc_drv.c(2686) 711 % warning soc_l2x_thread: Bad L2 table entry found. Recovering

Though with this fix there will be no unit reset, do note that all traffic flowing through that unit will for a brief time be forwarded using the soft path (~ CPU) while the problematic table entry is being cleared.
Photo of FAQ User

FAQ User, Official Rep

  • 13,590 Points 10k badge 2x thumb

Posted 3 years ago

  • 0
  • 1

There are no replies.

This conversation is no longer open for comments or replies.