C5/C3/B5/B3-Series Firmware Resolution for DMA Error reset


Userlevel 3
Article ID: 16165

Products
C5-Series; firmware 6.42.10.0016 through 6.61.12.0005, 6.71.01.0067 through 6.71.04.0004, 6.81.01.0027
C3-Series; firmware 6.42.10.0016 through 6.61.12.0005
B5-Series; firmware 6.42.10.0016 through 6.61.12.0005, 6.71.01.0067 through 6.71.04.0004, 6.81.01.0027
B3-Series; firmware 6.42.10.0016 through 6.61.12.0005

Symptoms
L2 Table parity error misreported as DMA error.
DMA-type errors display in the current.log, followed by a unit reboot event.

The current.log (5487) displays DMA-type errors (14007); for example:

<
code:
160
>
code:
Apr 8 06:35:10 0.0.0.0-1 SIM[89329680]: hwutils.c(4455) 37651 %% Unit 1 DMA regs:PCIMEM_START(0x055cb8a0) SBUS_START(0x07a01000) ENTRY_CNT(0x00001000) CFG(0x0004011c) SBUS_ADDR(0x07a01000) CMIC_SCHAN_CTRL(0x00000000) CMIC_DMA_STAT(0x00082012) CMIC_IRQ_STAT(0x60000102) rv(0xfffffff5) LINE(2986)

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4476) 37653 %% PCI Status for CPU=0x20a0

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4470) 37655 %% PCI Status for Device 0x14e4:0xb620=0x02a0

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4483) 37659 %% MPC85xx DMA/PCI register dump

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4502) 37661 %% DGSR(0x00000000) ERR_DR(0x80000040) ERR_ATTRIB(0x001fa001) ERR_ADDR(0x00000000) ERR_EXT_ADDR(0x00000000) ERR_DL(0x00000000) ERR_DH(0x00000000)

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4518) 37663 %% 1:PEX_ERR_DR(0x00000000) PEX_ERR_CAP_STAT(0x00000000) PEX_ERR_CAP_R0(0x00000000) PEX_ERR_CAP_R1(0x00000000) PEX_ERR_CAP_R2(0x00000000) PEX_ERR_CAP_R3(0x00000000)

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: broad_hpc_drv.c(2686) 37669 %% _soc_xgs3_mem_dma: L2_ENTRY.ipipe0 failed(NAK), unit 1

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: broad_hpc_drv.c(2686) 37670 %% soc_l2x_thread: Too many errors

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 DRIVER[89329680]: hwutils.c(4237) 37671 %% soc_l2x_thread unit = 1: DMA failed too many times

<
code:
160
>
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4238) 37672 %% soc_l2x_thread unit = 1: DMA failed too many times

<
code:
160
>
code:
Apr 8 06:35:21 0.0.0.0-1 SIM[172323408]: hwutils.c(3223) 37673 %% Error(0x6c327800)

<
code:
160
>
code:
Apr 8 06:35:28 0.0.0.0-1 SIM[65830416]: hwutils.c(4715) 37675 %% ERROR:Code exception:Watchdog no longer being serviced.


The current.log (5487) goes on to display a task suspension line which identifies the event as one of three known varieties:
    From C5-Series 14739: "[code]Task C5IntProc(0x[/code]<[code]address[/code]>[code]) is suspended with error 2, creating file sysDmp[/code][code]xMmmddyy[/code][code].z[/code]"
  • From B5-Series 14793: "[code]Task IntProc(0x[/code]<[code]address[/code]>[code]) is suspended with error 2, creating file sysDmp[/code][code]xMmmddyy[/code][code].z[/code]"
  • From C3/B3-Series 14755: "[code]Task CPLD_Status(0x[/code]<[code]address[/code]>[code]) is suspended with error 2, creating file sysDmp[/code][code]xMmmddyy[/code][code].z[/code]"
Cause
The passage of high-energy particles can trigger a table memory bit transition, which is detected as a memory parity error, which causes the table DMA to fail. The rate at which these errors have occurred is within the norms predicted to be observed in this class of silicon.

The stated sequence of events will in all likelihood never occur on any given unit, but within a broad deployment of many such units, may well be experienced somewhere in the network.

Solution
For the C5, C3, B5, or B3; upgrade to 6.61 firmware 6.61.13.0006 or higher.
For the C5 or B5; upgrade to 6.71 firmware 6.71.05.0008 or higher.
For the C5 or B5; upgrade to 6.81 firmware 6.81.02.0007 or higher.
Release notes state, in the '
code:
Firmware Changes and Enhancements
' section:
code:
16086
code:
Attempt to recover from a L2 table DMA error that previously resulted in a reset with a log entry of: "soc_l2x_thread DMA failed too many times". On an L2 Table DMA failure we will now walk the table to find the corrupted entry and remove it. The expected warning message is: "warning soc_l2x_thread: Bad L2 table entry found. Recovering".


Upon detection of a parity error, the affected table entry is removed and a set of new messages is logged; for example:

<
code:
160
>
code:
May 15 16:04:31 0.0.0.0-1 SIM[99694928]: broad_hpc_drv.c(2686) 710 % warning soc_l2x_thread: DMA failed. Attempting recovery

<
code:
160
>
code:
May 15 16:04:31 0.0.0.0-1 SIM[99694928]: broad_hpc_drv.c(2686) 711 % warning soc_l2x_thread: Bad L2 table entry found. Recovering


Though with this fix there will be no unit reset, do note that all traffic flowing through that unit will for a brief time be forwarded using the soft path (~ CPU) while the problematic table entry is being cleared.

0 replies

Be the first to reply!

Reply