C5/C3/B5/B3-Series Firmware Resolution for DMA Error reset
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Get Direct Link
- Report Inappropriate Content
‎12-23-2014 02:43 PM
Article ID: 16165
Products
C5-Series; firmware 6.42.10.0016 through 6.61.12.0005, 6.71.01.0067 through 6.71.04.0004, 6.81.01.0027
C3-Series; firmware 6.42.10.0016 through 6.61.12.0005
B5-Series; firmware 6.42.10.0016 through 6.61.12.0005, 6.71.01.0067 through 6.71.04.0004, 6.81.01.0027
B3-Series; firmware 6.42.10.0016 through 6.61.12.0005
Symptoms
L2 Table parity error misreported as DMA error.
DMA-type errors display in the current.log, followed by a unit reboot event.
The current.log (5487) displays DMA-type errors (14007); for example:
<
<
<
<
<
<
<
<
<
<
<
<
The current.log (5487) goes on to display a task suspension line which identifies the event as one of three known varieties:From B5-Series 14793: " From C3/B3-Series 14755: " Cause
The passage of high-energy particles can trigger a table memory bit transition, which is detected as a memory parity error, which causes the table DMA to fail. The rate at which these errors have occurred is within the norms predicted to be observed in this class of silicon.
The stated sequence of events will in all likelihood never occur on any given unit, but within a broad deployment of many such units, may well be experienced somewhere in the network.
Solution
For the C5, C3, B5, or B3; upgrade to 6.61 firmware 6.61.13.0006 or higher.
For the C5 or B5; upgrade to 6.71 firmware 6.71.05.0008 or higher.
For the C5 or B5; upgrade to 6.81 firmware 6.81.02.0007 or higher.
Release notes state, in the '
Upon detection of a parity error, the affected table entry is removed and a set of new messages is logged; for example:
<
<
Though with this fix there will be no unit reset, do note that all traffic flowing through that unit will for a brief time be forwarded using the soft path (~ CPU) while the problematic table entry is being cleared.
Products
C5-Series; firmware 6.42.10.0016 through 6.61.12.0005, 6.71.01.0067 through 6.71.04.0004, 6.81.01.0027
C3-Series; firmware 6.42.10.0016 through 6.61.12.0005
B5-Series; firmware 6.42.10.0016 through 6.61.12.0005, 6.71.01.0067 through 6.71.04.0004, 6.81.01.0027
B3-Series; firmware 6.42.10.0016 through 6.61.12.0005
Symptoms
L2 Table parity error misreported as DMA error.
DMA-type errors display in the current.log, followed by a unit reboot event.
The current.log (5487) displays DMA-type errors (14007); for example:
<
code:
>160
code:
Apr 8 06:35:10 0.0.0.0-1 SIM[89329680]: hwutils.c(4455) 37651 %% Unit 1 DMA regs:PCIMEM_START(0x055cb8a0) SBUS_START(0x07a01000) ENTRY_CNT(0x00001000) CFG(0x0004011c) SBUS_ADDR(0x07a01000) CMIC_SCHAN_CTRL(0x00000000) CMIC_DMA_STAT(0x00082012) CMIC_IRQ_STAT(0x60000102) rv(0xfffffff5) LINE(2986)
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4476) 37653 %% PCI Status for CPU=0x20a0
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4470) 37655 %% PCI Status for Device 0x14e4:0xb620=0x02a0
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4483) 37659 %% MPC85xx DMA/PCI register dump
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4502) 37661 %% DGSR(0x00000000) ERR_DR(0x80000040) ERR_ATTRIB(0x001fa001) ERR_ADDR(0x00000000) ERR_EXT_ADDR(0x00000000) ERR_DL(0x00000000) ERR_DH(0x00000000)
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4518) 37663 %% 1:PEX_ERR_DR(0x00000000) PEX_ERR_CAP_STAT(0x00000000) PEX_ERR_CAP_R0(0x00000000) PEX_ERR_CAP_R1(0x00000000) PEX_ERR_CAP_R2(0x00000000) PEX_ERR_CAP_R3(0x00000000)
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: broad_hpc_drv.c(2686) 37669 %% _soc_xgs3_mem_dma: L2_ENTRY.ipipe0 failed(NAK), unit 1
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: broad_hpc_drv.c(2686) 37670 %% soc_l2x_thread: Too many errors
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 DRIVER[89329680]: hwutils.c(4237) 37671 %% soc_l2x_thread unit = 1: DMA failed too many times
<
code:
>160
code:
Apr 8 06:35:11 0.0.0.0-1 SIM[89329680]: hwutils.c(4238) 37672 %% soc_l2x_thread unit = 1: DMA failed too many times
<
code:
>160
code:
Apr 8 06:35:21 0.0.0.0-1 SIM[172323408]: hwutils.c(3223) 37673 %% Error(0x6c327800)
<
code:
>160
code:
Apr 8 06:35:28 0.0.0.0-1 SIM[65830416]: hwutils.c(4715) 37675 %% ERROR:Code exception:Watchdog no longer being serviced.
The current.log (5487) goes on to display a task suspension line which identifies the event as one of three known varieties:
- From C5-Series 14739: "
Task C5IntProc(0x<
address>
) is suspended with error 2, creating file sysDmp
xMmmddyy
.z"
Task IntProc(0x<
address>
) is suspended with error 2, creating file sysDmp
xMmmddyy
.z"
Task CPLD_Status(0x<
address>
) is suspended with error 2, creating file sysDmp
xMmmddyy
.z"
The passage of high-energy particles can trigger a table memory bit transition, which is detected as a memory parity error, which causes the table DMA to fail. The rate at which these errors have occurred is within the norms predicted to be observed in this class of silicon.
The stated sequence of events will in all likelihood never occur on any given unit, but within a broad deployment of many such units, may well be experienced somewhere in the network.
Solution
For the C5, C3, B5, or B3; upgrade to 6.61 firmware 6.61.13.0006 or higher.
For the C5 or B5; upgrade to 6.71 firmware 6.71.05.0008 or higher.
For the C5 or B5; upgrade to 6.81 firmware 6.81.02.0007 or higher.
Release notes state, in the '
code:
' section:Firmware Changes and Enhancements
code:
16086
code:
Attempt to recover from a L2 table DMA error that previously resulted in a reset with a log entry of: "soc_l2x_thread DMA failed too many times". On an L2 Table DMA failure we will now walk the table to find the corrupted entry and remove it. The expected warning message is: "warning soc_l2x_thread: Bad L2 table entry found. Recovering".
Upon detection of a parity error, the affected table entry is removed and a set of new messages is logged; for example:
<
code:
>160
code:
May 15 16:04:31 0.0.0.0-1 SIM[99694928]: broad_hpc_drv.c(2686) 710 % warning soc_l2x_thread: DMA failed. Attempting recovery
<
code:
>160
code:
May 15 16:04:31 0.0.0.0-1 SIM[99694928]: broad_hpc_drv.c(2686) 711 % warning soc_l2x_thread: Bad L2 table entry found. Recovering
Though with this fix there will be no unit reset, do note that all traffic flowing through that unit will for a brief time be forwarded using the soft path (~ CPU) while the problematic table entry is being cleared.
0 REPLIES 0
