cancel
Showing results for 
Search instead for 
Did you mean: 

C5 high CPU utilization - ipMapForwardingTask

C5 high CPU utilization - ipMapForwardingTask

Jason_Wisniewsk
New Contributor
I have a C5 that is giving me some grief. It is the core L3 for a medium sized network. There are 2 C5Gs and 1 C5K stacked.

Every so often when adding new hardware to the network the CPU goes nuts on the device and the only resolution is to randomly disconnect trunk ports to reset STP, essentially.

Today we added a new HP stack to the mix to act as an L2 for a VM network. This all went fine. The uplinks are trunked on both sides and we have a good link. I plugged in a VM server without issue. I then plugged in a simple DHCP device (APC PDU) and it completely brought down the network. CPU went to 95% and brought down pretty much all traffic. The process breakdown is below:

Total CPU Utilization:
Switch CPU 5 sec 1 min 5 min
-------------------------------------------------
3 1 95% 96% 96%

Switch:3 CPU:1

TID Name 5Sec 1Min 5Min
----------------------------------------------------------
3eb5430 tNet0 0.20% 0.17% 0.13%
3f53ea0 tXbdService 0.00% 0.08% 0.02%
4713b20 osapiTimer 2.20% 2.16% 2.13%
4a79ff0 bcmL2X.0 0.60% 0.53% 0.57%
4b26eb0 bcmCNTR.0 1.00% 0.94% 0.96%
4b9f490 bcmTX 1.00% 1.01% 1.19%
53b9f40 bcmRX 16.00% 15.57% 16.38%
54042f0 bcmATP-TX 25.60% 22.90% 23.34%
54097f0 bcmATP-RX 0.00% 0.08% 0.14%
59fb7f0 MAC Send Task 0.20% 0.20% 0.20%
5a0ccf0 MAC Age Task 0.20% 0.06% 0.05%
6e02f30 bcmLINK.0 0.40% 0.40% 0.40%
90e38d0 osapiMemMon 2.20% 2.47% 2.63%
91177f0 SysIdleTask 2.40% 1.64% 1.74%
920dce0 C5IntProc 0.00% 0.11% 0.07%
9dfe8b0 hapiRxTask 2.00% 1.81% 1.86%
9e33d40 tEmWeb 0.40% 0.32% 0.18%
b61e280 EDB BXS Req 0.00% 4.58% 2.32%
b763a90 SNMPTask 0.00% 1.30% 0.68%
b7ab5d0 RMONTask 0.00% 0.31% 1.24%
e2f2e30 dot1s_timer_task 1.00% 1.00% 1.00%
106fa4a0 fftpTask 0.00% 0.04% 0.01%
10793cc0 ipMapForwardingTask 42.60% 39.87% 40.37%
10c3a880 ARP Timer 0.20% 0.03% 0.00%

And this is what we saw in the logs. There was a topo change, but it had happened almost 2 hours before.

<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2257 %% Setting Port(130) instance(4095) State: DISCARDING<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2258 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2259 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2260 %% Setting Port(130) Role: ROLE_DESIGNATED | STP Port(130) | Int Cost(2000) | Ext Cost(2000)
<166>Mar 1 07:21:47 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2261 %% Setting Port(130) instance(0) State: LEARNING
<166>Mar 1 07:21:47 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2262 %% Setting Port(130) instance(0) State: FORWARDING
<166>Mar 1 09:00:27 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2274 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2277 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2278 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2279 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2280 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2281 %% Setting Port(445) instance(4095) State: DISABLED
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2282 %% Setting Port(446) instance(4095) State: DISABLED
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2283 %% Setting Port(123) instance(4095) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2284 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2285 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2286 %% Setting Port(123) Role: ROLE_DESIGNATED | STP Port(123) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2287 %% Setting Port(123) instance(0) State: LEARNING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2288 %% Setting Port(123) instance(0) State: FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2289 %% Setting Port(445) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2290 %% Setting Port(446) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2291 %% Setting Port(124) instance(4095) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2292 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2293 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2294 %% Setting Port(124) Role: ROLE_DESIGNATED | STP Port(124) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2295 %% Setting Port(124) instance(0) State: LEARNING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2296 %% Setting Port(124) instance(0) State: FORWARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2297 %% Setting Port(130) instance(4095) State: DISCARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2298 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2299 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2300 %% Setting Port(130) Role: ROLE_DESIGNATED | STP Port(130) | Int Cost(2000) | Ext Cost(2000)
<166>Mar 1 09:04:54 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2301 %% Setting Port(130) instance(0) State: LEARNING
<166>Mar 1 09:04:54 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2302 %% Setting Port(130) instance(0) State: FORWARDING

We had this happen in the past with a single Netgear junk switch that someone had plugged in at their desk. It turned out that it would cause the same symptoms every day when the person would boot it up. There was no loop anywhere on the switch at that time.

Any ideas on where to begin?
26 REPLIES 26

Jason_Wisniewsk
New Contributor
In the spirit of causing me more premature grey hair, it just happened again, but this time it was so bad I had to console to even get to the device. It did give me a chance to test my fix theory. Disconnecting a 3rd, completely unique, uplink to a switch caused the processor to go right back to normal.

We do have an NLB device, but its nodes have been staying alive during power outages, I also have the MAC/IP listed in the ARP table staticly.

Regarding your STP suggestions, those have been enabled previously to try and combat this.

It's driving me to drink.

Mike_D
Extreme Employee

Hello Jason,

re: head scratch

My good idea meter is leaning towards 'E' as well.

Often more than one problem is involved when root cause is hard to pin down.
In your case it seems likely given the different symptoms described. The first was related to STP - but this last seems different. Confusing for sure.

From the bottom of the toolbox:
NLB (m-soft network load balance) traffic - unicast IP with multicast mac - often causes problems for switching and would tickle ipMap and bcmrx. The switch can be tuned to better handle load balance but the addresses must be known and configured.

IP Multicast (224.x.x.x-239.x.x.x) is also forwarded/handled by ipMap. If configured, watch for fault conditions associated with multicast routing...

Since STP was involved in at least one instance, work toward forcing spanning tree out of the equation by setting spanguard to 'enabled' globally and set admin edge to 'true' for all edge ports.

If not already configured to force root priority, consider configuring this switch as lowest value (highest priority) for STP root election. This will cause service outage for a period but this move may be the single safest move for your network's stability.

Best regards,
Mike



Jason_Wisniewsk
New Contributor
The situation occurred again today, but this time there was no planned or structured network topology change. The CPU spiked and it came down. Since I was available I spent a few minutes troubleshooting while this was occurring. I had been syslogging at level 7 to a demo Netsight unit I installed and it showed absolutely nothing occurring at the time outside of my login and then running commands. "Show spantree stats" showed no topo changes. I decided to sniff the uplink to the new switch which I assumed was causing the problem and it had nothing out of the ordinary, just a few packets per second. Nothing on the level I would expect in the situation.

My next step was to get things working again to keep the company running, so I shut down a random non-edge port on my L3 and then re-enabled it. Immediately I saw a topo increment and the CPU went right back to normal. This was not a port I normally kick offline because I wanted to see if any topology change seemed to calm the situation, and it does.

It has me scratching my head.

Jason_Wisniewsk
New Contributor
Hi Mike-

Thank you for the great information, I really appreciate it.

I have attempted to re-create the problem a couple of times now and have had no success. This leads me to believe that there was a change somewhere on the network between my initial disasters and now. I looked through our Change Management and we have made no topo moves in the recent past outside of my work, which makes me think that there is/was a rogue switch somewhere nearby that was plugged in at the time, but is no longer. I traced back all of the neighbors, looked for multiple MACs, etc, and found nothing out of the ordinary. I even grabbed a demo of Netsight to do a visual deep dive and it found nothing odd.

Mike_D
Extreme Employee

Hello Jason,

I received a response regarding ipMapForwarding.
This process does indeed handle forwarding of IP packets at the CPU. In addition the task handles install of the route into hardware - necessary to offload future traffic with the same signature (ex: packets from the same udp stream or tcp conversation).
Most days the described model results in a tiny fraction of your traffic hitting the CPU while the large majority forwards in hardware - optimized.
One reason the impact is as severe as you've described during failure mode is the priority nature of this hardware handoff. The CPU tries to forward the packet and program hardware/asic. next packet, same routine. the situation quickly spins up to critical levels if there are problems with the L2 or L3 topology, loops, fdb thrashing, packet reflection etc. Each of those items are capable network killers but negative impact is compounded when the system is unable to leverage the advantage gained by forwarding in hardware.

I hope the clarification helps your cause.

Oh and one more thing. As I mentioned earlier in the thread, I poked at our documentation and picked nearby brains before forwarding your question to a technical contact suggested by a local developer. The request was sent last thursday.
Thing is - the answer was in my email before close of business the same day.
I missed his email response until our follow-up exchange.

So another mystery out of the way. It was the support guy - in the switch room - using the brain cramp.

I hope the info helps manage those crazy minutes of service outage with more direction and less confusion. and I apologize if opportunities were missed due to the mixup

Best regards,
Mike



GTM-P2G8KFN