Extreme Networks

Jason_Wisniewsk · ‎03-01-2016

I have a C5 that is giving me some grief. It is the core L3 for a medium sized network. There are 2 C5Gs and 1 C5K stacked.

Every so often when adding new hardware to the network the CPU goes nuts on the device and the only resolution is to randomly disconnect trunk ports to reset STP, essentially.

Today we added a new HP stack to the mix to act as an L2 for a VM network. This all went fine. The uplinks are trunked on both sides and we have a good link. I plugged in a VM server without issue. I then plugged in a simple DHCP device (APC PDU) and it completely brought down the network. CPU went to 95% and brought down pretty much all traffic. The process breakdown is below:

Total CPU Utilization:
Switch CPU 5 sec 1 min 5 min
-------------------------------------------------
3 1 95% 96% 96%

Switch:3 CPU:1

TID Name 5Sec 1Min 5Min
----------------------------------------------------------
3eb5430 tNet0 0.20% 0.17% 0.13%
3f53ea0 tXbdService 0.00% 0.08% 0.02%
4713b20 osapiTimer 2.20% 2.16% 2.13%
4a79ff0 bcmL2X.0 0.60% 0.53% 0.57%
4b26eb0 bcmCNTR.0 1.00% 0.94% 0.96%
4b9f490 bcmTX 1.00% 1.01% 1.19%
53b9f40 bcmRX 16.00% 15.57% 16.38%
54042f0 bcmATP-TX 25.60% 22.90% 23.34%
54097f0 bcmATP-RX 0.00% 0.08% 0.14%
59fb7f0 MAC Send Task 0.20% 0.20% 0.20%
5a0ccf0 MAC Age Task 0.20% 0.06% 0.05%
6e02f30 bcmLINK.0 0.40% 0.40% 0.40%
90e38d0 osapiMemMon 2.20% 2.47% 2.63%
91177f0 SysIdleTask 2.40% 1.64% 1.74%
920dce0 C5IntProc 0.00% 0.11% 0.07%
9dfe8b0 hapiRxTask 2.00% 1.81% 1.86%
9e33d40 tEmWeb 0.40% 0.32% 0.18%
b61e280 EDB BXS Req 0.00% 4.58% 2.32%
b763a90 SNMPTask 0.00% 1.30% 0.68%
b7ab5d0 RMONTask 0.00% 0.31% 1.24%
e2f2e30 dot1s_timer_task 1.00% 1.00% 1.00%
106fa4a0 fftpTask 0.00% 0.04% 0.01%
10793cc0 ipMapForwardingTask 42.60% 39.87% 40.37%
10c3a880 ARP Timer 0.20% 0.03% 0.00%

And this is what we saw in the logs. There was a topo change, but it had happened almost 2 hours before.

<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2257 %% Setting Port(130) instance(4095) State: DISCARDING<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2258 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2259 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2260 %% Setting Port(130) Role: ROLE_DESIGNATED | STP Port(130) | Int Cost(2000) | Ext Cost(2000)
<166>Mar 1 07:21:47 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2261 %% Setting Port(130) instance(0) State: LEARNING
<166>Mar 1 07:21:47 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2262 %% Setting Port(130) instance(0) State: FORWARDING
<166>Mar 1 09:00:27 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2274 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2277 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2278 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2279 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2280 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2281 %% Setting Port(445) instance(4095) State: DISABLED
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2282 %% Setting Port(446) instance(4095) State: DISABLED
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2283 %% Setting Port(123) instance(4095) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2284 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2285 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2286 %% Setting Port(123) Role: ROLE_DESIGNATED | STP Port(123) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2287 %% Setting Port(123) instance(0) State: LEARNING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2288 %% Setting Port(123) instance(0) State: FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2289 %% Setting Port(445) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2290 %% Setting Port(446) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2291 %% Setting Port(124) instance(4095) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2292 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2293 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2294 %% Setting Port(124) Role: ROLE_DESIGNATED | STP Port(124) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2295 %% Setting Port(124) instance(0) State: LEARNING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2296 %% Setting Port(124) instance(0) State: FORWARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2297 %% Setting Port(130) instance(4095) State: DISCARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2298 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2299 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2300 %% Setting Port(130) Role: ROLE_DESIGNATED | STP Port(130) | Int Cost(2000) | Ext Cost(2000)
<166>Mar 1 09:04:54 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2301 %% Setting Port(130) instance(0) State: LEARNING
<166>Mar 1 09:04:54 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2302 %% Setting Port(130) instance(0) State: FORWARDING

We had this happen in the past with a single Netgear junk switch that someone had plugged in at their desk. It turned out that it would cause the same symptoms every day when the person would boot it up. There was no loop anywhere on the switch at that time.

Any ideas on where to begin?

Mike_D · ‎03-09-2016

Hello Jason,

Jeeze I was hoping someone in the community would tell us about ipMapForwarding.

Seems like a trivial request I know - but a detailed list of the processes and sub-processes owned by a CLI level entry like ipMapForwarding has always been out of reach of for the legacy enterasys fixed port products.

When you asked about ipmap, I consulted with several peers from support but nobody had the answer. I checked with a development contact about the process. The question forwarded to that guys contact. No word back yet.
I wouldn't count on that data to solve this case. I dont know how long this will take.
I can tell you I'll continue making a nuisance of myself tracking this down.

mean time, here's my best educated guess:
We're probably looking at soft forwarding at the IP layer.

Traffic using ip helper/relay agent to traverse routed networks is soft forwarded.

but there would need to be a fault condition, typically external, sending far more udp/broadcast traffic than normal to crush the box.

I believe directed broadcast (if present in the config) would be handled in soft path as well.
The soft path reference is traffic forwarded by software (the CPU) rather than the optimized hardware based data plane path. Its lots of work. For lots of traffic - or even medium traffic load, soft-forwarding can easily crush a switch.

I hate to send a best-guess answer. I'd rather wait and get you a real answer - but since neither of us control the crushed-network timeline, I'll risk adding to the confusion factor - with the warning there's a chance I'm adding to your confusion.

Best regards,
Mike

Jason_Wisniewsk · ‎03-03-2016

Thanks for the Mike.

I have a case open currently and am just waiting for some ideas from the engineer assigned. I did a bunch of testing today and I could not reproduce the issue, which concerns me a great deal. I did walk through the network switch by switch and found no rogue hubs/switches. This leads me to believe that someone is plugging in a rogue device somewhere at certain times which would explain the CPU spike when the client device was plugged in the switch, it really wasn't that device just bad timing.

Can anyone confirm what ipMapForwardingTask is?

Obviously my testing when there is an issue is usually limited because the entire network comes to a halt, all the VOIP phones reboot and crash, no internet, the L3 becomes almost impossible to manage. The CLI takes seconds per keystroke. I would love for it to occur early in the morning where I can spend more than 2 minutes furiously trying to get things back to order .

Mike_D · ‎03-02-2016

Hello Jason,

The netgear mention caught my attention. Though often not STP enabled these are infamous for allowing or introducing loops while suppressing or eating bpdu's. This may or may not be the same root cause as your current/recent condition the symptom (at least net effect) is critical. You may get lucky here on the HUB and run into a fellow community member with the same experience and same root cause. If not you'll probably need to take action while the network is suffering real-time impact.

In this case I would open a case with GTAC for a more detailed fw/config/topology review. Check the latest release notes for fixes or known issues that may be related. Make an action plan in case no solution is readily available. A few favorites:

- port ingress and egress counters. watch high broadcast/multicast/flood count.
show port counters ge.x.x

- Review RMON history
show rmon history ge.x.x

The nature of the L2 stats seen should give your analysis further direction.

- review pause-packet counts with snapshots of show port flowcontrol

- continue to correlate times/events from the log file and show system utilization process

- continue to monitor spanning tree stats - I like show spantree debug

- Take a trace at the new end station if possible (on the off-chance it's not a network element).

- Take a trace from the switch side - mirror is not likely necessary. With this sort of system level problem I prefer to plug a wireshark station into another port on the switch and manually configure port egress to match that of the new device - then listen for bcast/mcast/flood packets. Follow your nose from here. IPv6 ND, ipv4 unicast flood, pipes clogged by looping bcast/mcast etc.

I say mirror likely not needed - with the exception of mac level protocol. If STP is thrashing you would probably benefit from port mirror as the spanning tree protocol packets might not otherwise make it beyond the ingress mac. set port mirroring create ge.1.1 ge.1.2
where ge.1.1 is mirror-from and ge.1.2 is mirror-to. clear port mirroring ge.1.1 ge.1.2

Note that pause frames wont be mirrored. These are consumed by the mac - pre-mirror magic. Use counters to monitor pause/flow control.

If you dont get to the bottom of this your best bet is to prepare.

Hope that helps,
Mike

Jason_Wisniewsk · ‎03-02-2016

"no ip proxy-arp" is a default setting, as it turns out so that was already enabled.

Jason_Wisniewsk · ‎03-01-2016

I don't see that last one as being an available command. Redirects were already disabled, however I had not set "no ip proxy-arp" on each interface. Doing that now.

Extreme Networks

C5 high CPU utilization - ipMapForwardingTask

C5 high CPU utilization - ipMapForwardingTask