Header Only - DO NOT REMOVE - Extreme Networks

C5 high CPU utilization - ipMapForwardingTask


I have a C5 that is giving me some grief. It is the core L3 for a medium sized network. There are 2 C5Gs and 1 C5K stacked.

Every so often when adding new hardware to the network the CPU goes nuts on the device and the only resolution is to randomly disconnect trunk ports to reset STP, essentially.

Today we added a new HP stack to the mix to act as an L2 for a VM network. This all went fine. The uplinks are trunked on both sides and we have a good link. I plugged in a VM server without issue. I then plugged in a simple DHCP device (APC PDU) and it completely brought down the network. CPU went to 95% and brought down pretty much all traffic. The process breakdown is below:

Total CPU Utilization:
Switch CPU 5 sec 1 min 5 min
-------------------------------------------------
3 1 95% 96% 96%

Switch:3 CPU:1

TID Name 5Sec 1Min 5Min
----------------------------------------------------------
3eb5430 tNet0 0.20% 0.17% 0.13%
3f53ea0 tXbdService 0.00% 0.08% 0.02%
4713b20 osapiTimer 2.20% 2.16% 2.13%
4a79ff0 bcmL2X.0 0.60% 0.53% 0.57%
4b26eb0 bcmCNTR.0 1.00% 0.94% 0.96%
4b9f490 bcmTX 1.00% 1.01% 1.19%
53b9f40 bcmRX 16.00% 15.57% 16.38%
54042f0 bcmATP-TX 25.60% 22.90% 23.34%
54097f0 bcmATP-RX 0.00% 0.08% 0.14%
59fb7f0 MAC Send Task 0.20% 0.20% 0.20%
5a0ccf0 MAC Age Task 0.20% 0.06% 0.05%
6e02f30 bcmLINK.0 0.40% 0.40% 0.40%
90e38d0 osapiMemMon 2.20% 2.47% 2.63%
91177f0 SysIdleTask 2.40% 1.64% 1.74%
920dce0 C5IntProc 0.00% 0.11% 0.07%
9dfe8b0 hapiRxTask 2.00% 1.81% 1.86%
9e33d40 tEmWeb 0.40% 0.32% 0.18%
b61e280 EDB BXS Req 0.00% 4.58% 2.32%
b763a90 SNMPTask 0.00% 1.30% 0.68%
b7ab5d0 RMONTask 0.00% 0.31% 1.24%
e2f2e30 dot1s_timer_task 1.00% 1.00% 1.00%
106fa4a0 fftpTask 0.00% 0.04% 0.01%
10793cc0 ipMapForwardingTask 42.60% 39.87% 40.37%
10c3a880 ARP Timer 0.20% 0.03% 0.00%

And this is what we saw in the logs. There was a topo change, but it had happened almost 2 hours before.

<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2257 %% Setting Port(130) instance(4095) State: DISCARDING<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2258 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2259 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 07:21:45 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2260 %% Setting Port(130) Role: ROLE_DESIGNATED | STP Port(130) | Int Cost(2000) | Ext Cost(2000)
<166>Mar 1 07:21:47 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2261 %% Setting Port(130) instance(0) State: LEARNING
<166>Mar 1 07:21:47 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2262 %% Setting Port(130) instance(0) State: FORWARDING
<166>Mar 1 09:00:27 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2274 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2277 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2278 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2279 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2280 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2281 %% Setting Port(445) instance(4095) State: DISABLED
<166>Mar 1 09:03:31 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2282 %% Setting Port(446) instance(4095) State: DISABLED
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2283 %% Setting Port(123) instance(4095) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2284 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2285 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2286 %% Setting Port(123) Role: ROLE_DESIGNATED | STP Port(123) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2287 %% Setting Port(123) instance(0) State: LEARNING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2288 %% Setting Port(123) instance(0) State: FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2289 %% Setting Port(445) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2290 %% Setting Port(446) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2291 %% Setting Port(124) instance(4095) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2292 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2293 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2294 %% Setting Port(124) Role: ROLE_DESIGNATED | STP Port(124) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2295 %% Setting Port(124) instance(0) State: LEARNING
<166>Mar 1 09:03:40 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2296 %% Setting Port(124) instance(0) State: FORWARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2297 %% Setting Port(130) instance(4095) State: DISCARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2298 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2299 %% Setting Port(130) instance(0) State: DISCARDING
<166>Mar 1 09:04:52 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2300 %% Setting Port(130) Role: ROLE_DESIGNATED | STP Port(130) | Int Cost(2000) | Ext Cost(2000)
<166>Mar 1 09:04:54 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2301 %% Setting Port(130) instance(0) State: LEARNING
<166>Mar 1 09:04:54 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2302 %% Setting Port(130) instance(0) State: FORWARDING

We had this happen in the past with a single Netgear junk switch that someone had plugged in at their desk. It turned out that it would cause the same symptoms every day when the person would boot it up. There was no loop anywhere on the switch at that time.

Any ideas on where to begin?

26 replies

Well I have lied, it was not the APC PDU. It is any random device plugged into the edge switch. Sometimes one causes an issue, sometimes the exact same device does not. All on the same VLANs.

Here is the log output at the time it happened:

<166>Mar 1 11:17:15 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2342 %% dot1sMstpTx(): CIST Role Disabled<166>Mar 1 11:17:15 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2343 %% Setting Port(130) instance(0) State: DISCARDING
<164>Mar 1 11:17:37 10.10.1.1-3 USER_MGR[1]: 2344 %% User:admin(su) logged in from 10.10.10.16(telnet)
<166>Mar 1 11:17:50 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2345 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 11:17:50 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2346 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 11:17:50 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2347 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 11:17:50 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2348 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 11:17:50 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2349 %% Setting Port(445) instance(4095) State: DISABLED
<166>Mar 1 11:17:50 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1346) 2350 %% Setting Port(446) instance(4095) State: DISABLED
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2351 %% Setting Port(123) instance(4095) State: DISCARDING
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2352 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2353 %% Setting Port(123) instance(0) State: DISCARDING
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2354 %% Setting Port(123) Role: ROLE_DESIGNATED | STP Port(123) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2355 %% Setting Port(123) instance(0) State: LEARNING
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2356 %% Setting Port(123) instance(0) State: FORWARDING
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2357 %% Setting Port(445) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 11:17:58 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1503) 2358 %% Setting Port(446) instance(4095) State: MANUAL_FORWARDING
<166>Mar 1 11:17:59 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2359 %% Setting Port(124) instance(4095) State: DISCARDING
<166>Mar 1 11:17:59 10.10.1.1-3 DOT1S[238044000]: dot1s_txrx.c(485) 2360 %% dot1sMstpTx(): CIST Role Disabled
<166>Mar 1 11:17:59 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1485) 2361 %% Setting Port(124) instance(0) State: DISCARDING
<166>Mar 1 11:17:59 10.10.1.1-3 DOT1S[238044000]: dot1s_sm.c(4253) 2362 %% Setting Port(124) Role: ROLE_DESIGNATED | STP Port(124) | Int Cost(20000) | Ext Cost(20000)
<166>Mar 1 11:17:59 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1360) 2363 %% Setting Port(124) instance(0) State: LEARNING
<166>Mar 1 11:17:59 10.10.1.1-3 DOT1S[238044000]: dot1s_ih.c(1424) 2364 %% Setting Port(124) instance(0) State: FORWARDING
On the routed interface for those vlans, set this:
  • no ip proxy-arp
  • no ip redirects
  • no ip icmp unreachable
I don't see that last one as being an available command. Redirects were already disabled, however I had not set "no ip proxy-arp" on each interface. Doing that now.
"no ip proxy-arp" is a default setting, as it turns out so that was already enabled.
Userlevel 5
Hello Jason,

The netgear mention caught my attention. Though often not STP enabled these are infamous for allowing or introducing loops while suppressing or eating bpdu's. This may or may not be the same root cause as your current/recent condition the symptom (at least net effect) is critical. You may get lucky here on the HUB and run into a fellow community member with the same experience and same root cause. If not you'll probably need to take action while the network is suffering real-time impact.

In this case I would open a case with GTAC for a more detailed fw/config/topology review. Check the latest release notes for fixes or known issues that may be related. Make an action plan in case no solution is readily available. A few favorites:

- port ingress and egress counters. watch high broadcast/multicast/flood count.
show port counters ge.x.x

- Review RMON history
show rmon history ge.x.x

The nature of the L2 stats seen should give your analysis further direction.

- review pause-packet counts with snapshots of show port flowcontrol


- continue to correlate times/events from the log file and show system utilization process


- continue to monitor spanning tree stats - I like show spantree debug


- Take a trace at the new end station if possible (on the off-chance it's not a network element).

- Take a trace from the switch side - mirror is not likely necessary. With this sort of system level problem I prefer to plug a wireshark station into another port on the switch and manually configure port egress to match that of the new device - then listen for bcast/mcast/flood packets. Follow your nose from here. IPv6 ND, ipv4 unicast flood, pipes clogged by looping bcast/mcast etc.

I say mirror likely not needed - with the exception of mac level protocol. If STP is thrashing you would probably benefit from port mirror as the spanning tree protocol packets might not otherwise make it beyond the ingress mac. set port mirroring create ge.1.1 ge.1.2
where ge.1.1 is mirror-from and ge.1.2 is mirror-to. clear port mirroring ge.1.1 ge.1.2

Note that pause frames wont be mirrored. These are consumed by the mac - pre-mirror magic. Use counters to monitor pause/flow control.

If you dont get to the bottom of this your best bet is to prepare.

Hope that helps,
Mike
Thanks for the Mike.

I have a case open currently and am just waiting for some ideas from the engineer assigned. I did a bunch of testing today and I could not reproduce the issue, which concerns me a great deal. I did walk through the network switch by switch and found no rogue hubs/switches. This leads me to believe that someone is plugging in a rogue device somewhere at certain times which would explain the CPU spike when the client device was plugged in the switch, it really wasn't that device just bad timing.

Can anyone confirm what ipMapForwardingTask is?

Obviously my testing when there is an issue is usually limited because the entire network comes to a halt, all the VOIP phones reboot and crash, no internet, the L3 becomes almost impossible to manage. The CLI takes seconds per keystroke. I would love for it to occur early in the morning where I can spend more than 2 minutes furiously trying to get things back to order 😞.
Userlevel 5
Hello Jason,

Jeeze I was hoping someone in the community would tell us about ipMapForwarding.

Seems like a trivial request I know - but a detailed list of the processes and sub-processes owned by a CLI level entry like ipMapForwarding has always been out of reach of for the legacy enterasys fixed port products.

When you asked about ipmap, I consulted with several peers from support but nobody had the answer. I checked with a development contact about the process. The question forwarded to that guys contact. No word back yet.
I wouldn't count on that data to solve this case. I dont know how long this will take.
I can tell you I'll continue making a nuisance of myself tracking this down.

mean time, here's my best educated guess:
We're probably looking at soft forwarding at the IP layer.

Traffic using ip helper/relay agent to traverse routed networks is soft forwarded.

but there would need to be a fault condition, typically external, sending far more udp/broadcast traffic than normal to crush the box.

I believe directed broadcast (if present in the config) would be handled in soft path as well.
The soft path reference is traffic forwarded by software (the CPU) rather than the optimized hardware based data plane path. Its lots of work. For lots of traffic - or even medium traffic load, soft-forwarding can easily crush a switch.

I hate to send a best-guess answer. I'd rather wait and get you a real answer - but since neither of us control the crushed-network timeline, I'll risk adding to the confusion factor - with the warning there's a chance I'm adding to your confusion.

Best regards,
Mike
Userlevel 5
Hello Jason,

I received a response regarding ipMapForwarding.
This process does indeed handle forwarding of IP packets at the CPU. In addition the task handles install of the route into hardware - necessary to offload future traffic with the same signature (ex: packets from the same udp stream or tcp conversation).
Most days the described model results in a tiny fraction of your traffic hitting the CPU while the large majority forwards in hardware - optimized.
One reason the impact is as severe as you've described during failure mode is the priority nature of this hardware handoff. The CPU tries to forward the packet and program hardware/asic. next packet, same routine. the situation quickly spins up to critical levels if there are problems with the L2 or L3 topology, loops, fdb thrashing, packet reflection etc. Each of those items are capable network killers but negative impact is compounded when the system is unable to leverage the advantage gained by forwarding in hardware.

I hope the clarification helps your cause.

Oh and one more thing. As I mentioned earlier in the thread, I poked at our documentation and picked nearby brains before forwarding your question to a technical contact suggested by a local developer. The request was sent last thursday.
Thing is - the answer was in my email before close of business the same day.
I missed his email response until our follow-up exchange.

So another mystery out of the way. It was the support guy - in the switch room - using the brain cramp.

I hope the info helps manage those crazy minutes of service outage with more direction and less confusion. and I apologize if opportunities were missed due to the mixup

Best regards,
Mike
Hi Mike-

Thank you for the great information, I really appreciate it.

I have attempted to re-create the problem a couple of times now and have had no success. This leads me to believe that there was a change somewhere on the network between my initial disasters and now. I looked through our Change Management and we have made no topo moves in the recent past outside of my work, which makes me think that there is/was a rogue switch somewhere nearby that was plugged in at the time, but is no longer. I traced back all of the neighbors, looked for multiple MACs, etc, and found nothing out of the ordinary. I even grabbed a demo of Netsight to do a visual deep dive and it found nothing odd.
The situation occurred again today, but this time there was no planned or structured network topology change. The CPU spiked and it came down. Since I was available I spent a few minutes troubleshooting while this was occurring. I had been syslogging at level 7 to a demo Netsight unit I installed and it showed absolutely nothing occurring at the time outside of my login and then running commands. "Show spantree stats" showed no topo changes. I decided to sniff the uplink to the new switch which I assumed was causing the problem and it had nothing out of the ordinary, just a few packets per second. Nothing on the level I would expect in the situation.

My next step was to get things working again to keep the company running, so I shut down a random non-edge port on my L3 and then re-enabled it. Immediately I saw a topo increment and the CPU went right back to normal. This was not a port I normally kick offline because I wanted to see if any topology change seemed to calm the situation, and it does.

It has me scratching my head.
Userlevel 5
Hello Jason,

re: head scratch

My good idea meter is leaning towards 'E' as well.

Often more than one problem is involved when root cause is hard to pin down.
In your case it seems likely given the different symptoms described. The first was related to STP - but this last seems different. Confusing for sure.

From the bottom of the toolbox:
NLB (m-soft network load balance) traffic - unicast IP with multicast mac - often causes problems for switching and would tickle ipMap and bcmrx. The switch can be tuned to better handle load balance but the addresses must be known and configured.

IP Multicast (224.x.x.x-239.x.x.x) is also forwarded/handled by ipMap. If configured, watch for fault conditions associated with multicast routing...

Since STP was involved in at least one instance, work toward forcing spanning tree out of the equation by setting spanguard to 'enabled' globally and set admin edge to 'true' for all edge ports.

If not already configured to force root priority, consider configuring this switch as lowest value (highest priority) for STP root election. This will cause service outage for a period but this move may be the single safest move for your network's stability.

Best regards,
Mike
In the spirit of causing me more premature grey hair, it just happened again, but this time it was so bad I had to console to even get to the device. It did give me a chance to test my fix theory. Disconnecting a 3rd, completely unique, uplink to a switch caused the processor to go right back to normal.

We do have an NLB device, but its nodes have been staying alive during power outages, I also have the MAC/IP listed in the ARP table staticly.

Regarding your STP suggestions, those have been enabled previously to try and combat this.

It's driving me to drink.
Sounds like you might want to break out wire shark and do some captures. Are you using any default routes? Also, what is upstream from the C?
Userlevel 5
The addition of an ARP entry will allow routing of the NLB traffic but to protect the L2 switch and maximize throughput there's a setting called unicast as multicast in which you name the mac address and the egress vlan and ports. This statically builds a hardware path where normally the mcast mac hits the flood path as the address cant be learned in the FDB.

set mac unicast-as-multicast enable

set mac multicast 03-bf-xx-xx-xx-xx ge.1.1-10

Hang in there. Keep brainstorming.

I agree with all of Jeremy's input. Wireshark should be attached to the network and ready for action.

Easy to say from my position but troubleshoots always look boogery and bizarre while you're in the middle. But its always something. And you'll eventually find that something. And then it will make perfect sense. And generally you've learned something by then.
What are your thoughts on how to prepare for a packet sniff? Build a port that has every VLAN tagged to it and throw the sniffer on that one? Building it one VLAN at a time will take a long time, as I have approximately 30 in this location
Userlevel 5
Jason

the set vlan port command will allow you to add egress for each vlan to a particular port - tagged or untagged.

(vs set port vlan command which sets pvid and egress)

I wouldn't bother with tagging - traffic source tends to be easy to identify and tags can confuse some laptops. But I guess that's a chocolate or vanilla call.

That's how I'd do it to start with. traffic doing this sort of damage in many cases floods a vlan and this approach throws a wide net. you should see little traffic - mostly arp - in a healthy environment.

Regards

Mike
I have the L3 configured as a central hub with spokes going out to each of my IDFs plus a few VLANs for physical vs virtual, etc, so a packet sniffer on one of my VLANs will only capture those packets, unless we are seeing a spillover into multiple VLANs. My gut feeling is that it isn't, though.

'set vlan port' isn't a command I recognize on the B/C, but I am working from memory currently. I did mean to set a port with every VLAN egress'd through that would be sniffed via Wireshark.
Userlevel 5
Hi Jason,
you're right the command for egress is set vlan egress (tag options)

As far as the trace goes, I definitely think you should follow your gut here. If you dont think the IDFs are introducing reflection or flood behavior, I'm in no position to fault your reasoning.
As net admin your time should be used to its best purpose, right now that's the shortest path to discovering root cause and a solution; intuition is often a big part of that.

Looks like my input on the thread has reached the point of diminishing returns.
Worst case, my posts may keep others from adding fresh perspectives - so I'll pipe down.
You will get to the bottom of this if you keep at it, I predict sooner rather than later. Good luck,

Mike
Try this, from a computer on your LAN, run a traceroute to an IP address that is RFC1918(private) that doesn't exist.

I know it sounds weird, but just give it a try and let me know the results. I am just going on a hunch here.

Edit: When I say "IP that doesn't exist", I mean, pick an IP that is in a subnet that you aren't using and shouldn't technically have a L3 interface for.
Hi Mike-

There is no such thing as diminishing returns when it comes to problem solving, any input is MUCH appreciated. I hope that this discussion gets indexed and assists others in the future. Finding problems that appear impossible are no fun.
Hi Jeremy-

Here are the outputs, note that 10.10.3.100 is my fw cluster and 10.10.11.1 is the L3 with the CPU issue. I did a few just for a complete overview. None of these IPs fall anywhere in my subnets.

C:\Users\jwisniewski>tracert -d 10.3.1.1
Tracing route to 10.3.1.1 over a maximum of 30 hops

1 1 ms 1 ms 2 ms 10.10.11.1
2 <1 ms <1 ms <1 ms 10.10.3.100
3 1 ms 1 ms 2 ms 74.126.23.89
4 2 ms 2 ms 2 ms 216.234.118.33
5 2 ms 2 ms 2 ms 216.234.96.1
6 ^C
C:\Users\jwisniewski>ping 10.244.244.244

Pinging 10.244.244.244 with 32 bytes of data:
Control-C
^C

C:\Users\jwisniewski>tracert -d 10.244.244.244

Tracing route to 10.244.244.244 over a maximum of 30 hops

1 2 ms 1 ms 1 ms 10.10.11.1
2 1 ms <1 ms <1 ms 74.126.4.9
3 2 ms 1 ms * 74.126.23.89
4 3 ms 3 ms 3 ms 216.234.118.33
5 2 ms 2 ms 2 ms 216.234.96.1
6 * ^C
C:\Users\jwisniewski>tracert -d 10.230.24.32

Tracing route to 10.230.24.32 over a maximum of 30 hops

1 2 ms 1 ms 1 ms 10.10.11.1
2 <1 ms <1 ms <1 ms 10.10.3.100
3 1 ms 2 ms 1 ms 74.126.23.89
4 2 ms 2 ms 2 ms 216.234.118.33
5 3 ms 2 ms 2 ms 216.234.96.1
6 ^C
C:\Users\jwisniewski>

Also some further info. We had another power outage that brought down the building after hours. This did not cause a problem, at all which has been confirmed by my SNMP monitoring tool. This leads me to two potential thoughts:

1. The device isn't online after hours
2. The network is so calm afterhours that whatever traffic tries to pass manages to do so without issue.
Hmm, I was thinking it could be a routing loop. This is just what I do, but any routes that I don't have routes for, I create a generic black hole route.

Also, you can 'set dos-control ....' and maybe get some idea of what is going on, surely the C will identify this as a DOS style event.. Usually it logs the events.
Sadly no set dos-control here. Extreme GTAC did suggest using flowcontrol (currently disabled) but according to the fixed switchin config guide this only works in a non-auto/auto world.

I do have wireshark running at this point. Now it is a waiting game.
Ever solve the issue?
In fact, no. It actually hadn't happened again until just a few weeks ago for seemingly no reason at all. Same symptoms and same fix as before.

Reply