Question

X440 90+% CPU usage

  • 5 September 2019
  • 13 replies
  • 220 views

Userlevel 6
We have multiple closets at a EDU customer's High School many of which are seeing 90+% CPU spikes on a regular basis. I have read the articles about 8-20% CPU usage and the cli related CPU levels of 30+%.

The customer is experiencing random network issues this for the second day in a row and it is hard to argue against the CPU usage being related. Yes, I understand switching should not be affected by CPU usage.

These are stacks of 3-6 X440-G2-48p.
They are running summitX-22.5.1.7-patch1-7
ELRP is configured
configure elrp-client periodic vlan NOLOOP ports all disable-port duration 600
configure elrp-client disable-port exclude ports 1:52
NOLOOP is on every port on every stack.

13 replies

Userlevel 5
Hi David

What process is spiking?

Thanks
Brad
Userlevel 6
Sorry meant to paste from the log
Hal

09/05/2019 09:21:59.80 CPU utilization monitor: process hal consumes 92 % CPU
09/05/2019 09:21:54.81 CPU utilization monitor: process hal consumes 93 % CPU
09/05/2019 09:13:33.16 The maximum number of neighbors supported on port 4:35 was exceeded.
09/05/2019 09:07:19.83 CPU utilization monitor: process hal consumes 91 % CPU
09/05/2019 08:51:54.79 CPU utilization monitor: process hal consumes 92 % CPU
09/05/2019 08:36:54.79 CPU utilization monitor: process hal consumes 93 % CPU
09/05/2019 08:21:54.81 CPU utilization monitor: process hal consumes 92 % CPU
09/05/2019 08:06:59.82 CPU utilization monitor: process hal consumes 94 % CPU
09/05/2019 08:06:54.83 CPU utilization monitor: process hal consumes 91 % CPU
09/05/2019 08:00:57.26 The maximum number of neighbors supported on port 5:40 was exceeded.
09/05/2019 08:00:05.80 The maximum number of neighbors supported on port 4:35 was exceeded.
09/05/2019 07:51:59.83 CPU utilization monitor: process hal consumes 92 % CPU
09/05/2019 07:51:54.82 CPU utilization monitor: process hal consumes 91 % CPU
09/05/2019 07:45:04.26 The maximum number of neighbors supported on port 5:40 was exceeded.
09/05/2019 07:44:15.56 The maximum number of neighbors supported on port 4:35 was exceeded.
09/05/2019 07:21:59.79 CPU utilization monitor: process hal consumes 93 % CPU
09/05/2019 05:52:43.34 Setting hwclock time to system time, and broadcasting time
09/05/2019 01:21:27.34 Setting hwclock time to system time, and broadcasting time
09/04/2019 22:51:59.65 CPU utilization monitor: process hal consumes 91 % CPU
09/04/2019 21:06:59.62 CPU utilization monitor: process hal consumes 94 % CPU
09/04/2019 20:51:59.65 CPU utilization monitor: process hal consumes 94 % CPU
09/04/2019 20:50:11.40 Setting hwclock time to system time, and broadcasting time
09/04/2019 20:36:59.52 CPU utilization monitor: process hal consumes 95 % CPU
09/04/2019 20:21:59.54 CPU utilization monitor: process hal consumes 96 % CPU
09/04/2019 20:06:59.53 CPU utilization monitor: process hal consumes 95 % CPU
09/04/2019 19:51:59.54 CPU utilization monitor: process hal consumes 94 % CPU
Userlevel 5
If you use the following commands, do you see macs moving back and forth?
enable log debug-mode
configure log filter defaultfilter add event fdb.macmove

We can also try "run script spath-stats.py" and see if anything jumps out there.
Userlevel 6
So that reveals guest wireless clients moving between local ports and the uplink port.

09/05/2019 09:46:18.57 MAC E0:89:7E:CE:40:64 on VLAN 913 moved from port 4:30 to port 1:52
09/05/2019 09:46:15.79 MAC 38:53:9C:A7:72:C3 on VLAN 913 moved from port 4:14 to port 4:22
09/05/2019 09:46:14.70 MAC E0:33:8E:BF:B5:F6 on VLAN 913 moved from port 4:30 to port 1:52
09/05/2019 09:46:14.62 MAC 38:53:9C:A7:72:C3 on VLAN 913 moved from port 4:24 to port 4:14
09/05/2019 09:46:10.99 MAC 38:53:9C:A7:72:C3 on VLAN 913 moved from port 1:52 to port 4:24
09/05/2019 09:46:07.58 MAC FC:18:3C:83:7F:95 on VLAN 913 moved from port 4:30 to port 1:52
09/05/2019 09:46:07.10 MAC FC:18:3C:76:ED:2C on VLAN 913 moved from port 4:28 to port 1:52
09/05/2019 09:46:05.40 MAC 38:53:9C:A7:72:C3 on VLAN 913 moved from port 4:12 to port 1:52
09/05/2019 09:46:02.82 MAC 38:53:9C:A7:72:C3 on VLAN 913 moved from port 4:30 to port 4:12
09/05/2019 09:45:36.17 MAC 70:3C:69:07:B1:4D on VLAN 913 moved from port 4:28 to port 4:30

great command by the way.
Userlevel 5
some movement is normal as clients roam from AP to AP. But if they are going back and forth a lot, it can signify that the MU is between two different APs and the signal strength is such that the MU connects to both in rapid succession.
Userlevel 5
it might be best to call GTAC to see if we can get a tcp dump from the CPU to see what's actually causing it.
Userlevel 6
I found that the IPv4 MCast entries on the L3 Hash Table were running between 1750 and 1800 of the theoretical max (2048)

I ran [configure forwarding ipmc lookup-key group-vlan] and entries are down to 8-20. Since this change was made the CPU utilization messages have stopped and customer issues appear to have stopped.

We will continue monitoring...
Userlevel 6
The CPU utilization messages have gone away but the customer is still reporting intermittent issues with web traffic.
Waiting to hear if they see anything with their webfilter or firewall.
Userlevel 2
Hi!

configure forwarding ipmc lookup-key group-vlan used to be more of a pain with the X440 non G2, but surely it can affect the G2 too. It does have very small tables even if they are magnitudes better than the old X440... I'd start monitoring the switches with some ping tracer to see where the problem starts. Do you have in-band management on the X440G2s? If so, setting up smokeping (made by the same guy that made MRTG and RRD, Tobi Oetriker) on a Linux PC (or VM) could let you trace where and when the problems occur. You have a lot of work in front of you finding the cause, but if you're lucky, tools like this can help. A TAC case will certainly be helpful too as suggested.

Simply pinging a lot of switches (if management is in-band) might be a good starting point, but smokeping is my favourite! Remember to ping with a size of 1472 bytes as that will produce IP packets with 1500 bytes size. If you use smaller paket size, MTU problems may not be detected (not that this seems to be the case here, but...).

Check optical levels if you have fiber links. "show ports transceiver information".

Check for RX errors (and collisions and TX errors while you're at it), mainly on uplinks. You should have a tool to monitor that along with utilization.

Check for congestion (tail drops, buffers running over) "show ports qosmonitor congestion"
You'll be surprised to see how soon some switches drop packets. I've seen heavy tail drops on X460 (I think) with 22 % utilization (as reported by the CLI 2 second update).

/Fredrik
Userlevel 6
Thanks for the reply.

So far we found an ATA spewing garbage all on the network. (It has been removed)
We also found the customer configured their new WiFi to dump unauthenticated users on the VLAN we are using for NoLoop. (Users have been moved)
We also discovered the new WiFi (Meraki) is bridging broadcast traffic from the staff and student VLANs this was a big issue for wireless users and they would get DHCP responses from the DHCP server on both VLANs then have to pick one. This bridged traffic never made it to the wire, it was just in the air. The customer currently has a case open with Meraki to determine the cause and correct the VLAN bridging.

Some of my reading for this brought up shared packet buffers again. on the 440-G2 I believe each port is allowed 25% of the port buffer by default.

What are everyone's thoughts and recommendations on adjusting this?
I keep contemplating moving to 30 or 35% where we see congestion.

Thanks,
Userlevel 2
Try the increased buffers out! Also, calculate if an increase from 25 to 35 % will actually accommodate another full length packet or not. Surely, not all packets are 1500/1518 bytes but it may be so that the port buffer memory always is allocated for that size. The TAC will know. Port buffer memory is implemented differently on most every model, so I can't remember how the X440-G2 is built, but it sure doesn't have any deep buffer pools...

On congested links, adding another port or two in a LAG might do the trick, if possible of course. QoS might be a way to go as well.

/Fredrik
Userlevel 6
In this environment all the congestion we see is on user facing ports. The uplinks are all running clean.
Userlevel 2
You don't have any congestion on ports in core or dist facing access?

Reply