BGP Process taking up all the CPU

  • 0
  • 2
  • Question
  • Updated 4 years ago
  • (Edited)
Running several 10K Black Diamonds with ExtremeXOS version 15.1.4.3 patch1-4.  The routers randomly become very slow running any BGP related commands. CPU runs dcbgp at 50+%  for hours on end. No flapping of BGP neighbors. It still seems to route traffic ok, but have had to reboot after seeing this go on for over 24 hours at times. Currently receiving about 250000 routes.  Is this a known issue? Any advice or thoughts on what might be causing this?
Photo of Cavan

Cavan

  • 234 Points 100 badge 2x thumb
  • anxious

Posted 4 years ago

  • 0
  • 2
Photo of Sumit Tokle

Sumit Tokle, Alum

  • 5,738 Points 5k badge 2x thumb
Please go ahead and open case with TAC. It needs more TS.
(Edited)
Photo of PARTHIBAN CHINNAYA

PARTHIBAN CHINNAYA, Alum

  • 4,362 Points 4k badge 2x thumb
See if it is possible to reduce the routing table.
Terminate sessions.
Terminate process which are not needed in the switch.
if it is a core switch and if netlogin,isis,vrrp not used.
Terminate these process.
Photo of Dave E Martin

Dave E Martin

  • 272 Points 250 badge 2x thumb

This problem happened to us on Summit 480/460. We have opened several TAC cases over it, it is not resolved.

We have found:

The more BGP routes you have, the more likely it is to occur.

The more BGP peers you have (such as if the switch is serving as a reflector) the more likely it is to occur.

If a BGP peer goes up or down while the switch is still processing another BGP peer having recently gone up or down, the more likely it is to occur.

If a policy change occurs while BGP is still processing updates, it might occur.

The problem is hard to reproduce consistently.

It occurred to us once on a 460 with only 2 bgp peers and 2500 or so routes. Usually it occurs on our 480s with several hundred thousand routes.

You might be able to fix it by disabling then re-enabling one or some of your BGP peers (or peer groups if you are using them), and then waiting about an hour (at least on summit 480 with full internet routes). Alternatively, you can restart process bgp (note the impact these actions may have on your network).

Essentially, you can do "show bgp route summary" (and/or show bgp route ipv6 summary) and wait until the counters stabilize. At that point, CPU usage should drop to normal. If it doesn't then try again (stopping/starting BGP peers or BGP itself, or reboot).

This has been a very frustrating problem, and I feel vindicated to hear that it is happening to someone else. It has happened throughout many 15.x versions.

Photo of Dave E Martin

Dave E Martin

  • 272 Points 250 badge 2x thumb

I'll add that on a 480 with full internet routes from multiple providers, changing a policy caused it to go to 50% cpu (presumably the 480 has two cores?) for about 20 to 40 minutes, as it worked through applying the policy change and propogating it out to its peers. If the policy was again updated before this process was complete, It would end up "stuck" at 50% cpu, until reboot, or restart of bgp or peers. This 480 was serving as a reflector for several dozen peers. Typically, when it got stuck, we could disable the reflector peer group, wait until CPU dropped to normal, then re-enable the peer group, and verify after an hour or so that CPU had dropped back to normal.

Photo of Cavan

Cavan

  • 234 Points 100 badge 2x thumb
Thanks for that. Have you tried terminating unused processes like netlogin, isis, vrrp..etc as the other user in this thread recommended?