02-13-2020 08:07 AM
Hi. Im preparing pair of x670-G1 for mlag production and in process of testing some fail scenarios in lab. Since production switches working in heavy l2 load enviroment, i have potential probability of broadcast\multicast storm.
X670 platform control plane protection desingn сompared to similar Huawei or Juniper Trident+ devices looks very poor.
Im running on 16.1\16,2 trail.
Digging in to XOS guide gives me following:
Normally, x670 will send to CPU following packets:
0 : Broadcast and IPv6 packets
1 : sFlow packets
2 : vMAC destined packets (VRRP MAC and ESRP MAC)
3 : L3 Miss packets (ARP request not resolved) or L2 Miss packets (Software MAC learning)
4 : Multicast traffic not hitting hardware ipmc table (224.0.0.0/4 normal IP multicast packets neither IGMP nor PIM)
5 : ARP reply packets or packets destined for switch itself
6 : IGMP or PIM packets
7 : Packets whose TOS field is "0xc0" and Ethertype is "0x0800", or STP, EAPS, EDP, OSPF packets
Mitigation:
Like been said, XOS have not any centralised control plane policy like others. You cannot set rate of ARP or any other protocol hitting CPU.
1)Storm control on XOS 16.x is very coarse and cant preciesely control rate of BUM traffic - this is related to 15.625 ms time slots. So, if i have rate of 300pps on ports, setting flood control rate of 10000 will drop some little amount of packets. This will help, but not much.
This behavior is fixed in 22.x trail, but since im on G1 devices, i cant upgrade, so no luck here.
2)Yes, i know about dos-protect, but it will not help me in case of storm. Too much source-destination pairs.
3)Policy. Yep, i can use optins like deny-cpu matching broadcast and IPV6. But, i have complex scenario, where L3 ring, MPLS\VPLS. ERPS and OSPF is runnig. Maintainng this with such policy installed will be pure design.
After some testing, have some questions here:
1)Why switch sends to CPU IPV6 Neighbor Advertisement , Neighbor Solicitation etc frames, when i dont have any IPv6 interface configured ? Can i change this behavior ?
2) Same for ARP broadcast - i dont have any L3 interfaces in those vlans.
Currently on production switches i have following rates hitting CPU :
MC_PERQ_PKT(0).cpu0 : 35,948,519,436 +7,961,581 1,064/s
MC_PERQ_PKT(3).cpu0 : 1,844,892,096 +1,545,150 1/s
MC_PERQ_PKT(4).cpu0 : 28,464,148 +10,055
MC_PERQ_PKT(5).cpu0 : 3,008,692,593 +714,726 80/s
MC_PERQ_PKT(6).cpu0 : 191,770,235 +67,442 10/s
MC_PERQ_PKT(7).cpu0 : 2,289,556,745 +667,560 88/s
Will be glad to hear and advice about CPU protection on X670 platform.
03-05-2021 02:31 PM
I beleve token bucket algoritm was replaced here :
In ExtremeXOS 16.2.5-Patch1-22 (Apr 2020) storm-control fix was announced:
xos0063205 Even though the traffic rate is below the configured flood rate limit, traffic is dropped.
03-01-2021 06:32 PM
You should be aware that in some of the old platforms, rate limiting works very differently from what at least I expect and it can give surprising results. The switch tops up the token bucket and counts packets every 1/64000 seconds (which equals 15.625 microseconds). This is true for X440, X460 and also X670(-G1) as it’s the same generation platform.
“To process one packet, 64000 tokens are required.“
https://extremeportal.force.com/ExtrArticleDetail?an=000083176
The article gets very technical and complicated very quickly. Thus, you cannot say “I want to pass 200 pps”. You can only say “I want to top up the token bucket with, say, 200 tokens every 1/64000 seconds”. You do this by writing:
configure port 1 rate-limit flood broadcast 200
It will then take 200/64000 seconds to fill the token bucket enough to let one packet through, that is ~3 ms. If a broadcast packet is received after 2.99 ms, it will be blocked.
It’s not entirely clear, but it seems the bucket is emptied every second. That means that if you have a rate of 200 configured and don’t receive any broadcasts for 998 ms, you will have tokens enough to pass 199 packets back-to back (no pause between them). This, however, will not be true if the 199 packets happen to come at a time recently after the bucket is emptied. This means that regardless of what limit you set, you can end up with multiple packets being allowed sometimes and blocked sometimes, depending on when in the second they happen to arrive.
I know, it’s complicated...
If your X670 can use EXOS 21, the algorithms have changed fundamentally and rate limiting will work more like one would expect.
02-15-2021 12:42 PM
In ExtremeXOS 16.2.5-Patch1-22 (Apr 2020) storm-control fix was announced:
xos0063205 Even though the traffic rate is below the configured flood rate limit, traffic is dropped.
But here is also issue with that fix:
Port config:
configure port 1 rate-limit flood broadcast 500 out-actions log disable-port
configure port 1 rate-limit flood multicast 500 out-actions log disable-port
configure port 1 rate-limit flood unknown-destmac 500 out-actions log
So, if host connected to port 1 will flood unknown unicast, this traffic should be just rate limited to 500pps.
Actualy, "disable-port” is triggered instead of just “log”.
just tested and confirmed in lab on 16.2.5.4 16.2.5.4-patch1-29.
02-13-2020 11:57 AM
Have just tested in lab. Action "redirect-vlan" have no effect, packets reaches cpu anyway in other queue.
Packets are matched, couners is grouing.
In dump i can see broadcast arp.
MC_PERQ_PKT(2).cpu0 : 4,511,330 +1,311,178 38,188/s
Tested with “deny-cpu” action, packets are just flooded with no CPU hit
15Mpps storm just passing thru, with around 15% CPU usage.
But, as been said, this is pure design in complex enviroment, to put such policy on all ports.
Any more thoughts ?