Extreme Networks

Chris1 · ‎06-20-2017

I believe my issue is related to xos0053644, but info on that specific issue seems to be limited to https://extremeportal.force.com/ExtrArticleDetail?an=000075074 no mention in release notes etc. Unless i'm missing a way to search bulk release notes/version history for bug numbers.

Also not 100% sure this is the issue, we do seem to have issues periodically with OSPF at some sites, but i'm unsure which of my switches is the root cause.

Our layout is
Remote Site Switch (ospf x450/x460-g2 etc) -> FIBER -> MLAG-Aggregation (l2-transit x670) -> FIBER -> MLAG-Core switches (ospf x670-g2)

The one page I found says "temporary/short outages" on ospf but honestly we've seen outages of many hours or days on ospf for some sites, and it doesn't happen to all the sites.

Should i just disable the to-cpu on all ports of our aggregation switches? Is their a draw back to doing that if those switches are only qinq and vlans no l3 beyond inband management ip? Can i do the commands on a live network or will it affect traffic flow on the transit switch?

I'm having issues understanding the problem, theirs 4 solutions listed but no explanation really of figuring out which switches are the issue actually, or what the draw back is to each of the solutions

None of the options seem to be for running on the actual OSPF switch (the ones with the ip interfaces) so is the issue only on switches that are layer 2 transit switches?

Chris1 · ‎06-20-2017

so on my main 2 aggregation switches i ran the recommended clears you gave and also even ran the recommended commands from that recommended page from extreme on multicast packetloss...

enable igmp snooping forward-mcrouter-only
configure forwarding ipmc local-network-range fast-path

even followed your recommendation and did a full disable igmp snooping

but was still stuck...with sites dropping in and out of idle.

I guess next option unless extreme or you have another recommendation is upgrding from these releases to 21.x/22.x as i really starting to get the feeling that 15.6 was just a buggy branch and my agg switches and core switches are running on 15.6.2.12 (no-patch)

EtherMAN · ‎06-20-2017

Chris, Disabling IGMP snooping should not break any routing as all disabling does is the switch does not prune back and mcast traffic and treats it like a broadcast packet so it forwards all mcast including router announcements to all ports in the vlan the routers are in...

If it makes you feel better you can do it during window but we have done it under an outage scenario with 89K macs 2 k vlans on those 8900 at peak without issues. You can also clear the tables I listed and see if that gives you some relief. If you have one broken and can do it one switch at a time till it starts working then you can maybe narrow down the culprit.

When all else fails and everything we try to restore the router adjacencies fails we have had to delete the vlan or vman and reprovision it to clean the hung table.. We have never had to reboot to fix this.

Also what cards are you running on the 8900's XL cards with MSM 128 need to match up. If you put one of the c cards in a chassis the whole chassis will drop down to the lesser card. Same thing for MSM. You cut your processing power in half by only running one card.

My problem has been we dont have any visibility or access in customer's routers so when they report a problem I have to get them back up now and have limited time to trouble shoot this kind of issue. We know there is an issue but it is impossible to replicate on demand... We will go 2 o3 months with no issues. With us it is always EIGRP because they are a Cisco shop ... Been one of those things we all are aware of including the NOC and know how to fix quickly when it is reported. It also seems to always be smaller less used services 1 to 5 mbs not any of the larger ones...

Disabling the L2-CPU has not seemed to make much difference with this problem. It may make your messages go away but not passing mcast router announcements is something different I believe.

Chris1 · ‎06-20-2017

I thought disabling the Igmp completely will break vrrp and ospf (mlag ISC isn't mcast) And yes it's a mcast issue because ping always works between the l3 interfaces of the backbones Does disabling igmp disrupt traffic when you do it on the layer 2 aggregation? Like I should do it during a maintenance window? Also do I need to reboot th switch to clear the hardware corrupted entry or will disabling the l2-CPU for all ports be enough. The bug/issue is annoying as it seems super random

EtherMAN · ‎06-20-2017

Chris, we are an metro carrier running on a layer 2 core of 8900's and this has been a sporadic issue for years. Best work around we have found is disable IGMP snooping on your layer 2 only switches as much as possible. If you are not running video that needs to be pruned then just kill it 100 %...

Fix/workaround to help you find which switch you may be having issues with ( theory is a mcast entry get's corrupted in hardware and then not forwarded) ... You need to make sure that your two routers can at least ping their outer ip interfaces... If not then you have other issues.

One switch at a time when you have OSPF or router agency issues...
clear igmp snooping
clear fdb
clear ipmc fdb

check to see if your routers re-gain their agencies after each switch you clear in the path till you find the one that was at fault... GTAC will tell you which code you will need to be running for the switch and setup you have. 15. had some issues for sure.

We have found that this seems to be a very random thing and is usually triggered after a topology event where you have an EAPS failover. Seems to happen when we have port in the rings that flaps multiple times in a short time period.

Good luck, We have never found this to be a lack of resources so there are always open buckets in the memory for more entries. By default unless you have an ACL in place to block 224.0.0.0/24 all modern layer 2 switches should always forward the mcast traffic from router mcast ip's period. CPU only moves it to hardware first time... so good luck on your efforts... I will be tracking this one closely too for new info or ideas.

Chris1 · ‎06-20-2017

Will test it next time it goes down, to find which switch is doing it, is their a list of affected firmware versions? Is it actually a bug thats fixed in a future release? Or is it something we just have to apply workarounds for such as the doc linked?

In addition i just noticed we have a few sites that have 1 full and one in EX_START and on the core side its the same 1 full and one EX_START not sure if its related or the same issue. in the EX_START case i checked and i see the 224.0.0.5 on the vlan from what appears to be all switches