multicast packetloss

  • 0
  • 1
  • Problem
  • Updated 1 year ago
  • Not a Problem
  • (Edited)
I believe my issue is related to xos0053644, but info on that specific issue seems to be limited to https://gtacknowledge.extremenetworks.com/articles/Solution/Local-multicast-224-0-0-x-packet-loss no mention in release notes etc. Unless i'm missing a way to search bulk release notes/version history for bug numbers.

Also not 100% sure this is the issue, we do seem to have issues periodically with OSPF at some sites, but i'm unsure which of my switches is the root cause.

Our layout is
Remote Site Switch (ospf x450/x460-g2 etc) -> FIBER -> MLAG-Aggregation (l2-transit x670) -> FIBER -> MLAG-Core switches (ospf x670-g2) 

The  one page I found says "temporary/short outages" on ospf but honestly we've seen outages of many hours or days on ospf for some sites, and it doesn't happen to all the sites. 

Should i just disable the to-cpu on all ports of our aggregation switches? Is their a draw back to doing that if those switches are only qinq and vlans no l3 beyond inband management ip? Can i do the commands on a live network or will it affect traffic flow on the transit switch?

I'm having issues understanding the problem, theirs 4 solutions listed but no explanation really of figuring out which switches are the issue actually, or what the draw back is to each of the solutions

None of the options seem to be for running on the actual OSPF switch (the ones with the ip interfaces) so is the issue only on switches that are layer 2 transit switches?
Photo of Chris

Chris

  • 492 Points 250 badge 2x thumb

Posted 1 year ago

  • 0
  • 1
Photo of Balaji

Balaji, Employee

  • 776 Points 500 badge 2x thumb
Chris, 

you could configure to-cpu off on a switch which is completely l2 and it is not going to participate in any layer 3 protocols.

we need to have a better understanding of the behavior to suggest a solution.

Could you explain in detail what is the issue you are observing in regards to OSPF?

Thanks
Photo of Chris

Chris

  • 492 Points 250 badge 2x thumb
An example right now we have one of our sites that is showing 1 neighbor in FULL the other neighbor doesn't show up at all.

The 2 Core switches part of same area and running vrrp+ospf on the vlan.

One of them shows FULL/FULL for both neighbors, the other core switch shows INIT/FULL

So the core switches seem to be talking to each other fine, but randomly one of the switches refuses to bring up the ospf and is stuck in init towards the remote site.

 
Photo of Balaji

Balaji, Employee

  • 776 Points 500 badge 2x thumb
The hellos from the router which is in Init state is not reaching it's neighbor. but it sees the hello from it's neighbor. you need to track down the path from the init router to it's neighbor and run  "show igmp snooping vlan vlan-name"  and look for 224.0.0.5 entry in the senders list and find where in the path you miss the entry from one of the neighbors. 
Photo of Chris

Chris

  • 492 Points 250 badge 2x thumb
Will test it next time it goes down, to find which switch is doing it, is their a list of affected firmware versions? Is it actually a bug thats fixed in a future release? Or is it something we just have to apply workarounds for such as the doc linked?

In addition i just noticed we have a few sites that have 1 full and one in EX_START and on the core side its the same 1 full and one EX_START not sure if its related or the same issue. in the EX_START case i checked and i see the 224.0.0.5 on the vlan from what appears to be all switches
Photo of EtherMAN

EtherMAN, Embassador

  • 6,628 Points 5k badge 2x thumb
Chris, we are an metro carrier running on a layer 2 core of 8900's and this has been a sporadic issue for years.  Best work around we have found is disable IGMP snooping on your layer 2 only switches as much as possible.  If you are not running video that needs to be pruned then just kill it 100 %...

Fix/workaround to help you find which switch you may be having issues with ( theory is a mcast entry get's corrupted in hardware and then not forwarded) ... You need to make sure that your two routers can at least ping their outer ip interfaces... If not then you have other issues. 

One switch at a time when you have OSPF or router agency issues... 
clear igmp snooping  
clear fdb
clear ipmc fdb

check to see if your routers re-gain their agencies after each switch you clear in the path till you find the one that was at fault... GTAC will tell you which code you will need to be running for the switch and setup you have.  15. had some issues for sure.  

We have found that this seems to be a very random thing and is usually triggered after a topology event where you have an EAPS failover.  Seems to happen when we have port in the rings that flaps multiple times in a short time period.     

Good luck,  We have never found this to be a lack of resources so there are always open buckets in the memory for more entries.  By default unless you have an ACL in place to block 224.0.0.0/24 all modern layer 2 switches should always forward the mcast traffic from router mcast ip's period. CPU only moves it to hardware first time... so good luck on your efforts... I will be tracking this one closely too for new info or ideas.  
   
Photo of Chris

Chris

  • 492 Points 250 badge 2x thumb
I thought disabling the Igmp completely will break vrrp and ospf (mlag ISC isn't mcast)

And yes it's a mcast issue because ping always works between the l3 interfaces of the backbones

Does disabling igmp disrupt traffic when you do it on the layer 2 aggregation? Like I should do it during a maintenance window?
Also do I need to reboot th switch to clear the hardware corrupted entry or will disabling the l2-CPU for all ports be enough.

The bug/issue is annoying as it seems super random
Photo of EtherMAN

EtherMAN, Embassador

  • 6,628 Points 5k badge 2x thumb
Chris, Disabling IGMP snooping should not break any routing as all disabling does is the switch does not prune back and mcast traffic and treats it like a broadcast packet so it forwards all mcast including router announcements to all ports in the vlan the routers are in...

If it makes you feel better you can do it during window but we have done it under an outage scenario with 89K macs 2 k vlans on those 8900 at peak without issues.  You can also clear the tables I listed and see if that gives you some relief.  If you have one broken and can do it one switch at a time till it starts working then you can maybe narrow down the culprit.

When all else fails and everything we try to restore the router adjacencies fails we have had to delete the vlan or vman and reprovision it to clean the hung table.. We have never had to reboot to fix this.  

Also what cards are you running on the 8900's XL cards with MSM 128 need to match up.  If you put one of the c cards in a chassis the whole chassis will drop down to the lesser card.  Same thing for MSM.   You cut your processing power in half by only running one card.  

My problem has been we dont have any visibility or access in customer's routers so when they report a problem I have to get them back up now and have limited time to trouble shoot this kind of issue.  We know there is an issue but it is impossible to replicate on demand... We will go 2 o3 months with no issues.   With us it is always EIGRP because they are a Cisco shop ... Been one of those things we all are aware of including the NOC and know how to fix quickly when it is reported.  It also seems to always be smaller less used services 1 to 5 mbs not any of the larger ones... 

Disabling the L2-CPU has not seemed to make much difference with this problem.  It may make your messages go away but not passing mcast router announcements is something different I believe.  

   
Photo of Chris

Chris

  • 492 Points 250 badge 2x thumb
so on my main 2 aggregation switches i ran the recommended clears you gave and also even ran the recommended commands from that recommended page from extreme on multicast packetloss...

enable igmp snooping forward-mcrouter-only
configure forwarding ipmc local-network-range fast-path

even followed your recommendation and did a full disable igmp snooping 

but was still stuck...with sites dropping in and out of idle.

I guess next option unless extreme or you have another recommendation is upgrding from these releases to 21.x/22.x as i really starting to get the feeling that 15.6 was just a buggy branch and my agg switches and core switches are running on 15.6.2.12 (no-patch)
(Edited)
Photo of EtherMAN

EtherMAN, Embassador

  • 6,628 Points 5k badge 2x thumb
Chris, sorry to hear you are still having issues... here is link to recommended code http://www.extremenetworks.com/extreme-hardwaresoftware-compatibility-recommendation-matrices/softwa...  It seems you only did the 8900's in your core, are there other layer 2 only switches in the path?... There is a chance that you may have a resource issues in the blocks and tables and I would start working with GTAC if you have not already to open a case and see if you can figure this out. I can tell you already though they will ask you to get to current code before they will do very much of anything.  If there are other layer 2 switches in your path you may also need to clear their tables too... Clearing the tables does not affect traffic so it cant hurt.  

You said yours are coming and going right?  This is a bit different than what we have seen in the past.  Once we see the issue on a vlan the routers will not find their neighbors till we intervene and clear the tables.  They go down and stay down.  You said yours are coming and going so you may indeed have another issue.    
Photo of EtherMAN

EtherMAN, Embassador

  • 6,628 Points 5k badge 2x thumb
Chris, just wondering if you resolved this or are still fighting Mcast problems between your routers? 
Photo of Chris

Chris

  • 492 Points 250 badge 2x thumb
Ya still having issues, as a note we don't have 8900's we have all x460-g2 and x670-g2's 

I plan to upgrade to recommended latest patches soon, but due to some traffic engineering issues our redundancy isn't fully redundant at the moment so trying to get that fixed before i do any reboots/upgrades, and we created temporary static routes to back up the ospf routes until we get the problem solved so it isn't affecting customer traffic.

I've seen both site that drop out and stay out for seemingly forever and others drop randomly, but i don't want to open a gtac case until i upgrade  the affected routers and my core to the recommended versions to avoid reporting something thats already been fixed in the recommended versions.
Photo of EtherMAN

EtherMAN, Embassador

  • 6,628 Points 5k badge 2x thumb
Makes sense... I am very interested in how this proceeds and what the final fix is... we will be moving away from the 8900 in the nest year or so and going to 670 G2's and 870's.  I am in same boat as you.  Would love to reboot and upgrade our 8900 but there are just to many single links and critical services... Some have been running over 1200 days so they are way over due... 

If we have any additional insight in this I will drop you all an update too.