Question

Network disruption VSP8600


  • Participator
  • 24 replies

I’m using 4 VSP8600 in a SPBM-Configuration. Today we experienced a network disruption although no configuration changes were made. The logs are full of these:

************************************************************************************
Command Execution Time: Tue Oct 06 11:26:49 2020 CEST
************************************************************************************
1 2020-10-06T11:24:33.770+02:00 kreuz IO4 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:11 on VID 2422 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 8051
1 2020-10-06T11:24:30.533+02:00 kreuz IO2 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:36 on VID 426 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 16644
1 2020-10-06T11:22:35.621+02:00 kreuz IO3 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:2b on VID 419 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 13707

...

1 2020-10-06T10:42:31.049+02:00 kreuz IO5 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:28 on VID 205 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 9389
1 2020-10-06T10:42:31.035+02:00 kreuz IO6 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:28 on VID 205 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 631

...

1 2020-10-06T10:42:22.788+02:00 kreuz CP1 - 0x00004619 - 01900001 DYNAMIC CLEAR GlobalRouter SNMP INFO Smlt Link Up Trap(SmltId=133)
1 2020-10-06T10:42:22.788+02:00 kreuz CP1 - 0x0000000a - 01900001.133 DYNAMIC CLEAR GlobalRouter SW INFO SMLT 133 Link is UP
1 2020-10-06T10:42:17.262+02:00 kreuz CP1 - 0x0000461a - 01900001 DYNAMIC SET GlobalRouter SNMP INFO Smlt Link Down Trap(SmltId=133)
1 2020-10-06T10:42:17.261+02:00 kreuz CP1 - 0x00000009 - 01900001.133 DYNAMIC SET GlobalRouter SW INFO SMLT 133 Link is DOWN

...

1 2020-10-06T10:04:03.768+02:00 kreuz IO3 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:2b on VID 419 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 12529
1 2020-10-06T10:03:48.600+02:00 kreuz IO4 - 0x00138537 - 0004e001 DYNAMIC CLEAR GlobalRouter COP-SW INFO VIST peer mac b4:2d:56:9c:7a:11 on VID 2422 is learnt on non-IST MltId-1, Pointing record back to IST port.Total Peer Mac Move Count: 7347

What could be the cause of this problem and how can i debug this any further? my log only goes back for ~1hr


17 replies

Userlevel 3

The reason for this message is typically a network loop. The switch detects that it has learnt it’s peers MAC address over a normal or SMLT UNI port and thus is moving the peer-mac back to the vIST where it should it learn it from.

Make sure you protect yourself against loops with the recommended measures such as SLPP and SLPP guard and BPDU guard.

 

Roger 

as far as i understand the virtual-Ist configuration only consists of a vlan-id and the loopback-ip of the ist peer. mlt 1 is a trunk between the two vIST-peers, so how come the log says its a non-IST MLT?

i’m using loop-detection on all switches that are connected to the vsp. i had this error a few times the last months and after rebooting all 4 vsp it goes away for a few weeks. honestly it feels more like a bug not like a real loop.

is there a way to debug this occurance further? unfortunatly the log doesn’t show me anything from the start of this error because it only covers the last 1-2 hours. is there a way to increase log size?

Userlevel 5
Badge +1

brms,

 

The following info from both IST members is needed to check if a loop is possible on your infra:

show mlt

show isis interface

show isis adja

show slpp

show spanning-tree config

show spanning-tree mstp port role

For the vIST:

show virtual-ist

show vlan members

show i-sid vlan

 

The logs are still present on the flash memory of your VSP. Just connect with sftp or perform an “ls” on the VSP to see them.

The name starts with “log”…

 

As personal advice, in such situation (network disruption) I would open a GTAC case in parallel of my own investigations. There is a lot of info to be grabbed just after the issue.

Regards

Mig

Userlevel 3

The switch reports the error on the link it saw the issue on. In order for this error to be reported, the peer MAC (on any VLAN) is seen on a UNI (none vIST NNI) port. I doubt that there is a bug in this regards, as the message is only triggered when an actual peer-MAC move had to be executed by the switch. What is connect on the reported link?

 

Roger

Thanks in advance. I already created a ticket with our partner, unfortunately communication with them is rather slow and fruitless. Here are the outputs of the commands:

show mlt: https://pastebin.com/MG8yP18m

MLT-1 is an MLT between 2 VSPs at one location (pik, kreuz and karo, herz) which are configured as a vist-pair. VLAN 4051 and 4052 are the b-vlans.

show isis interface: https://pastebin.com/cPyu31AX

show isis adj: https://pastebin.com/Aj6WXRPM

show slpp: https://pastebin.com/Y0y51ARF

show spanning-tree config: https://pastebin.com/g3eFKEdH

show spanning-tree mst port role: https://pastebin.com/rQC7AMwS

show virtual-ist: https://pastebin.com/s4gcekmL

show vlan members: https://pastebin.com/4XKhPu3A

show i-sid elan (show i-sid vlan isn’t a valid command): https://pastebin.com/8547Hi6e

as far as i know vlan 4054 doesn’t need to be assigned to the MLT that forms the vIST, as the vlan connectivity is handled by spbm?! Maybe thats the problem?

Userlevel 3

From what you share something looks wrong. Based on your data, the NNI link to the peer switch should be MLT-1, but the switch thinks MLT1 is the culprit. Can you check how the two vist-peers see each other, I assume there are more NNI links than just MLT-1?

Do you have a parallel non-NNI link between the two nodes connected as well (even without any “looping” VLANs)?

I think you need someone from support to look into this.

Roger

the 4 VSPs are located at 2 locations. the 2 VSPs at each location are connected via the mlt 1 directly and via 1/1 and 1/5 to the VSPs at the other location. all 4 uplinks (1/1,1/2,5/1,5/2) are configure as nni-links. there should be no non-NNI-Links between the VSPs directly. a possibility could be a connected exos-switch that hasn’t LACP configured maybe. in that cause there might be an indirect non-NNI-Link?

Based on your data, the NNI link to the peer switch should be MLT-1, but the switch thinks MLT1 is the culprit.

thats what im wondering too, since mlt 1 is an nni-link and should server the vist-vlan, the log entries calling MLT-1 non-IST is strange.

i’m not really sure how i can: “Can you check how the two vist-peers see each other”. any idea?

Userlevel 5
Badge +1

brms, 

 

just wondering, what’s the output of:

show ip route | include 172.28.7

show vlan i-sid | include 4054

show isis spbm i-sid all | include 4054

Mig

 

VOSS 8600 doesnt support include :)

Here is the desired output:

show ip route on first vsp (pik). 172.28.72.1/2 are the vist-ips of the 2 VSPs at the other location:

172.28.72.0     255.255.255.0   karo                 GlobalRouter     10     4051     ISIS 0   IBSE 7  
172.28.72.0     255.255.255.0   herz                 GlobalRouter     10     4051     ISIS 0   IBSE 7  
172.28.72.0     255.255.255.0   karo                 GlobalRouter     10     4052     ISIS 0   IBSE 7  
172.28.72.0     255.255.255.0   herz                 GlobalRouter     10     4052     ISIS 0   IBSE 7  
172.28.73.0     255.255.255.0   172.28.73.1          -                1      4054     LOC  0   DB   0 

show ip route on second vsp (kreuz: the one with the log entries):

172.28.72.0     255.255.255.0   karo                 GlobalRouter     10     4051     ISIS 0   IBSE 7  
172.28.72.0     255.255.255.0   herz                 GlobalRouter     10     4051     ISIS 0   IBSE 7  
172.28.72.0     255.255.255.0   karo                 GlobalRouter     10     4052     ISIS 0   IBSE 7  
172.28.72.0     255.255.255.0   herz                 GlobalRouter     10     4052     ISIS 0   IBSE 7  
172.28.73.0     255.255.255.0   172.28.73.2          -                1      4054     LOC  0   DB   0

 

show vlan i-sid on both vist-peers:

4054       104054

 

show isis spbm i-sid all on pik:

104054    0.3d.b3       4051   00db.face.0003       config       pik

104054    0.4d.b4       4052   00db.face.0004       discover   kreuz

show isis spbm i-sid all on kreuz:

104054    0.3d.b3       4051   00db.face.0003       discover   pik

104054    0.4d.b4       4052   00db.face.0004       config       kreuz

 

BTW: i forgot the mention. the problem only occurs at location 2, location 1 doesn’t show the same errors. the config should be identical as far as possible.

Userlevel 5
Badge +1

brms,

 

I don’t like neither the fact that the subnet of the vist is distributed on the isis routing table nor the fact it is a /24.

I suppose that you use the same vlan id and i-sid on the four VSPs. Could you confirm?

Best practices is to have a /30 not redistributed in the routing table nd using a different i-sid.

Mig

in the past i had the same vlan for both vist because i didn’t know better. after i noticed log entries about the vlan i changed it on the VSPs which are making problems now from 4053 to 4054. can the subnet size be a problem? in the fdb tabel of these subnets i dont see other MACs aside of the VSPs.

problematic VSPs: show interface vlan 4054: https://pastebin.com/ek4VSnuc

working VSPs: show interfac vlan 4053: https://pastebin.com/mNtsfmaX

how can i disable route redistribution for the vist-vlan?

since i had to resolve the immediate problems i rebooted the switches with the cli command “reset” just now. the one with the log-errors now shows me a coredump has been saved. does this indicate some hardware error?

Userlevel 5
Badge +1

brms,

 

I see the following points to be worked out:

  • change the isis metrics of your isis interfaces: the MLT should have the cost of the interfaces 1/1 or 1/5 divided by 2.
    • If it is 10G links: MLT=100, interfaces 1/1,1/5 = 200
  • I would enable SLPP on all the C-VLANs
  • Could you confirm the value of the i-sid used on the different switches for the vIST?
    • It should be uniq per cluster
  • I would change the subnet to /30 but this shouldn’t cause any issue using a /24
  • avoid the redistribution of the vIST subnets in ISIS/OSPF/other: https://gtacknowledge.extremenetworks.com/pkb_mobile#article/How_To/kA12T0000004QhGSAU/s
  • Ensure that you don’t use the vIST subnet for other purposes than VIST (not as next hop, not as SNMP access, etc)

Mig

  • change the isis metrics of your isis interfaces: the MLT should have the cost of the interfaces 1/1 or 1/5 divided by 2.
    • If it is 10G links: MLT=100, interfaces 1/1,1/5 = 200

 

atm all three are at a metric of 10. i should change the mlt-1 to 5 with the following command, right? can this be done without interrupting anything or do i need to disable isis first?

interface mlt 1
isis spbm 1 l1-metric 5

I would enable SLPP on all the C-VLANs

 

since the we don’t have a BEB/BCB configuration but the 4-VSP-Cores and directly connected EXOS-switches with fa and servers I’m not sure which SLPP method to use. do the cores count as access-smlt in my case? what would you recommend? SLPP per port or per VLAN?

 

Could you confirm the value of the i-sid used on the different switches for the vIST?

  • It should be uniq per cluster

vist-cluster 1 has vlanid 4053 and isid 104053

vist-cluster 2 has vlanid 4054 and isid 104054

 

I would change the subnet to /30 but this shouldn’t cause any issue using a /24

last time i changed the ip of the vist vlan the whole system crashed. as far as i remember i needed to disable isis before changing anything. what precautions do i need to take before changing the ip? is it enough to disable both ports belonging to mlt 1 before changing the ip?

 

avoid the redistribution of the vIST subnets in ISIS/OSPF/other: https://gtacknowledge.extremenetworks.com/pkb_mobile#article/How_To/kA12T0000004QhGSAU/s

wow, this looks rather unintuitiv. will give it a try

 

Ensure that you don’t use the vIST subnet for other purposes than VIST (not as next hop, not as SNMP access, etc)

as far as i know we don’t use the vist-ips anywhere but will check that.

 

Thank you very much for your help!

Userlevel 5
Badge +1

brms,

As those operations are quite intrusive I would insist to have the support of the partner.

 

Cheers

 

Mig

Userlevel 3

BRMS - can you please share your support case number that was opened for the MAC move issue? Please send it directly to me at rlapuh@extremenetworks.com

Thanks Roger

I’ll ask our partner if they already opened a support case at extreme. As soon as i get the case number i’ll send it to you. Thanks again for all the help!

Reply