VIP can't see device in same switch and VLAN with LACP/LAG problem to the core.

  • 5 December 2016
  • 3 replies

Userlevel 4
Strange issue I suspect is related to possible LACP sharing method mismatch between Avaya ERS5xxx and XOS on x440's. 8 x440's connected back using port 47/48 to a Nortel/Avaya ERS core using LACP. Also a stack of 5 x450G2's which are used as a voice stack (tagged voice vlan over LACP trunk to core, where VMware hosts a virtual IPOffice). The latter has 4 ports in LACP to core. Was having an issue where phones (30) could not connect to IPOffice during provisioning. I could not ping a phone, but could see it in the FDB. Added a VIP in the voice VLAN on the x450 stack and still could not ping the phone. I could see it in the IPARP cache and could still see it in the FDB.

Looked in the arp cache from the core side, it showed via trunk 28 (the LACP back to the voip stack). I looked in our firewall (connected to the core on that VLAN, but not routing or anything) and it too showed it in the ARP cache.

On a whim, disabled 2 of the 4 ports on the core side (both in the same switch [2 ports were in switch 1 and 2 in switch 4; disabled the two in switch 4]). After about 5 minutes, the ping from the VIP to the phone began working. So...

What role would LACP play in a switch where a VIP is directly tied to the VLAN of a device in the same switch/stack? While troubleshooting, found a similar problem on one of the x440's where a VDI machine could ping a printer, but another VDI machine on the same VMhost could not. In this case it was across the core (VDI/VM -- CORE -- X440 via LACP -- PRINTER) but the same fix worked - turned off one of the two ports in LACP on the core side.

This leads me to think it's a LAG mismatch but still does not explain the scenario where the VIP could not ping a local device in the same VLAN. Thoughts?


3 replies

Userlevel 7
Hi Eric,

the role of LACP is to determine which ports belong to which LAG (port sharing). If both sides of a LAG show the same active ports, LACP has done its job.

Load sharing is a local decision of the switch forwarding a frame through a LAG. The remote side of the LAG needs to accept the frame on every active link of this LAG.

It is not normal that you need to disable ports of a LAG to fix connectivity issues. You should consider opening cases with both Extreme and Avaya to investigate this.

If you want to investigate yourself, I would suggest using port mirrors to find the ports traversed by the frames needed for communication (e.g. ARP request & response, ICMP echo & echo reply). On EXOS, you are mirroring on the physical port, thus you see which of links is actually used.

Userlevel 4
Thanks Erik. Was remote to the client when running through the various scenarios so I was not able to connect up a packet capture device to a mirror (definitely planned). Was thinking I just overlooked something but after reading a lot about even "mismatched" LACP/LAG's, they simply are not efficient but they should still work (meaning they may not load balance as efficiently as possible). This is odd due to the nature of the fix that restored icmp from the VIP to the device. I will for certain run a trace and see what gives then report back to this thread. Very odd...
Userlevel 7
Hi Eric,

with mismatched you mean different load sharing algorithms on the two ends of a LAG, I assume. That is no problem as long as the input values to the algorithm provide enough entropy. Using the same load sharing algorithm on all LAGs can lead to hash polarization, see e.g. Uneven load sharing of traffic being forwarded through several subsequient load sharing groups (from GTAC Knowledge) or CEF Polarization (from Cisco).