cancel
Showing results for 
Search instead for 
Did you mean: 

MLAG with VMware or HyperV - tons of DUP packets

MLAG with VMware or HyperV - tons of DUP packets

Eric_Burke
New Contributor III
We deployed our first MLAG scenario with 2 x670's at the core and with VMware 6.0 downstream. We had unusual amounts of what appeared to be packet loss, overall slowness and VM's showing online, then offline, etc. A packet trace revealed a TON of duplicate packets and retransmissions. When we took the second peer offline, these ceased. We've since confirmed that both sides of each MLAG are using L3 algorithm (IP HASH on vswitch), we removed all unused / standby adapters from the vswitch and we ensured that beacon probing was off. We otherwise followed the 2012 white paper from Extreme on deploying MLAG in an ESXi environment. Not sure if all of the duplicate traffic is expected in this config or if we're doing something wrong. Ideas?
12 REPLIES 12

Eric_Burke
New Contributor III
Okay, so we did some extensive testing this morning. (2) 440's as MLAG peers. VMware host with a NIC to each (port 1), both in the same vSwitch. Left teaming default (originating port ID) and configured MLAG port 1 on each switch. This matches the old 2012 best practices document for the most part. We don't have Enterprise Plus, so we cannot use LAG/LACP.

Findings:

- No observed IP's on one NIC in VMware
- If we disable a port on one switch, pings continue to mgmt address from a node but actual application access (like hitting the host via http) fails most times
- Switch with inactive port cannot ping mgmt address across ISC via switch that still has an active port
- With IP hash, similar issues (also dupe packets in wireshark) - left it that way for next test.

If we remove sharing on the switch (which is setup as a single port on each switch, port 1) but leave IP hash on VMware and still with MLAG enabled (MLAG 1 on both, but no sharing underneath) it works as expected.

If we remove MLAG (peers / ISC still up), but port 1 on each as simply trunks to each pnic, it works as it did with MLAG. Not sure MLAG is helping in this scenario.

So our plan (tentatively) is to use separate links to each core switch (trunks, no MLAG), knowing that since VMware is load balancing it will result in traffic across the ISC (which is way oversized in our scenario, so not a big deal). For dual-connected windows servers, we'll instead use LACP with both sharing and MLAG with a hash type of "address" on the server, L3 on the LACP side of extreme. For downstream user switches, we'll do the same. MLAG, LACP, L3 on both sides. In theory (as I understand it), this will result in only one link being actively handling traffic unless one core fails. Then, the address table will move to the remaining peer.

Wish us luck!

Ty_Kolff
New Contributor II
That sounds like a good plan. We did use MLAG on all of the other connections (primarily IDF's and other switch stacks) just not on the VMWare servers.

simon_bingham
New Contributor II
heres an idea
You get something similar when one end is a aggregation and the other is not.

imagine a 4 port agg, the end that is not a aggregation loops back 3 copies of the frame back into the aggregation. you see 4 of every packet if you wireshark

Simon

Thanks Simon. Agreed, that's what had us thinking that the MLAGs were improperly configured (on one end or the other). We were pretty confident in the Extreme side, but not on the VMware side. Reading their article made it seem that we'd made a mistake in the method of aggregating (wrong hash, beacon probing originally on), but latter tests showed the same results. I feel like I'm missing something in that to me, a LAG is simply two uplinks active in the same vswitch but other comments are leading me to think there is an added layer to making an actual LAG on the vmware side. Am I missing something?

GTM-P2G8KFN