Dual x670V Stacks, MLAG, and VMware ESX

  • 0
  • 1
  • Question
  • Updated 2 years ago
  • Answered
  • (Edited)
Pro/Con design considerations for configuring (2) 2-node x670V stacks with emphasis on availability and performance.

Prior to submitting for budget approval, I was hoping to get feedback from anyone with experience configuring MLAG's with X670V-48t switches and VMware.  I'm currently running out of available ports and instead of adding just one x670V to my stack, I was looking to possibly add a separate stack and configure MLAGs to our ESX hypervisors w/standard switches.  Right now, I would have to shut down the entire server/storage footprint to update the EXOS software with our single 2-node stack.  Would be grateful to hear comments "in favor of" or "in opposition to" the below configuration.

End result: 
(2) 2-node x670V stacks
MLAG Stack A Port 1:1 with Stack B Port 1:1 for generic server traffic
MLAG Stack A Port 1:17 with Stack B Port 1:17 for NFS storage traffic
MLAG Stack A Port 1:33 with Stack B Port 1:33 for management traffic
Server environment is all VMware with 4 10G and 4 1G NICs per Host


Thanks!
Photo of Scott Benne

Scott Benne

  • 80 Points 75 badge 2x thumb

Posted 2 years ago

  • 0
  • 1
Photo of Paul Russo

Paul Russo, Alum

  • 9,694 Points 5k badge 2x thumb
Hey Scott

I am a big fan of using MLAG for the reason you mentioned above.  MLAG allows you to have complete failover redundancy and additional bandwidth with the LAG from the end station.

The only con with MLAG is that there is additional configuration needed for the MLAG versus with a stack but I think that the added redundancy is well worth it.

I hope that helps.

P
Photo of Erik Auerswald

Erik Auerswald, Embassador

  • 13,220 Points 10k badge 2x thumb
Hi Scott,

please note that the ESXi Standard vSwitch cannot use LACP, thus you would need to use static LAGs (port sharing without LACP or possibly physical ports) to connect the ESXi servers via MLAG.

ESXi does not need to use a LAG for the vSwitch uplinks. If you use a load balancing mechanism that keeps all flows from one VM on one uplink (e.g. based on source MAC or based on source port [of the vSwitch]), you can connect different ESXi server uplinks active/active to different switches. The switches just need to be in the same layer 2 domain (same VLANs).

The Distributed vSwitch is needed to use LACP for ESXi uplinks (Enterprise+ license level). Load Based Teaming (LBT), preferred by many VMware admins, requires the Distributed vSwitch as well.

Erik
Photo of Scott Benne

Scott Benne

  • 80 Points 75 badge 2x thumb
Thanks Paul and Erik for the quick response!  In regards to NFS storage and static LAGs, in the event an active flow/LAG member would go down, does the vSwitch recover gracefully to the other LAG member or is there a chance of data corruption?
Photo of Erik Auerswald

Erik Auerswald, Embassador

  • 13,058 Points 10k badge 2x thumb
Data corruption because of network problems should be prevented by NFS, modulo bugs in the implementations.

In my experience, NFS is quite robust. My experience in this regard pertains primarily to classical UNIX and GNU/Linux implementations, as opposed to VMware and storage vendors.
Photo of Ty Kolff

Ty Kolff

  • 1,098 Points 1k badge 2x thumb
I recently did some testing with this scenario and we found that the ESX host worked better if it was just plugged into each of the x670s with no MLAG configuration whatsoever. 
Photo of Paul Russo

Paul Russo, Alum

  • 9,694 Points 5k badge 2x thumb
Hey Ty

To be clear you are saying that when the LAG is into one switch and you lose a link it is better than if it is across MLAG?

If so how would you handle the redundancy ?

Thanks
P
Photo of Ty Kolff

Ty Kolff

  • 1,098 Points 1k badge 2x thumb
No, I tested plugging one NIC into each x670 in a pair of MLAG/VRRP cores.  It worked best when we just plugged in a NIC into each of the cores in a port with no MLAG or LAG configuration whatsoever. 

Note we did not Team the NICs together.  The VMWare guys I talked to didn't recommend teaming the NICs on the ESX host.
Photo of Erik Auerswald

Erik Auerswald, Embassador

  • 13,220 Points 10k badge 2x thumb
Hi,

it is my impression as well that the VMware guys do not like using LAGs, either static or with LACP. This is opposed to the networking guys that want to use LAGs all the time. ;-)

The VMware vSwitch is not a software Ethernet switch, it is something similar, but different. It uses a concept of uplinks that connect the vSwitch to the network. A frame entering one uplink is never sent to another uplink. It is sent to virtual ports only. Thus redundant uplinks work without grouping them into an LAG.

Fail over time in an LAG is usually determined by the time needed to detect a link down situation. LACP (with 30s hellos and 90s hold time) is not used as primary fail over mechanism.

Not using a LAG on VMware still uses link down detection as signal to fail over.

I would prefer the use of LAGs, but that is from the network point of view, not the VMware one.

Erik