How to fix the Unicast issue in Vmware

  • 0
  • 1
  • Question
  • Updated 2 years ago
  • Answered

Basically any Unicast communications from one VM to another is not being received when they sit on separate hosts. This is true for NLB as well as a new service we have introduced to our network. Dynamics AX AOS clustering. When the AOS servers are on different hosts the performance is severely degraded and the application becomes unusable. Has anyone else come across this type of scenario? VMware has a couple of KB's on it outlining this problem however I would like to get another perspective on this issue.


Photo of Eddie Brown

Eddie Brown

  • 192 Points 100 badge 2x thumb

Posted 2 years ago

  • 0
  • 1
Photo of Grosjean, Stephane

Grosjean, Stephane, Employee

  • 13,672 Points 10k badge 2x thumb
When you say "unicast" issue in VMware, are you referring to "unicast NLB mode" (as you are speaking of NLB), or something else? I'm not familiar with AOS Clustering, is it using a similar trick than NLB?

"unicast NLB mode" would result in flooding, this is not efficient, but that should work. What platforms do you use?
Photo of Mike D

Mike D, Alum

  • 3,852 Points 3k badge 2x thumb
Hello Eddie,

It may turn out the unicast behavior, NLB operation and AOX are all suffering from the same root cause.  I'm not schooled on AX AOS specifics so I think it will be helpful to pull these symptoms apart and discuss them as isolated protocol behaviors rather than as a batch.  If you could fill in some of the blanks as we go along I'd appreciate your insights.  

We sometimes see NLB and similar load-share/visualization schemes experience performance problems at L2

As I believe you implied and as previously noted, clustering setups often depend on flooding to deliver packets to all cluster nodes.  I would agree this is not efficient.  Each client to server (cluster destined) packet has a dmac that never gets learned in the forwarding table.  The switch's ability to deliver high speed data is limited under these conditions. Each packet in the file transfer has to go to the CPU. "slow path" or "soft path".  Traffic taking the slow path is strictly managed to avoid overrunning critical resources and seems a pretty good candidate for root cause of the sickly forwarding performance you're seeing.  

I'm making several assumptions - and its obviously not an open and shut case.
One way to ID this condition is to keep an eye on the switch CPU.   One side effect of slow path forwarding is an elevated CPU.  Not wild resource depletion (due to rate limits) - but you should see a jump that correlates with the file transfer timing if this is your culprit.

 If so, there's configuration tuning in the EOS switch that will improve things.  Some of this depends on your setup - like the end to end traffic path (routing or switching?) how the cluster has been set up, switch type and f/w rev etc - but we should be able to get your cluster back on track after a bit of optimizing.    

The commands at L2 include the global 'enable' command followed by calling out each mac address used by the cluster and associating the vlan ports where the cluster hosts live.  This statically creates a hardware or fast path.   

Global: set unicast as multicast enable.  Turns on this process.
VLAN/port/mac specific - calls out a unicast mac address instructing switch to treat packets matching this dmac as a multicast
Since the unicast mac is never used as an smac by the cluster, you'd get flood behavior even with no configuration but this config tells the switch to replicate the packet and assigns/scopes the cluster's  host ports.

  

1 Using the set mac multicast command, in any command mode, to specify the MAC address to be treated as a multicast address, specifying the VLAN and egress port(s) to use 

2 Using the set mac unicast-as-multicast command, in any command mode, to enable static unicast MAC addresses to be treated as multicast addresses on this device

The following command enables the unicast as multicast feature on this device: 

System(rw)->set mac unicast-as-multicast enable 
System(rw)->show mac unicast-as-multicast Unicast as multicast: enabled


    

There's at least an average chance I'm off the mark here.  
If so, we should start the discussion at a different point.  

If the scenario sounds like a fit and/or you have questions about switch NLB configuration - or questions about routing the traffic (vs L2 switching) update here so we can take one of our symptoms off the table.     

Regards,
Mike
Photo of Eddie Brown

Eddie Brown

  • 192 Points 100 badge 2x thumb
Very informative post I appreciate your time.

For the purpose of this troubleshooting it would be entirely L2 swtiching. OSPF is enabled and routing packets between my core switches however everything regarding this conversation resides on one VLAN and one switch.

From a Microsoft NLB perspective the unicast to multicast trick for those specific mac addresses sounds quite promising. However the hardware switch isn't the one doing the flooding. Will changing those mac addresses to identify as multicast at the hardware level then update the virtual level to send multicast requests?  I may be a little confused by this functionality. 

From an AOS Clustering level I will be monitoring the CPU utilization after moving a couple of the servers to different hosts. I will update here once I can say for sure it is effecting CPU. I should mention that I am no AX expert and I have an expert looking into how the cluster handles its load balancing. He should be getting back to me very soon.

Again thanks for your time. 
I will be in touch.
Photo of Erik Auerswald

Erik Auerswald, Embassador

  • 13,552 Points 10k badge 2x thumb
The servers still use unicast MAC addresses, and the switch does not change the Ethernet frame.

The "unicast-as-multicast" feature in S-Series and similar switches changes how the switch treats a unicast MAC address. If the unicast MAC is not found in the CAM table, a second lookup for that MAC address with the multicast bit set to 1 is done. If this multicast MAC address is found in the MAC table, the frame is forwarded out the respective ports. Otherwise it is treated as an unknown unicast and flooded in the VLAN.
Photo of Mike D

Mike D, Alum

  • 3,852 Points 3k badge 2x thumb
Before I forget - from this location
https://msdn.microsoft.com/en-us/library/bb742455.aspx

check this clip out:
/snip/
Network Load Balancing's unicast mode has the side effect of disabling communication between cluster hosts using the cluster adapters. Since outgoing packets for another cluster host are sent to the same MAC address as the sender, these packets are looped back within the sender by the network stack and never reach the wire. This limitation can be avoided by adding a second network adapter card to each cluster host. In this configuration, Network Load Balancing is bound to the network adapter on the subnet that receives incoming client requests, and the other adapter is typically placed on a separate, local subnet for communication between cluster hosts and with back-end file and database servers. Network Load Balancing only uses the cluster adapter for its heartbeat and remote control traffic.
/snip/

That described  behavior sounds like it may be a fit for some of the unicast results you've  experienced.
If it sounds right and your choice of verification tests pan out, maybe we can take down another symptom today.


Regarding the traffic we're talking about as multicast:
I normally identify traffic associated with the old class d address block as "IP multicast".  
the dmac of ip multicast belongs to the the IP mcast group address in the packet.  Familiar behaviors such as you've described - join, leave, query are normally part of this environment.  

On the NLB side the ip address (vip?) is typically a class A/B/C unicast address.   Like Erik says, the tricky business when configuring cluster mac address in the switch is to force the static vlan/mac/port relationships into the forwarding behavior so the switch treats the traffic as mac multicast.  The traffic should flood within the port scope you configure.
Another option would be to ignore calling out ports in the static switch config.   Traffic will flood to all ports with vlan egress - which is probably what was happening anyway so no real loss.  the difference is hopefully the optimized handling of the traffic by the switch after the new config.  

About CPU:   
You no doubt have benchmarks in your environment that serve as quick health checks for your network.  
As far as a 'tell' for an improperly handled flooding condition I would normally use the switch CPU as a guide.  It's not definitive as a diagnostic but significant soft path traffic will normally leave tracks in the switch OS.   Show system utilization process table output will include the switch packet processing task.   This should be your canary during NLB performance tests from the switch's perspective.  

So maybe we're on the right track.  I hope we're able to eventually bring the various behaviors back in - identifying each as a known quantity. .

Jeeze I really need to find a way to have this discussion with 95%.fewer words. The long story just adds confusion.  
Maybe links to documentation will improve usability,  And would also improve my technical accuracy.    

PS: Its been a long time since I boned up on NLB behavior.   Thing is, Msoft used to include instructions for IP multicast/igmp support but I never heard of a network making that sort of config work.  
Things may have changed for the better since then.   
Yes, the next chapter will have fewer words, more links :) 

Regards,

Mike 
(Edited)