MLAG vs Stacking Explained

  • 3
  • 3
  • Question
  • Updated 2 months ago
  • Doesn't Need an Answer
  • (Edited)
This question comes up a lot, of which myself have been guilty of asking. Was recently asked to define the differences between the two, and in the process wrote the following piece. In sharing it I hope it helps answer some of the queries.

Most of this is in my interpretation, and I have used sources of information from other posts but feel free to correct or append any element.

1        Introduction

This article doesn’t go into any detail on the configuration of MLAG or Stacking, as both of these are already well documented, but what this artile intends to do is give an informed explanation of the differences between the two. The pros and cons, and the different approaches to scaling and mixed protocol approaches so that better design decisions can be made.

2         MLAG

There are example descriptions on the internet of MLAG (Multi-Chassis Link Aggregation) with other venders been given in-terms describing features like Virtual Switching / Virtual Chassis Bonding etc, which is absolutely not the case in the Extreme Networks sense and will detail the differences between these further on.

MLAG is the ability of switches to appear as a single switch at layer 2, so that bundles of links in the form of LAGs can be diversely connected to each switch and appear as one. LAGs are typically created North & South i.e. between host and switch, whereas MLAG is created and expanded in an East & West direction.

MLAG itself is not standardised, so you wouldn’t necessarily be able to create an MLAG between differing venders, but once the MLAG is created between switches the North & South LAG connection is the same regardless of what’s attaching, exactly the same as you would traditionally have seen and implemented.

MLAG vs Stacking / Bonding / virtual switching etc. is something that comes up a lot, and to understand what the differences are it helps to understand a little what’s going on under the hood.

3         Stacking

The term stacking is something you would expect to do at the edge, but it isn’t something you would typically want to associate at the core. To get around this vendors created a stacking like technology for the core like the use of virtual chassis bonding, but this is essentially doing the same thing! The difference is that there are some additional measures used to safeguard the integrity of the chassis bonding, here are a few:

  • Use of multiple physical links bundled into a channel for the Virtual-Switch Link (VSL);
  • Independent keep alive link between the switches.
  • LFR (Link Failure Response) protocol

The primary issue with the stacking type approach is the virtual device partitioning, and how it would handle layer 3 problems. In essence the architecture takes an X number of boxes and turns off all but one brain (control plane). You then distribute the switching subsystems, by essentially adding complexity. So now, with one control plane it becomes possible to implement link aggregation across the multiple switches. The primary control plan will receive all control packets and directly controls the switching fabric for the whole system.

In summary it’s having this single control plan that is often in contention when looking at stacking / bonding type approaches at the core, whereas MLAG the control, management and data planes are all separate.

There are though pros and cons to using stacking and /or MLAG, even with the use MLAG and stacking at the same time considered complementary!

4         MLAG vs Stacking Approaches 

So let’s start by taking two approaches to same problem and decide which might be better. This being two geographically separate sites with the requirement for a pair of core switches in each location for resiliency.

The first is using MLAG: 

  • The design entails four cores, with a pair in each location MLAG’ed together. The geo-separate locations are shown by the dotted red line.
  • All four cores have a common VRRP VRID 100 for all the common VLANs shared between them.
  • Each pair of cores have their own VRRP VRID 10 & 20 for only VLANs in those areas i.e VRID 10 for location A (one core pair) and VRID 20 for location B (the other core pair).
  • Fabric routing mode is enabled on all VLANs (more on this later)
  • OSPF is enabled on all VLANs as passive. Except /30 between each MLAG pair (across ISL link)
  • OSPF is configured for broadcast and using a /29 address, where each core has its own IP address (Layer 3). At Layer 2 each pair of MLAG’ed cores at each location is joined by a single LAG.


The second approach is using stacking. The design is similar and simpler, that appeal alone might be a draw for many to use it, but the pros and cons will assist in the overall decision.

As you can see it almost looks the same from a topology perspective as the MLAG approach, but is simply tilted around where the cores forming the stack are across the geographic locations as opposed to being in the same place. You can, like the MLAG approach, do inline upgrades to one stack at a time with this approach without affecting service.


4.1       MLAG Approach

4.1.1        Pros

  • It offers the best (by a fair margin) the most optimal approach, for reasons given in further bullets.
  • Traffic is more evenly distributed to each of the switches through the use of LAG hashing.
  • Each switch is independently able for forward / route traffic without passing to a master switch (Separate Control Planes)
  • Better control of traffic distribution across links via LAG hashing algorithms
  • Can simply bundle more links into the LAGs to increase bandwidth for North & South
  • Can simply bundle more links into the ISC (LAG) to increase bandwidth for East & West
  • Can offer a more economical approach by using more cost effective links and simply adding more as and when required for LAGs or ISC
  • The approach offers more stability over stacking failures i.e. there are dual management and control planes.
  • This approach seems more logical for switches that are more geographically separate. Although possible with V-Stacking it’s probably not a good idea to have stacks joined so far apart when the exponential of error increasing with distance.
  • Ability to upgrade one switch at a time without effecting service.
  • Expand port capacity beyond the limitation that you could with stacking, by simply adding another switch East or West by creating another MLAG (additional ISL) to another switch.
  • This is not official, but I have been able with fully resilient connections, fabric routing mode, VRRP and OSPF been able to disconnect all the connections (ISL) between both cores and the network to continue running. This couldn’t be maintained but adding it as a pro as a differentiating factor from stacking.

4.1.2        Cons

  • More complex to configure.
  • Each switch is configured / maintained individually.
  • Current MLAG doesn't support spanning-tree (be aware if connecting to say an existing EOS network that might be using MSTP)
  • If using Fabric Routing Mode, ARP learning is still carried out by the VRRP master which will cause traffic to transverse ISC. The traffic on the ISC link is still minimal with all devices dual connected, but just added here for reference.
4.2       Stacking Approach

4.2.1        Pros

  • Configuration is a lot simpler and easier to manage
  • Possibly easier to add more ports by adding an additional switch to the stack
  • Possibly OK approach for smaller sites.
  • Makes sense to be used at the edge where the control plane services are not required for the full functioning of the network.
  • Can stack multiple different types of switches together
4.2.2        Cons
  • Limited to number of switches that can be added to the stack or bond
  • Single control and management plane
  • More inter-switch communication, as opposed to the ISC for MLAG.
  • Stack cannot do inline upgrades (HA upgrades possible with bonding)
  • Not able to add more bandwidth to stacking (but you can for bonding).
4.3       Which approach is best?

So in this particular scenario it would seem to make sense to use MLAG. The decision could though be that stacking is preferred for other cases, for the following reasons:

  • Simplicity
  • Additional measures used for stacking / VSB is adequate enough to compensate for concerns over a single control plane
  • The switches are in close proximity of each other, further reducing geographically diverse concerns
  • Inter-switch bandwidth is not a concern
  • Bandwidth distribution to switches less of a concern (master / backup)
  • Small site
  • Multiple different types of switches in a stack for different connections
  • Require the use of spanning-tree

So the decision to use one or the other is simply a matter of weighing up the options and understanding your network that would determine the solution.

5         Additional Information

5.1       Extending MLAG

Here is an example where you have an existing MLAG solution but you have run out of ports and need to expand. With MLAG you simply add another MLAG pair in an East & West direction. With stacking you are limited to how far you can expand due to the common control plane - this is normally 2 for bonding, or 8 for stacking. Although stacking you want to try and stick to 4-5 best you can.


5.2       Best Practice OSPF / VRRP / Fabric Routing

Whenever configuring MLAG I always configure OSPF and Fabric Routing mode. OSPF is more relevant if a route might appear on one switch and not another or you have muliple routers in the mix as in the example above. It’s not always necessary in some instances, but think its good practise and eases any future growth of the network to configure it from the start.

Directly in line with OSPF I always enable Fabric routing mode, more detail is given here, as another best practice feature.

Enabling fabric routing mode provides a kind of active active approach when used in conjunction with VRRP, and by proxy offers optimal efficiency with use with MLAG. Effectively what it’s doing is sharing the VRRP VMAC address between each of the routers so that each router can answer independently to requests sent to the default gateway. The VRRP master will still have to respond to ARP requests, so this traffic could essentially be going over the ISL, but after that the immediate upstream switch / router will be able to directly respond.

5.3       MLAG & Stacking

There are circumstances where stacking and MLAG might make sense when used in conjunction with each other. An example of this is that stacking can stack multiple different switch types together, so in the core you might want one switch with all SFP ports and another with copper. What you could do is stack the two different switch types together and then MLAG them together, as per below:

5.3       Inter-Switch Traffic Flow

In relation to the S/K series below is an example of the traffic flow that might be expected when using stacking / bonding, and as can be observed this is not always optimal. This is in opposite contrast to MLAG, whereas there would be minimal traffic traversing the interlink (ISL). 

Fortunately with Extreme Networks VSS there is a configurable option to make this more optimal.

set lacp outportLocalPreference [none | weak | strong | all-local]


None = Do not prefer LAG ports based on chasis
Weak = Use a weak preference towards ports on the local chassis
Strong = Use a strong preference towards ports on the local chassis
All-local = Force all packets onto local chassis ports, if possible
Photo of Martin Flammia

Martin Flammia

  • 5,108 Points 5k badge 2x thumb

Posted 2 months ago

  • 3
  • 3
Photo of Bin

Bin, Employee

  • 5,128 Points 5k badge 2x thumb
Hi Martin,
Thank you so much for this post. Very nice article!
Photo of Grosjean, Stephane

Grosjean, Stephane, Employee

  • 11,968 Points 10k badge 2x thumb
Impressive article. I didn't read it all yet, but this looks very good. One thing to add to it, from my overlook, is that a stack behaves similarly than the VSS with LAG. Ie a flow may use the stack link to egress a LAG because hashing has no local knowledge. Hopefully, since 21.1 (I think) you have a CLI option to prefer a local port over hashing.