We're a bit new to using MLAG and I sense we may have implemented a flawed design, but I'm curious to get feedback on what might have caused the issue we found (and oddly, the resolution).
We have (2) x670's running as an MLAG core with dual downstream LACP links to all edge switches/stacks in our environment. We have a pair of SonicWALL firewalls connected to the core using high-availability; basically a shared MAC advertised by the "active" member. We also have two ISP/WAN VLAN's with each firewall having an interface in each along with the ISP's router.
If you look at the diagram, all was working okay and FW1 was active. That FW had an issue and FW2 took over. When it did, we lost visibility to the ISP1 connection from FW2 which in turn caused the firewall to also fail-over the internet to ISP2 alone (FW2 could not ping ISP1's DFGW, so it moved all traffic to ISP2 which it could see fine).
We got an alert that IP's on ISP1 had become unreachable from the outside, as the firewall is responsible for NAT'ing traffic to those IP's and it was no longer reachable. We expected/thought that the traffic would simply traverse the ISC from ISP1 to FW2 but it did not. Oddly, simply resetting (disable/enable) the CORE2 switch port connected to the FW2 (ISP-1 interface) restored service. FW2 could now see ISP1 and our external alerts cleared.
The question - why? My expectations were that FW2 would send a GARP when it took over and that this would flush across the ISC and basically shift the location of that MAC in the MLAG FDB. That did not seem to happen but, physically resetting the core-side port of that connection did cause that to happen. Theories? Is this just a bad design?
I would have expected that the firewall MAC address is learned on the correct port after firewall failover, if the now active firewall sends frames with that MAC as source, and the other firewall stopped sending frames with that source MAC. I do not see why that should not work.
The problem in the design is that the ISC needs to provide sufficient bandwidth since about half the traffic to and from the firewall cluster needs to traverse the ISC, if you have additional switches connected to the core via MLAG. But a design like yours should work, I'd say.
All the above assumes that the firewall is not connected to MLAG ports. The two firewalls of the firewall cluster are single connected devices and the firewall ports on the two switches must not be configured as ports of an MLAG.
Thanks for the feedback Erik. We're running dual 10Gb ports for the ISC and the firewalls are NOT configured as MLAG ports on the core. That's why I was confused by the behavior. As you noted, I expected to see the connection switch when the old firewall stopped sending frames with the virtual mac and in fact it did, but only after cycling the switch port that the secondary firewall was connected to. I'm thinking if it occurs again I should dump the various forwarding db's to see if I can garner anything else. Not sure if that would have helped or not at this point...
For ISC ports, conventional Source Address Table / Filtering DataBase Learning is disabled by default and should remain disabled while not in failover mode.
My guess is that the Firewall HA failover was basically the same as moving a single-homed device from one MLAG peer to the other. This does not necessarily go well with the FDB, and in your case - again, a guess - Core 1 did not learn that the MAC address of the active Firewall is now reachable via the ISC. Which it did, on the other hand, upon rebooting.
Thanks for the thoughts Carsten. I don't believe SonicWALL allows for LACP to be used when the firewalls are deployed in an HA pair. If I understand it, it uses their portshield technology to combine links via LACP and portshield has to be disabled in order for HA to be enabled. Kind of a catch-22 in that you could put an extension switch off the core using LACP (like we do for user switches), but then we lose the HA capabilities of the firewall(s), as they're now tied to a single point of failure. Putting two of them downstream (extension sw1 to core 1&2, extension sw2 to core 1&2) would add a lot of expense but might be the only way to accomplish it. Would be nice to simply take advantage of the redundant core/firewall without having to add an entirely new layer in between.