We're a bit new to using MLAG and I sense we may have implemented a flawed design, but I'm curious to get feedback on what might have caused the issue we found (and oddly, the resolution).
We have (2) x670's running as an MLAG core with dual downstream LACP links to all edge switches/stacks in our environment. We have a pair of SonicWALL firewalls connected to the core using high-availability; basically a shared MAC advertised by the "active" member. We also have two ISP/WAN VLAN's with each firewall having an interface in each along with the ISP's router.
If you look at the diagram, all was working okay and FW1 was active. That FW had an issue and FW2 took over. When it did, we lost visibility to the ISP1 connection from FW2 which in turn caused the firewall to also fail-over the internet to ISP2 alone (FW2 could not ping ISP1's DFGW, so it moved all traffic to ISP2 which it could see fine).
We got an alert that IP's on ISP1 had become unreachable from the outside, as the firewall is responsible for NAT'ing traffic to those IP's and it was no longer reachable. We expected/thought that the traffic would simply traverse the ISC from ISP1 to FW2 but it did not. Oddly, simply resetting (disable/enable) the CORE2 switch port connected to the FW2 (ISP-1 interface) restored service. FW2 could now see ISP1 and our external alerts cleared.
The question - why? My expectations were that FW2 would send a GARP when it took over and that this would flush across the ISC and basically shift the location of that MAC in the MLAG FDB. That did not seem to happen but, physically resetting the core-side port of that connection did cause that to happen. Theories? Is this just a bad design?
Thanks in advance for any thoughts...
BigRic