I have a scenario where an SSA switch (Enterasys) stops IPv4 forwarding (or NAT or both) and requires a reboot to recover. Firstly, this behavior was induced by a misconfigured Windows host, that introduced a routing loop and has not been noticed on any production system/ netowrk configuration. The problem is that once the switch is in a non-functioning state, using the switch's command line, I do not see any difference in output between a working versus non-working state. That is, the output of show run, show mac, show arp, show interface, show ip nat bindings summary, show spantree portstate, show spantree lp all look same as before the switch stops forwarding.
The scenario is as follows. I have a lab test network with two SSA switches with similar configurations. Two VLANs each, say VLAN 2, and VLAN 50 on and VLAN 2 and VLAN 55 on the other. VLAN 2 s are completely separate networks and have no connection between them). VLAN 50 and 55 of the SSAs are connected to a router which acts as a gateway for those two subnets (this router is just simulating the infrastructure between two sites to which we will move parts of the setup for demonstration purposes). Both SSAs have spantree stpmode set to none. Both SSAs have one static route each to set the router in the middle as next hop for the route to the other end. No default gateway is configured on either switch. They have NAT rules (ip nat inside source static x y, where VLAN 2 is the "inside" VLAN and the other VLAN is "outside" VLAN) to map addresses of two hosts in VLAN2 to addresses in VLAN 50 (or 55). On both sides VLAN 2 s happen to have VRRP enabled and the VRRP IP address is the gateway for devices on VLAN 2. However high availability is not required for the demonstration, and hence the partner switch on either side is absent. This is probably not relevant to the problem here and is only stated for completeness.No VRRP on VLAN 50/55 side.
So far so good, a Windows 7 PC on VLAN2 on one side can ping each other by referencing the NAT address of the other PC on the other end. Now the problem occurred when I tried to capture traffic between one the SSAs and the router in the middle from a mirror port. I happened to use another NIC on the same workstation on VLAN2, which was pinging the other end. Unknown to me the workstation had IP routing enabled, and the IPV4 protocol was enabled on the NIC used for capture. So as soon as Wireshark enabled promiscuous mode, the mirrored traffic with NAted source IP address was pumped back from VLAN 55 to VLAN2. This introduces a routing loop and causes the ping to report both successes as well as TTL timed-out. Each ping starts with TTL of 128 and is forwarded 64 times by the SSA decrementing TTL by 2 (one by SSA and one by PC) every loop. After a few seconds, the SSA switch stops forwarding IP packets in either direction. Removing the loop, either by stopping wireshark capture, or by disconnecting cable from mirror port physically does not bring that switch back to a state where it is forwarding IP packets. As far as I can see all ports (Except the mirror) are in forwarding state.
"clear arp all" did not help and the only way to recover that I know of is to reboot the switch. What is interesting is that only one of my two SSAs exhibit this behaviour. The other one recovers gracefully when the routing loop is removed, and even when it is present never stops forwarding packets.
Both SSAs report same hardware and firmware revisions.
SSA Chassis(su)->show version
Copyright (c) 2013 by Enterasys Networks, Inc.
Slot Model Serial # Versions
------ ---------------- -------------------- -------------------------
1 SSA-G8018-0652 Hw: 2
Bp: 01.03.02
Fw: 08.11.04.0005
I am posting here hoping to get some suggestions on 1) Any CLI commands that may give me more information on why the switch stopped forwarding, 2) ways to recover without rebooting if any 3) a way to tell if the switch gets into such a state during normal operations.
As SSA switches are recommended for use with our products, obviously any potential issue is of interest.
My next step is to identify whether the problem is with one specific device alone. It would be really strange for it to be a hardware issue because apart from this scenario the switch seems to work fine.
Thanks!