03-19-2024 10:27 AM
x460-G2 stack version 31.7.2.28-patch1-8
Tricky issue to describe: We have two internet connections and use a policy file to route traffic (based on layer 3 VLAN) to the connections for balancing. There are small twice daily updates to the policy file for 'bed time rules'. Occasionally we need to make significant policy changes, like for maintenance or if one internet connection is faulty. For the second time since this FW (applied June 2023), after a major policy file switch, the whole stack gets flakey. It's like all traffic starts going to the default route instead of the proper local VLAN, which we can see using traceroutes. We can't even access the stack via network and have to console in. We have to do a soft reboot of the stack. After the reboot everything is fine, so we know the policy file is OK.
Assuming the switch config is OK and the policy file is OK (since everything is good after reboot), what could be happening? Is there something more I can check while the issue is occuring? Are there proper commands to run while applying the policy file (we just do a check policy then refresh policy after replacing the file)? Is there a 'flush' option? Is there an issue with this firmware?
This is production so it's extremely difficult to test (pun intended) and our window for FW updates is limited unless we have good reason to believe there would be a fix.
Solved! Go to Solution.
03-21-2024 05:51 AM
Without more information my guess is that changing the routes in the hardware tables goes wrong or takes a long time. During that time traffic is send to the CPU for forwarding instead of through hardware and is causing the CPU to become too busy.
Another issue could be a long install of the policy, do you use refresh policy or uninstall/install ? Refresh policy can take significantly longer time than unconfigure access-list and configure the same ACL again.
03-21-2024 05:51 AM
Without more information my guess is that changing the routes in the hardware tables goes wrong or takes a long time. During that time traffic is send to the CPU for forwarding instead of through hardware and is causing the CPU to become too busy.
Another issue could be a long install of the policy, do you use refresh policy or uninstall/install ? Refresh policy can take significantly longer time than unconfigure access-list and configure the same ACL again.