Apple Devices Casuing intermittent network outages

  • 0
  • 1
  • Problem
  • Updated 2 years ago
  • Solved
We have recently started rolling Apple devices at one of our locations. Approximately 1200 iPads. They are connecting to via an Aerohive wireless solution. That solution has 1 point of entry to my network at a B5 switch that is also doing layer 3 for that site to our WAN. When the iPads starting ramping up, it has seemed to cause frequent intermittent network outages. Devices on the rest of lan in different vlans stop communicating and then will come back within 5-10 seconds. This happens anywhere from 1-50 minutes during every hour of the day including overnight when no users are onsite. After extensive troubleshooting, we still have not identified the cause of the problem but believe it is possibly something with the B5G124-48P2. From one of my switches at the local site, when the problem happens, I cannot reach my default gateway, which is the IP address of that vlan interface configured on the B5. Since affects all vlans, we suspect the B5. The routing is all static and RIP is disabled. Also when this is happening, the system CPU is not excessive, it is usually around 26-45% but I have seen it spike very shortly up to 60-80%. I have debug logging enable and the have no logs indicating that system resources are taxed. The only recurring log entry I get is "DHCPRELAY[265701448]: relay_main.c(315) 568089 This is from manager 1 %% Request could not be relayed to Server". I have also checked my DHCP server, however it is supporting all 14 of my sites with no issues anywhere else. I checked the mac tables and we usually sit at just under 4000 entries, but according to the specs for the B5 it can support up to 24,000. I appreciate any help or suggestions.
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
  • frustrated

Posted 3 years ago

  • 0
  • 1
Photo of Aguilar, William

Aguilar, William, Employee

  • 2,664 Points 2k badge 2x thumb

It must be AeroHive ... it wouldn't happen with ExtremeWireless :-).  Just kidding.  Ultimately the best thing to do is to work with GTAC but one area you could look at is the effect that Apple Bonjour is having on the network.  Bonjour is a zero-touch discovery protocol used by Apple devices which is great for the home but it doesn't scale in enterprises.  It uses multicast as a discovery mechanism which is very expensive on a wireless network and also puts strain on switches because it has to be processed in software in the CPU (= high CPU utilization).  AH has some controls for Bonjour and one thing you could try to do is block all Bonjour at the APs on one of the sites to see if it addresses the issue.  Or you can try to drop the multicast at the switch before it is processed in the CPU to see if that helps.  Again, the best thing to do is to work with GTAC to isolate the issue but it is something to consider.

Good luck with AH and keep and eye for what we are doing with ExtremeWireless next time you are looking to upgrade your Wi-Fi network.

Thanks,


Will 

Photo of Christoph

Christoph

  • 1,862 Points 1k badge 2x thumb
How about port utilisation?
Maybe it's a broadcast or multicast issue?
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
We have ran multiple packet captures with nothing standing out. Port utilization is very low. We were told that Bonjour Gateway was disabled with Aerohive. The thing is that I get reports of their switches going down too. Down is relative, they basically cannot get to their Default Gateway for a short period of time.
Photo of Erik Auerswald

Erik Auerswald, Embassador

  • 13,792 Points 10k badge 2x thumb
I have recently seen high switch CPU usage (100% for about half an hour) on EXOS based switches after rebooting connected Aerohive APs. Blocking mDNS from reaching the switch CPU dropped the CPU usage below 100%, but it was still quite high.

Packet captures showed a significant increase in the following three frame / packet types affecting switch CPUs sent by the Aerohive access points:
  1. mDNS requests
  2. Some layer 2 broadcasts probably used for Aerohive AP discovery
  3. Gratuitous ARP replies
Strangely the access points rejected every received mDNS answer with an ICMP Port Unreachable message, but continued sending requests.

The Bonjour gateway was disabled on the Aerohive APs, but the access points generate their own mDNS requests.

It usually takes hours for the switch CPU usage to drop to the normal values observed in the steady state network.

See the GTAC Knowledge article "How can I block mDNS with an ACL using MAC addresses" for info on an ACL to mitigate mDNS impacts on EXOS switches.
Photo of Joseph Burnsworth

Joseph Burnsworth

  • 2,328 Points 2k badge 2x thumb
Could you tell us what REV of code is on the B5?
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb

Hw:BCM56514 REV 1

                                   Bp:02.02.51  

                                   Fw:06.81.05.0003   

                                   BuFw:06.81.01.0027   

                                   PoE:2_1     
Photo of Jeremy

Jeremy, Embassador

  • 9,788 Points 5k badge 2x thumb
That's the version of code we are running.. It's been rock solid so far.
Photo of Joseph Burnsworth

Joseph Burnsworth

  • 2,328 Points 2k badge 2x thumb
How does the AeroHive handle all of the mDNS from the iPads? Sounds like the multicast is over running the switch. Are the AP's and the wireless networks on different VLANs from the workstations?
Photo of Jeremy

Jeremy, Embassador

  • 9,788 Points 5k badge 2x thumb
Have you checked port counters?  Do a show rmon stats ge.1.9 for example to look at that port.  However, I would take a look at your uplink first.  Also, have you checked spanning-tree?  Run show spantree stats active and show spantree stats.  Do you see any topology changes when this is happening?  If you are using 802.1d and the network has to reconverge because of some spantree event, it could take several seconds up to 40-50 seconds to stabilize.  
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
The rmon stats show clean 0 errors or drops. We did check spantree too. We actually disabled it as a test, but did not change the result.
Photo of Jason Parker

Jason Parker, Employee

  • 3,038 Points 3k badge 2x thumb
I do agree that you may want to reach out to the GTAC and open a ticket
I would would be glad to assist if needed

Let's start with a quick use toon
Does the AP support -pause frames?

To find out from the switch cli
run the command
show port advertise
and look for pause under the remote section
If a Yes is seen then the AP supports Pause

So if yes, then run the command
Show port flowControl port# or show txwmonitor flowControl port#
If the ports show rx #'s,
then there is more likely that the AP may be overwhelmed or if feels that the load along with some unwanted traffic it is processing traffic(may be multicast ) or maybe a small change on the switches config is needed)
GTAC can assist

An option ( if there are numbers that increment)
as a workaround:
You may have to run this command:
warning : The command will drop the link, clear the option to advertise the port supports pause (Default on all Ports)and then negotiate speed/duplex
Clear port advertise port# pause
Now check you connection

Here is an article that may help you:
https://gtacknowledge.extremenetworks...
Jason Parker
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
Thanks for the info. The Aerohive APs are not attached to the B5, Aerohive just hands off to us on a copper gig port from their switch. It is a managed wireless solution. I did check the advertise status and it shows no for everything on all the 1000Base-T ports.
Photo of Bill Handler

Bill Handler

  • 1,434 Points 1k badge 2x thumb

Just to chime in on the information already provided with an additional possibility/question...

Is it possible that there is an IP conflict?  We chased an issue at a customer site for months.  Our situation was slightly different, but had a similar impact.

There was a BYOD that would come on the network (iPad) that would keep it's DHCP address from their home network.  The device would not relinquish the IP for a few cycles, even though it was connected to the wireless with a different IP space.

As an unfortunate coincidence, the IP that this device would not relinquish happened to be the IP address of the gateway.  As a result, this poisoned the ARP tables on a few devices and caused devices to not be able to reach the gateway... 

Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
That is definitely a weird one, however our problem persist through the night with no users onsite. Our users are also not allowed to take iPads offsite. I do have a GTAC case open, but wanted to cover my bases with the community help. Thanks.
Photo of Jason Parker

Jason Parker, Employee

  • 3,038 Points 3k badge 2x thumb
Another thought
What mgbic are you using

Try
Set. Logging local Consol enable file enable
Set logging default severity 7
Remove and insert the mgbic
any messages in the show logging buffer

How about egress

Show vlan port info port port#
Also try it on the uplink to dhcp and radius server
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
The ports we are using are copper Ethernet. No gbics involved
(Edited)
Photo of Thomas Randolph

Thomas Randolph

  • 440 Points 250 badge 2x thumb
Current update. We found out that for some reason the Layer 3 ARP Cache is not holding records. For example, the default ARP Cache is 4 hours. We have network switches that are showing anywhere from 0-5 minutes age on their ARP entries. No ARP Entry is past 10 minutes for anything. Also when an entry drops out of the list, it can take from a few seconds up to hour to re-populate. The number of entries in the cache is just over 2000 so not even close to the max. Same thing with mac address tables. We are waiting for a replacement device from Extreme support to see if that fixes the issue.
Photo of Arison Mercado

Arison Mercado

  • 442 Points 250 badge 2x thumb
Hi Thomas,

Any luck on this? I've been battling this for over a year and it seems to get worse and worse. And Yes, I do have random clients that connect to the wireless but cannot reach their gateway and that doesn't just happen on Apple devices.