Header Only - DO NOT REMOVE - Extreme Networks
Solved

Understanding high "Retry Percentage" values

  • 19 September 2019
  • 6 replies
  • 1423 views

Userlevel 1
Hi all -

I'm writing from a College environment where we have just over 850 WiNG APs. We are currently running WiNG 5.9.5.0-007R. We are ~2 weeks into our semester and we have just recently started receiving many complaints about network speeds/reliability/etc...

While digging into this, I stumbled on some extremely high retry percentages (see two examples below):



My questions are:

  • What is an acceptable retry percentage?
  • Is it reasonable to expect a retry percentage of close to 0 on most (all) devices?
  • What could be causing this? Admittedly, I don't actively look at retry percentages, so I don't know if these are "normal" numbers on my network
  • Could this be a problem with version 5.9.5.0-007R? These complaints are new (we didn't have them last school year or during the summer), and we did upgrade to 5.9.5.0-007R in mid-July.
Thanks!

Max
icon

Best answer by Chris Kelly 19 September 2019, 15:58

Max,

(long response...because there's no easy/short answer to this)

First answer is that 'acceptable' retry rates are generally defined as under 20% for non-critical applications. There is no official value though, just industry accepted norms.

As far as retries being zero, pretty much the ONLY time you'd ever see this is in a thought experiment or a lab environment. In the real world, you're always going see some number of retries. The goal is to simply minimize them to acceptable values for the applications in use (guest traffic vs VoWLAN for example).

In any case though, anything over 25% being seen on a consistent or sustained basis raises a red flag for me (but this value will naturally fluctuate and it's not unusual to periodically spike). Sometimes there's things you can do about it, sometimes not. Some things will simply be out of your control. All you can do is 'fix' the things that you ARE able to control.

To analyze this is and determine root cause is likely not trivial - with the assumption that it's not WiNG code related, which I doubt it is. Retries simply occur which the sender is not able to detect/see an ACK for the frame that it transmitted (for those frame transmissions that actually require an ACK). This failure to see an ACK can be caused by many things - and that is where the challenge lies.

Looking at these two examples though does help to start ruling out some potential causes though.
In both cases, you have extremely high SNR and RSSI, and a very low noise floor. Fantastic. These three things alone would normally dictate a very healthy environment that should allow for maximum data rates to be used.

BUT.....let's take a look at things that can cause retries:

1) Non-802.11 interference (occurring on the same frequencies as your devices, of course) leading to frame corruption which the receiver recognizes. In this case, the original transmitter's frames ends up not being received so the receiver obviously never sends an ACK...or the transmitted ACK is corrupted and the therefore the ACK is interpreted as never being received. In both cases, retries occur.

2) Frame collisions (Typically caused by a hidden node, OBSS situation, or adjacent channel traffic)

3) Extremely high airtime contention - causing the transmitter of the ACK to have to wait SO long to transmit the ACK that the other device finally gives up and interprets the situation as the other device not ever having received the frame and so it 'retries'.'

4) Low transmit power devices - or devices too distant from the AP. In this case, a client device may be operating on the fringe edge of the AP's ability to decode the client's preamble. This would likely lead to many of the ACK frames being received by the AP being too weak and thus not decode-able and therefore the AP thinks the client never ACK'd. This situation is usually in this direction (client to AP) because client devices are MUCH lower powered and have almost no antenna gain - whereas APs can transmit at much higher power levels and have better antennas - overall, much higher EIRP capable. Bottom line, the clients are the weak-link in the chain when it comes to wifi.
  • This is also related to the sticky-client problem where a client associates to an AP initially but then as it moves, it doesn't roam properly to the next AP that is much closer. In that scenario, the client is then having to communicate with an AP that is very distant - leading to the problems just described.
5) This one is user-induced. If you configure a setup such that the BASIC rates are unrealistically high, the ACK frames (which are management frames) will be sent at the configured BASIC rate(s). So if those BASIC rates aren't achievable or just barely are, then the success rates will be low.

6) Client side 802.11 driver issues. It happens.

There's other causes, but they get more corner-case related, but these 6 are really the most common and likely.

In your examples, the Error rates are zero. This indicates to me that there are no issues with corrupted CRC values. If there were, the frames would be discarded and would be considered an error.

The SNR, RSSI, and noise levels seen on the AP are so good that you're not dealing with a 'distance' related issue (are there APs located in the same room as the devices??) At these levels, the devices shouldn't have any issues with even the highest BASIC rates being set.

What can't be accounted for here is CCC (co-channel contention) related problems - basically meaning high channel utilization - airtime is too busy to allow devices to communicate in a timely manner.
  • Something I cannot tell from the screenshots is if this is for devices operating on 2.4GHz or 5GHz. If it's 2.4GHz, there's a much higher likelihood that high channel utilization is the problem. And unfortunately, there's not much you can do about it, especially if the clients are not 5GHz capable. At the very least, if your deployment is setup well, ensure that you disable all non-OFDM data rates. Do not allow 1, 2, 5.5, 11 Mbps data rates at all. But if you have devices that need them, those users will be affected. But that also means that those devices are VERY old (pre-11n). Doing this will help ensure that traffic is moving more quickly and will help free-up airtime.
There's also the potential for an OBSS situation, which can't be seen here (or hidden node). Both cases will lead to two or more devices thinking that the airwaves are clear for them to transmit...at the same time. With this, you end up with collisions in the air...leading to frame corruption. But, back to zero Error rate, that wouldn't seem to be the case here.

Are the complaints seen everywhere (or are the very high retry rates seen as occurring all over...or are they relegated to just certain areas?)
Seeing these very high retry rates on any of the devices in the **same** area that are using 5GHz? (THIS would be an interesting answer to see)


Forgot to mention one thing - The retry rates can sometime be skewed. WiNG reports the values as a percentage but you can have cases where there is VERY little traffic for a device and have normal retries...which ends up, because of the simple math, looking like there's an issue because the values are so high. Where the values are legitimate though is where there's a normal/decent amount of traffic associated with the device. To see if this is maybe the culprit, you'll have to look at the traffic stats for these devices.
View original

6 replies

Userlevel 6
Max,

(long response...because there's no easy/short answer to this)

First answer is that 'acceptable' retry rates are generally defined as under 20% for non-critical applications. There is no official value though, just industry accepted norms.

As far as retries being zero, pretty much the ONLY time you'd ever see this is in a thought experiment or a lab environment. In the real world, you're always going see some number of retries. The goal is to simply minimize them to acceptable values for the applications in use (guest traffic vs VoWLAN for example).

In any case though, anything over 25% being seen on a consistent or sustained basis raises a red flag for me (but this value will naturally fluctuate and it's not unusual to periodically spike). Sometimes there's things you can do about it, sometimes not. Some things will simply be out of your control. All you can do is 'fix' the things that you ARE able to control.

To analyze this is and determine root cause is likely not trivial - with the assumption that it's not WiNG code related, which I doubt it is. Retries simply occur which the sender is not able to detect/see an ACK for the frame that it transmitted (for those frame transmissions that actually require an ACK). This failure to see an ACK can be caused by many things - and that is where the challenge lies.

Looking at these two examples though does help to start ruling out some potential causes though.
In both cases, you have extremely high SNR and RSSI, and a very low noise floor. Fantastic. These three things alone would normally dictate a very healthy environment that should allow for maximum data rates to be used.

BUT.....let's take a look at things that can cause retries:

1) Non-802.11 interference (occurring on the same frequencies as your devices, of course) leading to frame corruption which the receiver recognizes. In this case, the original transmitter's frames ends up not being received so the receiver obviously never sends an ACK...or the transmitted ACK is corrupted and the therefore the ACK is interpreted as never being received. In both cases, retries occur.

2) Frame collisions (Typically caused by a hidden node, OBSS situation, or adjacent channel traffic)

3) Extremely high airtime contention - causing the transmitter of the ACK to have to wait SO long to transmit the ACK that the other device finally gives up and interprets the situation as the other device not ever having received the frame and so it 'retries'.'

4) Low transmit power devices - or devices too distant from the AP. In this case, a client device may be operating on the fringe edge of the AP's ability to decode the client's preamble. This would likely lead to many of the ACK frames being received by the AP being too weak and thus not decode-able and therefore the AP thinks the client never ACK'd. This situation is usually in this direction (client to AP) because client devices are MUCH lower powered and have almost no antenna gain - whereas APs can transmit at much higher power levels and have better antennas - overall, much higher EIRP capable. Bottom line, the clients are the weak-link in the chain when it comes to wifi.
  • This is also related to the sticky-client problem where a client associates to an AP initially but then as it moves, it doesn't roam properly to the next AP that is much closer. In that scenario, the client is then having to communicate with an AP that is very distant - leading to the problems just described.
5) This one is user-induced. If you configure a setup such that the BASIC rates are unrealistically high, the ACK frames (which are management frames) will be sent at the configured BASIC rate(s). So if those BASIC rates aren't achievable or just barely are, then the success rates will be low.

6) Client side 802.11 driver issues. It happens.

There's other causes, but they get more corner-case related, but these 6 are really the most common and likely.

In your examples, the Error rates are zero. This indicates to me that there are no issues with corrupted CRC values. If there were, the frames would be discarded and would be considered an error.

The SNR, RSSI, and noise levels seen on the AP are so good that you're not dealing with a 'distance' related issue (are there APs located in the same room as the devices??) At these levels, the devices shouldn't have any issues with even the highest BASIC rates being set.

What can't be accounted for here is CCC (co-channel contention) related problems - basically meaning high channel utilization - airtime is too busy to allow devices to communicate in a timely manner.
  • Something I cannot tell from the screenshots is if this is for devices operating on 2.4GHz or 5GHz. If it's 2.4GHz, there's a much higher likelihood that high channel utilization is the problem. And unfortunately, there's not much you can do about it, especially if the clients are not 5GHz capable. At the very least, if your deployment is setup well, ensure that you disable all non-OFDM data rates. Do not allow 1, 2, 5.5, 11 Mbps data rates at all. But if you have devices that need them, those users will be affected. But that also means that those devices are VERY old (pre-11n). Doing this will help ensure that traffic is moving more quickly and will help free-up airtime.
There's also the potential for an OBSS situation, which can't be seen here (or hidden node). Both cases will lead to two or more devices thinking that the airwaves are clear for them to transmit...at the same time. With this, you end up with collisions in the air...leading to frame corruption. But, back to zero Error rate, that wouldn't seem to be the case here.

Are the complaints seen everywhere (or are the very high retry rates seen as occurring all over...or are they relegated to just certain areas?)
Seeing these very high retry rates on any of the devices in the **same** area that are using 5GHz? (THIS would be an interesting answer to see)


Forgot to mention one thing - The retry rates can sometime be skewed. WiNG reports the values as a percentage but you can have cases where there is VERY little traffic for a device and have normal retries...which ends up, because of the simple math, looking like there's an issue because the values are so high. Where the values are legitimate though is where there's a normal/decent amount of traffic associated with the device. To see if this is maybe the culprit, you'll have to look at the traffic stats for these devices.
Userlevel 1
Chris,

Wow! Thank you for such a thorough and thoughtful response!

A few things I can respond with:

We have the following Radio Rates setup in our WLAN profiles:



Because we were told it was better t do it in our AP profiles:





I'm under the assumption that my radio rates are setup to best practice. Can you confirm?

Also, we've been interested in setting up band steering/load balancing for quite some time, but have never quite figured it out. Maybe this is the time to do that to help clear out the 2.4 spectrum? Do I need to setup both load balancing (in the WLAN profile) and an SBC strategy (in the AP profiles):





Thanks again!

Max
Userlevel 6
Max,

Data rates mostly look good. The difference is that you defined them at the actual radio level, whereas you could also define them on a per-SSID basis. No big deal. It just means that the rates you've selected will apply to ANY WLAN you have mapped to the radio(s). I can't tell you if the rates you've selected are best...because it depends on you're actual deployment, as well as the actual wireless clients and their capabilities.

But, from a general perspective, yes, they do look good. One change - On the 2.4GHz radio, add 24Mbps as both Basic and Supported. Not sure why that rate isn't being offered at all - maybe a mistake?

The idea is that you want the clients and the APs to be able to use the highest possible rates that they *can* use. This keeps your airtime contention down (at least, the airtime that YOU and your devices contribute to - but obviously can't affect APs and devices that are NOT under your control and are operating in your airspace).

As for the load balancing / band-steering capabilities - IMO, it really depends on the situation. Also, it's a known potential problem that certain wireless clients don't take well to the behaviors that APs use to attempt to influence which band a client should use. Most new(er) devices today have 5GHz capability.
Here's an existing GTAC article on band-steering setup.

You have many options though when setting this up (take a look at Profiles->(AP profile)->Advanced->Client Load Balancing) . You can balance the number of users across the APs, across a single AP's radios, across channels, setup ratios, etc. It gets pretty involved, and honestly is too lengthy of a topic to discuss here.
But, before you dive into the deep end on this topic, you need to have a clearly defined goal for what you actually WANT to accomplish. This functionality isn't one that you just enable because you think that it's a good thing to have turned on/best practice. The related features exist because there are situations where it's NEEDED and you're needing to solve a problem.

But yes, you DO need to enable the option in the WLAN(s) themselves (which then provides a couple of very base level options) and then you begin configuring everything in the AP Profile CLB section. But fair warning, if you don't know exactly what you're doing and don't setup these settings properly, you could very likely make things worse.

What I'd want to look at next is the client distributions. What radios (bands) are clients using?
Go to Statistics->(Select an RFDomain in the tree where complaints are coming from)->Inventory
Also found at:
Dashboard->(Select RFDomain)->Inventory Tab

Look at the widget for "Clients by Band".
11ac means only 5GHz
11a means only 5GHz
11an means only 5GHz
The last three mean only 2.4GHz

You want to see as many possible users using the 5GHz band primarily because 2.4GHz channels are most likely to have extremely high contention levels, interference (both ACI and non-802.11-based). Unfortunately nowdays, 2.4GHz needs to be treated as a best-effort band for wireless devices. Some users will even go to the lengths of creating two identical copies of their WLAN Profile (one applied to 2.4 and the other applied to 5GHz radios) with the only difference being that one of them has the SSID 'WLAN1-SLOW' and 'WLAN1-FAST'. Users see both and guess which one they want to use? 🙂 But this makes it so that if there ARE any 2.4GHz-only devices out there, they can still connect.

You can also see a different representation of this same info in the widget "Clients by Channel". This one though breaks it down by actual channel number, which isn't as important.

Also, in addition to the Inventory section, look in the Radio->RF Statistics section and select one of the RF-Domains in the tree where you see complaints. Look at the last column for RF Quality Index? See any listed as Poor or Fair Quality?
Userlevel 1
Chris -

My overall goal with SBC would to help steer devices toward the 5GHz spectrum and lessen the load/utilization of the 2.4GHz spectrum:

Below are some of my residence halls:





And some of my academic buildings:





These are stats from only 6 of our 20 RF Domains, but this shows ~75% of connected devices are on 5GHz and the other ~25% are on 2.4GHz. Perhaps that is fairly well distributed on it's own and I don't need to worry about band steering?

Lastly, yes, I see see some Poor and Fair quality RF Quality Indexes, but the overwhelming majority are Excellent and Good (and N/A?).



Max
Userlevel 6
So then a good number of devices are already using 5GHz. That's good news.
The question then is....are those remaining 2.4GHz connections because those devices are choosing to use 2.4GHz or because they don't have 5GHz radios? If it's the later, then there's nothing you can do except try to do your part to make the 2.4GHz networks work as well as possible (proper AP placement, proper power/channel planning, optimized data-rate configurations, etc).

The Poor And Fair rated networks many times are biased by the fact that they are under-utilized (very small numbers of users many times mean very small amounts of traffic - which can sort of unfairly influence how the networks end up getting rated). I can't tell if this is the case with the screenshot though. But begin by taking a look at the entries with the Poor ratings and see if there's a correlation between the lowest ratings and if it's a 2.4 or 5GHz WLAN entry. If all/most of the Poor entries are 2.4GHz networks, then I would think it's safe to chalk it up to the 2.4GHz band just being prone to being bad.

With this being a school, the client device types are pretty much out of your control, unless you have the authority to dictate what devices are allowed to connect (registration process or something like that). In this case, you're going to have to deal with (for likely several more years) devices that are 2.4GHz-only capable. You have the choice then either trying to accommodate them as best as possible (as mentioned above) or just saying that they're not supported (and disable the 2.4GHz radios). A third alternative is to somehow disseminate a message indicating that connections made on the 2.4GHz network are 'best-effort' only and that performance problems won't be addressed...or something to that effect. :)

Overall though, the 5GHz networks seen here look pretty healthy. I wouldn't expect to hear that complaints are coming from any of those networks.
Userlevel 1
Chris -

Thank you very much. This was all very helpful!

Max

Reply