Extreme Networks

k_berrien · ‎08-12-2019

After moving from Classic to NG in our on-prem environment (about a month ago), we started seeing issues. Unable to push out configs, error in data reporting (# of clients connected, etc), a call to support brought up an obscure and rare hardware requirement for NG on prem. No SAN support, must be a directly connected drive, preferably SSD.

I have skepticism from my colleagues, this is a highly rare request for a VM environment in our experience. How did any of you else address this specifically?

thanks,

aprice · ‎11-19-2019

Hi Kevin.

That's good/interesting to know. While working on other stuff (see below if you're bored) I proactively tuned my settings down to the same as your SE/post suggested in order to "eliminate" HMVA congestion from our possible root causes. That's really decimated my client stats, which is pretty awkward to glace at or troubleshoot, but I can live without the application data. Once I get the other bits worked out I may try to go back to 10 minute stat intervals, abandon KDDR (I've never needed it), and probably skip application info until the software backend is...fixed.

After getting HMVA stabilized, and some creativity on our Cisco UCS blades to get a local SSD into one of them, we still have some pretty big issues. So while I eliminated this one thanks to this thread I'm only one or two steps down the path to a fix. The first thing I encountered is that our HTTPS certificate, of all things, destroyed our ability to push configs to the APs. So a call to support reached a tech who'd seen that, pointed to the cert, deleted it from our VA, and we could push configs again. That, in turn, broke our API usage to generate WiFi keys...so that's offline for now and people have to write me an email to get a key.

Those two problems aside, I'm curious about one aspect of your particular installation: what APs are you using, and what HiveOS do you have on them?

I'm still running into massive connectivity and performance problems on our AP230s with 8.4r11. I updated to this build a few weeks ago in order to try and get better Network 360 analytics to troubleshoot the other connectivity problems. But I think it may have its own problems and am trying to sort out what to do. Ironically, 8.4r11 contains a config option to specify a syslog UDP port, which 8.2r6 (our previous version) does not. So I was able to troubleshoot some issues only by having a version that seems to cause those issues. And, it turns out 8.2x, 10.x, and maybe 8.4x have a known bug that causes the 5GHz radio to flap if beamforming is enabled...which I discovered accidentally by 1) reviewing my logs and 2) searching for help with said logging. I'm basically in a problem loop right now. It's awesome.

Anyway, I'd be curious what your model and OS versions are since those seem to be stable for you with 3x the active APs I've got. Clearly something is still (freshly?) amiss in my setup, and we're only two weeks out from final exams.

(Fun side note: I think Elasticsearch in HMVA is configured to keep indexes open in perpetuity and needs manual intervention to delete older data. I can tell you from personal experience, by bricking my little one-node ES cluster, that this is a great way to wreck Elasticsearch. Indexes should be closed, or at least moved to a warm state [newer ES feature], if they aren't being written to. When left open they eat memory and add massive time to system startup as every index is initialized. If too many are open and your config isn't tuned it can actually trigger the "high water mark" and shut down ALL indexing until you manually clear the issue on every impacted index.

So Aerohive/Extreme, if you're reading this thread I really hope your next HMVA release includes enhancements to Elasticsearch, a new version, and some index template tuning! Or that this already exists and the VAMS UI just needs some phrasing work to make that clear.)

k_berrien · ‎11-15-2019

Alan, just read your post and figured I'd drop in with an update. Now many months after the analytics disabling we continue to be issue free.

However, what was supposed to be a temporary fix now appears permanent. We're still running 12.8.3.3-NGVAFEB19 - and I've inquired through the same channels and Aerohive staff who assisted with our "temporary fix" with the simple question - CAN WE UPGRADE? I have not gotten a response after posing the question about a month or so ago. I get the feeling, either this is a major sticky issue/denial within the company, or the company has changed.

We certainly share your feelings of disappointment. Every shop has those 2-3 products when asked they will rave about. Aerohive was one of those products for us, but before any hardware upgrade (or even out of precaution of resolving our present state) we would likely walk into product demo's thinking we almost HAVE to change platforms. Our situation is similar, we've invested in staff and VM infrastructure that is far above all the requirements for any OTHER product our schools or city needs - putting that investment aside and re-purchasing as a service for cloud isn't in our budget as well.

aprice · ‎11-15-2019

This seems to be the thread matching issues we're seeing, so I'm glad to find a group also suffering what we are. Though I wish none of us were going through this with what used to be a pretty solid product.

We upgraded to 19.5.1.7 and hit a known issue where the tables/indexes aren't properly cleaned out during the upgrade. That was messing with our stats collection, and it was taking out our PPSK services when HM would stop servicing much of anything. I could restart all services from the appliance manager or reboot the VA and we'd be back for about a week. Support's official fix was to redeploy the VA from scratch and import a copy of the VHM. I did that and am still experiencing major problems.

I just opened my third support ticket related to this as well, and this time the first question was about whether we were using a SAN. We are, and I got the "we don't support that." I'm currently pursuing help by asking for info regarding the exact error we're currently receiving when trying to push any kind of config or update ("Could not download the captive web portal file. CWP files abnormal when checking cwp files on AFS.") Let's say they won't help fix my VA, but they should be able to explain the error message and I can work from there.

We're running 200 APs with ~5000 daily unique clients, also as an academic institution (small residential 4-year college). We're heavy users of PPSK, so when HM goes weird it really causes some problems for all of our students. I haven't tweaked the settings Kevin Berrien posted from the defaults, so that's a possible path for a dirty fix. But I don't find it acceptable that a "local SSD" magically fixes all of this. You don't built a VMware cluster around local SSD disks instead of a SAN. We have sub-millisecond latency on HM writes and most reads, much like John Wagrowski's posts contends, so it should be just about as fast as a local SATA/SAS SSD. In fact, our SAN analytics shows HM is only doing about 100-150 IOPS on average. That's pretty meager compared to the demands of larger applications. (HM is our biggest IOPS VM in total.)

I also actually run Elasticsearch for logging and reporting. I'm ingesting 30,000-40,000 events per minute to a system backed by shared spinning disk. The fact that HM is exploding with 200 devices is nuts, and I'd love to pull some stats out of the HM ES instance to know what's going on in the indexes.

Once I get a bit farther our account SE is definitely getting a note with some feedback. If I enjoyed conspiracy theories I'd say this is a play to end on-prem HM and force everyone to cloud. Which I'd be happy to do but we literally can't afford it. Such a move would subsequently force me to find a new manufacturer. I've been an Aerohive champion since I selected it in 2012/2013, but this HMNG debacle has got me on the ropes. I find that really quite disappointing.

wagrowski · ‎11-01-2019

I have just received this requirement from Aerohive support after opening a third ticket on a similar issue where Hivemanager stops receiving data from the APs and all my dashboards show "Data unavailable" and I get cannot get required device list when trying to get APs. A reboot fixes the issue for maybe a month or a little bit longer. We've been on Hivemanger NG since like Sept of 2015 and I've been having issues since probably version 12.8.1.2. The last upgrade I did in July 3rd was a fresh install of 19.5.1.7 and everything was fine for about two months, and then happened again about a month ago, but I didn't have time to call in a ticket and just rebooted to fix it. Then it just happened again yesterday and I opening a ticket today and got that answer.

Of course I went livid because blaming SAN storage (that has tiered enterprise SSD and multiple 10Gb links) and suggesting that it may cause a problem and data corruption I found that as an unacceptable response to the issue we're having. We've been running MSSQL, Exchange, and even Oracle on AIX for years and none of them have issues with being on a SAN. I'm sure despite the big hyperconverged push by the Dell's, Nutanix's etc, that even Hivemanager Online being hosted on AWS, some portions of it are probably on SAN storage somewhere. HivemanagerNG was working great for us for about two years, and then started having these problems after a certain update, while all these other, probably bigger, applications than HivemanagerNG are working fine, so that tells me something broke in Hivemanager and they just don't know what.

k_berrien · ‎09-19-2019

A month in now, we're averaging 4,000 connections a day on 600 APs and things remain the same as my post above.

Extreme Networks

Non-SAN on-prem Aerohive NG hardware requirement. How did you implement?

Non-SAN on-prem Aerohive NG hardware requirement. How did you implement?