I've been asked to improve Alarms configuration in XMC at a customer whose network is EXOS-based (currently 22.6).
They need their support team to react ASAP in case of any issues, outages etc. related to the network services and connectivity.
As it's my first serious fun with alarms in XMC, I'd like to ask you for some advice on this. Perhaps some of you already have a best practice config for Alarms section?
The customer uses EXOS for 2-tier network with stacks at the edge, with firewall performing routing, DHCP relay etc. Switches are mainly serving PoE, AAA (Policy + RFC 3580 from EAC), telemetry, RSTP. All kind of typical devices are connected: PCs, APs, phones, servers, cameras, even some physical access control devices.
The customer wishes to have reasonable alarms configured with urgent ones also sent via e-mail.
I've tried to walk through EMS Messages Catalog but unfortunately it doesn't consist of all the possible message strings to decide on putting those as an alarm criteria in XMC, just event types.
Here's what I thought of and would really appreciate if you helped sort this out:
- All syslogs with Warning, Error or Critical severity raising an alarm.
- All syslogs with Critical severity sending an e-mail additionally.
- NAC-related alarms for AAA status awareness.
- Particular types of syslogs that raise an alarm with e-mail as an action:
- temperature: HAL.Sys.ShutDwnTempRangeExcd, HAL.Sys.TempWarning, HAL.Sys.TempCritical, HAL.Sys.FanTrayFail,
- general HW failures: HAL.Msg.Critical/Error/Warning, POE.Critical/Error/Warning, DM.DsblSlotShutDown, ds.oom (?), ds.pcfg_init_fail (?),
- STP loop detected: STP.DsblPortLoopDtect,
- interface errors: DM.SensorAlarmDtect (regarding transceiver operation; Rx/Tx errors don't have any 'excessive rate' logs, do they?),
- high resource utilization: EPM.cpu? rather stats-based alarms on XMC itself; HAL.Card.HwTblThrshldExcd, HAL.Card.L2L3HwTblThrshldExcd, HAL.Card.AclHwTblThrshldExcd,
- stack topology errors: HAL.Stacking.Critical/Error/Warning, NM.NodeStateFail,
- STP errors/events: STP.DsblPortBrdgDtect, STP.InBPDU.DropRxNonSTPPort, STP.System.InitFail/AllocMemFail/InsNodeFail.
- Particular alarm-raising events without e-mail as an action:
- vlan.msgs.PortLinkFlapActLogEvent - too many from some endsystems at the moment,
- thought of vlan.msgs.FldRateOutActLogEvent with 10kpps as a threshold to inform on BUM traffic excessive rate.
BTW2 For some reason their 18.104.22.168 XMC only has 'Workflow Dashboard' in Tasks section, what is wrong here?