yesterday
Environment
— Core: Extreme Networks 7520-48Y-8C running EXOS 33.5.2.118-patch3-1
— Access: Extreme Networks 5520 / 5720 series
— Link: 25G SFP28 fiber between core and access switches
Problem
After any power outage (accidental or planned maintenance), the 25G SFP28 ports on the 7520 core switch that connect to 5520/5720 access switches via fiber go down and never come back up. No CLI command recovers them. The only workaround is physically moving the fiber cable to a different port on the 7520.
We opened a TAC case and received patch 33.5.2.118-patch3-1 but the issue persists. We reverse-engineered the firmware binary (.xos) and found two critical bugs.
Root cause — what we found inside the firmware
Bug 1: tx_disable asserted on every cold boot
Inside extr/hw-config/Extreme/7520/ports.yaml (extracted from rootfs.xos), all 48 SFP28 ports have this in their module_init_ops:
module_init_ops: - device: port_cpld_1 register: 0x1 bitmask: 0x0040 # tx_disable bit value: 1 # ← asserts TX_DISABLE on every cold bootOn cold boot EXOS applies these init ops first — asserting tx_disable=1 on all 48 SFP28 ports via the PORT CPLDs. EXOS is supposed to clear this bit during port init. If there is a race condition (remote switch still booting) or EXOS crashes mid-init, the bit stays latched and the port is physically dead. No CLI can recover it because the bit is set at CPLD hardware level.
Bug 2: FEC mode not specified (CL74 vs CL91)
The 25G_PORT definition has Fec: true but no explicit FEC type. EXOS negotiates at runtime, causing mismatches between the 7520 and 5520/5720 access switches.
Field test results
We changed FEC configuration (CL74 / CL91) explicitly on the affected ports:
— Most 25G SFP28 ports came back up immediately → FEC mismatch confirmed (Bug 2).
— Ports 1 and 5 remained down even after the FEC fix. Both use bitmask 0x0040 (bit 6 of the low byte) for tx_disable in PORT_CPLD_1 registers 0x1 and 0x2 respectively. This specific bit position appears to latch high and does not clear through normal EXOS port enable/disable cycles → Bug 1 confirmed as an independent failure.
1. Has anyone else hit this on 7520 + 55xx / 57xx topologies with 25G fiber after a power cycle?
2. Is there a way to write directly to PORT_CPLD registers from the EXOS CLI to manually clear the tx_disable bit without rebooting? Something like:
# Looking for something equivalent to: debug hal port 1 write cpld tx_disable 0 # or any undocumented EXOS debug command to clear CPLD bits directly3. Any experience with FEC auto-negotiation issues between 7520 and 5520/5720? Which FEC mode (CL74 or CL91) should be set on both sides for 25G SR fiber?
4. Is anyone aware of a fixed firmware version that addresses the tx_disable module_init_ops bug in ports.yaml?