The FCC released a report today on the 911 outage originating in CenturyLink’s network that occurred December 27, 2018. The findings are the latest example of the risks involved in what the FCC calls the telecom “tech transition.”
The 911 outage studied in the report was what is known as a “sunny day” outage – one not caused by weather or other natural disaster. Sunny day outages have become increasingly common as the telecom industry has been modernizing networks to centralize certain network functionality and to rely more heavily on internet protocol communications.
The FCC calls this shift the “tech transition.” And while many aspects of the transition are good, including the ability to operate networks more efficiently, there is a downside: The impact of small problems can be magnified, as occurred during the CenturyLink 911 outage.
CenturyLink 911 Outage
As the FCC explained in a press release, the CenturyLink 911 outage resulted from a fiber network outage that lasted for almost 37 hours.
“As many as 22 million customers across 39 states were affected, including approximately 17 million customers across 29 states who lacked reliable access to 911. At least 886 calls to 911 were not delivered,” the FCC said.
As the FCC explained in the report released today, the outage originated in a switching module that spontaneously generated four malformed management packets.
To complicate matters, the switching vendor had a proprietary management channel enabled by default. The channel is designed to allow for fast automatic rerouting of traffic during a failure by enabling line modules to send packets directly to other connected nodes without receiving network management instructions about how to route traffic. According to the report, CenturyLink had never used the channel, but that didn’t stop the channel from sending the malformed packets through the network.
“The exponentially increasing transmittal of malformed packets resulted in a never-ending feedback loop that consumed processing power in the affected nodes, which in turn disrupted the ability of the nodes to maintain internal synchronization,” the FCC explained in the report. And “[w]ithout this internal synchronization, the nodes’ capacity to route and transmit data failed,” causing multiple outages.
Those outages, in turn, impacted several providers that route 911 calls to public safety answering points.
The FCC identified several best practices that it says could have prevented the CenturyLink 911 outage or at least mitigated its effects, including:
- Turning off or disabling system features that are not in use
- Including memory and processor utilization alarms in network monitoring and regularly auditing and evaluating these alarms
- Having standard operating procedures for network repair that address cases where normal network monitoring procedures are inoperable or otherwise unavailable
Other sunny day 911 outages have been triggered by glitches in software used by a 911 connectivity provider to route calls to public safety answering points, as well as by glitches in networks operated by AT&T, T-Mobile and Verizon.
Virtually every time a 911 outage has occurred, the FCC has issued new recommendations for improving 911 reliability. But as today’s report illustrates, today’s telecom networks are so complex that it has been difficult to envision and guard against every possible cause of a 911 outage.