Yesterday, Facebook users around the world were hit with an unexpected and lengthy service outage that impacted all of the company’s programs, including WhatsApp, Instagram, and Messenger.
Since then, Facebook has released two blog articles that explain what happened.
The business issued the first blog post outlining what caused the major issue late Monday evening.
According to Santosh Janardhan, Facebook’s VP of infrastructure, “the main cause of this outage was a botched configuration update,” with the issues occurring “on the backbone routers that coordinate network traffic across Facebook’s data centers.”
The network traffic disruption not only interrupted services on Facebook-owned apps like WhatsApp,
but it also “impacted many of the internal tools and systems we use in our day-to-day operations, hampering our attempts to quickly diagnose and resolve the problem,” according to Janardhan.
Late this afternoon, Facebook issued a new, more detailed blog post outlining exactly what went wrong.
Janardhan says in it that the “backbone” he mentioned before “is the network Facebook has constructed to connect all our computing facilities,” and that this network also connects all of Facebook’s data centers across the world via physical wires and cables.
These data centers are in charge of storing information, maintaining the platform, and connecting Facebook’s network to the rest of the internet.
“Routers control the data flow between all of these computers, determining where to send all incoming and departing data. Our engineers frequently need to take part in the backbone offline for maintenance, perhaps repairing fiber line, add extra capacity, or update the software on the router itself, as part of the intensive day-to-day task of maintaining this infrastructure,” Janardhan explained.
However, during a routine maintenance job yesterday, “a command was issued to assess the availability of global backbone capacity,”
but it “unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally” and cutting them off from the internet.
To make matters worse, due to a fault, the audit tool that is supposed to spot blunders like this didn’t catch the problem.
The Domain Name System (DNS) servers and the Border Gateway Protocol (BGP),
which promotes the Facebook DNS to the rest of the internet, are two other aspects of internet architecture that are related.
“As a result, our DNS servers became unavailable, even though they were still active.
“As a result, the rest of the internet was unable to locate our servers,” Janardhan added.
“Many of the internal tools we’d normally use to examine and fix outages like this were broken by the total loss of DNS.”
All of that seems complicated, so here’s everything you need to know about DNS, BGP, and what happened at Facebook in layman’s terms.
LET’S DISCUSS DNS AND BGP.
Let’s start with DNS servers and the BGP protocol (Border Gateway Protocol) (BGP). So, what are they exactly?
The DNS is commonly referred to as the internet’s address book or phonebook.
“What it does is it converts a domain name that is supposed to be human-readable—something like Google.com or Facebook.com—into an IP address, which is a series of numbers,” says Justine Sherry, an assistant professor at Carnegie Mellon University.
“And that sounds eerily similar to your street address. As a result, it’s like 5000 Forbes Avenue vs. Carnegie Mellon University.”
This phonebook feature is significant since it’s the service that takes the human-readable domain name (facebook.com) you type into your search bar and informs the internet how to direct you to the server you want to talk to.
After all, typing the letters facebook.com into a web browser is far easier than remembering and entering numbers.
“Importantly, that phonebook is distributed,” Sherry continues, “so Facebook kinda owns a slice of that phonebook stating ‘we are Facebook.com, and here are our addresses.’
“I got an error that said NXDOMAIN when I entered in the URL, and it was the DNS notifying me, ‘I don’t know what that domain name is, it doesn’t point to any address for me.’”
Then there’s BGP (Border Gateway Protocol).
“Think of BGP as the internet’s version of Google Maps. Sherry explains, “That’s the device that tells you if I have an address, how do I get there.” “It’s intended to allow multiple networks from different companies, such as Facebook, Google, Comcast, Sprint, and AT&T, to share their routes.”
According to Ethan Katz-Bassett, an associate professor at Columbia University, the Border Gateway Protocol (BGP) establishes a path for requests for access to reach the Facebook DNS server (it’s called that because it runs at the borders of networks like Facebook and Google).
As a result of the misconfiguration, Facebook’s BGP routers stopped advertising a route to the Facebook DNS servers.
As a result, the requests would “go missing” at the sender’s network’s edge.
“The [Facebook] infrastructure was structured in such a way that if a router can’t communicate with a data center, the DNS route is withdrawn,” Katz-Bassett writes.
“This may be acceptable behavior when a single router is down, but it disconnects the entire network when multiple routers are down at the same time.”
Sherry compares BGP to the interstate highway system, saying, “This is the glue that holds the roadways of the various states together.”
Facebook pulled several routes and announced that they didn’t have any routes to get to their phonebook.”
WHY WERE ENGINEERS REQUIRED TO VISIT THE CALIFORNIA DATA CENTER?
During yesterday’s outage, the internet’s mapping system effectively deleted all routes to Facebook,
which meant that not only regular consumers but also Facebook staff, we’re unable to access it (at least, not remotely).
According to Sherry, Facebook likely connected all of the digital badge cards to an internal database located on their servers, as well as DNS, to keep track of who has access to the building.
The card key system also ceased working when their DNS and servers went down.
Engineers usually do not need to be physically near servers when working with them.
They can access and interact with the devices remotely, as well as operate on them via the internet.
They couldn’t access it remotely in this scenario, so the only option to acquire access is to physically go in and plug a monitor into those servers.
According to Facebook, engineers were dispatched to the actual data centers to diagnose and restart the systems.
Janardhan wrote in his blog that “this took time because these facilities are designed with high levels of physical and system security in mind.”
“They are tough to gain access to, and once inside, the hardware and routers are designed to be impossible to modify even if you have physical access to them.”
They reactivated services gradually to avoid power surges or breakdowns.
Before Facebook’s re-launch, many computer scientists saw overloads and backups in the internet infrastructure.
Cloudflare claimed that its DNS services received 30 times more inquiries than usual.
That’s because if your web browser tries to load Facebook or Instagram and fails, it will try again.
“People were continuously calling the phonebook and asking, ‘Where is Facebook?’ What happened to Facebook? Sherry asks, “Where is Facebook?”
THE FUTURE OF FACEBOOK AND THE STILL UNANSWERED QUESTIONS
An issue like this would not have been as pervasive a decade ago. WhatsApp, Instagram, and Facebook were all stand-alone businesses with their infrastructure.
“And now they’re all the same product,” Sherry explains. “We witnessed companies that we’re unable to operate yesterday.”
Many local companies that used Facebook and Instagram to reach out to clients were unable to do so any longer.
According to Bloomberg, rival chat apps like Signal, Telegram, and even Twitter attracted millions of users.
“A couple of times a year, we witness these incidents where significant parts of the internet fall.
“Sometimes it’s BGP, other times it’s DNS, and other times it’s some obscure Amazon storage system,” Sherry firm, every business, every organisation now relies on only a handful of companies, a handful of technical products, and when these things fail, they have massive cascade impacts throughout the internet and across other industries,” he says.
WhatsApp, a program she uses to communicate with relatives, had the most impact on her.
“WhatsApp is a cell phone service in many parts of the world,” she explains.
Still, Sherry believes that several different services going down at the same time is implausible.
It’s unusual for both Facebook and Twitter to fail at the same time, or for Google Chat and Facebook Messenger to have technical troubles at the same time.
“However, as more systems centralize and merge, we are becoming increasingly exposed to extremely large-scale disruptions with catastrophic consequences,” she warns.
“We saw cascade failures yesterday because all of Facebook’s services (even to the entry control on their doors) were reliant on a single, centralized system.”
Sherry points out that the technical community has long believed that centralized, combined systems are not the best architecture.
“The safest thing to do is keep things distinct so that if one system breaks, it’s a little and local failure rather than a complete global outage,” she explains.
“As a result, this quest for ‘one entity to handle it all’ renders us more vulnerable to enormous, catastrophic crises that could have been avoided.”