Contents
Facebook’s Outage May Have Been Caused by a Bug in an Audit Tool
Facebook’s outage lasted globally for six hours on Monday and Tuesday. It’s now restoring services worldwide. The cause is a configuration change on backbone routers, which coordinate network traffic between the company’s data centres. This configuration change cascaded throughout the network, causing the entire website to become unavailable. While the outage may have been a single problem, it has implications for the overall health of the company’s entire network.
Bug in audit tool
It appears that a bug in an audit tool may have caused the Facebook outage. Facebook has stated that a bug prevented it from stopping a command. Because the audit tool is used to track all of the user interactions on its network, it is important to identify the exact cause and the root cause should be fixed as soon as possible. But the problem is not as simple as a bug in the audit tool. Facebook’s systems are designed to catch mistakes and identify the source of the problem as quickly as possible.
The outage began when Facebook engineers issued a command during a routine maintenance job. The command caused Facebook’s data centers around the world to go down. The outage was caused by a bug in a program audit tool, which failed to stop a command. The outage left employees without access to email, internal tools, and even work passes. Some reports also claim that the Facebook headquarters were in “meltdown mode.”
DDoS attack
The outage affected millions of Facebook users, which accounts for a substantial chunk of the Internet’s traffic. If the DNS servers for Facebook had become inaccessible, the DNS providers and ISPs would have taken a beating. Even if the DNS servers were still in place, greater volumes of queries would have flooded DNS providers and ISPs. As DNS is resilient, the resolvers would have tried to resolve to another nameserver, which would have resulted in quadrupling the DNS query volume.
Although the outage was unprecedented, the cause of it is still unknown. Facebook has admitted that the outage was a mistake, but it’s not clear what the root cause of the problem is. The company does not say how many users were affected by the outage or when the problem would be resolved. However, a DDoS attack can be traced to several sources. Several researchers have tracked the attack since it began in November 2016.
Maintenance mistakes
The outage caused problems for users as well as employees, and is still being investigated, but the root cause is a simple mix of DNS and BGP configuration mistakes. Facebook has no way of telling exactly how many people were affected, but it is likely that thousands of people were inaccessible for some time. Facebook’s outage also put employees on the front lines of the issue. The company tries to avoid such incidents as much as possible, but this time, it failed to take preventative action.
The cause of the Facebook outage was a simple configuration change by an engineer who was fat-fingered and sent the wrong command, setting off a series of technical issues. The engineer failed to check the status of the change with an internal auditing tool, triggering a cascade of problems. The command that failed was subsequently executed across Facebook’s backbone routers, affecting the company’s entire network and rendering engineers’ access cards useless.
Lack of multi-location networks
An outage on Facebook was due to a BGP update gone wrong, and it prevented remote users from reverting the change. Because the BGP server cannot reach the internet, people with network, physical and logical access were unable to access Facebook. At the time, it looked like a DNS issue. But later reports stated that the cause was a network outage. In this article, we’ll take a closer look at the issues surrounding Facebook’s outage.
The outage was due to problems with Facebook’s Workplace system, which three-quarters of its employees use. Without this system, Facebook’s technical team was cut off from help in its main data center. The team had trouble accessing secure rooms and reverting faulty configurations on key systems. The outage even brought down the messaging service used by remote-working staff, making it more difficult to reach them to find the problem.