1/n OK, let me explain what's going on with the Facebook right now.
First, let's talk "routing". The Internet is a mesh of routers that forward packets. Packets go from source through a series of routers until they reach their destination.
First, let's talk "routing". The Internet is a mesh of routers that forward packets. Packets go from source through a series of routers until they reach their destination.
3/ Routers in the core of the Internet need to know the location of every IP address on the Internet. In this manner, they know which direction to forward a packet so that it reaches it's destination.
4/ IPv4 addresses are 32-bits in size, so roughly 4-billion possible IPv4 addresses. In theory, Internet routers could track each address separately, but that would take a lot of work. So instead, routers track IP addresses by subnet -- a range of addresses at the same location.
5/ The average subnet is 4-thousand addresses in size. This means instead of 4-billion entries in a routing table, there are only around a million. Note that different subnets vary in size, some are much larger, some are smaller.
6/ An IP address is thus split into two sections, the first bits at the front (on average, the first 20 bits) that are meaningful for routers. This is the "prefix". The remaining bits are only used once the packet reaches the target subnet.
7/ When you own address space, you must advertise your network prefix to your neighboring routers, who in turn, announce your route to their neighboring routers, and so on until the entire Internet knows your location.
8/ If you stop announcing the location of your prefix, well, then routers on the Internet stop forwarding packets to your network. They forget about you, and packets sent to you go nowhere.
Don't do that.
Don't do that.
9/ Facebook did that.
10/ Now let's talk about DNS. We generally don't refer to machines on the Internet by their IP address, we refer to them by their name, like " or "facebook.com".
Our apps use DNS underneath to convert names to IP addresses.
Our apps use DNS underneath to convert names to IP addresses.
11/ Router announcements (using BGP) and name lookups (using DNS) represent the logical structure of the Internet. When things fail, it's usually DNS and sometimes BGP.
12/ Facebook put their DNS servers inside it's own address space (instead of locating them elsewhere). Thus, because of the BGP problem, there's also a DNS problem.
13/ Not that it matters. If the DNS servers were functioning, they'd simply point to the offline address space, so attempts to contact Facebook still wouldn't work for routing reasons. But right now, the unreachability shows up as DNS reasons.
14/ DNS works by contacting intermediate servers called "resolvers". It's likely your local ISP's resolver (like 75.75.75.75 for Comcast) or a public resolver like Google's 8.8.8.8 or CloudFlare's 1.1.1.1.
15/ Since a billion people are running multiple apps trying to reach the Facebook every minute, they are now overloading resolvers trying to get an answer. And getting none, since Facebook's servers are down.
16/ A property of DNS is that successful answers are "cached" for a period of time. Once you get an answer, you probably won't ask again for some time. The amount of time is included in the response, such as "this answer is good for another hour).
17/ When lookups fail, there's no good caching of that failure. Indeed, you'll likely want to know the answer as soon as Facebook's servers become available again, rather than waiting hours or days before trying to lookup facebook.com again.
18/ So resolvers are now heavily loaded, meaning that a Facebook failure is causing failures throughout the rest of the Internet as DNS fails.
19/ But here's the FUN part: it's not just Facebook's DNS that's failed, but their internal network as well. Reports are that employees are locked out of their own buildings.
rawstory.com
rawstory.com
20/ You can imagine right now that employees are no applying sledge hammers to concrete walls to make a hole so they can get into the server rooms to fix the BGP problem, because their badges don't work.
21/ In theory, fixing this BGP problem should be a quick fix. But when your entire infrastructure is interdependent on itself, then there's a lot of impediments to fixing the core problem.
22/ It's easy to point fingers and laugh, but in truth, problems at this scale are huge, and answers only obvious in hindsight.
23/ CloudFlare has a post on this blog.cloudflare.com
24/ Ah, not sledgehammers, but angle grinders.
25/ Ah, apparently they did have physical security issues, but did not destroy things, so this story is less fun.
Loading suggestions...