A Framework To Fast Reroute Traffic Upon Remote Outages

Doctoral Thesis

Abstract

Nowadays, so many services – including critical ones – rely on the Internet to work that even a few minutes of connectivity disruption make customers unhappy and cause sizeable financial loss for companies. Ensuring that customers are always connected to the Internet is thus a top priority for Internet service providers. However, this is harder than one may think because the Internet is often subject to network outages. Network outages are a headache for network operators because they are unpredictable, can occur in any of the 70,000 independently operated networks composing the Internet, and can affect users’ connectivity network-wide. Far too often, the only way to restore connectivity upon an outage is to wait that (i) BGP, the glue of the Internet, converges; and (ii) the routers update their forwarding decisions accordingly. Unfortunately, these two processes work on a per-destination basis and are thus inherently slow given the always-increasing number of destinations in the Internet. It is therefore not a surprise that network operators still experience minutes of downtime upon outages. In this dissertation, we tackle the problem of fast connectivity recovery upon outages occurring in remote networks, without requiring network operators to change the standards, manufacture new devices, or cooperate with each other. The final result of our work is Snap, a framework that network operators can deploy on their routers and allows them to quickly detect outages and reroute tra ffic to working alternative paths that comply with the configured routing policies. Snap’s design follows a two-step recipe. First, it uses an outage inference algorithm based on new fundamental results and which, instead of waiting for the slow control-plane (BGP) notifications, analyzes the fast data-plane signals. Second, it uses a rerouting scheme that allows routers to quickly reroute all the a ffected traffi c to alternative paths circumventing the outage. Snap’s design takes advantage of the recent advances in network programmability and relies on a hardware-software codesign. To be fast, Snap collects data-plane signals at line-rate using programmable switches (e.g., Tofino). The switches then mirror the signals to a controller, which accurately infers remote outages and triggers tra ffic rerouting. We implemented Snap in P416 and Python and show its e ffectiveness in many real-world situations. Our results indicate that Snap can restore connectivity within a few seconds only, which is much faster than the few minutes often needed by traditional routers.

People

Dr. Thomas Holterbach
PhD student
2016—2021

BibTex

@PHDTHESIS{holterbach2021framework,
	copyright = {In Copyright - Non-Commercial Use Permitted},
	year = {2021-12-03},
	volume = {191},
	type = {Doctoral Thesis},
	journal = {TIK Schriftenreihe},
	author = {Holterbach, Thomas},
	size = {165 p.},
	abstract = {Nowadays, so many services – including critical ones – rely on the Internet to work that even a few minutes of connectivity disruption make customers unhappy and cause sizeable financial loss for companies. Ensuring that customers are always connected to the Internet is thus a top priority for Internet service providers. However, this is harder than one may think because the Internet is often subject to network outages.Network outages are a headache for network operators because they are unpredictable, can occur in any of the 70,000 independently operated networks composing the Internet, and can affect users’ connectivity network-wide. Far too often, the only way to restore connectivity upon an outage is to wait that (i) BGP, the glue of the Internet, converges; and (ii) the routers update their forwarding decisions accordingly. Unfortunately, these two processes work on a per-destination basis and are thus inherently slow given the always-increasing number of destinations in the Internet. It is therefore not a surprise that network operators still experience minutes of downtime upon outages.In this dissertation, we tackle the problem of fast connectivity recovery upon outages occurring in remote networks, without requiring network operators to change the standards, manufacture new devices, or cooperate with each other. The final result of our work is Snap, a framework that network operators can deploy on their routers and allows them to quickly detect outages and reroute tra ffic to working alternative paths that comply with the configured routing policies. Snap’s design follows a two-step recipe. First, it uses an outage inference algorithm based on new fundamental results and which, instead of waiting for the slow control-plane (BGP) notifications, analyzes the fast data-plane signals. Second, it uses a rerouting scheme that allows routers to quickly reroute all the a ffected traffi c to alternative paths circumventing the outage.Snap’s design takes advantage of the recent advances in network programmability and relies on a hardware-software codesign. To be fast, Snap collects data-plane signals at line-rate using programmable switches (e.g., Tofino). The switches then mirror the signals to a controller, which accurately infers remote outages and triggers tra ffic rerouting. We implemented Snap in P416 and Python and show its e ffectiveness in many real-world situations. Our results indicate that Snap can restore connectivity within a few seconds only, which is much faster than the few minutes often needed by traditional routers.},
	keywords = {Fast Reroute; Internet Routing},
	language = {en},
	address = {Zurich},
	publisher = {ETH Zurich},
	DOI = {10.3929/ethz-b-000521609},
	title = {A Framework To Fast Reroute Traffic Upon Remote Outages},
	school = {ETH Zurich}
}

Research Collection: 20.500.11850/521609