Transient Forwarding Anomalies and How to Find Them
Abstract
Analyzing transient violations of reachability—that happen while routing protocols are re-converging—helps in improving network availability and offering more precise SLAs. The key challenge is analyzing transient violations accurately, as they can be short-lived, for all affected prefix destinations, and practically, without worsening the network’s performance. Existing approaches fail to address at least one of these goals: measurement approaches are accurate but only for the prefixes they can probe or observe traffic for, while techniques that estimate the convergence time use the same crude proxy for all prefixes. To achieve all three goals, we present TRIX, a system that infers transient violation times for BGP events from logged routing events or collected BGP messages. TRIX’ key insight is that we do not need to probe all destinations if we use available information to infer the router-local forwarding state, for all destinations, and reconstruct the network-wide violations from router-level state. However, the logged events contain control-plane information that is inaccurate in terms of the content and the times of the forwarding updates, while reconstructing network-wide violations requires reasoning about the flow of traffic through the network. TRIX solves these challenges by simulating the BGP control-plane, modeling the FIB-update rate, and combining the state across routers with propagation delays. To evaluate TRIX, we implement a testbed that relies on a programmable switch and uses 12 real routers. Our evaluation shows that TRIX’ inferred reachability violation times are on average within 13–25ms from the ground truth, and inference scales to large networks.
People
BibTex
@ARTICLE{schmid2025transient,
copyright = {Creative Commons Attribution 4.0 International},
doi = {10.3929/ethz-b-000747785},
year = {2025-06},
volume = {3},
type = {Journal Article},
institution = {EC},
journal = {Proceedings of the ACM on Networking},
author = {Schmid, Roland and Schneider, Tibor and Fragkouli, Georgia and Vanbever, Laurent},
size = {23 p.},
abstract = {Analyzing transient violations of reachability---that happen while routing protocols are re-converging---helps in improving network availability and offering more precise SLAs. The key challenge is analyzing transient violations accurately, as they can be short-lived, for all affected prefix destinations, and practically, without worsening the network's performance. Existing approaches fail to address at least one of these goals: measurement approaches are accurate but only for the prefixes they can probe or observe traffic for, while techniques that estimate the convergence time use the same crude proxy for all prefixes. To achieve all three goals, we present TRIX, a system that infers transient violation times for BGP events from logged routing events or collected BGP messages. TRIX' key insight is that we do not need to probe all destinations if we use available information to infer the router-local forwarding state, for all destinations, and reconstruct the network-wide violations from router-level state. However, the logged events contain control-plane information that is inaccurate in terms of the content and the times of the forwarding updates, while reconstructing network-wide violations requires reasoning about the flow of traffic through the network. TRIX solves these challenges by simulating the BGP control-plane, modeling the FIB-update rate, and combining the state across routers with propagation delays. To evaluate TRIX, we implement a testbed that relies on a programmable switch and uses 12 real routers. Our evaluation shows that TRIX' inferred reachability violation times are on average within 13--25ms from the ground truth, and inference scales to large networks.},
issn = {2834-5509},
keywords = {Forwarding anomalies; Routing convergence; Transient violations},
language = {en},
publisher = {Association for Computing Machinery},
number = {CoNEXT2},
title = {Transient Forwarding Anomalies and How to Find Them},
PAGES = {10}
}
Research Collection: 20.500.11850/747785