Or: A new protocol actually /did/ improve our routing.
Bidirectional Forwarding Detection (BFD) is a protocol that allows detecting faults in links or routes. This is similar to GRE keep-alives, but is actually supported on real routers. Contrary to traditional link-state detection, BFD works on the next-hop IP address; so one can detect failures of some peers that do not affect the link state.
Internet links fail. This is a truism as old as Internet links. When a link fails, traffic gets dropped until the failure is detected and traffic can be re-routed. Detection of failures can be quite tricky however, since they are not always directly visible. Most systems use link state or a form of keep-alives for detection of failures. Link state detection does not help when there are active devices between a router and the other system, such as a switch or long distance links which use MPLS. The in-protocol BGP timers can also be quite long (a common default is 90 seconds) which is a lot of traffic when one are sending 10Gbps or even faster rates.
BFD is a new protocol that exists outside of existing routing protocols, but can communicate the status to all protocols. This allows for a single keep-alive to detect the health of a single link, without having to depend on a keep-alive in each and every protocol being used. As this is part of the “parent” interface, this does not introduce another layer in the network configuration. And since the link-state is only per next-hop IP, one can mix and match BFD and non-BFD neighbours on the same interface. This is extremely useful for routers connected to an Internet Exchange Point, which can have hundreds of peers spread over 10 or more physical locations.
A clever description of this is described in a draft RFC, which introduces automagic configuration of BFD between parties allowing for stronger resilience when there are many potential neighbouring networks without the overhead of manual configuration.
I will be discussing of the implementation of the BFD protocol for OpenBSD, problems discovered in both the protocol and network stack, use cases and production experience.
Slidedeck: PDF