What is going on with BGP

A brief overview of developments happening in the IETF working groups related to BGP evolution. The view is current as of mid-2023, in the timeframe between IETF meetings 116 and 117, and looking back several years to cover the recently published documents. The overview is given from the perspective of development of the protocol mechanics and recommended operational considerations, and is not directly related to specific implementation aspects of specific platforms – for that you would need to consult your vendors’ documentation. It is not expected that all of the functionality described here will be universally productized, as well as there will be specific deviations and extensions to functionality implemented by different vendors as seen required by the market. This is not an end to end overview of BGP, instead it focuses on specific protocol changes and therefore it is assumed that a reader has a sufficient understanding of foundations of BGP and its supporting machinery. It is a high level overview and does not go deep into the specifics, pointers to documents are provided for further and more detailed view into the topics under the discussion. This part covers the core protocol part and mechanisms specific to IPv4 unicast and IPv6 unicast AFs.

Deprecation of AS path aggregation sets (draft-ietf-idr-deprecate-as-set-confed-set). When aggregating multiple prefixes with different ASNs into a shorter covering prefix, besides other aggregation related path attributes, the ASN identifiers of component prefixes are contained in an unordered structure which has a different type than the ordered ASN sequence of the AS path attribute. The need for such a structure is to avoid possible propagation loops due to missing information on which ASNs the update has traversed previously. However such an approach obfuscates the real origin and prefix lengths of component prefixes and as a result is directly incompatible with the developments in global routing security mechanisms. Also it is yet another influencing aspect into possible attribute packing conflicts due to different interpretations of what an unordered set of ASNs in fact means. Therefore aggregation resulting in generation of ASN sets (AS_SET and AS_CONFED_SET segments in AS path attribute) is deprecated and should not be used. Receipt of an update carrying such segments should be treated as a withdraw due to a recoverable error (RFC7606), and no announcements carrying ASN set segments can be advertised. This does not deprecate the aggregation of component prefixes as such, but only the generation of ASN set segments. Both aggregation within an AS and proxy aggregation can be deployed as designed, and the fact of aggregation is indicated by the AGGREGATOR path attribute carrying the ASN and RID of the node that performed the aggregation – same as before, just omitting the addition of ASN set segment. The loop avoidance is ensured by controlling the advertisement of the resulting covering aggregate prefix – it must not be advertised to any of the origins of component prefixes.

Overall this is not a new concept, RFC6472 recommended against advertising ASN sets but did not change the behaviour of the receiving side. The amount of ASN sets seen in the global routing system is small enough (such cases do exist, but they are a clear exception or simply a neglect to clean the configuration up) to justify a more strict set of rules that would remove the ambiguity of interpretation of prefix origin.

Extended messages (RFC 8654). BGP message size is limited to 4096 octets (PDU size should not be confused with link and packet layer MTU sizes and transport window size), and that might not be enough in some cases. BGP PDUs do not have any mechanism for fragmentation, and therefore a set of path attributes that does not fit into the message cannot be advertised at all. In addition, attribute packing is an efficient way of speeding up the convergence of BGP, and the PDU size limitation puts an upper bound on the balance of how many NLRIs can be advertised together with the path attributes. New address families may carry larger NLRI elements and contain more or larger attributes, and therefore 4K octet limit may be not enough. BGP message encoding allows for a larger size, it is a historical limit that is now being lifted. Extended messages mechanism defines a new capability that needs to be configured and exchanged between the peers and if both sides agree, they can use messages up to 65535 octets for BGP signalling. Open message is excluded for backwards compatibility reasons, and a large keepalive message just does not make practical sense. Update, notification, refresh signalling, and potentially other newly defined messages may use this mechanism for exchanging both larger size objects and larger amounts of objects. Of particular note is the handling of notification message – while the overall size of the message now may be larger, so can be the size of notification data, and it cannot exceed the negotiated total limit of 4096 or 65535 octets.

Extended optional parameters (RFC9072). BGP Open message carries capabilities – a set of parameters that are exchanged between peers for finding out common operational modes and their corresponding parameters. The encoding used historically had a single octet length field for both total length of all parameters and for a length of each individual parameter, and therefore practically limited the amount of capabilities and their parameters that can be exchanged. Given the trends in increasing the amount of address families and their configuration parameters, the usability features carrying human readable information such as host names and version information, and supported BGP feature indicators, a mechanism that would allow for a larger capability size is needed. The overall approach of this extension is by defining a new optional parameter having a specific length which acts as an indicator that a new format of container is used for encoding individual optional parameters (BGP capability is a type of optional parameter). No changes are made to the actual optional parameters carried, it is just a container that is larger. This change is unambiguous to a speaker that does not understand the new encoding – it will result in an error indicating a presence of an unsupported optional parameter. This mechanism allows for slightly above 4K octets of usable space for optional open parameters – should be enough for everyone.

Dynamic capabilities (draft-ietf-idr-dynamic-cap). Capabilities provide an ability for parametrizing a set of BGP operational parameters during initial session startup. Once negotiated, capabilities and their corresponding parameters stay constant for the duration of the BGP session without any protocol-level ability to adjust it if needed. Dynamic capabilities introduce a two-way capability parameter synchronization mechanism without requiring a session bringdown. This is implemented by means of a new protocol message that carries a list of potentially negotiable capabilities and a request-response type of negotiation between the peers. While this mechanism by itself allows for renegotiation of any BGP capabilities after the initial session establishment, not all of such negotiations would be practical or even technically feasible. Enabling a new address family without bringing down an already established session might be both practical and easy to implement, while disabling an already established address family may result in logical dependency conflicts that would render the remaining address families unusable. Changing already negotiated timer parameters is easy, while enabling functionality such as additional paths may not be technically feasible – those all are implementation dependent aspects and would limit the practical breadth of applicability of dynamic capabilities.

Send hold timer (draft-ietf-idr-bgp-sendholdtimer). BGP transport session termination results in withdrawal of all received updates – which is a practical way of clearing out the state that has become stale. However, transport session liveness is controlled by the operation of the receive side of the connection – a timeout in receiving of any message from the remote peer is treated as transport failure. It is not the case for the transmit side of the connection though, and a scenario where a remote peer happily sends out periodic keepalives but fails to process any incoming messages from the local peer would result in keeping the potentially stale received information on a local peer. The concept of a send hold timer tracks the local transport endpoint activity on the transmit side and if no locally generated messages can be sent for a timer interval, local peer will initiate a session teardown and clear out all the state received from the remote peer.

Optimal route reflection (RFC 9107). Reflectors represent a universally used mechanism for reducing the amount of state to be transferred throughout the AS, their usage patterns are well understood, as well as some limitations. This proposal addresses one of them – the path selection as performed by the reflector is not necessarily an optimal one if that same selection were to be performed by the reflector client itself. If BGP and forwarding topologies are not congruent (which is the case in many reflector deployments), the path selection will be influenced by the metrics relevant and observed by the reflector itself and not by the reflector client. Given that the number of paths to be selected from is limited by a reflector, a client has no means for choosing what would have been a more optimal path from its own perspective. From the perspective of a reflector the differences would lie in the per-client path selection (which may be different due to different policies being used for different clients), and interpretation of IGP metric from the perspective of a particular client and not of a reflector itself. There are no changes required to be performed on the client side for this mechanism to work, while a reflector would need to have a more detailed visibility into the IGP topology for deriving a proper next hop cost for a prefix from a perspective of a reflector client. In addition a path selection process on a reflector would need to be performed individually per client or per group of clients sharing the same reflection policy.

Wide BGP communities (draft-ietf-idr-wide-bgp-communities). Standard communities, extended communities, large communities – haven’t we got enough of different flavours of communities for expressing the policy constructs? It appears that we in fact haven’t. The limiting factor appears to be the ability to express actions and parameters for those actions in a reasonably scalable and extensible way. Standard and large communities form a functionally equivalent pair, with large communities primarily addressing the 32 bit ASN clean signalling; extended communities mostly deal with address families other than global unicast and also lack sufficient flexibility for carrying information elements that have semantic interpretation differing from a single plain 32 bit field.

The underlying base for wide communities is a new BGP path attribute that acts as a container for community sub-objects, and wide communities are the first actual user of this container infrastructure. Container attribute itself is optional transitive, and individual sub-objects may have a finer level of transitivity control, thus allowing for more controllable attribute propagation.

The overall format of wide communities is :::, with Community being the actual community value, Source ASN indicating the AS that is originating this community, Target ASN indicating the AS that is supposed to interpret and react on the community, and then followed by a variable length and format set of parameters that are interpreted in the context of a community and target ASN namespaces. Wide communities also define a parameterized matching mechanism for indicating whether the community should or should not be acted upon based on a set of criteria matching or not matching upon the specific parameters. The actual parameters used for wide community policy include ASN lists, IPv4 and IPv6 prefix lists, uint32 and IEEE 754 fp32 number lists, neighbour class list, an UTF-8 string, and also a user-defined binary object having a free interpretation.

Extended admin shutdown communication (RFC 9003). RFC8302 defined a mechanism for sending a human-readable message for several Notification Cease subcodes, for a form of “in-band” message channel for BGP session shutdown. The initial message length of 128 octets appears to be too limited in the context of multibyte encodings, therefore this extension lifts the limitation to 255 octets (but not necessarily characters!). The message must be carried in a valid UTF-8 encoding, and it is not for the receiving router to try to make sense of it – it is to be presented to the operator as is.

Cease notification due to BFD session going down (RFC9384). BFD is a universally deployed mechanism for tracking data plane liveness, and BFD session state can also be propagated into control plane protocols for speeding up their reaction to data plane failures. BGP is not an exception to this mechanism, and it just works. When a BGP session goes down for some reason, peer sends out a Cease message that provides some information on the reason why the session is going down. This extension defines a subcode indicating that the session was brought down due to the underlying BFD session going down too. It may well be the case that a remote peer will not be able to receive this notification, but the local peer will contain an indication that the session was brought down due to BFD failure.

BGP graceful restart for notification messages (RFC 8538). BGP graceful restart (RFC4724) allows for retaining the forwarding state while the actual BGP session is restarting – except for when the session was brought down due to a notification. Having an ability to retain the state in case of protocol errors that are recoverable in some way appears to be of value for maintaining the stability and reducing the amount of state that needs to be propagated around. The core of this extension is exactly this – for some error scenarios GR retains the forwarding state while BGP sessions recover after the error that resulted in notification being sent. Not all errors can be treated this way – only those that are of a temporary nature, such as a remote peer running out of resources or transport having intermittent problems.

Long-lived BGP graceful restart (draft-ietf-idr-long-lived-gr). BGP graceful restart functionality (RFC4724) has been around for a while and is widely deployed. There is a limit of time for which the stale routing information can be retained by the BGP speaker, and that limit has the upper bound of 4K seconds – for a simple reason that a field used for signalling that value has got 12 bits. While an interval of more than an hour for a remote peer to come back up might seem reasonably long, there are use cases where it would be beneficial to retain a stale routing information for a longer time, and primarily to avoid propagating the withdrawals for the prefixes affected by the GR, at a cost and risk of potential blackholing of a traffic destined to those prefixes. Extending the timer value namespace is trivial, and that is precisely what is done by this specification. Keeping stale routing information for extended periods of time might be a bit more dangerous, therefore prefixes covered by the long lived GR are also marked by a specific community value indicating that they should be treated as a last resort for best path selection and also can be propagated only to peers that also support long lived GR.

Peer roles (RFC9234). BGP does not define a semantic relationship between peers – that is just a session over which prefixes can be advertised. What is the relationship of those peers and whether they in fact are allowed to exchange any routes learned from various sources historically was not tied to the actual BGP session and was left to the domain of policy implementation. This extension defines a set of mechanisms for specifying and validating the roles of BGP peers and the prefixes they are allowed to announce. Peers have roles defined based on their type and place in the network (a customer, a provider, a route server, a generic peer among the others) which get exchanged during the session establishment. If roles disagree the session is not allowed to come up in the first place; if roles allow for session to be established then depending on the actual role routes advertised get an additional path attribute that indicates the class of an advertising peer. Quite a similar mechanism for route leak detection can be implemented by using community tagging – and while the overall logic of peer roles is the same, the implementation is different: community needs to be acted upon by the policy, and if the policy is not configured properly or simply does not exist at all then such method will not work at all. A dedicated path attribute can be advertised and acted upon by the implementation not based on a user configurable policy and be active all the time regardless of the policy configuration. Policy can be applied on top of what is allowed in based on the peer roles, but peer roles have priority. This mechanism is implemented as a capability for session establishment with a set of role pairs that are allowed, and also as an additional path attribute that gets added to advertised routes and can be acted upon by the policy of the receiving peer.

Well-known large communities (draft-heitz-idr-wklc). Large communities enjoy a wide deployment, and that same wide deployment starts to bring in some requirements for an additional functionality, specifically limiting propagation scope and avoiding conflicts in assigning function identifiers. Large communities are a direct replacement of standard BGP communities and as a result they do not have any structure of interpretation of the values carried – it is just an AS number in the Global Administrator field. The proposal is to reserve a set of ASNs for the purpose of being used as well-known large community identifiers and also for control of propagation scope, leaving a total of 10 octets for use within a context of a particular WKLC. This appears to be quite controversial, both on the aspect of reserving a significant portion of 32 bit ASN space, and also trying to control the propagation of a transitive path attribute based on a value of that same attribute. However, the functionality appears to be needed, and it might be a right time to restart a discussion on defining an equivalent of extended communities for 32 bit ASN clean operation.

AS path prepending (draft-ietf-grow-as-path-prepending). More does not necessarily mean better. AS path prepending as a policy mechanism to influence path selection is well known and is universally deployed. Increasing the length of AS path by prepending ASN more than once decreases the probability of such path being selected as best. The whole question is how much to prepend in order to achieve the intended depreference of a path, especially in the conditions of everyone else effectively doing the same, and not to become too vulnerable to intentionally crafted shorter AS paths. It is not uncommon to see paths with multiples of tens of prepends in the global routing table – that is not surprising as AS path is a transitive attribute and it is strictly not allowed to remove ASNs from the received path. This can lead to a situation when any other – including intentionally crafted – AS path is seen as more preferred due to having a shorter length, and prepending yet more times only makes such types of attacks easier. Signalling of the policy intent to the remote peers is recommended to be implemented via specifically allocated communities or other path attributes, and prepending should be used only when there are no other alternatives available.

Maximum prefix limits (draft-sas-idr-maxprefix-inbound, draft-sas-idr-maxprefix-outbound). The number of prefixes received by the peer and advertised to other peers depends on the role and the place in the network of a particular BGP speaker. Generally those numbers are quite specifically bounded – a leaf site is not supposed to advertise a full global table, and a peer in the exchange should not advertise substantially larger amounts or prefixes than what it originates. A counter based mechanism for controlling the number of received prefixes is supported by virtually all BGP implementations. There are some deficiencies though – it may be that the large amount of prefixes received from the remote peer will be rejected by the inbound policy, and while keeping such rejected prefixes is a handy optimization for speeding up the convergence in case of inbound policy changes, the storage and processing of such rejected prefixes does not come for free. In addition, refresh signalling could be used for requesting only a specific set of prefixes to be readvertised if needed. Another aspect is the number of prefixes accepted by the inbound policy – for various reasons, including operator errors, the number of such prefixes may end up being larger than it should be. On the sending side, there is also a need for a mechanism to limit the number of prefixes past the outbound policy that will get advertised to a remote peer. This results in three separate counters – inbound pre-policy, inbound post-policy, and outbound post-policy - that together can control the tolerable amount of prefixes at various points of the BGP speaker processing. A twin set of documents defines a recommended session shutdown and notification of the remote side with an appropriate cease error to indicate the specific reason.

BMP support for local RIB monitoring (draft-ietf-grow-bmp-local-rib). BMP provides a view into input and output BGP RIBs but lacks a mechanism to transport all local RIB routes (all local routes meaning the full view of the local RIB after the path selection process, not only the routes imported into BGP context from other sources). Portions of such view can be derived from the information available via input RIB monitoring (likely requiring a coordinated monitoring of multiple nodes), however that may result in a notable amount of data to be filtered through, would require access to entities and state outside of BGP context, and still may lack to provide a specific sequence of events that happened during topology convergence. A dedicated mechanism for exporting information about all the prefixes contained in the local RIB is defined by means of a new type of BMP peer logically representing the contents of a local RIB. From the perspective of BMP protocol processing there are no logical changes to the usual operation of BMP route monitoring functionality – local RIB routes would be represented as being received from an emulated peer bound to a specific instance of a local RIB on a node.

More BMP TLVs (draft-ietf-grow-bmp-tlv). This is a framework document for defining an extensibility mechanism for BMP route monitoring messages in order to be able to carry additional and structured information within BMP. Initially BMP defined a fixed packet format for route monitoring messages, and deployment experience and evolving uses of BMP have indicated a need to convey additional information elements that were not thought of at the initial design time of BMP or may be specific to a particular implementation. BMP has a TLV based encoding mechanism from the start for most of its messages, but not for route monitoring. Therefore this simple extension mechanism allows for a TLV based encoding to be used for all BMP messages.

Autonomous System Provider Authorization (draft-ietf-sidrops-aspa-profile, draft-ietf-sidrops-aspa-verification). Origin validation provides a practical and reasonable level of verification of origination of prefixes, but the propagation path of those prefixes once originated is difficult to validate and protect from both unintended errors and malicious attacks. The tree-like nature of BGP peerings may be used for building sets of adjacency lists, treating one AS as a customer, and its peers as providers. This provides a 1:n relationship between a particular AS and its peers, and a distributed collection of such relationships forms a foundation for AS path validation in terms of checking whether adjacent pairs of AS numbers in fact have a customer-provider relationship. ASPA object is yet another object type in the RPKI infrastructure, and therefore existing mechanisms of RPKI used for origin validation would require only modest extensions in order to distribute path validation information. The validation outcome from the perspective of routing policy would result in already familiar states of “valid”, “invalid”, and “unknown” related to AS path attributes contained in received announcements.

RTR extensions (draft-ietf-sidrops-8210bis). RTR is a protocol used between a router and a cache for distribution of RPKI data objects. ASPA brings in a requirement for distributing AS path attestation objects to routers, and it is a simple extension to RTR for carrying yet another object type – and this set of extensions defines RTR protocol version 2. The overall protocol mechanics is equivalent to the lower versions (version 0 is specified in RFC6810, and version 1 is specified in RFC8210), and is based on the concept of a router controlling the pull in of the validation information instead of a cache pushing it out towards the router. Operation starts with a maximum supported version and can negotiate a lower backwards compatible protocol version if required during the session startup, and stays at the negotiated version for a lifetime of a session. Deployment experience has also identified several synchronization corner cases in the cache content transfer in the presence of dynamic changes of that content. ROAs for longer prefixes should be advertised before ROAs for corresponding shorter covering prefixes, and multiple ROAs for the same prefix should be advertised consecutively.

Thank you for this summary.