jdenisco | 0923a23 | 2018-08-29 13:19:43 -0400 | [diff] [blame] | 1 | .. _fastconvergence: |
| 2 | |
| 3 | Fast Convergence |
| 4 | ------------------------------------ |
| 5 | |
Neale Ranns | dfd3954 | 2020-11-09 10:09:42 +0000 | [diff] [blame] | 6 | This is an excellent description of the topic: |
jdenisco | 0923a23 | 2018-08-29 13:19:43 -0400 | [diff] [blame] | 7 | |
Neale Ranns | dfd3954 | 2020-11-09 10:09:42 +0000 | [diff] [blame] | 8 | 'FIB <https://tools.ietf.org/html/draft-ietf-rtgwg-bgp-pic-12>'_ |
| 9 | |
| 10 | but if you're interested in my take keep reading... |
| 11 | |
| 12 | First some definitions: |
| 13 | |
| 14 | - Convergence; When a FIB is forwarding all packets correctly based |
| 15 | on the network topology (i.e. doing what the routing control plane |
| 16 | has instructed it to do), then it is said to be 'converged'. |
| 17 | Not being in a converged state is [hopefully] a transient state, |
| 18 | when either the topology change (e.g. a link failure) has not been |
| 19 | observed or processed by the routing control plane, or that the FIB |
| 20 | is still processing routing updates. Convergence is the act of |
| 21 | getting to the converged state. |
| 22 | - Fast: In the shortest time possible. There are no absolute limits |
| 23 | placed on how short this must be, although there is one number often |
| 24 | mentioned. Apparently the human ear can detect loss/delay/jitter in |
| 25 | VOIP of 50ms, therefore network failures should last no longer than |
| 26 | this, and some technologies (notably link-free alternate fast |
| 27 | reroute) are designed to converge in this time. However, it is |
| 28 | generally accepted that it is not possible to converge a FIB with |
| 29 | tens of millions of routes in this time scale, the industry |
| 30 | 'standard' is sub-second. |
| 31 | |
| 32 | Converging the FIB quickly is thus a matter of: |
| 33 | |
| 34 | - discovering something is down |
| 35 | - updating as few objects as possible |
| 36 | - to determine which objects to update as efficiently as possible |
| 37 | - to update each object as quickly as possible |
| 38 | |
| 39 | we'll discuss each in turn. |
| 40 | All output came from VPP version 21.01rc0. In what follows I use IPv4 |
| 41 | prefixes, addresses and IPv4 host length masks, however, exactly the |
| 42 | same applies to IPv6. |
| 43 | |
| 44 | |
| 45 | Failure Detection |
| 46 | ^^^^^^^^^^^^^^^^^ |
| 47 | |
| 48 | The two common forms (we'll see others later on) of failure detection |
| 49 | are: |
| 50 | |
| 51 | - link down |
| 52 | - BFD |
| 53 | |
| 54 | The FIB needs to hook into these notifications to trigger |
| 55 | convergence. |
| 56 | |
| 57 | Whenever an interface goes down, VPP issues a callback to all |
| 58 | registerd clients. The adjacency code is such a client. The adjacency |
| 59 | is a leaf node in the FIB control-plane graph (containing fib_path_t, |
| 60 | fib_entry_t etc). A back-walk from the adjacnecy will trigger a |
| 61 | re-resolution of the paths. |
| 62 | |
| 63 | FIB is a client of BFD in order to receive BFD notifications. BFD |
| 64 | comes in two flavours; single and multi hop. Single hop is to protect |
| 65 | a specific peer on an interface, such peers are modelled by an |
| 66 | adjacency. Multi hop is to protect a peer on an unspecified interface |
| 67 | (i.e. a remote peer), this peer is represented by a host-prefix |
| 68 | **fib_entry_t**. In both case FIB will add a delegate to the |
| 69 | **ip_adjacency_t** or **fib_entry_t** that represents the association |
| 70 | to the BFD session. If the BFD session signals up/down then a backwalk |
| 71 | can be triggered from the object to trigger re-resolution and hence |
| 72 | convergence. |
| 73 | |
| 74 | |
| 75 | Few Updates |
| 76 | ^^^^^^^^^^^ |
| 77 | |
| 78 | In order to talk about what 'a few' is we have to leave the realm of |
| 79 | the FIB as an abstract graph based object DB and move into the |
| 80 | concrete representation of forwarding in a large network. Large |
| 81 | networks are built in layers, it's how you scale them. We'll take |
| 82 | here a hypothetical service provider (SP) network, but the concepts |
| 83 | apply equally to data center leaf-spines. This is a rudimentary |
| 84 | description, but it should serve our purpose. |
| 85 | |
| 86 | An SP manages a BGP autonomous system (AS). The SP's goal is both to |
| 87 | attract traffic into its network to serve its customers, but also to |
| 88 | serve transit traffic passing through it, we'll consider the latter here. |
| 89 | The SP's network is all devices in that AS, these |
| 90 | devices are split into those at the edge (provider edge (PE) routers) |
| 91 | which peer with routers in other SP networks, |
| 92 | and those in the core (termed provider (P) routers). Both the PE and P |
| 93 | routers run the IGP (usually OSPF or ISIS). Only the reachability of the devices |
| 94 | in the AS are advertised in the IGP - thus the scale (i.e. the number |
| 95 | of routes) in the IGP is 'small' - only the number of |
| 96 | devices that the SP has (typically not more than a few 10k). |
| 97 | PE routers run BGP; they have external BGP sessions to devices in |
| 98 | other ASs and internal BGP sessions to devices in the same AS. BGP is |
| 99 | used to advertise the routes to *all* networks on the internet - at |
| 100 | the time of writing this number is approaching 900k IPv4 route, hopefully by |
| 101 | the time you are reading this the number of IPv6 routes has caught up ... |
| 102 | If we include the additional routes the SP carries to offering VPN service to its |
| 103 | customers the number of BGP routes can grow to the tens of millions. |
| 104 | |
| 105 | BGP scale thus exceeds IGP scale by two orders of magnitude... pause for |
| 106 | a moment and let that sink in... |
| 107 | |
| 108 | A comparison of BGP and an IGP is way way beyond the scope of this |
| 109 | documentation (and frankly beyond me) so we'll note only the |
| 110 | difference in the form of the routes they present to FIB. A routing |
| 111 | protocol will produce routes that specify the prefixes that are |
| 112 | reachable through its peers. A good IGP |
| 113 | is link state based, it forms peerings to other devices over these |
| 114 | links, hence its routes specify links/interfaces. In |
| 115 | FIB nomenclature this means an IGP produces routes that are |
| 116 | attached-nexthop, e.g.: |
| 117 | |
| 118 | .. code-block:: console |
| 119 | |
| 120 | ip route add 1.1.1.1/32 via 10.0.0.1 GigEthernet0/0/0 |
| 121 | |
| 122 | BGP on the other hand forms peerings only to neighbours, it does not |
| 123 | know, nor care, what interface is used to reach the peer. In FIB |
| 124 | nomenclature therefore BGP produces recursive routes, e.g.: |
| 125 | |
| 126 | .. code-block:: console |
| 127 | |
| 128 | ip route 8.0.0.0/16 via 1.1.1.1 |
| 129 | |
| 130 | where 1.1.1.1 is the BGP peer. It's no accident in this example that |
| 131 | 1.1.1.1/32 happens to be the route the IGP advertised... BGP installs |
| 132 | routes for prefixes reachable via other BGP peers, and the IGP install |
| 133 | the routes to those BGP peers. |
| 134 | |
| 135 | This has been a very long winded way of describing why the scale of |
| 136 | recursive routes is therefore 2 orders of magnitude greater than |
| 137 | non-recursive/attached-nexthop routes. |
| 138 | |
| 139 | If we step back for a moment and recall why we've crawled down this |
| 140 | rabbit hole, we're trying to determine what 'a few' updates means, |
| 141 | does it include all those recursive routes, probably not ... let's |
| 142 | keep crawling. |
| 143 | |
| 144 | We started this chapter with an abstract description of convergence, |
| 145 | let's now make that more real. In the event of a network failure an SP |
| 146 | is interested in moving to an alternate forwarding path as quickly as |
| 147 | possible. If there is no alternate path, and a converged FIB will drop |
| 148 | the packet, then who cares how fast it converges. In other words the |
| 149 | interesting convergence scenarios are the scenarios where the network has |
| 150 | alternate paths. |
| 151 | |
| 152 | PIC Core |
| 153 | ^^^^^^^^ |
| 154 | |
| 155 | First let's consider alternate paths in the IGP, e.g.; |
| 156 | |
| 157 | .. code-block:: console |
| 158 | |
| 159 | ip route add 1.1.1.1/32 via 10.0.0.2 GigEthernet0/0/0 |
| 160 | ip route add 1.1.1.1/32 via 10.0.1.2 GigEthernet0/0/1 |
| 161 | |
| 162 | this gives us in the FIB: |
| 163 | |
| 164 | .. code-block:: console |
| 165 | |
| 166 | DBGvpp# sh ip fib 1.1.1.1/32 |
| 167 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ] |
| 168 | 1.1.1.1/32 fib:0 index:15 locks:2 |
| 169 | API refs:1 src-flags:added,contributing,active, |
| 170 | path-list:[23] locks:2 flags:shared, uPRF-list:22 len:2 itfs:[1, 2, ] |
| 171 | path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, |
| 172 | 10.0.0.2 GigEthernet0/0/0 |
| 173 | [@0]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800 |
| 174 | path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, |
| 175 | 10.0.1.2 GigEthernet0/0/1 |
| 176 | [@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 177 | |
| 178 | forwarding: unicast-ip4-chain |
| 179 | [@0]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]] |
| 180 | [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800 |
| 181 | [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 182 | |
| 183 | There is ECMP across the two paths. Note that the instance/index of the |
| 184 | load-balance present in the forwarding graph is 17. |
| 185 | |
| 186 | Let's add a BGP route via this peer; |
| 187 | |
| 188 | .. code-block:: console |
| 189 | |
| 190 | ip route add 8.0.0.0/16 via 1.1.1.1 |
| 191 | |
| 192 | in the FIB we see: |
| 193 | |
| 194 | |
| 195 | .. code-block:: console |
| 196 | |
| 197 | DBGvpp# sh ip fib 8.0.0.0/16 |
| 198 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ] |
| 199 | 8.0.0.0/16 fib:0 index:18 locks:2 |
| 200 | API refs:1 src-flags:added,contributing,active, |
| 201 | path-list:[24] locks:2 flags:shared, uPRF-list:21 len:2 itfs:[1, 2, ] |
| 202 | path:[29] pl-index:24 ip4 weight=1 pref=0 recursive: oper-flags:resolved, |
| 203 | via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] |
| 204 | |
| 205 | forwarding: unicast-ip4-chain |
| 206 | [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:21 to:[0:0]] |
| 207 | [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:22 to:[0:0]] |
| 208 | [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001111111111dead000000000800 |
| 209 | [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 210 | |
| 211 | the load-balance object used by this route is index 20, but note that |
| 212 | the next load-balance in the chain is index 17, i.e. it is exactly |
| 213 | the same instance that appears in the forwarding chain for the IGP |
| 214 | route. So in the forwarding plane the packet first encounters |
| 215 | load-balance object 20 (which it will use in ip4-lookup) and then |
| 216 | number 17 (in ip4-load-balance). |
| 217 | |
| 218 | What's the significance? Let's shut down one of those IGP paths: |
| 219 | |
| 220 | .. code-block:: console |
| 221 | |
| 222 | DBGvpp# set in state GigEthernet0/0/0 down |
| 223 | |
| 224 | the resulting update to the IGP route is: |
| 225 | |
| 226 | .. code-block:: console |
| 227 | |
| 228 | DBGvpp# sh ip fib 1.1.1.1/32 |
| 229 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, ] |
| 230 | 1.1.1.1/32 fib:0 index:15 locks:4 |
| 231 | API refs:1 src-flags:added,contributing,active, |
| 232 | path-list:[23] locks:2 flags:shared, uPRF-list:25 len:2 itfs:[1, 2, ] |
| 233 | path:[27] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: |
| 234 | 10.0.0.2 GigEthernet0/0/0 |
| 235 | [@0]: arp-ipv4: via 10.0.0.2 GigEthernet0/0/0 |
| 236 | path:[28] pl-index:23 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, |
| 237 | 10.0.1.2 GigEthernet0/0/1 |
| 238 | [@0]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 239 | |
| 240 | recursive-resolution refs:1 src-flags:added, cover:-1 |
| 241 | |
| 242 | forwarding: unicast-ip4-chain |
| 243 | [@0]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]] |
| 244 | [0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 245 | |
| 246 | |
| 247 | notice that the path via 10.0.0.2 is no longer flagged as resolved, |
| 248 | and the forwarding chain does not contain this path as a |
| 249 | choice. However, the key thing to note is the load-balance |
| 250 | instance is still index 17, i.e. it has been modified not |
| 251 | exchanged. In the FIB vernacular we say it has been 'in-place |
| 252 | modified', a somewhat linguistically redundant expression, but one that serves |
| 253 | to emphasise that it was changed whilst still be part of the graph, it |
| 254 | was never at any point removed from the graph and re-added, and it was |
| 255 | modified without worker barrier lock held. |
| 256 | |
| 257 | Still don't see the significance? In order to converge around the |
| 258 | failure of the IGP link it was not necessary to update load-balance |
| 259 | object number 20! It was not necessary to update the recursive |
| 260 | route. i.e. convergence is achieved without updating any recursive |
| 261 | routes, it is only necessary to update the affected IGP routes, this is |
| 262 | the definition of 'a few'. We call this 'prefix independent |
| 263 | convergence' (PIC) which should really be called 'recursive prefix |
| 264 | independent convergence' but it isn't... |
| 265 | |
| 266 | How was the trick done? As with all problems in computer science, it |
| 267 | was solved by a layer of misdirection, I mean indirection. The |
| 268 | indirection is the load-balance that belongs to the IGP route. By |
| 269 | keeping this object in the forwarding graph and updating it in place, |
| 270 | we get PIC. The alternative design would be to collapse the two layers of |
| 271 | load-balancing into one, which would improve forwarding performance |
| 272 | but would come at the cost of prefix dependent convergence. No doubt |
| 273 | there are situations where the VPP deployment would favour forwarding |
| 274 | performance over convergence, you know the drill, contributions welcome. |
| 275 | |
| 276 | This failure scenario is known as PIC core, since it's one of the IGP's |
| 277 | core links that has failed. |
| 278 | |
| 279 | iBGP PIC Edge |
| 280 | ^^^^^^^^^^^^^ |
| 281 | |
| 282 | Next, let's consider alternate paths in BGP, e.g: |
| 283 | |
| 284 | .. code-block:: console |
| 285 | |
| 286 | ip route add 8.0.0.0/16 via 1.1.1.1 |
| 287 | ip route add 8.0.0.0/16 via 1.1.1.2 |
| 288 | |
| 289 | the 8.0.0.0/16 prefix is reachable via two BGP next-hops (two PEs). |
| 290 | |
| 291 | Our FIB now also contains: |
| 292 | |
| 293 | .. code-block:: console |
| 294 | |
| 295 | DBGvpp# sh ip fib 8.0.0.0/16 |
| 296 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ] |
| 297 | 8.0.0.0/16 fib:0 index:18 locks:2 |
| 298 | API refs:1 src-flags:added,contributing,active, |
| 299 | path-list:[15] locks:2 flags:shared, uPRF-list:11 len:2 itfs:[1, 2, ] |
| 300 | path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, |
| 301 | via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] |
| 302 | path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, |
| 303 | via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12] |
| 304 | |
| 305 | forwarding: unicast-ip4-chain |
| 306 | [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:2 uRPF:11 to:[0:0]] |
| 307 | [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:1 uRPF:25 to:[0:0]] |
| 308 | [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 |
| 309 | [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 310 | [1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:13 to:[0:0]] |
| 311 | [0] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 312 | |
| 313 | The first load-balance (LB) in the forwarding graph is index 20 (the astute |
| 314 | reader will note this is the same index as in the previous |
| 315 | section, I am adding paths to the same route, the load-balance is |
| 316 | in-place modified again). Each choice in LB 20 is another LB |
| 317 | contributed by the IGP route through which the route's paths recurse. |
| 318 | |
| 319 | So what's the equivalent in BGP to a link down in the IGP? An IGP link |
| 320 | down means it loses its peering out of that link, so the equivalent in |
| 321 | BGP is the loss of the peering and thus the loss of reachability to |
| 322 | the peer. This is signaled by the IGP withdrawing the route to the |
| 323 | peer. But "Wait wait wait", i hear you say ... "just because the IGP |
| 324 | withdraws 1.1.1.1/32 doesn't mean I can't reach 1.1.1.1, perhaps there |
| 325 | is a less specific route that gives reachability to 1.1.1.1". Indeed |
| 326 | there may be. So a little more on BGP network design. I know it's like |
| 327 | a bad detective novel where the author drip feeds you the plot... When |
| 328 | describing iBGP peerings one 'always' describes the peer using one of |
| 329 | its GigEthernet0/0/back addresses. Why? A GigEthernet0/0/back interface |
| 330 | never goes down (unless you admin down it yourself), some muppet can't |
| 331 | accidentally cut through the GigEthernet0/0/back cable whilst digging up the |
| 332 | street. And what subnet mask length does a prefix have on a GigEthernet0/0/back |
| 333 | interface? it's 'always' a /32. Why? because there's no cable to connect |
| 334 | any other devices. This choice justifies there 'always' being a /32 |
| 335 | route for the BGP peer. But what prevents there not being a less |
| 336 | specific - nothing. |
| 337 | Now clearly if the BGP peer crashes then the /32 for its GigEthernet0/0/back is |
| 338 | going to be removed from the IGP, but what will withdraw the less |
| 339 | specific - nothing. |
| 340 | |
| 341 | So in order to make use of this trick of relying on the withdrawal of |
| 342 | the /32 for the peer to signal that the peer is down and thus the |
| 343 | signal to converge the FIB, we need to force FIB to recurse only via |
| 344 | the /32 and not via a less specific. This is called a 'recursion |
| 345 | constraint'. In this case the constraint is 'recurse via host' |
| 346 | i.e. for ipv4 use a /32. |
| 347 | So we need to update our route additions from before: |
| 348 | |
| 349 | .. code-block:: console |
| 350 | |
| 351 | ip route add 8.0.0.0/16 via 1.1.1.1 resolve-via-host |
| 352 | ip route add 8.0.0.0/16 via 1.1.1.2 resolve-via-host |
| 353 | |
| 354 | checking the FIB output is left as an exercise to the reader. I hope |
| 355 | you're doing these configs as you read. There's little change in the |
| 356 | output, you'll see some extra flags on the paths. |
| 357 | |
| 358 | Now let's add the less specific, just for fun: |
| 359 | |
| 360 | |
| 361 | .. code-block:: console |
| 362 | |
| 363 | ip route add 1.1.1.0/28 via 10.0.0.2 GigEthernet0/0/0 |
| 364 | |
| 365 | nothing changes in resolution of 8.0.0.0/16. |
| 366 | |
| 367 | Now withdraw the route to 1.1.1.2/32: |
| 368 | |
| 369 | .. code-block:: console |
| 370 | |
| 371 | ip route del 1.1.1.2/32 via 10.0.0.2 GigEthernet0/0/0 |
| 372 | |
| 373 | In the FIB we see: |
| 374 | |
| 375 | .. code-block:: console |
| 376 | |
| 377 | DBGvpp# sh ip fib 8.0.0.0/32 |
| 378 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:2, default-route:1, ] |
| 379 | 8.0.0.0/16 fib:0 index:18 locks:2 |
| 380 | API refs:1 src-flags:added,contributing,active, |
| 381 | path-list:[15] locks:2 flags:shared, uPRF-list:13 len:2 itfs:[1, 2, ] |
| 382 | path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, |
| 383 | via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] |
| 384 | path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host, |
| 385 | via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0] |
| 386 | |
| 387 | forwarding: unicast-ip4-chain |
| 388 | [@0]: dpo-load-balance: [proto:ip4 index:20 buckets:1 uRPF:13 to:[0:0]] |
| 389 | [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]] |
| 390 | [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 |
| 391 | [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 392 | |
| 393 | the path via 1.1.1.2 is unresolved, because the recursion constraints |
| 394 | are preventing the the path resolving via 1.1.1.0/28. the LB index 20 |
| 395 | has been updated to remove the unresolved path. |
| 396 | |
| 397 | Job done? Not quite! Why not? |
| 398 | |
| 399 | Let's re-examine the goals of this chapter. We wanted to update 'a |
| 400 | few' objects, which we have defined as not all the millions of |
| 401 | recursive routes. Did we do that here? We sure did, when we |
| 402 | modified LB index 20. So WTF?? Where's the indirection object that can |
| 403 | be modified so that the LBs for the recursive routes are not |
| 404 | modified - it's not there.... WTF? |
| 405 | |
| 406 | OK so the great detective has assembled all the suspects in the |
| 407 | drawing room and only now does he drop the bomb; the FIB knows the |
| 408 | scale, we talked above about what the scale **can** be, worst case |
| 409 | scenario, but that's not necessarily what it is in this hypothetical |
| 410 | (your) deployment. It knows how many recursive routes there are that |
| 411 | depend on a /32, it can thus make its own determination of the |
| 412 | definition of 'a few'. In other words, if there are only 'a few' |
| 413 | recursive prefixes that depend on a /32 then it will update them |
| 414 | synchronously (and we'll discuss what synchronously means a bit more later). |
| 415 | |
| 416 | So what does FIB consider to be 'a few'. Let's add more routes and |
| 417 | find out. |
| 418 | |
| 419 | .. code-block:: console |
| 420 | |
| 421 | DBGvpp# ip route add 8.1.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host |
| 422 | ... |
| 423 | DBGvpp# ip route add 8.63.0.0/16 via 1.1.1.2 resolve-via-host via 1.1.1.1 resolve-via-host |
| 424 | |
| 425 | and we see: |
| 426 | |
| 427 | .. code-block:: console |
| 428 | |
| 429 | DBGvpp# sh ip fib 8.8.0.0 |
| 430 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ] |
| 431 | 8.8.0.0/16 fib:0 index:77 locks:2 |
| 432 | API refs:1 src-flags:added,contributing,active, |
| 433 | path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ] |
| 434 | path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, |
| 435 | via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] |
| 436 | path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, |
| 437 | via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12] |
| 438 | |
| 439 | forwarding: unicast-ip4-chain |
| 440 | [@0]: dpo-load-balance: [proto:ip4 index:79 buckets:2 uRPF:28 flags:[uses-map] to:[0:0]] |
| 441 | load-balance-map: index:0 buckets:2 |
| 442 | index: 0 1 |
| 443 | map: 0 1 |
| 444 | [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]] |
| 445 | [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 |
| 446 | [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 447 | [1] [@12]: dpo-load-balance: [proto:ip4 index:12 buckets:1 uRPF:18 to:[0:0]] |
| 448 | [0] [@3]: arp-ipv4: via 10.0.1.2 GigEthernet0/0/0 |
| 449 | |
| 450 | |
| 451 | Two elements to note here; the path-list has the 'popular' flag and |
| 452 | there is a load-balance map in the forwarding path. |
| 453 | |
| 454 | 'popular' in this case means that the path-list has passed the limit |
| 455 | of 'a few' in the number of children it has. |
| 456 | |
| 457 | here are the children: |
| 458 | |
| 459 | .. code-block:: console |
| 460 | |
| 461 | DBGvpp# sh fib path-list 15 |
| 462 | path-list:[15] locks:128 flags:shared,popular, uPRF-list:28 len:2 itfs:[1, 2, ] |
| 463 | path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, |
| 464 | via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] |
| 465 | path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, |
| 466 | via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-load-balance:12] |
| 467 | children:{entry:18}{entry:21}{entry:22}{entry:23}{entry:25}{entry:26}{entry:27}{entry:28}{entry:29}{entry:30}{entry:31}{entry:32}{entry:33}{entry:34}{entry:35}{entry:36}{entry:37}{entry:38}{entry:39}{entry:40}{entry:41}{entry:42}{entry:43}{entry:44}{entry:45}{entry:46}{entry:47}{entry:48}{entry:49}{entry:50}{entry:51}{entry:52}{entry:53}{entry:54}{entry:55}{entry:56}{entry:57}{entry:58}{entry:59}{entry:60}{entry:61}{entry:62}{entry:63}{entry:64}{entry:65}{entry:66}{entry:67}{entry:68}{entry:69}{entry:70}{entry:71}{entry:72}{entry:73}{entry:74}{entry:75}{entry:76}{entry:77}{entry:78}{entry:79}{entry:80}{entry:81}{entry:82}{entry:83}{entry:84} |
| 468 | |
| 469 | 64 children makes it popular. The number is fixed (there is no API to |
| 470 | change it). Its choice is an attempt to balance the performance cost |
| 471 | of the indirection performance degradation versus the convergence |
| 472 | gain. |
| 473 | |
| 474 | Popular path-lists contribute the load-balance map, this is the |
| 475 | missing indirection object. Its indirection happens when choosing the |
| 476 | bucket in the LB. The packet's flow-hash is taken 'mod number of |
| 477 | buckets' to give the 'candidate bucket' then the map will take this |
| 478 | 'index' and convert it into the 'map'. You can see in the example above |
| 479 | that no change occurs, i.e. if the flow-hash mod n chooses bucket 1 |
| 480 | then it gets bucket 1. |
| 481 | |
| 482 | Why is this useful? The path-list is shared (you can convince |
| 483 | yourself of this if you look at each of the 8.x.0.0/16 routes we |
| 484 | added) and all of these routes use the same load-balance map, therefore, to |
| 485 | converge all the recursive routs, we need only change the map and |
| 486 | we're good; we again get PIC. |
| 487 | |
| 488 | OK who's still awake... if you're thinking there's more to this story, |
| 489 | you're right. Keep reading. |
| 490 | |
| 491 | This failure scenario is called iBGP PIC edge. It's 'edge' because it |
| 492 | refers to the loss of an edge device, and iBGP because the device was |
| 493 | a iBGP peer (we learn iBGP peers in the IGP). There is a similar eBGP |
| 494 | PIC edge scenario, but this is left for an exercise to the reader (hint |
| 495 | there are other recursion constraints - see the RFC). |
| 496 | |
| 497 | Which Objects |
| 498 | ^^^^^^^^^^^^^ |
| 499 | |
| 500 | The next topic on our list of how to converge quickly was to |
| 501 | effectively find the objects that need to be updated when a converge |
| 502 | event happens. If you haven't realised by now that the FIB is an |
| 503 | object graph, then can I politely suggest you go back and start from |
| 504 | the beginning ... |
| 505 | |
| 506 | Finding the objects affected by a change is simply a matter of walking |
| 507 | from the parent (the object affected) to its children. These |
| 508 | dependencies are kept really for this reason. |
| 509 | |
| 510 | So is fast convergence just a matter of walking the graph? Yes and |
| 511 | no. The question to ask yourself is this, "in the case of iBGP PIC edge, |
| 512 | when the /32 is withdrawn, what is the list of objects that need to be |
| 513 | updated and particularly what is the order they should be updated in |
| 514 | order to obtain the best convergence time?" Think breadth v. depth first. |
| 515 | |
| 516 | ... ponder for a while ... |
| 517 | |
| 518 | For iBGP PIC edge we said it's the path-list that provides the |
| 519 | indirection through the load-balance map. Hence once all path-lists |
| 520 | are updated we are converged, thereafter, at our leisure, we can |
| 521 | update the child recursive prefixes. Is the breadth or depth first? |
| 522 | |
| 523 | It's breadth first. |
| 524 | |
| 525 | Breadth first walks are achieved by spawning an async walk of the |
| 526 | branch of the graph that we don't want to traverse. Withdrawing the /32 |
| 527 | triggers a synchronous walk of the children of the /32 route, we want |
| 528 | a synchronous walk because we want to converge ASAP. This synchronous |
| 529 | walk will encounter path-lists in the /32 route's child dependent list. |
| 530 | These path-lists (and thier LB maps) will be updated. If a path-list is |
| 531 | popular, then it will spawn a async walk of the path-list's child |
| 532 | dependent routes, if not it will walk those routes. So the walk |
| 533 | effectively proceeds breadth first across the path-lists, then returns |
| 534 | to the start to do the affected routes. |
| 535 | |
| 536 | Now the story is complete. The murderer is revealed. |
| 537 | |
| 538 | Let's withdraw one of the IGP routes. |
| 539 | |
| 540 | .. code-block:: console |
| 541 | |
| 542 | DBGvpp# ip route del 1.1.1.2/32 via 10.0.1.2 GigEthernet0/0/1 |
| 543 | |
| 544 | DBGvpp# sh ip fib 8.8.0.0 |
| 545 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:4, default-route:1, ] |
| 546 | 8.8.0.0/16 fib:0 index:77 locks:2 |
| 547 | API refs:1 src-flags:added,contributing,active, |
| 548 | path-list:[15] locks:128 flags:shared,popular, uPRF-list:18 len:2 itfs:[1, 2, ] |
| 549 | path:[17] pl-index:15 ip4 weight=1 pref=0 recursive: oper-flags:resolved, cfg-flags:resolve-host, |
| 550 | via 1.1.1.1 in fib:0 via-fib:15 via-dpo:[dpo-load-balance:17] |
| 551 | path:[15] pl-index:15 ip4 weight=1 pref=0 recursive: cfg-flags:resolve-host, |
| 552 | via 1.1.1.2 in fib:0 via-fib:10 via-dpo:[dpo-drop:0] |
| 553 | |
| 554 | forwarding: unicast-ip4-chain |
| 555 | [@0]: dpo-load-balance: [proto:ip4 index:79 buckets:1 uRPF:18 to:[0:0]] |
| 556 | [0] [@12]: dpo-load-balance: [proto:ip4 index:17 buckets:2 uRPF:27 to:[0:0]] |
| 557 | [0] [@5]: ipv4 via 10.0.0.2 GigEthernet0/0/0: mtu:9000 next:3 001122334455dead000000000800 |
| 558 | [1] [@5]: ipv4 via 10.0.1.2 GigEthernet0/0/1: mtu:9000 next:4 001111111111dead000000010800 |
| 559 | |
| 560 | the LB Map has gone, since the prefix now only has one path. You'll |
| 561 | need to be a CLI ninja if you want to catch the output showing the LB |
| 562 | map in its transient state of: |
| 563 | |
| 564 | .. code-block:: console |
| 565 | |
| 566 | load-balance-map: index:0 buckets:2 |
| 567 | index: 0 1 |
| 568 | map: 0 0 |
| 569 | |
| 570 | but it happens. Trust me. I've got tests and everything. |
| 571 | |
| 572 | On the final topic of how to converge quickly; 'make each update fast' |
| 573 | there are no tricks. |
| 574 | |
| 575 | |
| 576 | |