Neale Ranns | 0bfe5d8 | 2016-08-25 15:29:12 +0100 | [diff] [blame] | 1 | /* |
| 2 | * Copyright (c) 2016 Cisco and/or its affiliates. |
| 3 | * Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | * you may not use this file except in compliance with the License. |
| 5 | * You may obtain a copy of the License at: |
| 6 | * |
| 7 | * http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | * |
| 9 | * Unless required by applicable law or agreed to in writing, software |
| 10 | * distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | * See the License for the specific language governing permissions and |
| 13 | * limitations under the License. |
| 14 | */ |
| 15 | /** |
| 16 | * \brief |
| 17 | * A IP v4/6 independent FIB. |
| 18 | * |
| 19 | * The main functions provided by the FIB are as follows; |
| 20 | * |
| 21 | * - source priorities |
| 22 | * |
| 23 | * A route can be added to the FIB by more than entity or source. Sources |
| 24 | * include, but are not limited to, API, CLI, LISP, MAP, etc (for the full list |
| 25 | * see fib_entry.h). Each source provides the forwarding information (FI) that |
| 26 | * is has determined as required for that route. Since each source determines the |
| 27 | * FI using different best path and loop prevention algorithms, it is not |
| 28 | * correct for the FI of multiple sources to be combined. Instead the FIB must |
| 29 | * choose to use the FI from only one source. This choose is based on a static |
| 30 | * priority assignment. For example; |
| 31 | * IF a prefix is added as a result of interface configuration: |
| 32 | * set interface address 192.168.1.1/24 GigE0 |
| 33 | * and then it is also added from the CLI |
| 34 | * ip route 192.168.1.1/32 via 2.2.2.2/32 |
| 35 | * then the 'interface' source will prevail, and the route will remain as |
| 36 | * 'local'. |
| 37 | * The requirement of the FIB is to always install the FI from the winning |
| 38 | * source and thus to maintain the FI added by losing sources so it can be |
| 39 | * installed should the winning source be withdrawn. |
| 40 | * |
| 41 | * - adj-fib maintenance |
| 42 | * |
| 43 | * When ARP or ND discover a neighbour on a link an adjacency forms for the |
| 44 | * address of that neighbour. It is also required to insert a route in the |
| 45 | * appropriate FIB table, corresponding to the VRF for the link, an entry for |
| 46 | * that neighbour. This entry is often referred to as an adj-fib. Adj-fibs |
| 47 | * have a dedicated source; 'ADJ'. |
| 48 | * The priority of the ADJ source is lower than most. This is so the following |
| 49 | * config; |
| 50 | * set interface address 192.168.1.1/32 GigE0 |
| 51 | * ip arp 192.168.1.2 GigE0 dead.dead.dead |
| 52 | * ip route add 192.168.1.2 via 10.10.10.10 GigE1 |
| 53 | * will forward traffic for 192.168.1.2 via GigE1. That is the route added |
| 54 | * by the control plane is favoured over the adjacency discovered by ARP. |
| 55 | * The control plane, with its associated authentication, is considered the |
| 56 | * authoritative source. |
| 57 | * To counter the nefarious addition of adj-fib, through the nefarious injection |
| 58 | * of adjacencies, the FIB is also required to ensure that only adj-fibs whose |
| 59 | * less specific covering prefix is connected are installed in forwarding. This |
| 60 | * requires the use of 'cover tracking', where a route maintains a dependency |
| 61 | * relationship with the route that is its less specific cover. When this cover |
| 62 | * changes (i.e. there is a new covering route) or the forwarding information |
| 63 | * of the cover changes, then the covered route is notified. |
| 64 | * |
| 65 | * Overlapping sub-nets are not supported, so no adj-fib has multiple paths. |
| 66 | * The control plane is expected to remove a prefix configured for an interface |
| 67 | * before the interface changes VRF. |
| 68 | * So while the following config is accepted: |
| 69 | * set interface address 192.168.1.1/32 GigE0 |
| 70 | * ip arp 192.168.1.2 GigE0 dead.dead.dead |
| 71 | * set interface ip table GigE0 2 |
| 72 | * it does not result in the desired behaviour. |
| 73 | * |
| 74 | * - attached export. |
| 75 | * |
| 76 | * Further to adj-fib maintenance above consider the following config: |
| 77 | * set interface address 192.168.1.1/24 GigE0 |
| 78 | * ip route add table 2 192.168.1.0/24 GigE0 |
| 79 | * Traffic destined for 192.168.1.2 in table 2 will generate an ARP request |
| 80 | * on GigE0. However, since GigE0 is in table 0, all adj-fibs will be added in |
| 81 | * FIB 0. Hence all hosts in the sub-net are unreachable from table 2. To resolve |
| 82 | * this, all adj-fib and local prefixes are exported (i.e. copied) from the |
| 83 | * 'export' table 0, to the 'import' table 2. There can be many import tables |
| 84 | * for a single export table. |
| 85 | * |
| 86 | * - recursive route resolution |
| 87 | * |
| 88 | * A recursive route is of the form: |
| 89 | * 1.1.1.1/32 via 10.10.10.10 |
| 90 | * i.e. a route for which no egress interface is provided. In order to forward |
| 91 | * traffic to 1.1.1.1/32 the FIB must therefore first determine how to forward |
| 92 | * traffic to 10.10.10.10/32. This is recursive resolution. |
| 93 | * Recursive resolution, just like normal resolution, proceeds via a longest |
| 94 | * prefix match for the 'via-address' 10.10.10.10. Note it is only possible |
| 95 | * to add routes via an address (i.e. a /32 or /128) not via a shorter mask |
| 96 | * prefix. There is no use case for the latter. |
| 97 | * Since recursive resolution proceeds via a longest prefix match, the entry |
| 98 | * in the FIB that will resolve the recursive route, termed the via-entry, may |
| 99 | * change as other routes are added to the FIB. Consider the recursive |
| 100 | * route shown above, and this non-recursive route: |
| 101 | * 10.10.10.0/24 via 192.168.16.1 GigE0 |
| 102 | * The entry for 10.10.10.0/24 is thus the resolving via-entry. If this entry is |
| 103 | * modified, to say; |
| 104 | * 10.10.10.0/24 via 192.16.1.3 GigE0 |
| 105 | * Then packet for 1.1.1.1/32 must also be sent to the new next-hop. |
| 106 | * Now consider the addition of; |
| 107 | * 10.10.10.0/28 via 192.168.16.2 GigE0 |
| 108 | * The more specific /28 is a better longest prefix match and thus becomes the |
| 109 | * via-entry. Removal of the /28 means the resolution will revert to the /24. |
| 110 | * The tracking to the changes in recursive resolution is the requirement of |
| 111 | * the FIB. When the forwarding information of the via-entry changes a back-walk |
| 112 | * is used to update dependent recursive routes. When new routes are added to |
| 113 | * the table the cover tracking feature provides the necessary notifications to |
| 114 | * the via-entry routes. |
| 115 | * The adjacency constructed for 1.1.1.1/32 will be a recursive adjacency |
| 116 | * whose next adjacency will be contributed from the via-entry. Maintaining |
| 117 | * the validity of this recursive adjacency is a requirement of the FIB. |
| 118 | * |
| 119 | * - recursive loop avoidance |
| 120 | * |
| 121 | * Consider this set of routes: |
| 122 | * 1.1.1.1/32 via 2.2.2.2 |
| 123 | * 2.2.2.2/32 via 3.3.3.3 |
| 124 | * 3.3.3.3/32 via 1.1.1.1 |
| 125 | * this is termed a recursion loop - all of the routes in the loop are |
| 126 | * unresolved in so far as they do not have a resolving adjacency, but each |
| 127 | * is resolved because the via-entry is known. It is important here to note |
| 128 | * the distinction between the control-plane objects and the data-plane objects |
| 129 | * (more details in the implementation section). The control plane objects must |
| 130 | * allow the loop to form (i.e. the graph becomes cyclic), however, the |
| 131 | * data-plane absolutely must not allow the loop to form, otherwise the packet |
| 132 | * would loop indefinitely and never egress the device - meltdown would follow. |
| 133 | * The control plane must allow the loop to form, because when the loop breaks, |
| 134 | * all members of the loop need to be updated. Forming the loop allows the |
| 135 | * dependencies to be correctly setup to allow this to happen. |
| 136 | * There is no limit to the depth of recursion supported by VPP so: |
| 137 | * 9.9.9.100/32 via 9.9.9.99 |
| 138 | * 9.9.9.99/32 via 9.9.9.98 |
| 139 | * 9.9.9.98/32 via 9.9.9.97 |
| 140 | * ... turtles, turtles, turtles ... |
| 141 | * 9.9.9.1/32 via 10.10.10.10 Gig0 |
| 142 | * is supported to as many layers of turtles is desired, however, when |
| 143 | * back-walking a graph (in this case from 9.9.9.1/32 up toward 9.9.9.100/32) |
| 144 | * a FIB needs to differentiate the case where the recursion is deep versus |
| 145 | * the case where the recursion is looped. A simple method, employed by VPP FIB, |
| 146 | * is to limit the number of steps. VPP FIB limit is 16. Typical BGP scenarios |
| 147 | * in the wild do not exceed 3 (BGP Inter-AS option C). |
| 148 | * |
| 149 | * - Fast Convergence |
| 150 | * |
| 151 | * After a network topology change, the 'convergence' time, is the time taken |
| 152 | * for the router to complete a transition to forward traffic using the new |
| 153 | * topology. The convergence time is therefore a summation of the time to; |
| 154 | * - detect the failure. |
| 155 | * - calculate the new 'best path' information |
| 156 | * - download the new best paths to the data-plane. |
| 157 | * - install those best best in data-plane forwarding. |
| 158 | * The last two points are of relevance to VPP architecture. The download API is |
| 159 | * binary and batch, details are not discussed here. There is no HW component to |
| 160 | * programme, installation time is bounded by the memory allocation and table |
| 161 | * lookup and insert access times. |
| 162 | * |
| 163 | * 'Fast' convergence refers to a set of technologies that a FIB can employ to |
| 164 | * completely or partially restore forwarding whilst the convergence actions |
| 165 | * listed above are ongoing. Fast convergence technologies are further |
| 166 | * sub-divided into Prefix Independent Convergence (PIC) and Loop Free |
| 167 | * Alternate path Fast re-route (LFA-FRR or sometimes called IP-FRR) which |
| 168 | * affect recursive and non-recursive routes respectively. |
| 169 | * |
| 170 | * LFA-FRR |
| 171 | * |
| 172 | * Consider the network topology below: |
| 173 | * |
| 174 | * C |
| 175 | * / \ |
| 176 | * X -- A --- B - Y |
| 177 | * | | |
| 178 | * D F |
| 179 | * \ / |
| 180 | * E |
| 181 | * |
| 182 | * all links are equal cost, traffic is passing from X to Y. the best path is |
| 183 | * X-A-B-Y. There are two alternative paths, one via C and one via E. An |
| 184 | * alternate path is considered to be loop free if no other router on that path |
| 185 | * would forward the traffic back to the sender. Consider router C, its best |
| 186 | * path to Y is via B, so if A were to send traffic destined to Y to C, then C |
| 187 | * would forward that traffic to B - this is a loop-free alternate path. In |
| 188 | * contrast consider router D. D's shortest path to Y is via A, so if A were to |
| 189 | * send traffic destined to Y via D, then D would send it back to A; this is |
| 190 | * not a loop-free alternate path. There are several points of note; |
| 191 | * - we are considering the pre-failure routing topology |
| 192 | * - any equal-cost multi-path between A and B is also a LFA path. |
| 193 | * - in order for A to calculate LFA paths it must be aware of the best-path |
| 194 | * to Y from the perspective of D. These calculations are thus limited to |
| 195 | * routing protocols that have a full view of the network topology, i.e. |
| 196 | * link-state DB protocols like OSPF or an SDN controller. LFA protected |
| 197 | * prefixes are thus non-recursive. |
| 198 | * |
| 199 | * LFA is specified as a 1 to 1 redundancy; a primary path has only one LFA |
| 200 | * (a.k.a. backup) path. To my knowledge this limitation is one of complexity |
| 201 | * in the calculation of and capacity planning using a 1-n redundancy. |
| 202 | * |
| 203 | * In the event that the link A-B fails, the alternate path via C can be used. |
| 204 | * In order to provide 'fast' failover in the event of a failure, the control |
| 205 | * plane will download both the primary and the backup path to the FIB. It is |
| 206 | * then a requirement of the FIB to perform the failover (a.k.a cutover) from |
| 207 | * the primary to the backup path as quickly as possible, and particularly |
| 208 | * without any other control-plane intervention. The expectation is cutover is |
| 209 | * less than 50 milli-seconds - a value allegedly from the VOIP QoS. Note that |
| 210 | * cutover time still includes the fault detection time, which in a vitalised |
| 211 | * environment could be the dominant factor. Failure detection can be either a |
| 212 | * link down, which will affect multiple paths on a multi-access interface, or |
| 213 | * via a specific path heartbeat (i.e. BFD). |
| 214 | * At this time VPP does not support LFA, that is it does not support the |
| 215 | * installation of a primary and backup path[s] for a route. However, it does |
| 216 | * support ECMP, and VPP FIB is designed to quickly remove failed paths from |
| 217 | * the ECMP set, however, it does not insert shared objects specific to the |
| 218 | * protected resource into the forwarding object graph, since this would incur |
| 219 | * a forwarding/performance cost. Failover time is thus route number dependent. |
| 220 | * Details are provided in the implementation section below. |
| 221 | * |
| 222 | * PIC |
| 223 | * |
| 224 | * PIC refers to the concept that the converge time should be independent of |
| 225 | * the number of prefixes/routes that are affected by the failure. PIC is |
| 226 | * therefore most appropriate when considering networks with large number of |
| 227 | * prefixes, i.e. BGP networks and thus recursive prefixes. There are several |
| 228 | * flavours of PIC covering different locations of protection and failure |
| 229 | * scenarios. An outline is given below, see the literature for more details: |
| 230 | * |
| 231 | * Y/16 - CE1 -- PE1---\ |
| 232 | * | \ P1---\ |
| 233 | * | \ PE3 -- CE3 - X/16 |
| 234 | * | - P2---/ |
| 235 | * Y/16 - CE2 -- PE2---/ |
| 236 | * |
| 237 | * CE = customer edge, PE = provider edge. external-BGP runs between customer |
| 238 | * and provider, internal-BGP runs between provider and provider. |
| 239 | * |
| 240 | * 1) iBGP PIC-core: consider traffic from CE1 to X/16 via CE3. On PE1 there is |
| 241 | * are routes; |
| 242 | * X/16 (and hundreds of thousands of others like it) |
| 243 | * via PE3 |
| 244 | * and |
| 245 | * PE3/32 (its loopback address) |
| 246 | * via 10.0.0.1 Link0 (this is P1) |
| 247 | * via 10.1.1.1 Link1 (this is P2) |
| 248 | * the failure is the loss of link0 or link1 |
| 249 | * As in all PIC scenarios, in order to provide prefix independent convergence |
| 250 | * it must be that the route for X/16 (and all other routes via PE3) do not |
| 251 | * need to be updated in the FIB. The FIB therefore needs to update a single |
| 252 | * object that is shared by all routes - once this shared object is updated, |
| 253 | * then all routes using it will be instantly updated to use the new forwarding |
| 254 | * information. In this case the shared object is the resolving route via PE3. |
| 255 | * Once the route via PE3 is updated via IGP (OSPF) convergence, then all |
| 256 | * recursive routes that resolve through it are also updated. VPP FIB |
| 257 | * implements this scenario via a recursive-adjacency. the X/16 and it sibling |
| 258 | * routes share a recursive-adjacency that links to/points at/stacks on the |
| 259 | * normal adjacency contributed by the route for PE3. Once this shared |
| 260 | * recursive adj is re-linked then all routes are switched to using the new |
| 261 | * forwarding information. This is shown below; |
| 262 | * |
| 263 | * pre-failure; |
| 264 | * X/16 --> R-ADJ-1 --> ADJ-1-PE3 (multi-path via P1 and P2) |
| 265 | * |
| 266 | * post-failure: |
| 267 | * X/16 --> R-ADJ-1 --> ADJ-2-PE3 (single path via P1) |
| 268 | * |
| 269 | * note that R-ADJ-1 (the recursive adj) remains in the forwarding graph, |
| 270 | * therefore X/16 (and all its siblings) is not updated. |
| 271 | * X/16 and its siblings share the recursive adj since they share the same |
| 272 | * path-list. It is the path-list object that contributes the recursive-adj |
| 273 | * (see next section for more details) |
| 274 | * |
| 275 | * |
| 276 | * 2) iBGP PIC-edge; Traffic from CE3 to Y/16. On PE3 there is are routes; |
| 277 | * Y/16 (and hundreds of thousands of others like it) |
| 278 | * via PE1 |
| 279 | * via PE2 |
| 280 | * and |
| 281 | * PE1/32 (PE1's loopback address) |
| 282 | * via 10.0.2.2 Link0 (this is P1) |
| 283 | * PE2/32 (PE2's loopback address) |
| 284 | * via 10.0.3.3 Link1 (this is P2) |
| 285 | * |
| 286 | * the failure is the loss of reachability to PE2. this could be either the |
| 287 | * loss of the link P2-PE2 or the loss of the node PE2. This is detected either |
| 288 | * by the withdrawal of the PE2's loopback route or by some form of failure |
| 289 | * detection (i.e. BFD). |
| 290 | * VPP FIB again provides PIC via the use of the shared recursive-adj. Y/16 and |
| 291 | * its siblings will again share a path-list for the list {PE1,PE2}, this |
| 292 | * path-list will contribute a multi-path-recursive-adj, i.e. a multi-path-adj |
| 293 | * with each choice therein being another adj; |
| 294 | * |
| 295 | * Y/16 -> RM-ADJ --> ADJ1 (for PE1) |
| 296 | * --> ADJ2 (for PE2) |
| 297 | * |
| 298 | * when the route for PE1 is withdrawn then the multi-path-recursive-adjacency |
| 299 | * is updated to be; |
| 300 | * |
| 301 | * Y/16 --> RM-ADJ --> ADJ1 (for PE1) |
| 302 | * --> ADJ1 (for PE1) |
| 303 | * |
| 304 | * that is both choices in the ECMP set are the same and thus all traffic is |
| 305 | * forwarded to PE1. Eventually the control plane will download a route update |
| 306 | * for Y/16 to be via PE1 only. At that time the situation will be: |
| 307 | * |
| 308 | * Y/16 -> R-ADJ --> ADJ1 (for PE1) |
| 309 | * |
| 310 | * In the scenario above we assumed that PE1 and PE2 are ECMP for Y/16. eBGP |
| 311 | * PIC core is also specified for the case were one PE is primary and the other |
| 312 | * backup - VPP FIB does not support that case at this time. |
| 313 | * |
| 314 | * 3) eBGP PIC Edge; Traffic from CE3 to Y/16. On PE1 there is are routes; |
| 315 | * Y/16 (and hundreds of thousands of others like it) |
| 316 | * via CE1 (primary) |
| 317 | * via PE2 (backup) |
| 318 | * and |
| 319 | * CE1 (this is an adj-fib) |
| 320 | * via 11.0.0.1 Link0 (this is CE1) << this is an adj-fib |
| 321 | * PE2 (PE2's loopback address) |
| 322 | * via 10.0.5.5 Link1 (this is link PE1-PE2) |
| 323 | * the failure is the loss of link0 to CE1. The failure can be detected by FIB |
| 324 | * either as a link down event or by the control plane withdrawing the connected |
| 325 | * prefix on the link0 (say 10.0.5.4/30). The latter works because the resolving |
| 326 | * entry is an adj-fib, so removing the connected will withdraw the adj-fib, and |
| 327 | * hence the recursive path becomes unresolved. The former is faster, |
| 328 | * particularly in the case of Inter-AS option A where there are many VLAN |
| 329 | * sub-interfaces on the PE-CE link, one for each VRF, and so the control plane |
| 330 | * must remove the connected prefix for each sub-interface to trigger PIC in |
| 331 | * each VRF. Note though that total PIC cutover time will depend on VRF scale |
| 332 | * with either trigger. |
| 333 | * Primary and backup paths in this eBGP PIC-edge scenario are calculated by |
| 334 | * BGP. Each peer is configured to always advertise its best external path to |
| 335 | * its iBGP peers. Backup paths therefore send traffic from the PE back into the |
| 336 | * core to an alternate PE. A PE may have multiple external paths, i.e. multiple |
| 337 | * directly connected CEs, it may also have multiple backup PEs, however there |
| 338 | * is no correlation between the two, so unlike LFA-FRR, the redundancy model is |
| 339 | * N-M; N primary paths are backed-up by M backup paths - only when all primary |
| 340 | * paths fail, then the cutover is performed onto the M backup paths. Note that |
| 341 | * PE2 must be suitably configured to forward traffic on its external path that |
| 342 | * was received from PE1. VPP FIB does not support external-internal-BGP (eiBGP) |
| 343 | * load-balancing. |
| 344 | * |
| 345 | * As with LFA-FRR the use of primary and backup paths is not currently |
| 346 | * supported, however, the use of a recursive-multi-path-adj, and a suitably |
| 347 | * constrained hashing algorithm to choose from the primary or backup path sets, |
| 348 | * would again provide the necessary shared object and hence the prefix scale |
| 349 | * independent cutover. |
| 350 | * |
| 351 | * Astute readers will recognise that both of the eBGP PIC scenarios refer only |
| 352 | * to a BGP free core. |
| 353 | * |
| 354 | * Fast convergence implementation options come in two flavours: |
| 355 | * 1) Insert switches into the data-path. The switch represents the protected |
| 356 | * resource. If the switch is 'on' the primary path is taken, otherwise |
| 357 | * the backup path is taken. Testing the switch in the data-path comes with |
| 358 | * an associated performance cost. A given packet may encounter more than |
| 359 | * one protected resource as it is forwarded. This approach minimises |
| 360 | * cutover times as packets will be forwarded on the backup path as soon |
| 361 | * as the protected resource is detected to be down and the single switch |
| 362 | * is tripped. However, it comes at a performance cost, which increases |
| 363 | * with each shared resource a packet encounters in the data-path. |
| 364 | * This approach is thus best suited to LFA-FRR where the protected routes |
| 365 | * are non-recursive (i.e. encounter few shared resources) and the |
| 366 | * expectation on cutover times is more stringent (<50msecs). |
| 367 | * 2) Update shared objects. Identify objects in the data-path, that are |
| 368 | * required to be present whether or not fast convergence is required (i.e. |
| 369 | * adjacencies) that can be shared by multiple routes. Create a dependency |
| 370 | * between these objects at the protected resource. When the protected |
| 371 | * resource fails, each of the shared objects is updated in a way that all |
| 372 | * users of it see a consistent change. This approach incurs no performance |
| 373 | * penalty as the data-path structure is unchanged, however, the cutover |
| 374 | * times are longer as more work is required when the resource fails. This |
| 375 | * scheme is thus more appropriate to recursive prefixes (where the packet |
| 376 | * will encounter multiple protected resources) and to fast-convergence |
| 377 | * technologies where the cutover times are less stringent (i.e. PIC). |
| 378 | * |
| 379 | * Implementation: |
| 380 | * --------------- |
| 381 | * |
| 382 | * Due to the requirements outlined above, not all routes known to FIB |
| 383 | * (e.g. adj-fibs) are installed in forwarding. However, should circumstances |
| 384 | * change, those routes will need to be added. This adds the requirement that |
| 385 | * a FIB maintains two tables per-VRF, per-AF (where a 'table' is indexed by |
| 386 | * prefix); the forwarding and non-forwarding tables. |
| 387 | * |
| 388 | * For DP speed in VPP we want the lookup in the forwarding table to directly |
| 389 | * result in the ADJ. So the two tables; one contains all the routes (a |
| 390 | * lookup therein yields a fib_entry_t), the other contains only the forwarding |
| 391 | * routes (a lookup therein yields an ip_adjacency_t). The latter is used by the |
| 392 | * DP. |
| 393 | * This trades memory for forwarding performance. A good trade-off in VPP's |
| 394 | * expected operating environments. |
| 395 | * |
| 396 | * Note these tables are keyed only by the prefix (and since there 2 two |
| 397 | * per-VRF, implicitly by the VRF too). The key for an adjacency is the |
| 398 | * tuple:{next-hop, address (and it's AF), interface, link/ether-type}. |
| 399 | * consider this curious, but allowed, config; |
| 400 | * |
| 401 | * set int ip addr 10.0.0.1/24 Gig0 |
| 402 | * set ip arp Gig0 10.0.0.2 dead.dead.dead |
| 403 | * # a host in that sub-net is routed via a better next hop (say it avoids a |
| 404 | * # big L2 domain) |
| 405 | * ip route add 10.0.0.2 Gig1 192.168.1.1 |
| 406 | * # this recursive should go via Gig1 |
| 407 | * ip route add 1.1.1.1/32 via 10.0.0.2 |
| 408 | * # this non-recursive should go via Gig0 |
| 409 | * ip route add 2.2.2.2/32 via Gig0 10.0.0.2 |
| 410 | * |
| 411 | * for the last route, the lookup for the path (via {Gig0, 10.0.0.2}) in the |
| 412 | * prefix table would not yield the correct result. To fix this we need a |
| 413 | * separate table for the adjacencies. |
| 414 | * |
| 415 | * - FIB data structures; |
| 416 | * |
| 417 | * fib_entry_t: |
| 418 | * - a representation of a route. |
| 419 | * - has a prefix. |
| 420 | * - it maintains an array of path-lists that have been contributed by the |
| 421 | * different sources |
| 422 | * - install an adjacency in the forwarding table contributed by the best |
| 423 | * source's path-list. |
| 424 | * |
| 425 | * fib_path_list_t: |
| 426 | * - a list of paths |
| 427 | * - path-lists may be shared between FIB entries. The path-lists are thus |
| 428 | * kept in a DB. The key is the combined description of the paths. We share |
| 429 | * path-lists when it will aid convergence to do so. Adding path-lists to |
| 430 | * this DB that are never shared, or are not shared by prefixes that are |
| 431 | * not subject to PIC, will increase the size of the DB unnecessarily and |
| 432 | * may lead to increased search times due to hash collisions. |
| 433 | * - the path-list contributes the appropriate adj for the entry in the |
| 434 | * forwarding table. The adj can be 'normal', multi-path or recursive, |
| 435 | * depending on the number of paths and their types. |
| 436 | * - since path-lists are shared there is only one instance of the multi-path |
| 437 | * adj that they [may] create. As such multi-path adjacencies do not need a |
| 438 | * separate DB. |
| 439 | * The path-list with recursive paths and the recursive adjacency that it |
| 440 | * contributes forms the backbone of the fast convergence architecture (as |
| 441 | * described previously). |
| 442 | * |
| 443 | * fib_path_t: |
| 444 | * - a description of how to forward the traffic (i.e. via {Gig1, K}). |
| 445 | * - the path describes the intent on how to forward. This differs from how |
| 446 | * the path resolves. I.e. it might not be resolved at all (since the |
| 447 | * interface is deleted or down). |
| 448 | * - paths have different types, most notably recursive or non-recursive. |
| 449 | * - a fib_path_t will contribute the appropriate adjacency object. It is from |
| 450 | * these contributions that the DP graph/chain for the route is built. |
| 451 | * - if the path is recursive and a recursion loop is detected, then the path |
| 452 | * will contribute the special DROP adjacency. This way, whilst the control |
| 453 | * plane graph is looped, the data-plane graph does not. |
| 454 | * |
| 455 | * we build a graph of these objects; |
| 456 | * |
| 457 | * fib_entry_t -> fib_path_list_t -> fib_path_t -> ... |
| 458 | * |
| 459 | * for recursive paths: |
| 460 | * |
| 461 | * fib_path_t -> fib_entry_t -> .... |
| 462 | * |
| 463 | * for non-recursive paths |
| 464 | * |
| 465 | * fib_path_t -> ip_adjacency_t -> interface |
| 466 | * |
| 467 | * These objects, which constitute the 'control plane' part of the FIB are used |
| 468 | * to represent the resolution of a route. As a whole this is referred to as the |
| 469 | * control plane graph. There is a separate DP graph to represent the forwarding |
| 470 | * of a packet. In the DP graph each object represents an action that is applied |
| 471 | * to a packet as it traverses the graph. For example, a lookup of a IP address |
| 472 | * in the forwarding table could result in the following graph: |
| 473 | * |
| 474 | * recursive-adj --> multi-path-adj --> interface_A |
| 475 | * --> interface_B |
| 476 | * |
| 477 | * A packet traversing this FIB DP graph would thus also traverse a VPP node |
| 478 | * graph of: |
| 479 | * |
| 480 | * ipX_recursive --> ipX_rewrite --> interface_A_tx --> etc |
| 481 | * |
| 482 | * The taxonomy of objects in a FIB graph is as follows, consider; |
| 483 | * |
| 484 | * A --> |
| 485 | * B --> D |
| 486 | * C --> |
| 487 | * |
| 488 | * Where A,B and C are (for example) routes that resolve through D. |
| 489 | * parent; D is the parent of A, B, and C. |
| 490 | * children: A, B, and C are children of D. |
| 491 | * sibling: A, B and C are siblings of one another. |
| 492 | * |
| 493 | * All shared objects in the FIB are reference counted. Users of these objects |
| 494 | * are thus expected to use the add_lock/unlock semantics (as one would |
| 495 | * normally use malloc/free). |
| 496 | * |
| 497 | * WALKS |
| 498 | * |
| 499 | * It is necessary to walk/traverse the graph forwards (entry to interface) to |
| 500 | * perform a collapse or build a recursive adj and backwards (interface |
| 501 | * to entry) to perform updates, i.e. when interface state changes or when |
| 502 | * recursive route resolution updates occur. |
| 503 | * A forward walk follows simply by navigating an object's parent pointer to |
| 504 | * access its parent object. For objects with multiple parents (e.g. a |
| 505 | * path-list), each parent is walked in turn. |
| 506 | * To support back-walks direct dependencies are maintained between objects, |
| 507 | * i.e. in the relationship, {A, B, C} --> D, then object D will maintain a list |
| 508 | * of 'pointers' to its children {A, B, C}. Bare C-language pointers are not |
| 509 | * allowed, so a pointer is described in terms of an object type (i.e. entry, |
| 510 | * path-list, etc) and index - this allows the object to be retrieved from the |
| 511 | * appropriate pool. A list is maintained to achieve fast convergence at scale. |
| 512 | * When there are millions or recursive prefixes, it is very inefficient to |
| 513 | * blindly walk the tables looking for entries that were affected by a given |
| 514 | * topology change. The lowest hanging fruit when optimising is to remove |
| 515 | * actions that are not required, so all back-walks only traverse objects that |
| 516 | * are directly affected by the change. |
| 517 | * |
| 518 | * PIC Core and fast-reroute rely on FIB reacting quickly to an interface |
| 519 | * state change to update the multi-path-adjacencies that use this interface. |
| 520 | * An example graph is shown below: |
| 521 | * |
| 522 | * E_a --> |
| 523 | * E_b --> PL_2 --> P_a --> Interface_A |
| 524 | * ... --> P_c -\ |
| 525 | * E_k --> \ |
| 526 | * Interface_K |
| 527 | * / |
| 528 | * E_l --> / |
| 529 | * E_m --> PL_1 --> P_d -/ |
| 530 | * ... --> P_f --> Interface_F |
| 531 | * E_z --> |
| 532 | * |
| 533 | * E = fib_entry_t |
| 534 | * PL = fib_path_list_t |
| 535 | * P = fib_path_t |
| 536 | * The subscripts are arbitrary and serve only to distinguish object instances. |
| 537 | * This CP graph result in the following DP graph: |
| 538 | * |
| 539 | * M-ADJ-2 --> Interface_A |
| 540 | * \ |
| 541 | * -> Interface_K |
| 542 | * / |
| 543 | * M-ADJ-1 --> Interface_F |
| 544 | * |
| 545 | * M-ADJ = multi-path-adjacency. |
| 546 | * |
| 547 | * When interface K goes down a back-walk is started over its dependants in the |
| 548 | * control plane graph. This back-walk will reach PL_1 and PL_2 and result in |
| 549 | * the calculation of new adjacencies that have interface K removed. The walk |
| 550 | * will continue to the entry objects and thus the forwarding table is updated |
| 551 | * for each prefix with the new adjacency. The DP graph then becomes: |
| 552 | * |
| 553 | * ADJ-3 --> Interface_A |
| 554 | * |
| 555 | * ADJ-4 --> Interface_F |
| 556 | * |
| 557 | * The eBGP PIC scenarios described above relied on the update of a path-list's |
| 558 | * recursive-adjacency to provide the shared point of cutover. This is shown |
| 559 | * below |
| 560 | * |
| 561 | * E_a --> |
| 562 | * E_b --> PL_2 --> P_a --> E_44 --> PL_a --> P_b --> Interface_A |
| 563 | * ... --> P_c -\ |
| 564 | * E_k --> \ |
| 565 | * \ |
| 566 | * E_1 --> PL_k -> P_k --> Interface_K |
| 567 | * / |
| 568 | * E_l --> / |
| 569 | * E_m --> PL_1 --> P_d -/ |
| 570 | * ... --> P_f --> E_55 --> PL_e --> P_e --> Interface_E |
| 571 | * E_z --> |
| 572 | * |
| 573 | * The failure scenario is the removal of entry E_1 and thus the paths P_c and |
| 574 | * P_d become unresolved. To achieve PIC the two shared recursive path-lists, |
| 575 | * PL_1 and PL_2 must be updated to remove E_1 from the recursive-multi-path- |
| 576 | * adjacencies that they contribute, before any entry E_a to E_z is updated. |
| 577 | * This means that as the update propagates backwards (right to left) in the |
| 578 | * graph it must do so breadth first not depth first. Note this approach leads |
| 579 | * to convergence times that are dependent on the number of path-list and so |
| 580 | * the number of combinations of egress PEs - this is desirable as this |
| 581 | * scale is considerably lower than the number of prefixes. |
| 582 | * |
| 583 | * If we consider another section of the graph that is similar to the one |
| 584 | * shown above where there is another prefix E_2 in a similar position to E_1 |
| 585 | * and so also has many dependent children. It is reasonable to expect that a |
| 586 | * particular network failure may simultaneously render E_1 and E_2 unreachable. |
| 587 | * This means that the update to withdraw E_2 is download immediately after the |
| 588 | * update to withdraw E_1. It is a requirement on the FIB to not spend large |
| 589 | * amounts of time in a back-walk whilst processing the update for E_1, i.e. the |
| 590 | * back-walk must not reach as far as E_a and its siblings. Therefore, after the |
| 591 | * back-walk has traversed one generation (breadth first) to update all the |
| 592 | * path-lists it should be suspended/back-ground and further updates allowed |
| 593 | * to be handled. Once the update queue is empty, the suspended walks can be |
| 594 | * resumed. Note that in the case that multiple updates affect the same entry |
| 595 | * (say E_1) then this will trigger multiple similar walks, these are merged, |
| 596 | * so each child is updated only once. |
| 597 | * In the presence of more layers of recursion PIC is still a desirable |
| 598 | * feature. Consider an extension to the diagram above, where more recursive |
| 599 | * routes (E_100 -> E_200) are added as children of E_a: |
| 600 | * |
| 601 | * E_100 --> |
| 602 | * E_101 --> PL_3 --> P_j-\ |
| 603 | * ... \ |
| 604 | * E_199 --> E_a --> |
| 605 | * E_b --> PL_2 --> P_a --> E_44 --> ...etc.. |
| 606 | * ... --> P_c -\ |
| 607 | * E_k \ |
| 608 | * E_1 --> ...etc.. |
| 609 | * / |
| 610 | * E_l --> / |
| 611 | * E_m --> PL_1 --> P_d -/ |
| 612 | * ... --> P_e --> E_55 --> ...etc.. |
| 613 | * E_z --> |
| 614 | * |
| 615 | * To achieve PIC for the routes E_100->E_199, PL_3 needs to be updated before |
| 616 | * E_b -> E_z, a breadth first traversal at each level would not achieve this. |
| 617 | * Instead the walk must proceed intelligently. Children on PL_2 are sorted so |
| 618 | * those Entry objects that themselves have children appear first in the list, |
| 619 | * those without later. When an entry object is walked that has children, a |
| 620 | * walk of its children is pushed to the front background queue. The back |
| 621 | * ground queue is a priority queue. As the breadth first traversal proceeds |
| 622 | * across the dependent entry object E_a to E_k, when the first entry that does |
| 623 | * not have children is reached (E_b), the walk is suspended and placed at the |
| 624 | * back of the queue. Following this prioritisation method shared path-list |
| 625 | * updates are performed before all non-resolving entry objects. |
| 626 | * The CPU/core/thread that handles the updates is the same thread that handles |
| 627 | * the back-walks. Handling updates has a higher priority than making walk |
| 628 | * progress, so a walk is required to be interruptable/suspendable when new |
| 629 | * updates are available. |
| 630 | * !!! TODO - this section describes how walks should be not how they are !!! |
| 631 | * |
| 632 | * In the diagram above E_100 is an IP route, however, VPP has no restrictions |
| 633 | * on the type of object that can be a dependent of a FIB entry. Children of |
| 634 | * a FIB entry can be (and are) GRE & VXLAN tunnels endpoints, L2VPN LSPs etc. |
| 635 | * By including all object types into the graph and extending the back-walk, we |
| 636 | * can thus deliver fast convergence to technologies that overlay on an IP |
| 637 | * network. |
| 638 | * |
| 639 | * If having read all the above carefully you are still thinking; 'i don't need |
| 640 | * all this %&$* i have a route only I know about and I just need to jam it in', |
| 641 | * then fib_table_entry_special_add() is your only friend. |
| 642 | */ |
| 643 | |
| 644 | #ifndef __FIB_H__ |
| 645 | #define __FIB_H__ |
| 646 | |
| 647 | #include <vnet/fib/fib_table.h> |
| 648 | #include <vnet/fib/fib_entry.h> |
Neale Ranns | 0bfe5d8 | 2016-08-25 15:29:12 +0100 | [diff] [blame] | 649 | |
| 650 | #endif |