Blame - src/vnet/fib/fib.h - fdio/vpp

blob: ec97c565b811154ba282c7a0abce4d11d1a24d79 [file] [log] [blame]

Neale Ranns	0bfe5d8	2016-08-25 15:29:12 +0100	[diff] [blame]	1	/*
				2	* Copyright (c) 2016 Cisco and/or its affiliates.
				3	* Licensed under the Apache License, Version 2.0 (the "License");
				4	* you may not use this file except in compliance with the License.
				5	* You may obtain a copy of the License at:
				6	*
				7	* http://www.apache.org/licenses/LICENSE-2.0
				8	*
				9	* Unless required by applicable law or agreed to in writing, software
				10	* distributed under the License is distributed on an "AS IS" BASIS,
				11	* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
				12	* See the License for the specific language governing permissions and
				13	* limitations under the License.
				14	*/
				15	/**
				16	* \brief
				17	* A IP v4/6 independent FIB.
				18	*
				19	* The main functions provided by the FIB are as follows;
				20	*
				21	* - source priorities
				22	*
				23	* A route can be added to the FIB by more than entity or source. Sources
				24	* include, but are not limited to, API, CLI, LISP, MAP, etc (for the full list
				25	* see fib_entry.h). Each source provides the forwarding information (FI) that
				26	* is has determined as required for that route. Since each source determines the
				27	* FI using different best path and loop prevention algorithms, it is not
				28	* correct for the FI of multiple sources to be combined. Instead the FIB must
				29	* choose to use the FI from only one source. This choose is based on a static
				30	* priority assignment. For example;
				31	* IF a prefix is added as a result of interface configuration:
				32	* set interface address 192.168.1.1/24 GigE0
				33	* and then it is also added from the CLI
				34	* ip route 192.168.1.1/32 via 2.2.2.2/32
				35	* then the 'interface' source will prevail, and the route will remain as
				36	* 'local'.
				37	* The requirement of the FIB is to always install the FI from the winning
				38	* source and thus to maintain the FI added by losing sources so it can be
				39	* installed should the winning source be withdrawn.
				40	*
				41	* - adj-fib maintenance
				42	*
				43	* When ARP or ND discover a neighbour on a link an adjacency forms for the
				44	* address of that neighbour. It is also required to insert a route in the
				45	* appropriate FIB table, corresponding to the VRF for the link, an entry for
				46	* that neighbour. This entry is often referred to as an adj-fib. Adj-fibs
				47	* have a dedicated source; 'ADJ'.
				48	* The priority of the ADJ source is lower than most. This is so the following
				49	* config;
				50	* set interface address 192.168.1.1/32 GigE0
				51	* ip arp 192.168.1.2 GigE0 dead.dead.dead
				52	* ip route add 192.168.1.2 via 10.10.10.10 GigE1
				53	* will forward traffic for 192.168.1.2 via GigE1. That is the route added
				54	* by the control plane is favoured over the adjacency discovered by ARP.
				55	* The control plane, with its associated authentication, is considered the
				56	* authoritative source.
				57	* To counter the nefarious addition of adj-fib, through the nefarious injection
				58	* of adjacencies, the FIB is also required to ensure that only adj-fibs whose
				59	* less specific covering prefix is connected are installed in forwarding. This
				60	* requires the use of 'cover tracking', where a route maintains a dependency
				61	* relationship with the route that is its less specific cover. When this cover
				62	* changes (i.e. there is a new covering route) or the forwarding information
				63	* of the cover changes, then the covered route is notified.
				64	*
				65	* Overlapping sub-nets are not supported, so no adj-fib has multiple paths.
				66	* The control plane is expected to remove a prefix configured for an interface
				67	* before the interface changes VRF.
				68	* So while the following config is accepted:
				69	* set interface address 192.168.1.1/32 GigE0
				70	* ip arp 192.168.1.2 GigE0 dead.dead.dead
				71	* set interface ip table GigE0 2
				72	* it does not result in the desired behaviour.
				73	*
				74	* - attached export.
				75	*
				76	* Further to adj-fib maintenance above consider the following config:
				77	* set interface address 192.168.1.1/24 GigE0
				78	* ip route add table 2 192.168.1.0/24 GigE0
				79	* Traffic destined for 192.168.1.2 in table 2 will generate an ARP request
				80	* on GigE0. However, since GigE0 is in table 0, all adj-fibs will be added in
				81	* FIB 0. Hence all hosts in the sub-net are unreachable from table 2. To resolve
				82	* this, all adj-fib and local prefixes are exported (i.e. copied) from the
				83	* 'export' table 0, to the 'import' table 2. There can be many import tables
				84	* for a single export table.
				85	*
				86	* - recursive route resolution
				87	*
				88	* A recursive route is of the form:
				89	* 1.1.1.1/32 via 10.10.10.10
				90	* i.e. a route for which no egress interface is provided. In order to forward
				91	* traffic to 1.1.1.1/32 the FIB must therefore first determine how to forward
				92	* traffic to 10.10.10.10/32. This is recursive resolution.
				93	* Recursive resolution, just like normal resolution, proceeds via a longest
				94	* prefix match for the 'via-address' 10.10.10.10. Note it is only possible
				95	* to add routes via an address (i.e. a /32 or /128) not via a shorter mask
				96	* prefix. There is no use case for the latter.
				97	* Since recursive resolution proceeds via a longest prefix match, the entry
				98	* in the FIB that will resolve the recursive route, termed the via-entry, may
				99	* change as other routes are added to the FIB. Consider the recursive
				100	* route shown above, and this non-recursive route:
				101	* 10.10.10.0/24 via 192.168.16.1 GigE0
				102	* The entry for 10.10.10.0/24 is thus the resolving via-entry. If this entry is
				103	* modified, to say;
				104	* 10.10.10.0/24 via 192.16.1.3 GigE0
				105	* Then packet for 1.1.1.1/32 must also be sent to the new next-hop.
				106	* Now consider the addition of;
				107	* 10.10.10.0/28 via 192.168.16.2 GigE0
				108	* The more specific /28 is a better longest prefix match and thus becomes the
				109	* via-entry. Removal of the /28 means the resolution will revert to the /24.
				110	* The tracking to the changes in recursive resolution is the requirement of
				111	* the FIB. When the forwarding information of the via-entry changes a back-walk
				112	* is used to update dependent recursive routes. When new routes are added to
				113	* the table the cover tracking feature provides the necessary notifications to
				114	* the via-entry routes.
				115	* The adjacency constructed for 1.1.1.1/32 will be a recursive adjacency
				116	* whose next adjacency will be contributed from the via-entry. Maintaining
				117	* the validity of this recursive adjacency is a requirement of the FIB.
				118	*
				119	* - recursive loop avoidance
				120	*
				121	* Consider this set of routes:
				122	* 1.1.1.1/32 via 2.2.2.2
				123	* 2.2.2.2/32 via 3.3.3.3
				124	* 3.3.3.3/32 via 1.1.1.1
				125	* this is termed a recursion loop - all of the routes in the loop are
				126	* unresolved in so far as they do not have a resolving adjacency, but each
				127	* is resolved because the via-entry is known. It is important here to note
				128	* the distinction between the control-plane objects and the data-plane objects
				129	* (more details in the implementation section). The control plane objects must
				130	* allow the loop to form (i.e. the graph becomes cyclic), however, the
				131	* data-plane absolutely must not allow the loop to form, otherwise the packet
				132	* would loop indefinitely and never egress the device - meltdown would follow.
				133	* The control plane must allow the loop to form, because when the loop breaks,
				134	* all members of the loop need to be updated. Forming the loop allows the
				135	* dependencies to be correctly setup to allow this to happen.
				136	* There is no limit to the depth of recursion supported by VPP so:
				137	* 9.9.9.100/32 via 9.9.9.99
				138	* 9.9.9.99/32 via 9.9.9.98
				139	* 9.9.9.98/32 via 9.9.9.97
				140	* ... turtles, turtles, turtles ...
				141	* 9.9.9.1/32 via 10.10.10.10 Gig0
				142	* is supported to as many layers of turtles is desired, however, when
				143	* back-walking a graph (in this case from 9.9.9.1/32 up toward 9.9.9.100/32)
				144	* a FIB needs to differentiate the case where the recursion is deep versus
				145	* the case where the recursion is looped. A simple method, employed by VPP FIB,
				146	* is to limit the number of steps. VPP FIB limit is 16. Typical BGP scenarios
				147	* in the wild do not exceed 3 (BGP Inter-AS option C).
				148	*
				149	* - Fast Convergence
				150	*
				151	* After a network topology change, the 'convergence' time, is the time taken
				152	* for the router to complete a transition to forward traffic using the new
				153	* topology. The convergence time is therefore a summation of the time to;
				154	* - detect the failure.
				155	* - calculate the new 'best path' information
				156	* - download the new best paths to the data-plane.
				157	* - install those best best in data-plane forwarding.
				158	* The last two points are of relevance to VPP architecture. The download API is
				159	* binary and batch, details are not discussed here. There is no HW component to
				160	* programme, installation time is bounded by the memory allocation and table
				161	* lookup and insert access times.
				162	*
				163	* 'Fast' convergence refers to a set of technologies that a FIB can employ to
				164	* completely or partially restore forwarding whilst the convergence actions
				165	* listed above are ongoing. Fast convergence technologies are further
				166	* sub-divided into Prefix Independent Convergence (PIC) and Loop Free
				167	* Alternate path Fast re-route (LFA-FRR or sometimes called IP-FRR) which
				168	* affect recursive and non-recursive routes respectively.
				169	*
				170	* LFA-FRR
				171	*
				172	* Consider the network topology below:
				173	*
				174	* C
				175	* / \
				176	* X -- A --- B - Y
				177	* \| \|
				178	* D F
				179	* \ /
				180	* E
				181	*
				182	* all links are equal cost, traffic is passing from X to Y. the best path is
				183	* X-A-B-Y. There are two alternative paths, one via C and one via E. An
				184	* alternate path is considered to be loop free if no other router on that path
				185	* would forward the traffic back to the sender. Consider router C, its best
				186	* path to Y is via B, so if A were to send traffic destined to Y to C, then C
				187	* would forward that traffic to B - this is a loop-free alternate path. In
				188	* contrast consider router D. D's shortest path to Y is via A, so if A were to
				189	* send traffic destined to Y via D, then D would send it back to A; this is
				190	* not a loop-free alternate path. There are several points of note;
				191	* - we are considering the pre-failure routing topology
				192	* - any equal-cost multi-path between A and B is also a LFA path.
				193	* - in order for A to calculate LFA paths it must be aware of the best-path
				194	* to Y from the perspective of D. These calculations are thus limited to
				195	* routing protocols that have a full view of the network topology, i.e.
				196	* link-state DB protocols like OSPF or an SDN controller. LFA protected
				197	* prefixes are thus non-recursive.
				198	*
				199	* LFA is specified as a 1 to 1 redundancy; a primary path has only one LFA
				200	* (a.k.a. backup) path. To my knowledge this limitation is one of complexity
				201	* in the calculation of and capacity planning using a 1-n redundancy.
				202	*
				203	* In the event that the link A-B fails, the alternate path via C can be used.
				204	* In order to provide 'fast' failover in the event of a failure, the control
				205	* plane will download both the primary and the backup path to the FIB. It is
				206	* then a requirement of the FIB to perform the failover (a.k.a cutover) from
				207	* the primary to the backup path as quickly as possible, and particularly
				208	* without any other control-plane intervention. The expectation is cutover is
				209	* less than 50 milli-seconds - a value allegedly from the VOIP QoS. Note that
				210	* cutover time still includes the fault detection time, which in a vitalised
				211	* environment could be the dominant factor. Failure detection can be either a
				212	* link down, which will affect multiple paths on a multi-access interface, or
				213	* via a specific path heartbeat (i.e. BFD).
				214	* At this time VPP does not support LFA, that is it does not support the
				215	* installation of a primary and backup path[s] for a route. However, it does
				216	* support ECMP, and VPP FIB is designed to quickly remove failed paths from
				217	* the ECMP set, however, it does not insert shared objects specific to the
				218	* protected resource into the forwarding object graph, since this would incur
				219	* a forwarding/performance cost. Failover time is thus route number dependent.
				220	* Details are provided in the implementation section below.
				221	*
				222	* PIC
				223	*
				224	* PIC refers to the concept that the converge time should be independent of
				225	* the number of prefixes/routes that are affected by the failure. PIC is
				226	* therefore most appropriate when considering networks with large number of
				227	* prefixes, i.e. BGP networks and thus recursive prefixes. There are several
				228	* flavours of PIC covering different locations of protection and failure
				229	* scenarios. An outline is given below, see the literature for more details:
				230	*
				231	* Y/16 - CE1 -- PE1---\
				232	* \| \ P1---\
				233	* \| \ PE3 -- CE3 - X/16
				234	* \| - P2---/
				235	* Y/16 - CE2 -- PE2---/
				236	*
				237	* CE = customer edge, PE = provider edge. external-BGP runs between customer
				238	* and provider, internal-BGP runs between provider and provider.
				239	*
				240	* 1) iBGP PIC-core: consider traffic from CE1 to X/16 via CE3. On PE1 there is
				241	* are routes;
				242	* X/16 (and hundreds of thousands of others like it)
				243	* via PE3
				244	* and
				245	* PE3/32 (its loopback address)
				246	* via 10.0.0.1 Link0 (this is P1)
				247	* via 10.1.1.1 Link1 (this is P2)
				248	* the failure is the loss of link0 or link1
				249	* As in all PIC scenarios, in order to provide prefix independent convergence
				250	* it must be that the route for X/16 (and all other routes via PE3) do not
				251	* need to be updated in the FIB. The FIB therefore needs to update a single
				252	* object that is shared by all routes - once this shared object is updated,
				253	* then all routes using it will be instantly updated to use the new forwarding
				254	* information. In this case the shared object is the resolving route via PE3.
				255	* Once the route via PE3 is updated via IGP (OSPF) convergence, then all
				256	* recursive routes that resolve through it are also updated. VPP FIB
				257	* implements this scenario via a recursive-adjacency. the X/16 and it sibling
				258	* routes share a recursive-adjacency that links to/points at/stacks on the
				259	* normal adjacency contributed by the route for PE3. Once this shared
				260	* recursive adj is re-linked then all routes are switched to using the new
				261	* forwarding information. This is shown below;
				262	*
				263	* pre-failure;
				264	* X/16 --> R-ADJ-1 --> ADJ-1-PE3 (multi-path via P1 and P2)
				265	*
				266	* post-failure:
				267	* X/16 --> R-ADJ-1 --> ADJ-2-PE3 (single path via P1)
				268	*
				269	* note that R-ADJ-1 (the recursive adj) remains in the forwarding graph,
				270	* therefore X/16 (and all its siblings) is not updated.
				271	* X/16 and its siblings share the recursive adj since they share the same
				272	* path-list. It is the path-list object that contributes the recursive-adj
				273	* (see next section for more details)
				274	*
				275	*
				276	* 2) iBGP PIC-edge; Traffic from CE3 to Y/16. On PE3 there is are routes;
				277	* Y/16 (and hundreds of thousands of others like it)
				278	* via PE1
				279	* via PE2
				280	* and
				281	* PE1/32 (PE1's loopback address)
				282	* via 10.0.2.2 Link0 (this is P1)
				283	* PE2/32 (PE2's loopback address)
				284	* via 10.0.3.3 Link1 (this is P2)
				285	*
				286	* the failure is the loss of reachability to PE2. this could be either the
				287	* loss of the link P2-PE2 or the loss of the node PE2. This is detected either
				288	* by the withdrawal of the PE2's loopback route or by some form of failure
				289	* detection (i.e. BFD).
				290	* VPP FIB again provides PIC via the use of the shared recursive-adj. Y/16 and
				291	* its siblings will again share a path-list for the list {PE1,PE2}, this
				292	* path-list will contribute a multi-path-recursive-adj, i.e. a multi-path-adj
				293	* with each choice therein being another adj;
				294	*
				295	* Y/16 -> RM-ADJ --> ADJ1 (for PE1)
				296	* --> ADJ2 (for PE2)
				297	*
				298	* when the route for PE1 is withdrawn then the multi-path-recursive-adjacency
				299	* is updated to be;
				300	*
				301	* Y/16 --> RM-ADJ --> ADJ1 (for PE1)
				302	* --> ADJ1 (for PE1)
				303	*
				304	* that is both choices in the ECMP set are the same and thus all traffic is
				305	* forwarded to PE1. Eventually the control plane will download a route update
				306	* for Y/16 to be via PE1 only. At that time the situation will be:
				307	*
				308	* Y/16 -> R-ADJ --> ADJ1 (for PE1)
				309	*
				310	* In the scenario above we assumed that PE1 and PE2 are ECMP for Y/16. eBGP
				311	* PIC core is also specified for the case were one PE is primary and the other
				312	* backup - VPP FIB does not support that case at this time.
				313	*
				314	* 3) eBGP PIC Edge; Traffic from CE3 to Y/16. On PE1 there is are routes;
				315	* Y/16 (and hundreds of thousands of others like it)
				316	* via CE1 (primary)
				317	* via PE2 (backup)
				318	* and
				319	* CE1 (this is an adj-fib)
				320	* via 11.0.0.1 Link0 (this is CE1) << this is an adj-fib
				321	* PE2 (PE2's loopback address)
				322	* via 10.0.5.5 Link1 (this is link PE1-PE2)
				323	* the failure is the loss of link0 to CE1. The failure can be detected by FIB
				324	* either as a link down event or by the control plane withdrawing the connected
				325	* prefix on the link0 (say 10.0.5.4/30). The latter works because the resolving
				326	* entry is an adj-fib, so removing the connected will withdraw the adj-fib, and
				327	* hence the recursive path becomes unresolved. The former is faster,
				328	* particularly in the case of Inter-AS option A where there are many VLAN
				329	* sub-interfaces on the PE-CE link, one for each VRF, and so the control plane
				330	* must remove the connected prefix for each sub-interface to trigger PIC in
				331	* each VRF. Note though that total PIC cutover time will depend on VRF scale
				332	* with either trigger.
				333	* Primary and backup paths in this eBGP PIC-edge scenario are calculated by
				334	* BGP. Each peer is configured to always advertise its best external path to
				335	* its iBGP peers. Backup paths therefore send traffic from the PE back into the
				336	* core to an alternate PE. A PE may have multiple external paths, i.e. multiple
				337	* directly connected CEs, it may also have multiple backup PEs, however there
				338	* is no correlation between the two, so unlike LFA-FRR, the redundancy model is
				339	* N-M; N primary paths are backed-up by M backup paths - only when all primary
				340	* paths fail, then the cutover is performed onto the M backup paths. Note that
				341	* PE2 must be suitably configured to forward traffic on its external path that
				342	* was received from PE1. VPP FIB does not support external-internal-BGP (eiBGP)
				343	* load-balancing.
				344	*
				345	* As with LFA-FRR the use of primary and backup paths is not currently
				346	* supported, however, the use of a recursive-multi-path-adj, and a suitably
				347	* constrained hashing algorithm to choose from the primary or backup path sets,
				348	* would again provide the necessary shared object and hence the prefix scale
				349	* independent cutover.
				350	*
				351	* Astute readers will recognise that both of the eBGP PIC scenarios refer only
				352	* to a BGP free core.
				353	*
				354	* Fast convergence implementation options come in two flavours:
				355	* 1) Insert switches into the data-path. The switch represents the protected
				356	* resource. If the switch is 'on' the primary path is taken, otherwise
				357	* the backup path is taken. Testing the switch in the data-path comes with
				358	* an associated performance cost. A given packet may encounter more than
				359	* one protected resource as it is forwarded. This approach minimises
				360	* cutover times as packets will be forwarded on the backup path as soon
				361	* as the protected resource is detected to be down and the single switch
				362	* is tripped. However, it comes at a performance cost, which increases
				363	* with each shared resource a packet encounters in the data-path.
				364	* This approach is thus best suited to LFA-FRR where the protected routes
				365	* are non-recursive (i.e. encounter few shared resources) and the
				366	* expectation on cutover times is more stringent (<50msecs).
				367	* 2) Update shared objects. Identify objects in the data-path, that are
				368	* required to be present whether or not fast convergence is required (i.e.
				369	* adjacencies) that can be shared by multiple routes. Create a dependency
				370	* between these objects at the protected resource. When the protected
				371	* resource fails, each of the shared objects is updated in a way that all
				372	* users of it see a consistent change. This approach incurs no performance
				373	* penalty as the data-path structure is unchanged, however, the cutover
				374	* times are longer as more work is required when the resource fails. This
				375	* scheme is thus more appropriate to recursive prefixes (where the packet
				376	* will encounter multiple protected resources) and to fast-convergence
				377	* technologies where the cutover times are less stringent (i.e. PIC).
				378	*
				379	* Implementation:
				380	* ---------------
				381	*
				382	* Due to the requirements outlined above, not all routes known to FIB
				383	* (e.g. adj-fibs) are installed in forwarding. However, should circumstances
				384	* change, those routes will need to be added. This adds the requirement that
				385	* a FIB maintains two tables per-VRF, per-AF (where a 'table' is indexed by
				386	* prefix); the forwarding and non-forwarding tables.
				387	*
				388	* For DP speed in VPP we want the lookup in the forwarding table to directly
				389	* result in the ADJ. So the two tables; one contains all the routes (a
				390	* lookup therein yields a fib_entry_t), the other contains only the forwarding
				391	* routes (a lookup therein yields an ip_adjacency_t). The latter is used by the
				392	* DP.
				393	* This trades memory for forwarding performance. A good trade-off in VPP's
				394	* expected operating environments.
				395	*
				396	* Note these tables are keyed only by the prefix (and since there 2 two
				397	* per-VRF, implicitly by the VRF too). The key for an adjacency is the
				398	* tuple:{next-hop, address (and it's AF), interface, link/ether-type}.
				399	* consider this curious, but allowed, config;
				400	*
				401	* set int ip addr 10.0.0.1/24 Gig0
				402	* set ip arp Gig0 10.0.0.2 dead.dead.dead
				403	* # a host in that sub-net is routed via a better next hop (say it avoids a
				404	* # big L2 domain)
				405	* ip route add 10.0.0.2 Gig1 192.168.1.1
				406	* # this recursive should go via Gig1
				407	* ip route add 1.1.1.1/32 via 10.0.0.2
				408	* # this non-recursive should go via Gig0
				409	* ip route add 2.2.2.2/32 via Gig0 10.0.0.2
				410	*
				411	* for the last route, the lookup for the path (via {Gig0, 10.0.0.2}) in the
				412	* prefix table would not yield the correct result. To fix this we need a
				413	* separate table for the adjacencies.
				414	*
				415	* - FIB data structures;
				416	*
				417	* fib_entry_t:
				418	* - a representation of a route.
				419	* - has a prefix.
				420	* - it maintains an array of path-lists that have been contributed by the
				421	* different sources
				422	* - install an adjacency in the forwarding table contributed by the best
				423	* source's path-list.
				424	*
				425	* fib_path_list_t:
				426	* - a list of paths
				427	* - path-lists may be shared between FIB entries. The path-lists are thus
				428	* kept in a DB. The key is the combined description of the paths. We share
				429	* path-lists when it will aid convergence to do so. Adding path-lists to
				430	* this DB that are never shared, or are not shared by prefixes that are
				431	* not subject to PIC, will increase the size of the DB unnecessarily and
				432	* may lead to increased search times due to hash collisions.
				433	* - the path-list contributes the appropriate adj for the entry in the
				434	* forwarding table. The adj can be 'normal', multi-path or recursive,
				435	* depending on the number of paths and their types.
				436	* - since path-lists are shared there is only one instance of the multi-path
				437	* adj that they [may] create. As such multi-path adjacencies do not need a
				438	* separate DB.
				439	* The path-list with recursive paths and the recursive adjacency that it
				440	* contributes forms the backbone of the fast convergence architecture (as
				441	* described previously).
				442	*
				443	* fib_path_t:
				444	* - a description of how to forward the traffic (i.e. via {Gig1, K}).
				445	* - the path describes the intent on how to forward. This differs from how
				446	* the path resolves. I.e. it might not be resolved at all (since the
				447	* interface is deleted or down).
				448	* - paths have different types, most notably recursive or non-recursive.
				449	* - a fib_path_t will contribute the appropriate adjacency object. It is from
				450	* these contributions that the DP graph/chain for the route is built.
				451	* - if the path is recursive and a recursion loop is detected, then the path
				452	* will contribute the special DROP adjacency. This way, whilst the control
				453	* plane graph is looped, the data-plane graph does not.
				454	*
				455	* we build a graph of these objects;
				456	*
				457	* fib_entry_t -> fib_path_list_t -> fib_path_t -> ...
				458	*
				459	* for recursive paths:
				460	*
				461	* fib_path_t -> fib_entry_t -> ....
				462	*
				463	* for non-recursive paths
				464	*
				465	* fib_path_t -> ip_adjacency_t -> interface
				466	*
				467	* These objects, which constitute the 'control plane' part of the FIB are used
				468	* to represent the resolution of a route. As a whole this is referred to as the
				469	* control plane graph. There is a separate DP graph to represent the forwarding
				470	* of a packet. In the DP graph each object represents an action that is applied
				471	* to a packet as it traverses the graph. For example, a lookup of a IP address
				472	* in the forwarding table could result in the following graph:
				473	*
				474	* recursive-adj --> multi-path-adj --> interface_A
				475	* --> interface_B
				476	*
				477	* A packet traversing this FIB DP graph would thus also traverse a VPP node
				478	* graph of:
				479	*
				480	* ipX_recursive --> ipX_rewrite --> interface_A_tx --> etc
				481	*
				482	* The taxonomy of objects in a FIB graph is as follows, consider;
				483	*
				484	* A -->
				485	* B --> D
				486	* C -->
				487	*
				488	* Where A,B and C are (for example) routes that resolve through D.
				489	* parent; D is the parent of A, B, and C.
				490	* children: A, B, and C are children of D.
				491	* sibling: A, B and C are siblings of one another.
				492	*
				493	* All shared objects in the FIB are reference counted. Users of these objects
				494	* are thus expected to use the add_lock/unlock semantics (as one would
				495	* normally use malloc/free).
				496	*
				497	* WALKS
				498	*
				499	* It is necessary to walk/traverse the graph forwards (entry to interface) to
				500	* perform a collapse or build a recursive adj and backwards (interface
				501	* to entry) to perform updates, i.e. when interface state changes or when
				502	* recursive route resolution updates occur.
				503	* A forward walk follows simply by navigating an object's parent pointer to
				504	* access its parent object. For objects with multiple parents (e.g. a
				505	* path-list), each parent is walked in turn.
				506	* To support back-walks direct dependencies are maintained between objects,
				507	* i.e. in the relationship, {A, B, C} --> D, then object D will maintain a list
				508	* of 'pointers' to its children {A, B, C}. Bare C-language pointers are not
				509	* allowed, so a pointer is described in terms of an object type (i.e. entry,
				510	* path-list, etc) and index - this allows the object to be retrieved from the
				511	* appropriate pool. A list is maintained to achieve fast convergence at scale.
				512	* When there are millions or recursive prefixes, it is very inefficient to
				513	* blindly walk the tables looking for entries that were affected by a given
				514	* topology change. The lowest hanging fruit when optimising is to remove
				515	* actions that are not required, so all back-walks only traverse objects that
				516	* are directly affected by the change.
				517	*
				518	* PIC Core and fast-reroute rely on FIB reacting quickly to an interface
				519	* state change to update the multi-path-adjacencies that use this interface.
				520	* An example graph is shown below:
				521	*
				522	* E_a -->
				523	* E_b --> PL_2 --> P_a --> Interface_A
				524	* ... --> P_c -\
				525	* E_k --> \
				526	* Interface_K
				527	* /
				528	* E_l --> /
				529	* E_m --> PL_1 --> P_d -/
				530	* ... --> P_f --> Interface_F
				531	* E_z -->
				532	*
				533	* E = fib_entry_t
				534	* PL = fib_path_list_t
				535	* P = fib_path_t
				536	* The subscripts are arbitrary and serve only to distinguish object instances.
				537	* This CP graph result in the following DP graph:
				538	*
				539	* M-ADJ-2 --> Interface_A
				540	* \
				541	* -> Interface_K
				542	* /
				543	* M-ADJ-1 --> Interface_F
				544	*
				545	* M-ADJ = multi-path-adjacency.
				546	*
				547	* When interface K goes down a back-walk is started over its dependants in the
				548	* control plane graph. This back-walk will reach PL_1 and PL_2 and result in
				549	* the calculation of new adjacencies that have interface K removed. The walk
				550	* will continue to the entry objects and thus the forwarding table is updated
				551	* for each prefix with the new adjacency. The DP graph then becomes:
				552	*
				553	* ADJ-3 --> Interface_A
				554	*
				555	* ADJ-4 --> Interface_F
				556	*
				557	* The eBGP PIC scenarios described above relied on the update of a path-list's
				558	* recursive-adjacency to provide the shared point of cutover. This is shown
				559	* below
				560	*
				561	* E_a -->
				562	* E_b --> PL_2 --> P_a --> E_44 --> PL_a --> P_b --> Interface_A
				563	* ... --> P_c -\
				564	* E_k --> \
				565	* \
				566	* E_1 --> PL_k -> P_k --> Interface_K
				567	* /
				568	* E_l --> /
				569	* E_m --> PL_1 --> P_d -/
				570	* ... --> P_f --> E_55 --> PL_e --> P_e --> Interface_E
				571	* E_z -->
				572	*
				573	* The failure scenario is the removal of entry E_1 and thus the paths P_c and
				574	* P_d become unresolved. To achieve PIC the two shared recursive path-lists,
				575	* PL_1 and PL_2 must be updated to remove E_1 from the recursive-multi-path-
				576	* adjacencies that they contribute, before any entry E_a to E_z is updated.
				577	* This means that as the update propagates backwards (right to left) in the
				578	* graph it must do so breadth first not depth first. Note this approach leads
				579	* to convergence times that are dependent on the number of path-list and so
				580	* the number of combinations of egress PEs - this is desirable as this
				581	* scale is considerably lower than the number of prefixes.
				582	*
				583	* If we consider another section of the graph that is similar to the one
				584	* shown above where there is another prefix E_2 in a similar position to E_1
				585	* and so also has many dependent children. It is reasonable to expect that a
				586	* particular network failure may simultaneously render E_1 and E_2 unreachable.
				587	* This means that the update to withdraw E_2 is download immediately after the
				588	* update to withdraw E_1. It is a requirement on the FIB to not spend large
				589	* amounts of time in a back-walk whilst processing the update for E_1, i.e. the
				590	* back-walk must not reach as far as E_a and its siblings. Therefore, after the
				591	* back-walk has traversed one generation (breadth first) to update all the
				592	* path-lists it should be suspended/back-ground and further updates allowed
				593	* to be handled. Once the update queue is empty, the suspended walks can be
				594	* resumed. Note that in the case that multiple updates affect the same entry
				595	* (say E_1) then this will trigger multiple similar walks, these are merged,
				596	* so each child is updated only once.
				597	* In the presence of more layers of recursion PIC is still a desirable
				598	* feature. Consider an extension to the diagram above, where more recursive
				599	* routes (E_100 -> E_200) are added as children of E_a:
				600	*
				601	* E_100 -->
				602	* E_101 --> PL_3 --> P_j-\
				603	* ... \
				604	* E_199 --> E_a -->
				605	* E_b --> PL_2 --> P_a --> E_44 --> ...etc..
				606	* ... --> P_c -\
				607	* E_k \
				608	* E_1 --> ...etc..
				609	* /
				610	* E_l --> /
				611	* E_m --> PL_1 --> P_d -/
				612	* ... --> P_e --> E_55 --> ...etc..
				613	* E_z -->
				614	*
				615	* To achieve PIC for the routes E_100->E_199, PL_3 needs to be updated before
				616	* E_b -> E_z, a breadth first traversal at each level would not achieve this.
				617	* Instead the walk must proceed intelligently. Children on PL_2 are sorted so
				618	* those Entry objects that themselves have children appear first in the list,
				619	* those without later. When an entry object is walked that has children, a
				620	* walk of its children is pushed to the front background queue. The back
				621	* ground queue is a priority queue. As the breadth first traversal proceeds
				622	* across the dependent entry object E_a to E_k, when the first entry that does
				623	* not have children is reached (E_b), the walk is suspended and placed at the
				624	* back of the queue. Following this prioritisation method shared path-list
				625	* updates are performed before all non-resolving entry objects.
				626	* The CPU/core/thread that handles the updates is the same thread that handles
				627	* the back-walks. Handling updates has a higher priority than making walk
				628	* progress, so a walk is required to be interruptable/suspendable when new
				629	* updates are available.
				630	* !!! TODO - this section describes how walks should be not how they are !!!
				631	*
				632	* In the diagram above E_100 is an IP route, however, VPP has no restrictions
				633	* on the type of object that can be a dependent of a FIB entry. Children of
				634	* a FIB entry can be (and are) GRE & VXLAN tunnels endpoints, L2VPN LSPs etc.
				635	* By including all object types into the graph and extending the back-walk, we
				636	* can thus deliver fast convergence to technologies that overlay on an IP
				637	* network.
				638	*
				639	* If having read all the above carefully you are still thinking; 'i don't need
				640	* all this %&$* i have a route only I know about and I just need to jam it in',
				641	* then fib_table_entry_special_add() is your only friend.
				642	*/
				643
				644	#ifndef __FIB_H__
				645	#define __FIB_H__
				646
				647	#include <vnet/fib/fib_table.h>
				648	#include <vnet/fib/fib_entry.h>
Neale Ranns	0bfe5d8	2016-08-25 15:29:12 +0100	[diff] [blame]	649
				650	#endif