Neale Ranns | dfd3954 | 2020-11-09 10:09:42 +0000 | [diff] [blame] | 1 | .. _marknsweep: |
| 2 | |
| 3 | Mark and Sweep |
| 4 | -------------- |
| 5 | |
| 6 | The mark and sweep procedures, in FIB and in other subsystems, are |
| 7 | built for the purpose of recovering from a control plane crash. |
| 8 | |
| 9 | In routing if the control plane (CP) crashes, when it restarts, the network |
| 10 | topology may have changed. This means that some of the routes that |
| 11 | were programmed in the FIB may no longer be needed, and perhaps some |
| 12 | new ones are. If the CP were simply to insert all the new routes it |
| 13 | learned after it restarts, then FIB could be left with old routes that |
| 14 | never get removed, this would be bigly bad. |
| 15 | |
| 16 | At a high level the requirement is to delete routes from the old set |
| 17 | that are not present in the new set; 'delete the diff' as it might |
| 18 | be colloquially known. |
| 19 | |
| 20 | How should the control plane determine the old set? It could |
| 21 | conceivably read back the FIB from VPP. But this presents two |
| 22 | problems, firstly, it could be a large set of routes, numbering in the |
| 23 | millions, this is not an efficient mechanism and not one one wants to |
| 24 | perform at a point when the router is trying to converge |
| 25 | ASAP. Secondly it represents a 'source of truth' inversion. The |
| 26 | routing plane is the source of truth, not forwarding. Routing should |
| 27 | not receive its 'input' from the layers below. Thirdly, on a practical |
| 28 | note, the reading of VPP data structures to glean this sort of |
| 29 | accurate information, would only happen in this scenario, i.e. it's |
| 30 | not well tested and therefore not particularly reliable (see point 2). |
| 31 | |
| 32 | Enter 'mark and sweep' or m-n-s (not to be confused with the retail |
| 33 | giant) as it's affectionately known. |
| 34 | |
| 35 | The Mark and Sweep algorithm proceeds in three steps: |
| 36 | |
| 37 | - Step 1; the CP declares to VPP that it wants to begin the process |
| 38 | (i.e. it has just restarted). At this point VPP will iterate through |
| 39 | all the objects that the CP owns and 'mark' then as being |
| 40 | stale. This process effectively declares a new 'epoch', a barrier in |
| 41 | time that separates the old objects from the new. |
| 42 | - Step 2; The CP downloads all of its new objects. If one of these new |
| 43 | CP objects matches (has the same key as) an existing object, then |
| 44 | the CP add is considered an update, and the object's stale state is |
| 45 | removed. |
| 46 | - Step 3: The CP declares it has 'converged'; it has no more updates |
| 47 | to give (at this time). VPP will then again iterate through all the |
| 48 | CP's objects and remove those that do not belong to the new epoch, |
| 49 | i.e. those that are still marked stale. |
| 50 | |
| 51 | After step 3, the CP and VPP databases are in sync. |
| 52 | |
| 53 | The cost of the process was to download all the new routes again. This |
| 54 | is a highly-tuned and well-tested scenario. |
| 55 | |
| 56 | In VPP we use the synonym 'replace' to describe the mark-n-sweep |
| 57 | action in the API. We use this term because it refers to the goals of |
| 58 | the algorithm at a high level - the CP wants to replace the old DB |
| 59 | with a new one - but it does not specify the algorithm by which that |
| 60 | is achieved. One could equally perform this task by constructing a |
| 61 | brand new DB in VPP, and then swapping them when the CP |
| 62 | converges. Other subsystems may employ that approach, but FIB does |
| 63 | not. Updates are typically faster than adds, since the update is |
| 64 | likely a no-op, whereas a separate add would require the memory |
| 65 | allocator, which is the long pole in FIB additions. Additionally, it requires |
| 66 | twice the memory for a moment in time, which could be prohibitive when |
| 67 | the FIB is large. |
| 68 | |