John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 1 | :orphan: |
| 2 | |
| 3 | .. _what-is-vector-packet-processing: |
| 4 | |
| 5 | ================================= |
| 6 | What is vector packet processing? |
| 7 | ================================= |
| 8 | |
| 9 | FD.io VPP is developed using vector packet processing concepts, as opposed to |
| 10 | scalar packet processing, these concepts are explained in the following sections. |
| 11 | |
| 12 | Vector packet processing is a common approach among high performance `Userspace |
| 13 | <https://en.wikipedia.org/wiki/User_space>`_ packet processing applications such |
| 14 | as developed with FD.io VPP and `DPDK |
| 15 | <https://en.wikipedia.org/wiki/Data_Plane_Development_Kit>`_. The scalar based |
| 16 | aproach tends to be favoured by Operating System `Kernel |
| 17 | <https://en.wikipedia.org/wiki/Kernel_(operating_system)>`_ Network Stacks and |
| 18 | Userspace stacks that don't have strict performance requirements. |
| 19 | |
| 20 | **Scalar Packet Processing** |
| 21 | |
| 22 | A scalar packet processing network stack typically processes one packet at a |
| 23 | time: an interrupt handling function takes a single packet from a Network |
| 24 | Inteface, and processes it through a set of functions: fooA calls fooB calls |
| 25 | fooC and so on. |
| 26 | |
| 27 | .. code-block:: none |
| 28 | |
| 29 | +---> fooA(packet1) +---> fooB(packet1) +---> fooC(packet1) |
| 30 | +---> fooA(packet2) +---> fooB(packet2) +---> fooC(packet2) |
| 31 | ... |
| 32 | +---> fooA(packet3) +---> fooB(packet3) +---> fooC(packet3) |
| 33 | |
| 34 | |
| 35 | Scalar packet processing is simple, but inefficent in these ways: |
| 36 | |
| 37 | * When the code path length exceeds the size of the Microprocessor's instruction |
| 38 | cache (I-cache), `thrashing |
| 39 | <https://en.wikipedia.org/wiki/Thrashing_(computer_science)>`_ occurs as the |
| 40 | Microprocessor is continually loading new instructions. In this model, each |
| 41 | packet incurs an identical set of I-cache misses. |
| 42 | * The associated deep call stack will also add load-store-unit pressure as |
| 43 | stack-locals fall out of the Microprocessor's Layer 1 Data Cache (D-cache). |
| 44 | |
| 45 | **Vector Packet Processing** |
| 46 | |
| 47 | In contrast, a vector packet processing network stack processes multiple packets |
| 48 | at a time, called 'vectors of packets' or simply a 'vector'. An interrupt |
| 49 | handling function takes the vector of packets from a Network Inteface, and |
| 50 | processes the vector through a set of functions: fooA calls fooB calls fooC and |
| 51 | so on. |
| 52 | |
| 53 | .. code-block:: none |
| 54 | |
| 55 | +---> fooA([packet1, +---> fooB([packet1, +---> fooC([packet1, +---> |
| 56 | packet2, packet2, packet2, |
| 57 | ... ... ... |
| 58 | packet256]) packet256]) packet256]) |
| 59 | |
| 60 | This approach fixes: |
| 61 | |
| 62 | * The I-cache thrashing problem described above, by ammoritizing the cost of |
| 63 | I-cache loads across multiple packets. |
| 64 | |
| 65 | * The ineffeciences associated with the deep call stack by recieving vectors |
| 66 | of up to 256 packets at a time from the Network Interface, and processes them |
| 67 | using a directed graph of node. The graph scheduler invokes one node dispatch |
| 68 | function at a time, restricting stack depth to a few stack frames. |
| 69 | |
| 70 | The further optimizations that this approaches enables are pipelining and |
| 71 | prefetching to minimize read latency on table data and parallelize packet loads |
| 72 | needed to process packets. |
| 73 | |