John DeNisco | c96d618 | 2019-11-06 10:58:28 -0800 | [diff] [blame] | 1 | .. _scalar_vector: |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 2 | |
John DeNisco | c96d618 | 2019-11-06 10:58:28 -0800 | [diff] [blame] | 3 | ================================== |
| 4 | Scalar vs Vector packet processing |
| 5 | ================================== |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 6 | |
John DeNisco | c96d618 | 2019-11-06 10:58:28 -0800 | [diff] [blame] | 7 | FD.io VPP is developed using vector packet processing, as opposed to |
| 8 | scalar packet processing. |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 9 | |
John DeNisco | c96d618 | 2019-11-06 10:58:28 -0800 | [diff] [blame] | 10 | Vector packet processing is a common approach among high performance packet |
| 11 | processing applications such FD.io VPP and `DPDK <https://en.wikipedia.org/wiki/Data_Plane_Development_Kit>`_. |
| 12 | The scalar based approach tends to be favoured by network stacks that |
| 13 | don't necessarily have strict performance requirements. |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 14 | |
| 15 | **Scalar Packet Processing** |
| 16 | |
| 17 | A scalar packet processing network stack typically processes one packet at a |
| 18 | time: an interrupt handling function takes a single packet from a Network |
Paul Vinciguerra | 7fa3dd2 | 2019-10-27 17:28:10 -0400 | [diff] [blame] | 19 | Interface, and processes it through a set of functions: fooA calls fooB calls |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 20 | fooC and so on. |
| 21 | |
| 22 | .. code-block:: none |
| 23 | |
| 24 | +---> fooA(packet1) +---> fooB(packet1) +---> fooC(packet1) |
| 25 | +---> fooA(packet2) +---> fooB(packet2) +---> fooC(packet2) |
| 26 | ... |
| 27 | +---> fooA(packet3) +---> fooB(packet3) +---> fooC(packet3) |
| 28 | |
| 29 | |
Paul Vinciguerra | 7fa3dd2 | 2019-10-27 17:28:10 -0400 | [diff] [blame] | 30 | Scalar packet processing is simple, but inefficient in these ways: |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 31 | |
| 32 | * When the code path length exceeds the size of the Microprocessor's instruction |
| 33 | cache (I-cache), `thrashing |
| 34 | <https://en.wikipedia.org/wiki/Thrashing_(computer_science)>`_ occurs as the |
| 35 | Microprocessor is continually loading new instructions. In this model, each |
| 36 | packet incurs an identical set of I-cache misses. |
| 37 | * The associated deep call stack will also add load-store-unit pressure as |
| 38 | stack-locals fall out of the Microprocessor's Layer 1 Data Cache (D-cache). |
| 39 | |
| 40 | **Vector Packet Processing** |
| 41 | |
| 42 | In contrast, a vector packet processing network stack processes multiple packets |
| 43 | at a time, called 'vectors of packets' or simply a 'vector'. An interrupt |
Paul Vinciguerra | 7fa3dd2 | 2019-10-27 17:28:10 -0400 | [diff] [blame] | 44 | handling function takes the vector of packets from a Network Interface, and |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 45 | processes the vector through a set of functions: fooA calls fooB calls fooC and |
| 46 | so on. |
| 47 | |
| 48 | .. code-block:: none |
| 49 | |
| 50 | +---> fooA([packet1, +---> fooB([packet1, +---> fooC([packet1, +---> |
| 51 | packet2, packet2, packet2, |
| 52 | ... ... ... |
| 53 | packet256]) packet256]) packet256]) |
| 54 | |
| 55 | This approach fixes: |
| 56 | |
Paul Vinciguerra | 7fa3dd2 | 2019-10-27 17:28:10 -0400 | [diff] [blame] | 57 | * The I-cache thrashing problem described above, by amortizing the cost of |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 58 | I-cache loads across multiple packets. |
| 59 | |
Paul Vinciguerra | 7fa3dd2 | 2019-10-27 17:28:10 -0400 | [diff] [blame] | 60 | * The inefficiencies associated with the deep call stack by receiving vectors |
John DeNisco | 06dcd45 | 2018-07-26 12:45:10 -0400 | [diff] [blame] | 61 | of up to 256 packets at a time from the Network Interface, and processes them |
| 62 | using a directed graph of node. The graph scheduler invokes one node dispatch |
| 63 | function at a time, restricting stack depth to a few stack frames. |
| 64 | |
| 65 | The further optimizations that this approaches enables are pipelining and |
| 66 | prefetching to minimize read latency on table data and parallelize packet loads |
| 67 | needed to process packets. |
| 68 | |
John DeNisco | c96d618 | 2019-11-06 10:58:28 -0800 | [diff] [blame] | 69 | Press next for more on Packet Processing Graphs. |