Nathan Skrzypczak | 9ad39c0 | 2021-08-19 11:38:06 +0200 | [diff] [blame] | 1 | VLIB (Vector Processing Library) |
| 2 | ================================ |
| 3 | |
| 4 | The files associated with vlib are located in the ./src/{vlib, vlibapi, |
| 5 | vlibmemory} folders. These libraries provide vector processing support |
| 6 | including graph-node scheduling, reliable multicast support, |
| 7 | ultra-lightweight cooperative multi-tasking threads, a CLI, plug in .DLL |
| 8 | support, physical memory and Linux epoll support. Parts of this library |
| 9 | embody US Patent 7,961,636. |
| 10 | |
| 11 | Init function discovery |
| 12 | ----------------------- |
| 13 | |
| 14 | vlib applications register for various [initialization] events by |
| 15 | placing structures and \__attribute__((constructor)) functions into the |
| 16 | image. At appropriate times, the vlib framework walks |
| 17 | constructor-generated singly-linked structure lists, performs a |
| 18 | topological sort based on specified constraints, and calls the indicated |
| 19 | functions. Vlib applications create graph nodes, add CLI functions, |
| 20 | start cooperative multi-tasking threads, etc. etc. using this mechanism. |
| 21 | |
| 22 | vlib applications invariably include a number of VLIB_INIT_FUNCTION |
| 23 | (my_init_function) macros. |
| 24 | |
| 25 | Each init / configure / etc. function has the return type clib_error_t |
| 26 | \*. Make sure that the function returns 0 if all is well, otherwise the |
| 27 | framework will announce an error and exit. |
| 28 | |
| 29 | vlib applications must link against vppinfra, and often link against |
| 30 | other libraries such as VNET. In the latter case, it may be necessary to |
| 31 | explicitly reference symbol(s) otherwise large portions of the library |
| 32 | may be AWOL at runtime. |
| 33 | |
| 34 | Init function construction and constraint specification |
| 35 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 36 | |
| 37 | It’s easy to add an init function: |
| 38 | |
| 39 | .. code:: c |
| 40 | |
| 41 | static clib_error_t *my_init_function (vlib_main_t *vm) |
| 42 | { |
| 43 | /* ... initialize things ... */ |
| 44 | |
| 45 | return 0; // or return clib_error_return (0, "BROKEN!"); |
| 46 | } |
| 47 | VLIB_INIT_FUNCTION(my_init_function); |
| 48 | |
| 49 | As given, my_init_function will be executed “at some point,” but with no |
| 50 | ordering guarantees. |
| 51 | |
| 52 | Specifying ordering constraints is easy: |
| 53 | |
| 54 | .. code:: c |
| 55 | |
| 56 | VLIB_INIT_FUNCTION(my_init_function) = |
| 57 | { |
| 58 | .runs_before = VLIB_INITS("we_run_before_function_1", |
| 59 | "we_run_before_function_2"), |
| 60 | .runs_after = VLIB_INITS("we_run_after_function_1", |
| 61 | "we_run_after_function_2), |
| 62 | }; |
| 63 | |
| 64 | It’s also easy to specify bulk ordering constraints of the form “a then |
| 65 | b then c then d”: |
| 66 | |
| 67 | .. code:: c |
| 68 | |
| 69 | VLIB_INIT_FUNCTION(my_init_function) = |
| 70 | { |
| 71 | .init_order = VLIB_INITS("a", "b", "c", "d"), |
| 72 | }; |
| 73 | |
| 74 | It’s OK to specify all three sorts of ordering constraints for a single |
| 75 | init function, although it’s hard to imagine why it would be necessary. |
| 76 | |
| 77 | Node Graph Initialization |
| 78 | ------------------------- |
| 79 | |
| 80 | vlib packet-processing applications invariably define a set of graph |
| 81 | nodes to process packets. |
| 82 | |
| 83 | One constructs a vlib_node_registration_t, most often via the |
| 84 | VLIB_REGISTER_NODE macro. At runtime, the framework processes the set of |
| 85 | such registrations into a directed graph. It is easy enough to add nodes |
| 86 | to the graph at runtime. The framework does not support removing nodes. |
| 87 | |
| 88 | vlib provides several types of vector-processing graph nodes, primarily |
| 89 | to control framework dispatch behaviors. The type member of the |
| 90 | vlib_node_registration_t functions as follows: |
| 91 | |
| 92 | - VLIB_NODE_TYPE_PRE_INPUT - run before all other node types |
| 93 | - VLIB_NODE_TYPE_INPUT - run as often as possible, after pre_input |
| 94 | nodes |
| 95 | - VLIB_NODE_TYPE_INTERNAL - only when explicitly made runnable by |
| 96 | adding pending frames for processing |
| 97 | - VLIB_NODE_TYPE_PROCESS - only when explicitly made runnable. |
| 98 | “Process” nodes are actually cooperative multi-tasking threads. They |
| 99 | **must** explicitly suspend after a reasonably short period of time. |
| 100 | |
| 101 | For a precise understanding of the graph node dispatcher, please read |
| 102 | ./src/vlib/main.c:vlib_main_loop. |
| 103 | |
| 104 | Graph node dispatcher |
| 105 | --------------------- |
| 106 | |
| 107 | Vlib_main_loop() dispatches graph nodes. The basic vector processing |
| 108 | algorithm is diabolically simple, but may not be obvious from even a |
| 109 | long stare at the code. Here’s how it works: some input node, or set of |
| 110 | input nodes, produce a vector of work to process. The graph node |
| 111 | dispatcher pushes the work vector through the directed graph, |
| 112 | subdividing it as needed, until the original work vector has been |
| 113 | completely processed. At that point, the process recurs. |
| 114 | |
| 115 | This scheme yields a stable equilibrium in frame size, by construction. |
| 116 | Here’s why: as the frame size increases, the per-frame-element |
| 117 | processing time decreases. There are several related forces at work; the |
| 118 | simplest to describe is the effect of vector processing on the CPU L1 |
| 119 | I-cache. The first frame element [packet] processed by a given node |
| 120 | warms up the node dispatch function in the L1 I-cache. All subsequent |
| 121 | frame elements profit. As we increase the number of frame elements, the |
| 122 | cost per element goes down. |
| 123 | |
| 124 | Under light load, it is a crazy waste of CPU cycles to run the graph |
| 125 | node dispatcher flat-out. So, the graph node dispatcher arranges to wait |
| 126 | for work by sitting in a timed epoll wait if the prevailing frame size |
| 127 | is low. The scheme has a certain amount of hysteresis to avoid |
| 128 | constantly toggling back and forth between interrupt and polling mode. |
| 129 | Although the graph dispatcher supports interrupt and polling modes, our |
| 130 | current default device drivers do not. |
| 131 | |
| 132 | The graph node scheduler uses a hierarchical timer wheel to reschedule |
| 133 | process nodes upon timer expiration. |
| 134 | |
| 135 | Graph dispatcher internals |
| 136 | -------------------------- |
| 137 | |
| 138 | This section may be safely skipped. It’s not necessary to understand |
| 139 | graph dispatcher internals to create graph nodes. |
| 140 | |
| 141 | Vector Data Structure |
| 142 | --------------------- |
| 143 | |
| 144 | In vpp / vlib, we represent vectors as instances of the vlib_frame_t |
| 145 | type: |
| 146 | |
| 147 | .. code:: c |
| 148 | |
| 149 | typedef struct vlib_frame_t |
| 150 | { |
| 151 | /* Frame flags. */ |
| 152 | u16 flags; |
| 153 | |
| 154 | /* Number of scalar bytes in arguments. */ |
| 155 | u8 scalar_size; |
| 156 | |
| 157 | /* Number of bytes per vector argument. */ |
| 158 | u8 vector_size; |
| 159 | |
| 160 | /* Number of vector elements currently in frame. */ |
| 161 | u16 n_vectors; |
| 162 | |
| 163 | /* Scalar and vector arguments to next node. */ |
| 164 | u8 arguments[0]; |
| 165 | } vlib_frame_t; |
| 166 | |
| 167 | Note that one *could* construct all kinds of vectors - including vectors |
| 168 | with some associated scalar data - using this structure. In the vpp |
| 169 | application, vectors typically use a 4-byte vector element size, and |
| 170 | zero bytes’ worth of associated per-frame scalar data. |
| 171 | |
| 172 | Frames are always allocated on CLIB_CACHE_LINE_BYTES boundaries. Frames |
| 173 | have u32 indices which make use of the alignment property, so the |
| 174 | maximum feasible main heap offset of a frame is CLIB_CACHE_LINE_BYTES \* |
| 175 | 0xFFFFFFFF: 64*4 = 256 Gbytes. |
| 176 | |
| 177 | Scheduling Vectors |
| 178 | ------------------ |
| 179 | |
| 180 | As you can see, vectors are not directly associated with graph nodes. We |
| 181 | represent that association in a couple of ways. The simplest is the |
| 182 | vlib_pending_frame_t: |
| 183 | |
| 184 | .. code:: c |
| 185 | |
| 186 | /* A frame pending dispatch by main loop. */ |
| 187 | typedef struct |
| 188 | { |
| 189 | /* Node and runtime for this frame. */ |
| 190 | u32 node_runtime_index; |
| 191 | |
| 192 | /* Frame index (in the heap). */ |
| 193 | u32 frame_index; |
| 194 | |
| 195 | /* Start of next frames for this node. */ |
| 196 | u32 next_frame_index; |
| 197 | |
| 198 | /* Special value for next_frame_index when there is no next frame. */ |
| 199 | #define VLIB_PENDING_FRAME_NO_NEXT_FRAME ((u32) ~0) |
| 200 | } vlib_pending_frame_t; |
| 201 | |
| 202 | Here is the code in …/src/vlib/main.c:vlib_main_or_worker_loop() which |
| 203 | processes frames: |
| 204 | |
| 205 | .. code:: c |
| 206 | |
| 207 | /* |
| 208 | * Input nodes may have added work to the pending vector. |
| 209 | * Process pending vector until there is nothing left. |
| 210 | * All pending vectors will be processed from input -> output. |
| 211 | */ |
| 212 | for (i = 0; i < _vec_len (nm->pending_frames); i++) |
| 213 | cpu_time_now = dispatch_pending_node (vm, i, cpu_time_now); |
| 214 | /* Reset pending vector for next iteration. */ |
| 215 | |
| 216 | The pending frame node_runtime_index associates the frame with the node |
| 217 | which will process it. |
| 218 | |
| 219 | Complications |
| 220 | ------------- |
| 221 | |
| 222 | Fasten your seatbelt. Here’s where the story - and the data structures - |
| 223 | become quite complicated… |
| 224 | |
| 225 | At 100,000 feet: vpp uses a directed graph, not a directed *acyclic* |
| 226 | graph. It’s really quite normal for a packet to visit ip[46]-lookup |
| 227 | multiple times. The worst-case: a graph node which enqueues packets to |
| 228 | itself. |
| 229 | |
| 230 | To deal with this issue, the graph dispatcher must force allocation of a |
| 231 | new frame if the current graph node’s dispatch function happens to |
| 232 | enqueue a packet back to itself. |
| 233 | |
| 234 | There are no guarantees that a pending frame will be processed |
| 235 | immediately, which means that more packets may be added to the |
| 236 | underlying vlib_frame_t after it has been attached to a |
| 237 | vlib_pending_frame_t. Care must be taken to allocate new frames and |
| 238 | pending frames if a (pending_frame, frame) pair fills. |
| 239 | |
| 240 | Next frames, next frame ownership |
| 241 | --------------------------------- |
| 242 | |
| 243 | The vlib_next_frame_t is the last key graph dispatcher data structure: |
| 244 | |
| 245 | .. code:: c |
| 246 | |
| 247 | typedef struct |
| 248 | { |
| 249 | /* Frame index. */ |
| 250 | u32 frame_index; |
| 251 | |
| 252 | /* Node runtime for this next. */ |
| 253 | u32 node_runtime_index; |
| 254 | |
| 255 | /* Next frame flags. */ |
| 256 | u32 flags; |
| 257 | |
| 258 | /* Reflects node frame-used flag for this next. */ |
| 259 | #define VLIB_FRAME_NO_FREE_AFTER_DISPATCH \ |
| 260 | VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH |
| 261 | |
| 262 | /* This next frame owns enqueue to node |
| 263 | corresponding to node_runtime_index. */ |
| 264 | #define VLIB_FRAME_OWNER (1 << 15) |
| 265 | |
| 266 | /* Set when frame has been allocated for this next. */ |
| 267 | #define VLIB_FRAME_IS_ALLOCATED VLIB_NODE_FLAG_IS_OUTPUT |
| 268 | |
| 269 | /* Set when frame has been added to pending vector. */ |
| 270 | #define VLIB_FRAME_PENDING VLIB_NODE_FLAG_IS_DROP |
| 271 | |
| 272 | /* Set when frame is to be freed after dispatch. */ |
| 273 | #define VLIB_FRAME_FREE_AFTER_DISPATCH VLIB_NODE_FLAG_IS_PUNT |
| 274 | |
| 275 | /* Set when frame has traced packets. */ |
| 276 | #define VLIB_FRAME_TRACE VLIB_NODE_FLAG_TRACE |
| 277 | |
| 278 | /* Number of vectors enqueue to this next since last overflow. */ |
| 279 | u32 vectors_since_last_overflow; |
| 280 | } vlib_next_frame_t; |
| 281 | |
| 282 | Graph node dispatch functions call vlib_get_next_frame (…) to set “(u32 |
| 283 | \*)to_next” to the right place in the vlib_frame_t corresponding to the |
| 284 | ith arc (aka next0) from the current node to the indicated next node. |
| 285 | |
| 286 | After some scuffling around - two levels of macros - processing reaches |
| 287 | vlib_get_next_frame_internal (…). Get-next-frame-internal digs up the |
| 288 | vlib_next_frame_t corresponding to the desired graph arc. |
| 289 | |
| 290 | The next frame data structure amounts to a graph-arc-centric frame |
| 291 | cache. Once a node finishes adding element to a frame, it will acquire a |
| 292 | vlib_pending_frame_t and end up on the graph dispatcher’s run-queue. But |
| 293 | there’s no guarantee that more vector elements won’t be added to the |
| 294 | underlying frame from the same (source_node, next_index) arc or from a |
| 295 | different (source_node, next_index) arc. |
| 296 | |
| 297 | Maintaining consistency of the arc-to-frame cache is necessary. The |
| 298 | first step in maintaining consistency is to make sure that only one |
| 299 | graph node at a time thinks it “owns” the target vlib_frame_t. |
| 300 | |
| 301 | Back to the graph node dispatch function. In the usual case, a certain |
| 302 | number of packets will be added to the vlib_frame_t acquired by calling |
| 303 | vlib_get_next_frame (…). |
| 304 | |
| 305 | Before a dispatch function returns, it’s required to call |
| 306 | vlib_put_next_frame (…) for all of the graph arcs it actually used. This |
| 307 | action adds a vlib_pending_frame_t to the graph dispatcher’s pending |
| 308 | frame vector. |
| 309 | |
| 310 | Vlib_put_next_frame makes a note in the pending frame of the frame |
| 311 | index, and also of the vlib_next_frame_t index. |
| 312 | |
| 313 | dispatch_pending_node actions |
| 314 | ----------------------------- |
| 315 | |
| 316 | The main graph dispatch loop calls dispatch pending node as shown above. |
| 317 | |
| 318 | Dispatch_pending_node recovers the pending frame, and the graph node |
| 319 | runtime / dispatch function. Further, it recovers the next_frame |
| 320 | currently associated with the vlib_frame_t, and detaches the |
| 321 | vlib_frame_t from the next_frame. |
| 322 | |
| 323 | In …/src/vlib/main.c:dispatch_pending_node(…), note this stanza: |
| 324 | |
| 325 | .. code:: c |
| 326 | |
| 327 | /* Force allocation of new frame while current frame is being |
| 328 | dispatched. */ |
| 329 | restore_frame_index = ~0; |
| 330 | if (nf->frame_index == p->frame_index) |
| 331 | { |
| 332 | nf->frame_index = ~0; |
| 333 | nf->flags &= ~VLIB_FRAME_IS_ALLOCATED; |
| 334 | if (!(n->flags & VLIB_NODE_FLAG_FRAME_NO_FREE_AFTER_DISPATCH)) |
| 335 | restore_frame_index = p->frame_index; |
| 336 | } |
| 337 | |
| 338 | dispatch_pending_node is worth a hard stare due to the several |
| 339 | second-order optimizations it implements. Almost as an afterthought, it |
| 340 | calls dispatch_node which actually calls the graph node dispatch |
| 341 | function. |
| 342 | |
| 343 | Process / thread model |
| 344 | ---------------------- |
| 345 | |
| 346 | vlib provides an ultra-lightweight cooperative multi-tasking thread |
| 347 | model. The graph node scheduler invokes these processes in much the same |
| 348 | way as traditional vector-processing run-to-completion graph nodes; |
| 349 | plus-or-minus a setjmp/longjmp pair required to switch stacks. Simply |
| 350 | set the vlib_node_registration_t type field to vlib_NODE_TYPE_PROCESS. |
| 351 | Yes, process is a misnomer. These are cooperative multi-tasking threads. |
| 352 | |
| 353 | As of this writing, the default stack size is 2<<15 = 32kb. Initialize |
| 354 | the node registration’s process_log2_n_stack_bytes member as needed. The |
| 355 | graph node dispatcher makes some effort to detect stack overrun, e.g. by |
| 356 | mapping a no-access page below each thread stack. |
| 357 | |
| 358 | Process node dispatch functions are expected to be “while(1) { }” loops |
| 359 | which suspend when not otherwise occupied, and which must not run for |
| 360 | unreasonably long periods of time. |
| 361 | |
| 362 | “Unreasonably long” is an application-dependent concept. Over the years, |
| 363 | we have constructed frame-size sensitive control-plane nodes which will |
| 364 | use a much higher fraction of the available CPU bandwidth when the frame |
| 365 | size is low. The classic example: modifying forwarding tables. So long |
| 366 | as the table-builder leaves the forwarding tables in a valid state, one |
| 367 | can suspend the table builder to avoid dropping packets as a result of |
| 368 | control-plane activity. |
| 369 | |
| 370 | Process nodes can suspend for fixed amounts of time, or until another |
| 371 | entity signals an event, or both. See the next section for a description |
| 372 | of the vlib process event mechanism. |
| 373 | |
| 374 | When running in vlib process context, one must pay strict attention to |
| 375 | loop invariant issues. If one walks a data structure and calls a |
| 376 | function which may suspend, one had best know by construction that it |
| 377 | cannot change. Often, it’s best to simply make a snapshot copy of a data |
| 378 | structure, walk the copy at leisure, then free the copy. |
| 379 | |
| 380 | Process events |
| 381 | -------------- |
| 382 | |
| 383 | The vlib process event mechanism API is extremely lightweight and easy |
| 384 | to use. Here is a typical example: |
| 385 | |
| 386 | .. code:: c |
| 387 | |
| 388 | vlib_main_t *vm = &vlib_global_main; |
| 389 | uword event_type, * event_data = 0; |
| 390 | |
| 391 | while (1) |
| 392 | { |
| 393 | vlib_process_wait_for_event_or_clock (vm, 5.0 /* seconds */); |
| 394 | |
| 395 | event_type = vlib_process_get_events (vm, &event_data); |
| 396 | |
| 397 | switch (event_type) { |
| 398 | case EVENT1: |
| 399 | handle_event1s (event_data); |
| 400 | break; |
| 401 | |
| 402 | case EVENT2: |
| 403 | handle_event2s (event_data); |
| 404 | break; |
| 405 | |
| 406 | case ~0: /* 5-second idle/periodic */ |
| 407 | handle_idle (); |
| 408 | break; |
| 409 | |
| 410 | default: /* bug! */ |
| 411 | ASSERT (0); |
| 412 | } |
| 413 | |
| 414 | vec_reset_length(event_data); |
| 415 | } |
| 416 | |
| 417 | In this example, the VLIB process node waits for an event to occur, or |
| 418 | for 5 seconds to elapse. The code demuxes on the event type, calling the |
| 419 | appropriate handler function. Each call to vlib_process_get_events |
| 420 | returns a vector of per-event-type data passed to successive |
| 421 | vlib_process_signal_event calls; it is a serious error to process only |
| 422 | event_data[0]. |
| 423 | |
| 424 | Resetting the event_data vector-length to 0 [instead of calling |
| 425 | vec_free] means that the event scheme doesn’t burn cycles continuously |
| 426 | allocating and freeing the event data vector. This is a common vppinfra |
| 427 | / vlib coding pattern, well worth using when appropriate. |
| 428 | |
| 429 | Signaling an event is easy, for example: |
| 430 | |
| 431 | .. code:: c |
| 432 | |
| 433 | vlib_process_signal_event (vm, process_node_index, EVENT1, |
| 434 | (uword)arbitrary_event1_data); /* and so forth */ |
| 435 | |
| 436 | One can either know the process node index by construction - dig it out |
| 437 | of the appropriate vlib_node_registration_t - or by finding the |
| 438 | vlib_node_t with vlib_get_node_by_name(…). |
| 439 | |
| 440 | Buffers |
| 441 | ------- |
| 442 | |
| 443 | vlib buffering solves the usual set of packet-processing problems, |
| 444 | albeit at high performance. Key in terms of performance: one ordinarily |
| 445 | allocates / frees N buffers at a time rather than one at a time. Except |
| 446 | when operating directly on a specific buffer, one deals with buffers by |
| 447 | index, not by pointer. |
| 448 | |
| 449 | Packet-processing frames are u32[] arrays, not vlib_buffer_t[] arrays. |
| 450 | |
| 451 | Packets comprise one or more vlib buffers, chained together as required. |
| 452 | Multiple particle sizes are supported; hardware input nodes simply ask |
| 453 | for the required size(s). Coalescing support is available. For obvious |
| 454 | reasons one is discouraged from writing one’s own wild and wacky buffer |
| 455 | chain traversal code. |
| 456 | |
| 457 | vlib buffer headers are allocated immediately prior to the buffer data |
| 458 | area. In typical packet processing this saves a dependent read wait: |
| 459 | given a buffer’s address, one can prefetch the buffer header [metadata] |
| 460 | at the same time as the first cache line of buffer data. |
| 461 | |
| 462 | Buffer header metadata (vlib_buffer_t) includes the usual rewrite |
| 463 | expansion space, a current_data offset, RX and TX interface indices, |
| 464 | packet trace information, and a opaque areas. |
| 465 | |
| 466 | The opaque data is intended to control packet processing in arbitrary |
| 467 | subgraph-dependent ways. The programmer shoulders responsibility for |
| 468 | data lifetime analysis, type-checking, etc. |
| 469 | |
| 470 | Buffers have reference-counts in support of e.g. multicast replication. |
| 471 | |
| 472 | Shared-memory message API |
| 473 | ------------------------- |
| 474 | |
| 475 | Local control-plane and application processes interact with the vpp |
| 476 | dataplane via asynchronous message-passing in shared memory over |
| 477 | unidirectional queues. The same application APIs are available via |
| 478 | sockets. |
| 479 | |
| 480 | Capturing API traces and replaying them in a simulation environment |
| 481 | requires a disciplined approach to the problem. This seems like a |
| 482 | make-work task, but it is not. When something goes wrong in the |
| 483 | control-plane after 300,000 or 3,000,000 operations, high-speed replay |
| 484 | of the events leading up to the accident is a huge win. |
| 485 | |
| 486 | The shared-memory message API message allocator vl_api_msg_alloc uses a |
| 487 | particularly cute trick. Since messages are processed in order, we try |
| 488 | to allocate message buffering from a set of fixed-size, preallocated |
| 489 | rings. Each ring item has a “busy” bit. Freeing one of the preallocated |
| 490 | message buffers merely requires the message consumer to clear the busy |
| 491 | bit. No locking required. |
| 492 | |
| 493 | Debug CLI |
| 494 | --------- |
| 495 | |
| 496 | Adding debug CLI commands to VLIB applications is very simple. |
| 497 | |
| 498 | Here is a complete example: |
| 499 | |
| 500 | .. code:: c |
| 501 | |
| 502 | static clib_error_t * |
| 503 | show_ip_tuple_match (vlib_main_t * vm, |
| 504 | unformat_input_t * input, |
| 505 | vlib_cli_command_t * cmd) |
| 506 | { |
| 507 | vlib_cli_output (vm, "%U\n", format_ip_tuple_match_tables, &routing_main); |
| 508 | return 0; |
| 509 | } |
| 510 | |
| 511 | static VLIB_CLI_COMMAND (show_ip_tuple_command) = |
| 512 | { |
| 513 | .path = "show ip tuple match", |
| 514 | .short_help = "Show ip 5-tuple match-and-broadcast tables", |
| 515 | .function = show_ip_tuple_match, |
| 516 | }; |
| 517 | |
| 518 | This example implements the “show ip tuple match” debug cli command. In |
| 519 | ordinary usage, the vlib cli is available via the “vppctl” application, |
| 520 | which sends traffic to a named pipe. One can configure debug CLI telnet |
| 521 | access on a configurable port. |
| 522 | |
| 523 | The cli implementation has an output redirection facility which makes it |
| 524 | simple to deliver cli output via shared-memory API messaging, |
| 525 | |
| 526 | Particularly for debug or “show tech support” type commands, it would be |
| 527 | wasteful to write vlib application code to pack binary data, write more |
| 528 | code elsewhere to unpack the data and finally print the answer. If a |
| 529 | certain cli command has the potential to hurt packet processing |
| 530 | performance by running for too long, do the work incrementally in a |
| 531 | process node. The client can wait. |
| 532 | |
| 533 | Macro expansion |
| 534 | ~~~~~~~~~~~~~~~ |
| 535 | |
| 536 | The vpp debug CLI engine includes a recursive macro expander. This is |
| 537 | quite useful for factoring out address and/or interface name specifics: |
| 538 | |
| 539 | :: |
| 540 | |
| 541 | define ip1 192.168.1.1/24 |
| 542 | define ip2 192.168.2.1/24 |
| 543 | define iface1 GigabitEthernet3/0/0 |
| 544 | define iface2 loop1 |
| 545 | |
| 546 | set int ip address $iface1 $ip1 |
| 547 | set int ip address $iface2 $(ip2) |
| 548 | |
| 549 | undefine ip1 |
| 550 | undefine ip2 |
| 551 | undefine iface1 |
| 552 | undefine iface2 |
| 553 | |
| 554 | Each socket (or telnet) debug CLI session has its own macro tables. All |
| 555 | debug CLI sessions which use CLI_INBAND binary API messages share a |
| 556 | single table. |
| 557 | |
| 558 | The macro expander recognizes circular definitions: |
| 559 | |
| 560 | :: |
| 561 | |
| 562 | define foo \$(bar) |
| 563 | define bar \$(mumble) |
| 564 | define mumble \$(foo) |
| 565 | |
| 566 | At 8 levels of recursion, the macro expander throws up its hands and |
| 567 | replies “CIRCULAR.” |
| 568 | |
| 569 | Macro-related debug CLI commands |
| 570 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 571 | |
| 572 | In addition to the “define” and “undefine” debug CLI commands, use “show |
| 573 | macro [noevaluate]” to dump the macro table. The “echo” debug CLI |
| 574 | command will evaluate and print its argument: |
| 575 | |
| 576 | :: |
| 577 | |
| 578 | vpp# define foo This\ Is\ Foo |
| 579 | vpp# echo $foo |
| 580 | This Is Foo |
| 581 | |
| 582 | Handing off buffers between threads |
| 583 | ----------------------------------- |
| 584 | |
| 585 | Vlib includes an easy-to-use mechanism for handing off buffers between |
| 586 | worker threads. A typical use-case: software ingress flow hashing. At a |
| 587 | high level, one creates a per-worker-thread queue which sends packets to |
| 588 | a specific graph node in the indicated worker thread. With the queue in |
| 589 | hand, enqueue packets to the worker thread of your choice. |
| 590 | |
| 591 | Initialize a handoff queue |
| 592 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 593 | |
| 594 | Simple enough, call vlib_frame_queue_main_init: |
| 595 | |
| 596 | .. code:: c |
| 597 | |
| 598 | main_ptr->frame_queue_index |
| 599 | = vlib_frame_queue_main_init (dest_node.index, frame_queue_size); |
| 600 | |
| 601 | Frame_queue_size means what it says: the number of frames which may be |
| 602 | queued. Since frames contain 1…256 packets, frame_queue_size should be a |
| 603 | reasonably small number (32…64). If the frame queue producer(s) are |
| 604 | faster than the frame queue consumer(s), congestion will occur. Suggest |
| 605 | letting the enqueue operator deal with queue congestion, as shown in the |
| 606 | enqueue example below. |
| 607 | |
| 608 | Under the floorboards, vlib_frame_queue_main_init creates an input queue |
| 609 | for each worker thread. |
| 610 | |
| 611 | Please do NOT create frame queues until it’s clear that they will be |
| 612 | used. Although the main dispatch loop is reasonably smart about how |
| 613 | often it polls the (entire set of) frame queues, polling unused frame |
| 614 | queues is a waste of clock cycles. |
| 615 | |
| 616 | Hand off packets |
| 617 | ~~~~~~~~~~~~~~~~ |
| 618 | |
| 619 | The actual handoff mechanics are simple, and integrate nicely with a |
| 620 | typical graph-node dispatch function: |
| 621 | |
| 622 | .. code:: c |
| 623 | |
| 624 | always_inline uword |
| 625 | do_handoff_inline (vlib_main_t * vm, |
| 626 | vlib_node_runtime_t * node, vlib_frame_t * frame, |
| 627 | int is_ip4, int is_trace) |
| 628 | { |
| 629 | u32 n_left_from, *from; |
| 630 | vlib_buffer_t *bufs[VLIB_FRAME_SIZE], **b; |
| 631 | u16 thread_indices [VLIB_FRAME_SIZE]; |
| 632 | u16 nexts[VLIB_FRAME_SIZE], *next; |
| 633 | u32 n_enq; |
| 634 | htest_main_t *hmp = &htest_main; |
| 635 | int i; |
| 636 | |
| 637 | from = vlib_frame_vector_args (frame); |
| 638 | n_left_from = frame->n_vectors; |
| 639 | |
| 640 | vlib_get_buffers (vm, from, bufs, n_left_from); |
| 641 | next = nexts; |
| 642 | b = bufs; |
| 643 | |
| 644 | /* |
| 645 | * Typical frame traversal loop, details vary with |
| 646 | * use case. Make sure to set thread_indices[i] with |
| 647 | * the desired destination thread index. You may |
| 648 | * or may not bother to set next[i]. |
| 649 | */ |
| 650 | |
| 651 | for (i = 0; i < frame->n_vectors; i++) |
| 652 | { |
| 653 | <snip> |
| 654 | /* Pick a thread to handle this packet */ |
| 655 | thread_indices[i] = f (packet_data_or_whatever); |
| 656 | <snip> |
| 657 | |
| 658 | b += 1; |
| 659 | next += 1; |
| 660 | n_left_from -= 1; |
| 661 | } |
| 662 | |
| 663 | /* Enqueue buffers to threads */ |
| 664 | n_enq = |
| 665 | vlib_buffer_enqueue_to_thread (vm, node, hmp->frame_queue_index, |
| 666 | from, thread_indices, frame->n_vectors, |
| 667 | 1 /* drop on congestion */); |
| 668 | /* Typical counters, |
| 669 | if (n_enq < frame->n_vectors) |
| 670 | vlib_node_increment_counter (vm, node->node_index, |
| 671 | XXX_ERROR_CONGESTION_DROP, |
| 672 | frame->n_vectors - n_enq); |
| 673 | vlib_node_increment_counter (vm, node->node_index, |
| 674 | XXX_ERROR_HANDED_OFF, n_enq); |
| 675 | return frame->n_vectors; |
| 676 | } |
| 677 | |
| 678 | Notes about calling vlib_buffer_enqueue_to_thread(…): |
| 679 | |
| 680 | - If you pass “drop on congestion” non-zero, all packets in the inbound |
| 681 | frame will be consumed one way or the other. This is the recommended |
| 682 | setting. |
| 683 | |
| 684 | - In the drop-on-congestion case, please don’t try to “help” in the |
| 685 | enqueue node by freeing dropped packets, or by pushing them to |
| 686 | “error-drop.” Either of those actions would be a severe error. |
| 687 | |
| 688 | - It’s perfectly OK to enqueue packets to the current thread. |
| 689 | |
| 690 | Handoff Demo Plugin |
| 691 | ------------------- |
| 692 | |
| 693 | Check out the sample (plugin) example in …/src/examples/handoffdemo. If |
| 694 | you want to build the handoff demo plugin: |
| 695 | |
| 696 | :: |
| 697 | |
| 698 | $ cd .../src/plugins |
| 699 | $ ln -s ../examples/handoffdemo |
| 700 | |
| 701 | This plugin provides a simple example of how to hand off packets between |
| 702 | threads. We used it to debug packet-tracer handoff tracing support. |
| 703 | |
| 704 | Packet generator input script |
| 705 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 706 | |
| 707 | :: |
| 708 | |
| 709 | packet-generator new { |
| 710 | name x |
| 711 | limit 5 |
| 712 | size 128-128 |
| 713 | interface local0 |
| 714 | node handoffdemo-1 |
| 715 | data { |
| 716 | incrementing 30 |
| 717 | } |
| 718 | } |
| 719 | |
| 720 | Start vpp with 2 worker threads |
| 721 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 722 | |
| 723 | The demo plugin hands packets from worker 1 to worker 2. |
| 724 | |
| 725 | Enable tracing, and start the packet generator |
| 726 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 727 | |
| 728 | :: |
| 729 | |
| 730 | trace add pg-input 100 |
| 731 | packet-generator enable |
| 732 | |
| 733 | Sample Run |
| 734 | ~~~~~~~~~~ |
| 735 | |
| 736 | :: |
| 737 | |
| 738 | DBGvpp# ex /tmp/pg_input_script |
| 739 | DBGvpp# pa en |
| 740 | DBGvpp# sh err |
| 741 | Count Node Reason |
| 742 | 5 handoffdemo-1 packets handed off processed |
| 743 | 5 handoffdemo-2 completed packets |
| 744 | DBGvpp# show run |
| 745 | Thread 1 vpp_wk_0 (lcore 0) |
| 746 | Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00 |
| 747 | vector rates in 3.7331e-2, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 |
| 748 | Name State Calls Vectors Suspends Clocks Vectors/Call |
| 749 | handoffdemo-1 active 1 5 0 4.76e3 5.00 |
| 750 | pg-input disabled 2 5 0 5.58e4 2.50 |
| 751 | unix-epoll-input polling 22760 0 0 2.14e7 0.00 |
| 752 | --------------- |
| 753 | Thread 2 vpp_wk_1 (lcore 2) |
| 754 | Time 133.9, average vectors/node 5.00, last 128 main loops 0.00 per node 0.00 |
| 755 | vector rates in 0.0000e0, out 0.0000e0, drop 3.7331e-2, punt 0.0000e0 |
| 756 | Name State Calls Vectors Suspends Clocks Vectors/Call |
| 757 | drop active 1 5 0 1.35e4 5.00 |
| 758 | error-drop active 1 5 0 2.52e4 5.00 |
| 759 | handoffdemo-2 active 1 5 0 2.56e4 5.00 |
| 760 | unix-epoll-input polling 22406 0 0 2.18e7 0.00 |
| 761 | |
| 762 | Enable the packet tracer and run it again… |
| 763 | |
| 764 | :: |
| 765 | |
| 766 | DBGvpp# trace add pg-input 100 |
| 767 | DBGvpp# pa en |
| 768 | DBGvpp# sh trace |
| 769 | sh trace |
| 770 | ------------------- Start of thread 0 vpp_main ------------------- |
| 771 | No packets in trace buffer |
| 772 | ------------------- Start of thread 1 vpp_wk_0 ------------------- |
| 773 | Packet 1 |
| 774 | |
| 775 | 00:06:50:520688: pg-input |
| 776 | stream x, 128 bytes, 0 sw_if_index |
| 777 | current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000000 |
| 778 | 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 |
| 779 | 00000020: 0000000000000000000000000000000000000000000000000000000000000000 |
| 780 | 00000040: 0000000000000000000000000000000000000000000000000000000000000000 |
| 781 | 00000060: 0000000000000000000000000000000000000000000000000000000000000000 |
| 782 | 00:06:50:520762: handoffdemo-1 |
| 783 | HANDOFFDEMO: current thread 1 |
| 784 | |
| 785 | Packet 2 |
| 786 | |
| 787 | 00:06:50:520688: pg-input |
| 788 | stream x, 128 bytes, 0 sw_if_index |
| 789 | current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000001 |
| 790 | 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 |
| 791 | 00000020: 0000000000000000000000000000000000000000000000000000000000000000 |
| 792 | 00000040: 0000000000000000000000000000000000000000000000000000000000000000 |
| 793 | 00000060: 0000000000000000000000000000000000000000000000000000000000000000 |
| 794 | 00:06:50:520762: handoffdemo-1 |
| 795 | HANDOFFDEMO: current thread 1 |
| 796 | |
| 797 | Packet 3 |
| 798 | |
| 799 | 00:06:50:520688: pg-input |
| 800 | stream x, 128 bytes, 0 sw_if_index |
| 801 | current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000002 |
| 802 | 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 |
| 803 | 00000020: 0000000000000000000000000000000000000000000000000000000000000000 |
| 804 | 00000040: 0000000000000000000000000000000000000000000000000000000000000000 |
| 805 | 00000060: 0000000000000000000000000000000000000000000000000000000000000000 |
| 806 | 00:06:50:520762: handoffdemo-1 |
| 807 | HANDOFFDEMO: current thread 1 |
| 808 | |
| 809 | Packet 4 |
| 810 | |
| 811 | 00:06:50:520688: pg-input |
| 812 | stream x, 128 bytes, 0 sw_if_index |
| 813 | current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000003 |
| 814 | 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 |
| 815 | 00000020: 0000000000000000000000000000000000000000000000000000000000000000 |
| 816 | 00000040: 0000000000000000000000000000000000000000000000000000000000000000 |
| 817 | 00000060: 0000000000000000000000000000000000000000000000000000000000000000 |
| 818 | 00:06:50:520762: handoffdemo-1 |
| 819 | HANDOFFDEMO: current thread 1 |
| 820 | |
| 821 | Packet 5 |
| 822 | |
| 823 | 00:06:50:520688: pg-input |
| 824 | stream x, 128 bytes, 0 sw_if_index |
| 825 | current data 0, length 128, buffer-pool 0, ref-count 1, trace handle 0x1000004 |
| 826 | 00000000: 000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d0000 |
| 827 | 00000020: 0000000000000000000000000000000000000000000000000000000000000000 |
| 828 | 00000040: 0000000000000000000000000000000000000000000000000000000000000000 |
| 829 | 00000060: 0000000000000000000000000000000000000000000000000000000000000000 |
| 830 | 00:06:50:520762: handoffdemo-1 |
| 831 | HANDOFFDEMO: current thread 1 |
| 832 | |
| 833 | ------------------- Start of thread 2 vpp_wk_1 ------------------- |
| 834 | Packet 1 |
| 835 | |
| 836 | 00:06:50:520796: handoff_trace |
| 837 | HANDED-OFF: from thread 1 trace index 0 |
| 838 | 00:06:50:520796: handoffdemo-2 |
| 839 | HANDOFFDEMO: current thread 2 |
| 840 | 00:06:50:520867: error-drop |
| 841 | rx:local0 |
| 842 | 00:06:50:520914: drop |
| 843 | handoffdemo-2: completed packets |
| 844 | |
| 845 | Packet 2 |
| 846 | |
| 847 | 00:06:50:520796: handoff_trace |
| 848 | HANDED-OFF: from thread 1 trace index 1 |
| 849 | 00:06:50:520796: handoffdemo-2 |
| 850 | HANDOFFDEMO: current thread 2 |
| 851 | 00:06:50:520867: error-drop |
| 852 | rx:local0 |
| 853 | 00:06:50:520914: drop |
| 854 | handoffdemo-2: completed packets |
| 855 | |
| 856 | Packet 3 |
| 857 | |
| 858 | 00:06:50:520796: handoff_trace |
| 859 | HANDED-OFF: from thread 1 trace index 2 |
| 860 | 00:06:50:520796: handoffdemo-2 |
| 861 | HANDOFFDEMO: current thread 2 |
| 862 | 00:06:50:520867: error-drop |
| 863 | rx:local0 |
| 864 | 00:06:50:520914: drop |
| 865 | handoffdemo-2: completed packets |
| 866 | |
| 867 | Packet 4 |
| 868 | |
| 869 | 00:06:50:520796: handoff_trace |
| 870 | HANDED-OFF: from thread 1 trace index 3 |
| 871 | 00:06:50:520796: handoffdemo-2 |
| 872 | HANDOFFDEMO: current thread 2 |
| 873 | 00:06:50:520867: error-drop |
| 874 | rx:local0 |
| 875 | 00:06:50:520914: drop |
| 876 | handoffdemo-2: completed packets |
| 877 | |
| 878 | Packet 5 |
| 879 | |
| 880 | 00:06:50:520796: handoff_trace |
| 881 | HANDED-OFF: from thread 1 trace index 4 |
| 882 | 00:06:50:520796: handoffdemo-2 |
| 883 | HANDOFFDEMO: current thread 2 |
| 884 | 00:06:50:520867: error-drop |
| 885 | rx:local0 |
| 886 | 00:06:50:520914: drop |
| 887 | handoffdemo-2: completed packets |
| 888 | DBGvpp# |