blob: b4e1065f81e013860ef709a09d4ed4abd4181a77 [file] [log] [blame]
Nathan Skrzypczak9ad39c02021-08-19 11:38:06 +02001VPPINFRA (Infrastructure)
2=========================
3
4The files associated with the VPP Infrastructure layer are located in
5the ``./src/vppinfra`` folder.
6
7VPPinfra is a collection of basic c-library services, quite sufficient
8to build standalone programs to run directly on bare metal. It also
9provides high-performance dynamic arrays, hashes, bitmaps,
10high-precision real-time clock support, fine-grained event-logging, and
11data structure serialization.
12
13One fair comment / fair warning about vppinfra: you can't always tell a
14macro from an inline function from an ordinary function simply by name.
15Macros are used to avoid function calls in the typical case, and to
16cause (intentional) side-effects.
17
18Vppinfra has been around for almost 20 years and tends not to change
19frequently. The VPP Infrastructure layer contains the following
20functions:
21
22Vectors
23-------
24
25Vppinfra vectors are ubiquitous dynamically resized arrays with by user
26defined "headers". Many vpppinfra data structures (e.g. hash, heap,
27pool) are vectors with various different headers.
28
29The memory layout looks like this:
30
31::
32
33 User header (optional, uword aligned)
34 Alignment padding (if needed)
35 Vector length in elements
36 User's pointer -> Vector element 0
37 Vector element 1
38 ...
39 Vector element N-1
40
41As shown above, the vector APIs deal with pointers to the 0th element of
42a vector. Null pointers are valid vectors of length zero.
43
44To avoid thrashing the memory allocator, one often resets the length of
45a vector to zero while retaining the memory allocation. Set the vector
46length field to zero via the vec_reset_length(v) macro. [Use the macro!
47Its smart about NULL pointers.]
48
49Typically, the user header is not present. User headers allow for other
50data structures to be built atop vppinfra vectors. Users may specify the
51alignment for first data element of a vector via the [vec]()*_aligned
52macros.
53
54Vector elements can be any C type e.g. (int, double, struct bar). This
55is also true for data types built atop vectors (e.gheap, pool, etc.).
56Many macros have \_a variants supporting alignment of vector elements
57and \_h variants supporting non-zero-length vector headers. The \_ha
58variants support both. Additionally cacheline alignment within a vector
59element structure can be specified using the
60``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro.
61
62Inconsistent usage of header and/or alignment related macro variants
63will cause delayed, confusing failures.
64
65Standard programming error: memorize a pointer to the ith element of a
66vector, and then expand the vector. Vectors expand by 3/2, so such code
67may appear to work for a period of time. Correct code almost always
68memorizes vector **indices** which are invariant across reallocations.
69
70In typical application images, one supplies a set of global functions
71designed to be called from gdb. Here are a few examples:
72
73- vl(v) - prints vec_len(v)
74- pe(p) - prints pool_elts(p)
75- pifi(p, index) - prints pool_is_free_index(p, index)
76- debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p
77
78Use the show gdb debug CLI command to print the current set.
79
80Bitmaps
81-------
82
83Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
84Quite handy for a variety jobs.
85
86Pools
87-----
88
89Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
90fixed-size data structures with independent lifetimes. Pools are perfect
91for allocating per-session structures.
92
93Hashes
94------
95
96Vppinfra provides several hash flavors. Data plane problems involving
97packet classification / session lookup often use
98./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes.
99These templates are instantiated multiple times, to efficiently service
100different fixed-key sizes.
101
102Bihashes are thread-safe. Read-locking is not required. A simple
103spin-lock ensures that only one thread writes an entry at a time.
104
105The original vppinfra hash implementation in ./src/vppinfra/hash.[ch]
106are simple to use, and are often used in control-plane code which needs
107exact-string-matching.
108
109In either case, one almost always looks up a key in a hash table to
110obtain an index in a related vector or pool. The APIs are simple enough,
111but one must take care when using the unmanaged arbitrary-sized key
112variant. Hash_set_mem (hash_table, key_pointer, value) memorizes
113key_pointer. It is usually a bad mistake to pass the address of a vector
114element as the second argument to hash_set_mem. It is perfectly fine to
115memorize constant string addresses in the text segment.
116
117Timekeeping
118-----------
119
120Vppinfra includes high-precision, low-cost timing services. The datatype
121clib_time_t and associated functions reside in ./src/vppinfra/time.[ch].
122Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t
123object.
124
125Clib_time_init(…) can use a variety of different ways to establish the
126hardware clock frequency. At the end of the day, vppinfra timekeeping
127takes the attitude that the operating systems clock is the closest
128thing to a gold standard it has handy.
129
130When properly configured, NTP maintains kernel clock synchronization
131with a highly accurate off-premises reference clock. Notwithstanding
132network propagation delays, a synchronized NTP client will keep the
133kernel clock accurate to within 50ms or so.
134
135Why should one care? Simply put, oscillators used to generate CPU ticks
136arent super accurate. They work pretty well, but a 0.1% error wouldnt
137be out of the question. Thats a minute and a halfs worth of error in 1
138day. The error changes constantly, due to temperature variation, and a
139host of other physical factors.
140
141Its far too expensive to use system calls for timing, so were left
142with the problem of continuously adjusting our view of the CPU tick
143registers clocks_per_second parameter.
144
145The clock rate adjustment algorithm measures the number of cpu ticks and
146the gold standard reference time across an interval of approximately
14716 seconds. We calculate clocks_per_second for the interval: use rdtsc
148(on x86_64) and a system call to get the latest cpu tick count and the
149kernels latest nanosecond timestamp. We subtract the previous interval
150end values, and use exponential smoothing to merge the new clock rate
151sample into the clocks_per_second parameter.
152
153As of this writing, we maintain the clock rate by way of the following
154first-order differential equation:
155
156.. code:: c
157
158 clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
159 where K = e**(-1.0/3.75);
160
161This yields a per observation half-life of 1 minute. Empirically, the
162clock rate converges within 5 minutes, and appears to maintain
163near-perfect agreement with the kernel clock in the face of ongoing NTP
164time adjustments.
165
166See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the
167rate adjustment algorithm. The code rejects frequency samples
168corresponding to the sort of adjustment which might occur if someone
169changes the gold standard kernel clock by several seconds.
170
171Monotonic timebase support
172~~~~~~~~~~~~~~~~~~~~~~~~~~
173
174Particularly during system initialization, the gold standard system
175reference clock can change by a large amount, in an instant. Its not a
176best practice to yank the reference clock - in either direction - by
177hours or days. In fact, some poorly-constructed use-cases do so.
178
179To deal with this reality, clib_time_now(…) returns the number of
180seconds since vpp started, *guaranteed to be monotonically increasing,
181no matter what happens to the system reference clock*.
182
183This is first-order important, to avoid breaking every active timer in
184the system. The vpp host stack alone may account for tens of millions of
185active timers. Its utterly impractical to track down and fix timers, so
186we must deal with the issue at the timebase level.
187
188Heres how it works. Prior to adjusting the clock rate, we collect the
189kernel reference clock and the cpu clock:
190
191.. code:: c
192
193 /* Ask the kernel and the CPU what time it is... */
194 now_reference = unix_time_now ();
195 now_clock = clib_cpu_time_now ();
196
197Compute changes for both clocks since the last rate adjustment, roughly
19815 seconds ago:
199
200.. code:: c
201
202 /* Compute change in the reference clock */
203 delta_reference = now_reference - c->last_verify_reference_time;
204
205 /* And change in the CPU clock */
206 delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
207 c->seconds_per_clock;
208
209Delta_reference is key. Almost 100% of the time, delta_reference and
210delta_clock_in_seconds are identical modulo one system-call time.
211However, NTP or a privileged user can yank the system reference time -
212in either direction - by an hour, a day, or a decade.
213
214As described above, clib_time_now(…) must return monotonically
215increasing answers to the question how long has it been since vpp
216started, in seconds.” To do that, the clock rate adjustment algorithm
217begins by recomputing the initial reference time:
218
219.. code:: c
220
221 c->init_reference_time += (delta_reference - delta_clock_in_seconds);
222
223Its easy to convince yourself that if the reference clock changes by
22415.000000 seconds and the cpu clock tick time changes by 15.000000
225seconds, the initial reference time wont change.
226
227If, on the other hand, delta_reference is -86400.0 and delta clock is
22815.0 - reference time jumped backwards by exactly one day in a 15-second
229rate update interval - we add -86415.0 to the initial reference time.
230
231Given the corrected initial reference time, we recompute the total
232number of cpu ticks which have occurred since the corrected initial
233reference time, at the current clock tick rate:
234
235.. code:: c
236
237 c->total_cpu_time = (now_reference - c->init_reference_time)
238 * c->clocks_per_second;
239
240Timebase precision
241~~~~~~~~~~~~~~~~~~
242
243Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit
244floating-point value; the number of seconds since vpp started.
245
246Please see `this Wikipedia
247article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__
248for more information. C double-precision floating point numbers (called
249f64 in the vpp code base) have a 53-bit effective mantissa, and can
250accurately represent 15 decimal digits worth of precision.
251
252There are 315,360,000.000001 seconds in ten years plus one microsecond.
253That string has exactly 15 decimal digits. The vpp time base retains 1us
254precision for roughly 30 years.
255
256vlib/clib_time_now do *not* provide precision in excess of 1e-6 seconds.
257If necessary, please use clib_cpu_time_now(…) for direct access to the
258CPU clock-cycle counter. Note that the number of CPU clock cycles per
259second varies significantly across CPU architectures.
260
261Timer Wheels
262------------
263
264Vppinfra includes configurable timer wheel support. See the source code
265in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable
266number of template instances defined in …/src/vppinfra/tw_timer\_.[ch].
267
268Instantiation of tw_timer_template.h generates named structures to
269implement specific timer wheel geometries. Choices include: number of
270timer wheels (currently, 1 or 2), number of slots per ring (a power of
271two), and the number of timers per object handle”.
272
273Internally, user object/timer handles are 32-bit integers, so if one
274selects 16 timers/object (4 bits), the resulting timer wheel handle is
275limited to 2**28 objects.
276
277Here are the specific settings required to generate a single 2048 slot
278wheel which supports 2 timers per object:
279
280.. code:: c
281
282 #define TW_TIMER_WHEELS 1
283 #define TW_SLOTS_PER_RING 2048
284 #define TW_RING_SHIFT 11
285 #define TW_RING_MASK (TW_SLOTS_PER_RING -1)
286 #define TW_TIMERS_PER_OBJECT 2
287 #define LOG2_TW_TIMERS_PER_OBJECT 1
288 #define TW_SUFFIX _2t_1w_2048sl
289 #define TW_FAST_WHEEL_BITMAP 0
290 #define TW_TIMER_ALLOW_DUPLICATE_STOP 0
291
292See tw_timer_2t_1w_2048sl.h for a complete example.
293
294tw_timer_template.h is not intended to be #included directly. Client
295codes can include multiple timer geometry header files, although extreme
296caution would required to use the TW and TWT macros in such a case.
297
298API usage examples
299~~~~~~~~~~~~~~~~~~
300
301The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete
302API usage example. It uses a synthetic clock to rapidly exercise the
303underlying tw_timer_expire_timers(…) template.
304
305There are not many API routines to call.
306
307Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
308^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
309
310.. code:: c
311
312 tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
313 expired_timer_single_callback,
314 1.0 / * timer interval * / );
315
316Start a timer
317^^^^^^^^^^^^^
318
319.. code:: c
320
321 handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
322 [0 | 1] / * timer id * / ,
323 expiration_time_in_u32_ticks);
324
325Stop a timer
326^^^^^^^^^^^^
327
328.. code:: c
329
330 tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);
331
332An expired timer callback
333^^^^^^^^^^^^^^^^^^^^^^^^^
334
335.. code:: c
336
337 static void
338 expired_timer_single_callback (u32 * expired_timers)
339 {
340 int i;
341 u32 pool_index, timer_id;
342 tw_timer_test_elt_t *e;
343 tw_timer_test_main_t *tm = &tw_timer_test_main;
344
345 for (i = 0; i < vec_len (expired_timers);
346 {
347 pool_index = expired_timers[i] & 0x7FFFFFFF;
348 timer_id = expired_timers[i] >> 31;
349
350 ASSERT (timer_id == 1);
351
352 e = pool_elt_at_index (tm->test_elts, pool_index);
353
354 if (e->expected_to_expire != tm->single_wheel.current_tick)
355 {
356 fformat (stdout, "[%d] expired at %d not %d\n",
357 e - tm->test_elts, tm->single_wheel.current_tick,
358 e->expected_to_expire);
359 }
360 pool_put (tm->test_elts, e);
361 }
362 }
363
364We use wheel timers extensively in the vpp host stack. Each TCP session
365needs 5 timers, so supporting 10 million flows requires up to 50 million
366concurrent timers.
367
368Timers rarely expire, so its of utmost important that stopping and
369restarting a timer costs as few clock cycles as possible.
370
371Stopping a timer costs a doubly-linked list dequeue. Starting a timer
372involves modular arithmetic to determine the correct timer wheel and
373slot, and a list head enqueue.
374
375Expired timer processing generally involves bulk link-list retirement
376with user callback presentation. Some additional complexity at wheel
377wrap time, to relocate timers from slower-turning timer wheels into
378faster-turning wheels.
379
380Format
381------
382
383Vppinfra format is roughly equivalent to printf.
384
385Format has a few properties worth mentioning. Formats first argument is
386a (u8 \*) vector to which it appends the result of the current format
387operation. Chaining calls is very easy:
388
389.. code:: c
390
391 u8 * result;
392
393 result = format (0, "junk = %d, ", junk);
394 result = format (result, "more junk = %d\n", more_junk);
395
396As previously noted, NULL pointers are perfectly proper 0-length
397vectors. Format returns a (u8 \*) vector, **not** a C-string. If you
398wish to print a (u8 \*) vector, use the “%v format string. If you need
399a (u8 \*) vector which is also a proper C-string, either of these
400schemes may be used:
401
402.. code:: c
403
404 vec_add1 (result, 0)
405 or
406 result = format (result, "<whatever>%c", 0);
407
408Remember to vec_free() the result if appropriate. Be careful not to pass
409format an uninitialized (u8 \*).
410
411Format implements a particularly handy user-format scheme via the “%U
412format specification. For example:
413
414.. code:: c
415
416 u8 * format_junk (u8 * s, va_list *va)
417 {
418 junk = va_arg (va, u32);
419 s = format (s, "%s", junk);
420 return s;
421 }
422
423 result = format (0, "junk = %U, format_junk, "This is some junk");
424
425format_junk() can invoke other user-format functions if desired. The
426programmer shoulders responsibility for argument type-checking. It is
427typical for user format functions to blow up spectacularly if the
428va_arg(va, type) macros don’t match the caller’s idea of reality.
429
430Unformat
431--------
432
433Vppinfra unformat is vaguely related to scanf, but considerably more
434general.
435
436A typical use case involves initializing an unformat_input_t from either
437a C-string or a (u8 \*) vector, then parsing via unformat() as follows:
438
439.. code:: c
440
441 unformat_input_t input;
442 u8 *s = "<some-C-string>";
443
444 unformat_init_string (&input, (char *) s, strlen((char *) s));
445 /* or */
446 unformat_init_vector (&input, <u8-vector>);
447
448Then loop parsing individual elements:
449
450.. code:: c
451
452 while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
453 {
454 if (unformat (&input, "value1 %d", &value1))
455 ;/* unformat sets value1 */
456 else if (unformat (&input, "value2 %d", &value2)
457 ;/* unformat sets value2 */
458 else
459 return clib_error_return (0, "unknown input '%U'",
460 format_unformat_error, input);
461 }
462
463As with format, unformat implements a user-unformat function capability
464via a “%U” user unformat function scheme. Generally, one can trivially
465transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”,
466&foo)“.
467
468Unformat implements a couple of handy non-scanf-like format specifiers:
469
470.. code:: c
471
472 unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
473 unformat (input, "bitzero %|", &mask, (1<<0));
474 unformat (input, "bitone %|", &mask, (1<<1));
475 <etc>
476
477The phrase “enable %=” means “set the supplied variable to the default
478value” if unformat parses the “enable” keyword all by itself. If
479unformat parses “enable 123” set the supplied variable to 123.
480
481We could clean up a number of hand-rolled “verbose” + “verbose %d”
482argument parsing codes using “%=”.
483
484The phrase “bitzero %\|” means “set the specified bit in the supplied
485bitmask” if unformat parses “bitzero”. Although it looks like it could
486be fairly handy, it’s very lightly used in the code base.
487
488``%_`` toggles whether or not to skip input white space.
489
490For transition from skip to no-skip in middle of format string, skip
491input white space. For example, the following:
492
493.. code:: c
494
495 fmt = "%_%d.%d%_->%_%d.%d%_"
496 unformat (input, fmt, &one, &two, &three, &four);
497
498matches input “1.2 -> 3.4”. Without this, the space after -> does not
499get skipped.
500
501
502How to parse a single input line
503~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
504
505Debug CLI command functions MUST NOT accidentally consume input
506belonging to other debug CLI commands. Otherwise, it's impossible to
507script a set of debug CLI commands which "work fine" when issued one
508at a time.
509
510This bit of code is NOT correct:
511
512.. code:: c
513
514 /* Eats script input NOT beloging to it, and chokes! */
515 while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
516 {
517 if (unformat (input, ...))
518 ;
519 else if (unformat (input, ...))
520 ;
521 else
522 return clib_error_return (0, "parse error: '%U'",
523 format_unformat_error, input);
524 }
525 }
526
527When executed as part of a script, such a function will return “parse
528error: ‘’” every time, unless it happens to be the last command in the
529script.
530
531Instead, use “unformat_line_input” to consume the rest of a line’s worth
532of input - everything past the path specified in the VLIB_CLI_COMMAND
533declaration.
534
535For example, unformat_line_input with “my_command” set up as shown below
536and user input “my path is clear” will produce an unformat_input_t that
537contains “is clear”.
538
539.. code:: c
540
541 VLIB_CLI_COMMAND (...) = {
542 .path = "my path",
543 };
544
545Here’s a bit of code which shows the required mechanics, in full:
546
547.. code:: c
548
549 static clib_error_t *
550 my_command_fn (vlib_main_t * vm,
551 unformat_input_t * input,
552 vlib_cli_command_t * cmd)
553 {
554 unformat_input_t _line_input, *line_input = &_line_input;
555 u32 this, that;
556 clib_error_t *error = 0;
557
558 if (!unformat_user (input, unformat_line_input, line_input))
559 return 0;
560
561 /*
562 * Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
563 * not at the end of the script...
564 */
565 while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
566 {
567 if (unformat (line_input, "this %u", &this))
568 ;
569 else if (unformat (line_input, "that %u", &that))
570 ;
571 else
572 {
573 error = clib_error_return (0, "parse error: '%U'",
574 format_unformat_error, line_input);
575 goto done;
576 }
577 }
578
579 <do something based on "this" and "that", etc>
580
581 done:
582 unformat_free (line_input);
583 return error;
584 }
585 VLIB_CLI_COMMAND (my_command, static) = {
586 .path = "my path",
587 .function = my_command_fn",
588 };
589
590Vppinfra errors and warnings
591----------------------------
592
593Many functions within the vpp dataplane have return-values of type
594clib_error_t \*. Clib_error_ts are arbitrary strings with a bit of
595metadata [fatal, warning] and are easy to announce. Returning a NULL
596clib_error_t \* indicates A-OK, no error.”
597
598Clib_warning(format-args) is a handy way to add debugging output; clib
599warnings prepend function:line info to unambiguously locate the message
600source. Clib_unix_warning() adds perror()-style Linux system-call
601information. In production images, clib_warnings result in syslog
602entries.
603
604Serialization
605-------------
606
607Vppinfra serialization support allows the programmer to easily serialize
608and unserialize complex data structures.
609
610The underlying primitive serialize/unserialize functions use network
611byte-order, so there are no structural issues serializing on a
612little-endian host and unserializing on a big-endian host.