Blame - docs/developer/corearchitecture/infrastructure.rst - fdio/vpp

blob: b4e1065f81e013860ef709a09d4ed4abd4181a77 [file] [log] [blame]

Nathan Skrzypczak	9ad39c0	2021-08-19 11:38:06 +0200	[diff] [blame]	1	VPPINFRA (Infrastructure)
				2	=========================
				3
				4	The files associated with the VPP Infrastructure layer are located in
				5	the ``./src/vppinfra`` folder.
				6
				7	VPPinfra is a collection of basic c-library services, quite sufficient
				8	to build standalone programs to run directly on bare metal. It also
				9	provides high-performance dynamic arrays, hashes, bitmaps,
				10	high-precision real-time clock support, fine-grained event-logging, and
				11	data structure serialization.
				12
				13	One fair comment / fair warning about vppinfra: you can't always tell a
				14	macro from an inline function from an ordinary function simply by name.
				15	Macros are used to avoid function calls in the typical case, and to
				16	cause (intentional) side-effects.
				17
				18	Vppinfra has been around for almost 20 years and tends not to change
				19	frequently. The VPP Infrastructure layer contains the following
				20	functions:
				21
				22	Vectors
				23	-------
				24
				25	Vppinfra vectors are ubiquitous dynamically resized arrays with by user
				26	defined "headers". Many vpppinfra data structures (e.g. hash, heap,
				27	pool) are vectors with various different headers.
				28
				29	The memory layout looks like this:
				30
				31	::
				32
				33	User header (optional, uword aligned)
				34	Alignment padding (if needed)
				35	Vector length in elements
				36	User's pointer -> Vector element 0
				37	Vector element 1
				38	...
				39	Vector element N-1
				40
				41	As shown above, the vector APIs deal with pointers to the 0th element of
				42	a vector. Null pointers are valid vectors of length zero.
				43
				44	To avoid thrashing the memory allocator, one often resets the length of
				45	a vector to zero while retaining the memory allocation. Set the vector
				46	length field to zero via the vec_reset_length(v) macro. [Use the macro!
				47	It’s smart about NULL pointers.]
				48
				49	Typically, the user header is not present. User headers allow for other
				50	data structures to be built atop vppinfra vectors. Users may specify the
				51	alignment for first data element of a vector via the [vec]()*_aligned
				52	macros.
				53
				54	Vector elements can be any C type e.g. (int, double, struct bar). This
				55	is also true for data types built atop vectors (e.g. heap, pool, etc.).
				56	Many macros have \_a variants supporting alignment of vector elements
				57	and \_h variants supporting non-zero-length vector headers. The \_ha
				58	variants support both. Additionally cacheline alignment within a vector
				59	element structure can be specified using the
				60	``[CLIB_CACHE_LINE_ALIGN_MARK]()`` macro.
				61
				62	Inconsistent usage of header and/or alignment related macro variants
				63	will cause delayed, confusing failures.
				64
				65	Standard programming error: memorize a pointer to the ith element of a
				66	vector, and then expand the vector. Vectors expand by 3/2, so such code
				67	may appear to work for a period of time. Correct code almost always
				68	memorizes vector indices which are invariant across reallocations.
				69
				70	In typical application images, one supplies a set of global functions
				71	designed to be called from gdb. Here are a few examples:
				72
				73	- vl(v) - prints vec_len(v)
				74	- pe(p) - prints pool_elts(p)
				75	- pifi(p, index) - prints pool_is_free_index(p, index)
				76	- debug_hex_bytes (p, nbytes) - hex memory dump nbytes starting at p
				77
				78	Use the “show gdb” debug CLI command to print the current set.
				79
				80	Bitmaps
				81	-------
				82
				83	Vppinfra bitmaps are dynamic, built using the vppinfra vector APIs.
				84	Quite handy for a variety jobs.
				85
				86	Pools
				87	-----
				88
				89	Vppinfra pools combine vectors and bitmaps to rapidly allocate and free
				90	fixed-size data structures with independent lifetimes. Pools are perfect
				91	for allocating per-session structures.
				92
				93	Hashes
				94	------
				95
				96	Vppinfra provides several hash flavors. Data plane problems involving
				97	packet classification / session lookup often use
				98	./src/vppinfra/bihash_template.[ch] bounded-index extensible hashes.
				99	These templates are instantiated multiple times, to efficiently service
				100	different fixed-key sizes.
				101
				102	Bihashes are thread-safe. Read-locking is not required. A simple
				103	spin-lock ensures that only one thread writes an entry at a time.
				104
				105	The original vppinfra hash implementation in ./src/vppinfra/hash.[ch]
				106	are simple to use, and are often used in control-plane code which needs
				107	exact-string-matching.
				108
				109	In either case, one almost always looks up a key in a hash table to
				110	obtain an index in a related vector or pool. The APIs are simple enough,
				111	but one must take care when using the unmanaged arbitrary-sized key
				112	variant. Hash_set_mem (hash_table, key_pointer, value) memorizes
				113	key_pointer. It is usually a bad mistake to pass the address of a vector
				114	element as the second argument to hash_set_mem. It is perfectly fine to
				115	memorize constant string addresses in the text segment.
				116
				117	Timekeeping
				118	-----------
				119
				120	Vppinfra includes high-precision, low-cost timing services. The datatype
				121	clib_time_t and associated functions reside in ./src/vppinfra/time.[ch].
				122	Call clib_time_init (clib_time_t \*cp) to initialize the clib_time_t
				123	object.
				124
				125	Clib_time_init(…) can use a variety of different ways to establish the
				126	hardware clock frequency. At the end of the day, vppinfra timekeeping
				127	takes the attitude that the operating system’s clock is the closest
				128	thing to a gold standard it has handy.
				129
				130	When properly configured, NTP maintains kernel clock synchronization
				131	with a highly accurate off-premises reference clock. Notwithstanding
				132	network propagation delays, a synchronized NTP client will keep the
				133	kernel clock accurate to within 50ms or so.
				134
				135	Why should one care? Simply put, oscillators used to generate CPU ticks
				136	aren’t super accurate. They work pretty well, but a 0.1% error wouldn’t
				137	be out of the question. That’s a minute and a half’s worth of error in 1
				138	day. The error changes constantly, due to temperature variation, and a
				139	host of other physical factors.
				140
				141	It’s far too expensive to use system calls for timing, so we’re left
				142	with the problem of continuously adjusting our view of the CPU tick
				143	register’s clocks_per_second parameter.
				144
				145	The clock rate adjustment algorithm measures the number of cpu ticks and
				146	the “gold standard” reference time across an interval of approximately
				147	16 seconds. We calculate clocks_per_second for the interval: use rdtsc
				148	(on x86_64) and a system call to get the latest cpu tick count and the
				149	kernel’s latest nanosecond timestamp. We subtract the previous interval
				150	end values, and use exponential smoothing to merge the new clock rate
				151	sample into the clocks_per_second parameter.
				152
				153	As of this writing, we maintain the clock rate by way of the following
				154	first-order differential equation:
				155
				156	.. code:: c
				157
				158	clocks_per_second(t) = clocks_per_second(t-1) * K + sample_cps(t)*(1-K)
				159	where K = e**(-1.0/3.75);
				160
				161	This yields a per observation “half-life” of 1 minute. Empirically, the
				162	clock rate converges within 5 minutes, and appears to maintain
				163	near-perfect agreement with the kernel clock in the face of ongoing NTP
				164	time adjustments.
				165
				166	See ./src/vppinfra/time.c:clib_time_verify_frequency(…) to look at the
				167	rate adjustment algorithm. The code rejects frequency samples
				168	corresponding to the sort of adjustment which might occur if someone
				169	changes the gold standard kernel clock by several seconds.
				170
				171	Monotonic timebase support
				172	~~~~~~~~~~~~~~~~~~~~~~~~~~
				173
				174	Particularly during system initialization, the “gold standard” system
				175	reference clock can change by a large amount, in an instant. It’s not a
				176	best practice to yank the reference clock - in either direction - by
				177	hours or days. In fact, some poorly-constructed use-cases do so.
				178
				179	To deal with this reality, clib_time_now(…) returns the number of
				180	seconds since vpp started, *guaranteed to be monotonically increasing,
				181	no matter what happens to the system reference clock*.
				182
				183	This is first-order important, to avoid breaking every active timer in
				184	the system. The vpp host stack alone may account for tens of millions of
				185	active timers. It’s utterly impractical to track down and fix timers, so
				186	we must deal with the issue at the timebase level.
				187
				188	Here’s how it works. Prior to adjusting the clock rate, we collect the
				189	kernel reference clock and the cpu clock:
				190
				191	.. code:: c
				192
				193	/* Ask the kernel and the CPU what time it is... */
				194	now_reference = unix_time_now ();
				195	now_clock = clib_cpu_time_now ();
				196
				197	Compute changes for both clocks since the last rate adjustment, roughly
				198	15 seconds ago:
				199
				200	.. code:: c
				201
				202	/* Compute change in the reference clock */
				203	delta_reference = now_reference - c->last_verify_reference_time;
				204
				205	/* And change in the CPU clock */
				206	delta_clock_in_seconds = (f64) (now_clock - c->last_verify_cpu_time) *
				207	c->seconds_per_clock;
				208
				209	Delta_reference is key. Almost 100% of the time, delta_reference and
				210	delta_clock_in_seconds are identical modulo one system-call time.
				211	However, NTP or a privileged user can yank the system reference time -
				212	in either direction - by an hour, a day, or a decade.
				213
				214	As described above, clib_time_now(…) must return monotonically
				215	increasing answers to the question “how long has it been since vpp
				216	started, in seconds.” To do that, the clock rate adjustment algorithm
				217	begins by recomputing the initial reference time:
				218
				219	.. code:: c
				220
				221	c->init_reference_time += (delta_reference - delta_clock_in_seconds);
				222
				223	It’s easy to convince yourself that if the reference clock changes by
				224	15.000000 seconds and the cpu clock tick time changes by 15.000000
				225	seconds, the initial reference time won’t change.
				226
				227	If, on the other hand, delta_reference is -86400.0 and delta clock is
				228	15.0 - reference time jumped backwards by exactly one day in a 15-second
				229	rate update interval - we add -86415.0 to the initial reference time.
				230
				231	Given the corrected initial reference time, we recompute the total
				232	number of cpu ticks which have occurred since the corrected initial
				233	reference time, at the current clock tick rate:
				234
				235	.. code:: c
				236
				237	c->total_cpu_time = (now_reference - c->init_reference_time)
				238	* c->clocks_per_second;
				239
				240	Timebase precision
				241	~~~~~~~~~~~~~~~~~~
				242
				243	Cognoscenti may notice that vlib/clib_time_now(…) return a 64-bit
				244	floating-point value; the number of seconds since vpp started.
				245
				246	Please see `this Wikipedia
				247	article <https://en.wikipedia.org/wiki/Double-precision_floating-point_format>`__
				248	for more information. C double-precision floating point numbers (called
				249	f64 in the vpp code base) have a 53-bit effective mantissa, and can
				250	accurately represent 15 decimal digits’ worth of precision.
				251
				252	There are 315,360,000.000001 seconds in ten years plus one microsecond.
				253	That string has exactly 15 decimal digits. The vpp time base retains 1us
				254	precision for roughly 30 years.
				255
				256	vlib/clib_time_now do not provide precision in excess of 1e-6 seconds.
				257	If necessary, please use clib_cpu_time_now(…) for direct access to the
				258	CPU clock-cycle counter. Note that the number of CPU clock cycles per
				259	second varies significantly across CPU architectures.
				260
				261	Timer Wheels
				262	------------
				263
				264	Vppinfra includes configurable timer wheel support. See the source code
				265	in …/src/vppinfra/tw_timer_template.[ch], as well as a considerable
				266	number of template instances defined in …/src/vppinfra/tw_timer\_.[ch].
				267
				268	Instantiation of tw_timer_template.h generates named structures to
				269	implement specific timer wheel geometries. Choices include: number of
				270	timer wheels (currently, 1 or 2), number of slots per ring (a power of
				271	two), and the number of timers per “object handle”.
				272
				273	Internally, user object/timer handles are 32-bit integers, so if one
				274	selects 16 timers/object (4 bits), the resulting timer wheel handle is
				275	limited to 2**28 objects.
				276
				277	Here are the specific settings required to generate a single 2048 slot
				278	wheel which supports 2 timers per object:
				279
				280	.. code:: c
				281
				282	#define TW_TIMER_WHEELS 1
				283	#define TW_SLOTS_PER_RING 2048
				284	#define TW_RING_SHIFT 11
				285	#define TW_RING_MASK (TW_SLOTS_PER_RING -1)
				286	#define TW_TIMERS_PER_OBJECT 2
				287	#define LOG2_TW_TIMERS_PER_OBJECT 1
				288	#define TW_SUFFIX _2t_1w_2048sl
				289	#define TW_FAST_WHEEL_BITMAP 0
				290	#define TW_TIMER_ALLOW_DUPLICATE_STOP 0
				291
				292	See tw_timer_2t_1w_2048sl.h for a complete example.
				293
				294	tw_timer_template.h is not intended to be #included directly. Client
				295	codes can include multiple timer geometry header files, although extreme
				296	caution would required to use the TW and TWT macros in such a case.
				297
				298	API usage examples
				299	~~~~~~~~~~~~~~~~~~
				300
				301	The unit test code in …/src/vppinfra/test_tw_timer.c provides a concrete
				302	API usage example. It uses a synthetic clock to rapidly exercise the
				303	underlying tw_timer_expire_timers(…) template.
				304
				305	There are not many API routines to call.
				306
				307	Initialize a two-timer, single 2048-slot wheel w/ a 1-second timer granularity
				308	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				309
				310	.. code:: c
				311
				312	tw_timer_wheel_init_2t_1w_2048sl (&tm->single_wheel,
				313	expired_timer_single_callback,
				314	1.0 / * timer interval * / );
				315
				316	Start a timer
				317	^^^^^^^^^^^^^
				318
				319	.. code:: c
				320
				321	handle = tw_timer_start_2t_1w_2048sl (&tm->single_wheel, elt_index,
				322	[0 \| 1] / * timer id * / ,
				323	expiration_time_in_u32_ticks);
				324
				325	Stop a timer
				326	^^^^^^^^^^^^
				327
				328	.. code:: c
				329
				330	tw_timer_stop_2t_1w_2048sl (&tm->single_wheel, handle);
				331
				332	An expired timer callback
				333	^^^^^^^^^^^^^^^^^^^^^^^^^
				334
				335	.. code:: c
				336
				337	static void
				338	expired_timer_single_callback (u32 * expired_timers)
				339	{
				340	int i;
				341	u32 pool_index, timer_id;
				342	tw_timer_test_elt_t *e;
				343	tw_timer_test_main_t *tm = &tw_timer_test_main;
				344
				345	for (i = 0; i < vec_len (expired_timers);
				346	{
				347	pool_index = expired_timers[i] & 0x7FFFFFFF;
				348	timer_id = expired_timers[i] >> 31;
				349
				350	ASSERT (timer_id == 1);
				351
				352	e = pool_elt_at_index (tm->test_elts, pool_index);
				353
				354	if (e->expected_to_expire != tm->single_wheel.current_tick)
				355	{
				356	fformat (stdout, "[%d] expired at %d not %d\n",
				357	e - tm->test_elts, tm->single_wheel.current_tick,
				358	e->expected_to_expire);
				359	}
				360	pool_put (tm->test_elts, e);
				361	}
				362	}
				363
				364	We use wheel timers extensively in the vpp host stack. Each TCP session
				365	needs 5 timers, so supporting 10 million flows requires up to 50 million
				366	concurrent timers.
				367
				368	Timers rarely expire, so it’s of utmost important that stopping and
				369	restarting a timer costs as few clock cycles as possible.
				370
				371	Stopping a timer costs a doubly-linked list dequeue. Starting a timer
				372	involves modular arithmetic to determine the correct timer wheel and
				373	slot, and a list head enqueue.
				374
				375	Expired timer processing generally involves bulk link-list retirement
				376	with user callback presentation. Some additional complexity at wheel
				377	wrap time, to relocate timers from slower-turning timer wheels into
				378	faster-turning wheels.
				379
				380	Format
				381	------
				382
				383	Vppinfra format is roughly equivalent to printf.
				384
				385	Format has a few properties worth mentioning. Format’s first argument is
				386	a (u8 \*) vector to which it appends the result of the current format
				387	operation. Chaining calls is very easy:
				388
				389	.. code:: c
				390
				391	u8 * result;
				392
				393	result = format (0, "junk = %d, ", junk);
				394	result = format (result, "more junk = %d\n", more_junk);
				395
				396	As previously noted, NULL pointers are perfectly proper 0-length
				397	vectors. Format returns a (u8 \) vector, not* a C-string. If you
				398	wish to print a (u8 \*) vector, use the “%v” format string. If you need
				399	a (u8 \*) vector which is also a proper C-string, either of these
				400	schemes may be used:
				401
				402	.. code:: c
				403
				404	vec_add1 (result, 0)
				405	or
				406	result = format (result, "<whatever>%c", 0);
				407
				408	Remember to vec_free() the result if appropriate. Be careful not to pass
				409	format an uninitialized (u8 \*).
				410
				411	Format implements a particularly handy user-format scheme via the “%U”
				412	format specification. For example:
				413
				414	.. code:: c
				415
				416	u8 * format_junk (u8 * s, va_list *va)
				417	{
				418	junk = va_arg (va, u32);
				419	s = format (s, "%s", junk);
				420	return s;
				421	}
				422
				423	result = format (0, "junk = %U, format_junk, "This is some junk");
				424
				425	format_junk() can invoke other user-format functions if desired. The
				426	programmer shoulders responsibility for argument type-checking. It is
				427	typical for user format functions to blow up spectacularly if the
				428	va_arg(va, type) macros don’t match the caller’s idea of reality.
				429
				430	Unformat
				431	--------
				432
				433	Vppinfra unformat is vaguely related to scanf, but considerably more
				434	general.
				435
				436	A typical use case involves initializing an unformat_input_t from either
				437	a C-string or a (u8 \*) vector, then parsing via unformat() as follows:
				438
				439	.. code:: c
				440
				441	unformat_input_t input;
				442	u8 *s = "<some-C-string>";
				443
				444	unformat_init_string (&input, (char ) s, strlen((char ) s));
				445	/* or */
				446	unformat_init_vector (&input, <u8-vector>);
				447
				448	Then loop parsing individual elements:
				449
				450	.. code:: c
				451
				452	while (unformat_check_input (&input) != UNFORMAT_END_OF_INPUT)
				453	{
				454	if (unformat (&input, "value1 %d", &value1))
				455	;/* unformat sets value1 */
				456	else if (unformat (&input, "value2 %d", &value2)
				457	;/* unformat sets value2 */
				458	else
				459	return clib_error_return (0, "unknown input '%U'",
				460	format_unformat_error, input);
				461	}
				462
				463	As with format, unformat implements a user-unformat function capability
				464	via a “%U” user unformat function scheme. Generally, one can trivially
				465	transform “format (s,”foo %d”, foo) -> “unformat (input,”foo %d”,
				466	&foo)“.
				467
				468	Unformat implements a couple of handy non-scanf-like format specifiers:
				469
				470	.. code:: c
				471
				472	unformat (input, "enable %=", &enable, 1 /* defaults to 1 */);
				473	unformat (input, "bitzero %\|", &mask, (1<<0));
				474	unformat (input, "bitone %\|", &mask, (1<<1));
				475	<etc>
				476
				477	The phrase “enable %=” means “set the supplied variable to the default
				478	value” if unformat parses the “enable” keyword all by itself. If
				479	unformat parses “enable 123” set the supplied variable to 123.
				480
				481	We could clean up a number of hand-rolled “verbose” + “verbose %d”
				482	argument parsing codes using “%=”.
				483
				484	The phrase “bitzero %\\|” means “set the specified bit in the supplied
				485	bitmask” if unformat parses “bitzero”. Although it looks like it could
				486	be fairly handy, it’s very lightly used in the code base.
				487
				488	``%_`` toggles whether or not to skip input white space.
				489
				490	For transition from skip to no-skip in middle of format string, skip
				491	input white space. For example, the following:
				492
				493	.. code:: c
				494
				495	fmt = "%_%d.%d%_->%_%d.%d%_"
				496	unformat (input, fmt, &one, &two, &three, &four);
				497
				498	matches input “1.2 -> 3.4”. Without this, the space after -> does not
				499	get skipped.
				500
				501
				502	How to parse a single input line
				503	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				504
				505	Debug CLI command functions MUST NOT accidentally consume input
				506	belonging to other debug CLI commands. Otherwise, it's impossible to
				507	script a set of debug CLI commands which "work fine" when issued one
				508	at a time.
				509
				510	This bit of code is NOT correct:
				511
				512	.. code:: c
				513
				514	/* Eats script input NOT beloging to it, and chokes! */
				515	while (unformat_check_input (input) != UNFORMAT_END_OF_INPUT)
				516	{
				517	if (unformat (input, ...))
				518	;
				519	else if (unformat (input, ...))
				520	;
				521	else
				522	return clib_error_return (0, "parse error: '%U'",
				523	format_unformat_error, input);
				524	}
				525	}
				526
				527	When executed as part of a script, such a function will return “parse
				528	error: ‘’” every time, unless it happens to be the last command in the
				529	script.
				530
				531	Instead, use “unformat_line_input” to consume the rest of a line’s worth
				532	of input - everything past the path specified in the VLIB_CLI_COMMAND
				533	declaration.
				534
				535	For example, unformat_line_input with “my_command” set up as shown below
				536	and user input “my path is clear” will produce an unformat_input_t that
				537	contains “is clear”.
				538
				539	.. code:: c
				540
				541	VLIB_CLI_COMMAND (...) = {
				542	.path = "my path",
				543	};
				544
				545	Here’s a bit of code which shows the required mechanics, in full:
				546
				547	.. code:: c
				548
				549	static clib_error_t *
				550	my_command_fn (vlib_main_t * vm,
				551	unformat_input_t * input,
				552	vlib_cli_command_t * cmd)
				553	{
				554	unformat_input_t _line_input, *line_input = &_line_input;
				555	u32 this, that;
				556	clib_error_t *error = 0;
				557
				558	if (!unformat_user (input, unformat_line_input, line_input))
				559	return 0;
				560
				561	/*
				562	* Here, UNFORMAT_END_OF_INPUT is at the end of the line we consumed,
				563	* not at the end of the script...
				564	*/
				565	while (unformat_check_input (line_input) != UNFORMAT_END_OF_INPUT)
				566	{
				567	if (unformat (line_input, "this %u", &this))
				568	;
				569	else if (unformat (line_input, "that %u", &that))
				570	;
				571	else
				572	{
				573	error = clib_error_return (0, "parse error: '%U'",
				574	format_unformat_error, line_input);
				575	goto done;
				576	}
				577	}
				578
				579	<do something based on "this" and "that", etc>
				580
				581	done:
				582	unformat_free (line_input);
				583	return error;
				584	}
				585	VLIB_CLI_COMMAND (my_command, static) = {
				586	.path = "my path",
				587	.function = my_command_fn",
				588	};
				589
				590	Vppinfra errors and warnings
				591	----------------------------
				592
				593	Many functions within the vpp dataplane have return-values of type
				594	clib_error_t \*. Clib_error_t’s are arbitrary strings with a bit of
				595	metadata [fatal, warning] and are easy to announce. Returning a NULL
				596	clib_error_t \* indicates “A-OK, no error.”
				597
				598	Clib_warning(format-args) is a handy way to add debugging output; clib
				599	warnings prepend function:line info to unambiguously locate the message
				600	source. Clib_unix_warning() adds perror()-style Linux system-call
				601	information. In production images, clib_warnings result in syslog
				602	entries.
				603
				604	Serialization
				605	-------------
				606
				607	Vppinfra serialization support allows the programmer to easily serialize
				608	and unserialize complex data structures.
				609
				610	The underlying primitive serialize/unserialize functions use network
				611	byte-order, so there are no structural issues serializing on a
				612	little-endian host and unserializing on a big-endian host.