Blame - docs/developer/corearchitecture/bihash.rst - fdio/vpp

blob: 9b62baaf9cf84529e703d58640fbb8a261891552 [file] [log] [blame]

Nathan Skrzypczak	9ad39c0	2021-08-19 11:38:06 +0200	[diff] [blame]	1	Bounded-index Extensible Hashing (bihash)
				2	=========================================
				3
				4	Vpp uses bounded-index extensible hashing to solve a variety of
				5	exact-match (key, value) lookup problems. Benefits of the current
				6	implementation:
				7
				8	- Very high record count scaling, tested to 100,000,000 records.
				9	- Lookup performance degrades gracefully as the number of records
				10	increases
				11	- No reader locking required
				12	- Template implementation, it’s easy to support arbitrary (key,value)
				13	types
				14
				15	Bounded-index extensible hashing has been widely used in databases for
				16	decades.
				17
				18	Bihash uses a two-level data structure:
				19
				20	::
				21
				22	+-----------------+
				23	\| bucket-0 \|
				24	\| log2_size \|
				25	\| backing store \|
				26	+-----------------+
				27	\| bucket-1 \|
				28	\| log2_size \| +--------------------------------+
				29	\| backing store \| --------> \| KVP_PER_PAGE * key-value-pairs \|
				30	+-----------------+ \| page 0 \|
				31	... +--------------------------------+
				32	+-----------------+ \| KVP_PER_PAGE * key-value-pairs \|
				33	\| bucket-2**N-1 \| \| page 1 \|
				34	\| log2_size \| +--------------------------------+
				35	\| backing store \| ---
				36	+-----------------+ +--------------------------------+
				37	\| KVP_PER_PAGE * key-value-pairs \|
				38	\| page 2**(log2(size)) - 1 \|
				39	+--------------------------------+
				40
				41	Discussion of the algorithm
				42	---------------------------
				43
				44	This structure has a couple of major advantages. In practice, each
				45	bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
				46	CPU architectures support 64-bit atomic operations. When modifying the
				47	contents of a specific bucket, we do the following:
				48
				49	- Make a working copy of the bucket’s backing storage
				50	- Atomically swap a pointer to the working copy into the bucket array
				51	- Change the original backing store data
				52	- Atomically swap back to the original
				53
				54	So, no reader locking is required to search a bihash table.
				55
				56	At lookup time, the implementation computes a key hash code. We use the
				57	least-significant N bits of the hash to select the bucket.
				58
				59	With the bucket in hand, we learn log2 (nBackingPages) for the selected
				60	bucket. At this point, we use the next log2_size bits from the hash code
				61	to select the specific backing page in which the (key,value) page will
				62	be found.
				63
				64	Net result: we search one backing page, not 2**log2_size pages. This
				65	is a key property of the algorithm.
				66
				67	When sufficient collisions occur to fill the backing pages for a given
				68	bucket, we double the bucket size, rehash, and deal the bucket contents
				69	into a double-sized set of backing pages. In the future, we may
				70	represent the size as a linear combination of two powers-of-two, to
				71	increase space efficiency.
				72
				73	To solve the “jackpot case” where a set of records collide under hashing
				74	in a bad way, the implementation will fall back to linear search across
				75	2**log2_size backing pages on a per-bucket basis.
				76
				77	To maintain space efficiency, we should configure the bucket array so
				78	that backing pages are effectively utilized. Lookup performance tends to
				79	change very little if the bucket array is too small or too large.
				80
				81	Bihash depends on selecting an effective hash function. If one were to
				82	use a truly broken hash function such as “return 1ULL.” bihash would
				83	still work, but it would be equivalent to poorly-programmed linear
				84	search.
				85
				86	We often use cpu intrinsic functions - think crc32 - to rapidly compute
				87	a hash code which has decent statistics.
				88
				89	Bihash Cookbook
				90	---------------
				91
				92	Using current (key,value) template instance types
				93	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				94
				95	It’s quite easy to use one of the template instance types. As of this
				96	writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
				97	40, and 48 byte keys, u8 \* vector keys, and 8 byte values.
				98
				99	See …/src/vppinfra/{bihash\_\_8}.h
				100
				101	To define the data types, #include a specific template instance, most
				102	often in a subsystem header file:
				103
				104	.. code:: c
				105
				106	#include <vppinfra/bihash_8_8.h>
				107
				108	If you’re building a standalone application, you’ll need to define the
				109	various functions by #including the method implementation file in a C
				110	source file.
				111
				112	The core vpp engine currently uses most if not all of the known bihash
				113	types, so you probably won’t need to #include the method implementation
				114	file.
				115
				116	.. code:: c
				117
				118	#include <vppinfra/bihash_template.c>
				119
				120	Add an instance of the selected bihash data structure to e.g. a “main_t”
				121	structure:
				122
				123	.. code:: c
				124
				125	typedef struct
				126	{
				127	...
				128	BVT (clib_bihash) hash_table;
				129	or
				130	clib_bihash_8_8_t hash_table;
				131	...
				132	} my_main_t;
				133
				134	The BV macro concatenate its argument with the value of the preprocessor
				135	symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
				136	value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
				137	BVT (clib_bihash) generates “clib_bihash_8_8_t”.
				138
				139	If you’re sure you won’t decide to change the template / type name
				140	later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.
				141
				142	In fact, if you #include multiple template instances in a single source
				143	file, you must use fully-enumerated type names. The macros stand no
				144	chance of working.
				145
				146	Initializing a bihash table
				147	~~~~~~~~~~~~~~~~~~~~~~~~~~~
				148
				149	Call the init function as shown. As a rough guide, pick a number of
				150	buckets which is approximately
				151	number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
				152	template instance header-file. See previous discussion.
				153
				154	The amount of memory selected should easily contain all of the records,
				155	with a generous allowance for hash collisions. Bihash memory is
				156	allocated separately from the main heap, and won’t cost anything except
				157	kernel PTE’s until touched, so it’s OK to be reasonably generous.
				158
				159	For example:
				160
				161	.. code:: c
				162
				163	my_main_t *mm = &my_main;
				164	clib_bihash_8_8_t *h;
				165
				166	h = &mm->hash_table;
				167
				168	clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
				169	(uword) memory_size);
				170
				171	Add or delete a key/value pair
				172	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				173
				174	Use BV(clib_bihash_add_del), or the explicit type variant:
				175
				176	.. code:: c
				177
				178	clib_bihash_kv_8_8_t kv;
				179	clib_bihash_8_8_t * h;
				180	my_main_t *mm = &my_main;
				181	clib_bihash_8_8_t *h;
				182
				183	h = &mm->hash_table;
				184	kv.key = key_to_add_or_delete;
				185	kv.value = value_to_add_or_delete;
				186
				187	clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
				188
				189	In the delete case, kv.value is irrelevant. To change the value
				190	associated with an existing (key,value) pair, simply re-add the [new]
				191	pair.
				192
				193	Simple search
				194	~~~~~~~~~~~~~
				195
				196	The simplest possible (key, value) search goes like so:
				197
				198	.. code:: c
				199
				200	clib_bihash_kv_8_8_t search_kv, return_kv;
				201	clib_bihash_8_8_t * h;
				202	my_main_t *mm = &my_main;
				203	clib_bihash_8_8_t *h;
				204
				205	h = &mm->hash_table;
				206	search_kv.key = key_to_add_or_delete;
				207
				208	if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
				209	key_not_found();
				210	else
				211	key_found();
				212
				213	Note that it’s perfectly fine to collect the lookup result
				214
				215	.. code:: c
				216
				217	if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
				218	key_not_found();
				219	etc.
				220
				221	Bihash vector processing
				222	~~~~~~~~~~~~~~~~~~~~~~~~
				223
				224	When processing a vector of packets which need a certain lookup
				225	performed, it’s worth the trouble to compute the key hash, and prefetch
				226	the correct bucket ahead of time.
				227
				228	Here’s a sketch of one way to write the required code:
				229
				230	Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
				231	data required to form the record keys \* 4 packets ahead, form 2x record
				232	keys and call BV(clib_bihash_hash) or the explicit hash function to
				233	calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
				234	prefetch the buckets \* 2 packets ahead, call 2x
				235	BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
				236	In the processing section, call 2x
				237	BV(clib_bihash_search_inline_with_hash) to perform the search
				238
				239	Programmer’s choice whether to stash the hash code somewhere in
				240	vnet_buffer(b) metadata, or to use local variables.
				241
				242	Single-loop: \* Use simple search as shown above.
				243
				244	Walking a bihash table
				245	~~~~~~~~~~~~~~~~~~~~~~
				246
				247	A fairly common scenario to build “show” commands involves walking a
				248	bihash table. It’s simple enough:
				249
				250	.. code:: c
				251
				252	my_main_t *mm = &my_main;
				253	clib_bihash_8_8_t *h;
				254	void callback_fn (clib_bihash_kv_8_8_t , void );
				255
				256	h = &mm->hash_table;
				257
				258	BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
				259
				260	To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
				261	across the entire table, calling callback_fn with active entries.
				262
				263	Bihash table iteration safety
				264	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				265
				266	The iterator template “clib_bihash_foreach_key_value_pair” must be used
				267	with a certain amount of care. For one thing, the iterator template does
				268	not take the bihash hash table writer lock. If your use-case requires
				269	it, lock the table.
				270
				271	For another, the iterator template is not safe under all conditions:
				272
				273	- It’s OK to delete bihash table entries during a table-walk. The
				274	iterator checks whether the current bucket has been freed after each
				275	callback_fn(…) invocation.
				276
				277	- It is not OK to add entries during a table-walk.
				278
				279	The add-during-walk case involves a jackpot: while processing a
				280	key-value-pair in a particular bucket, add a certain number of entries.
				281	By luck, assume that one or more of the added entries causes the
				282	current bucket to split-and-rehash.
				283
				284	Since we rehash KVP’s to different pages based on what amounts to a
				285	different hash function, either of these things can go wrong:
				286
				287	- We may revisit previously-visited entries. Depending on how one coded
				288	the use-case, we could end up in a recursive-add situation.
				289
				290	- We may skip entries that have not been visited
				291
				292	One could build an add-safe iterator, at a significant cost in
				293	performance: copy the entire bucket, and walk the copy.
				294
				295	It’s hard to imagine a worthwhile add-during walk use-case in the first
				296	place; let alone one which couldn’t be implemented by walking the table
				297	without modifying it, then adding a set of records.
				298
				299	Creating a new template instance
				300	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				301
				302	Creating a new template is easy. Use one of the existing templates as a
				303	model, and make the obvious changes. The hash and key_compare methods
				304	are performance-critical in multiple senses.
				305
				306	If the key compare method is slow, every lookup will be slow. If the
				307	hash function is slow, same story. If the hash function has poor
				308	statistical properties, space efficiency will suffer. In the limit, a
				309	bad enough hash function will cause large portions of the table to
				310	revert to linear search.
				311
				312	Use of the best available vector unit is well worth the trouble in the
				313	hash and key_compare functions.