docs/developer/corearchitecture/bihash.rst - fdio/vpp - Gitiles

 Bounded-index Extensible Hashing (bihash)
 =========================================

 Vpp uses bounded-index extensible hashing to solve a variety of
 exact-match (key, value) lookup problems. Benefits of the current
 implementation:

 -  Very high record count scaling, tested to 100,000,000 records.
 -  Lookup performance degrades gracefully as the number of records
    increases
 -  No reader locking required
 -  Template implementation, it’s easy to support arbitrary (key,value)
    types

 Bounded-index extensible hashing has been widely used in databases for
 decades.

 Bihash uses a two-level data structure:

 ::

        +-----------------+
        | bucket-0        |
        |  log2_size      |
        |  backing store  |
        +-----------------+
        | bucket-1        |
        |  log2_size      |           +--------------------------------+
        |  backing store  | --------> | KVP_PER_PAGE * key-value-pairs |
        +-----------------+           | page 0                         |
             ...                      +--------------------------------+
        +-----------------+           | KVP_PER_PAGE * key-value-pairs |
        | bucket-2**N-1   |           | page 1                         |
        |  log2_size      |           +--------------------------------+
        |  backing store  |                       ---
        +-----------------+           +--------------------------------+
                                      | KVP_PER_PAGE * key-value-pairs |
                                      | page 2**(log2(size)) - 1       |
                                      +--------------------------------+

 Discussion of the algorithm
 ---------------------------

 This structure has a couple of major advantages. In practice, each
 bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
 CPU architectures support 64-bit atomic operations. When modifying the
 contents of a specific bucket, we do the following:

 -  Make a working copy of the bucket’s backing storage
 -  Atomically swap a pointer to the working copy into the bucket array
 -  Change the original backing store data
 -  Atomically swap back to the original

 So, no reader locking is required to search a bihash table.

 At lookup time, the implementation computes a key hash code. We use the
 least-significant N bits of the hash to select the bucket.

 With the bucket in hand, we learn log2 (nBackingPages) for the selected
 bucket. At this point, we use the next log2_size bits from the hash code
 to select the specific backing page in which the (key,value) page will
 be found.

 Net result: we search **one** backing page, not 2**log2_size pages. This
 is a key property of the algorithm.

 When sufficient collisions occur to fill the backing pages for a given
 bucket, we double the bucket size, rehash, and deal the bucket contents
 into a double-sized set of backing pages. In the future, we may
 represent the size as a linear combination of two powers-of-two, to
 increase space efficiency.

 To solve the “jackpot case” where a set of records collide under hashing
 in a bad way, the implementation will fall back to linear search across
 2**log2_size backing pages on a per-bucket basis.

 To maintain *space* efficiency, we should configure the bucket array so
 that backing pages are effectively utilized. Lookup performance tends to
 change *very little* if the bucket array is too small or too large.

 Bihash depends on selecting an effective hash function. If one were to
 use a truly broken hash function such as “return 1ULL.” bihash would
 still work, but it would be equivalent to poorly-programmed linear
 search.

 We often use cpu intrinsic functions - think crc32 - to rapidly compute
 a hash code which has decent statistics.

 Bihash Cookbook
 ---------------

 Using current (key,value) template instance types
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 It’s quite easy to use one of the template instance types. As of this
 writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
 40, and 48 byte keys, u8 \* vector keys, and 8 byte values.

 See …/src/vppinfra/{bihash\_\_8}.h

 To define the data types, #include a specific template instance, most
 often in a subsystem header file:

 .. code:: c

         #include <vppinfra/bihash_8_8.h>

 If you’re building a standalone application, you’ll need to define the
 various functions by #including the method implementation file in a C
 source file.

 The core vpp engine currently uses most if not all of the known bihash
 types, so you probably won’t need to #include the method implementation
 file.

 .. code:: c

         #include <vppinfra/bihash_template.c>

 Add an instance of the selected bihash data structure to e.g. a “main_t”
 structure:

 .. code:: c

        typedef struct
        {
          ...
          BVT (clib_bihash) hash_table;
          or
          clib_bihash_8_8_t hash_table;
          ...
        } my_main_t;

 The BV macro concatenate its argument with the value of the preprocessor
 symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
 value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
 BVT (clib_bihash) generates “clib_bihash_8_8_t”.

 If you’re sure you won’t decide to change the template / type name
 later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.

 In fact, if you #include multiple template instances in a single source
 file, you **must** use fully-enumerated type names. The macros stand no
 chance of working.

 Initializing a bihash table
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Call the init function as shown. As a rough guide, pick a number of
 buckets which is approximately
 number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
 template instance header-file. See previous discussion.

 The amount of memory selected should easily contain all of the records,
 with a generous allowance for hash collisions. Bihash memory is
 allocated separately from the main heap, and won’t cost anything except
 kernel PTE’s until touched, so it’s OK to be reasonably generous.

 For example:

 .. code:: c

        my_main_t *mm = &my_main;
        clib_bihash_8_8_t *h;

        h = &mm->hash_table;

        clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
                               (uword) memory_size);

 Add or delete a key/value pair
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Use BV(clib_bihash_add_del), or the explicit type variant:

 .. code:: c

       clib_bihash_kv_8_8_t kv;
       clib_bihash_8_8_t * h;
       my_main_t *mm = &my_main;
       clib_bihash_8_8_t *h;

       h = &mm->hash_table;
       kv.key = key_to_add_or_delete;
       kv.value = value_to_add_or_delete;

       clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);

 In the delete case, kv.value is irrelevant. To change the value
 associated with an existing (key,value) pair, simply re-add the [new]
 pair.

 Simple search
 ~~~~~~~~~~~~~

 The simplest possible (key, value) search goes like so:

 .. code:: c

       clib_bihash_kv_8_8_t search_kv, return_kv;
       clib_bihash_8_8_t * h;
       my_main_t *mm = &my_main;
       clib_bihash_8_8_t *h;

       h = &mm->hash_table;
       search_kv.key = key_to_add_or_delete;

       if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
         key_not_found();
       else
         key_found();

 Note that it’s perfectly fine to collect the lookup result

 .. code:: c

       if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
         key_not_found();
       etc.

 Bihash vector processing
 ~~~~~~~~~~~~~~~~~~~~~~~~

 When processing a vector of packets which need a certain lookup
 performed, it’s worth the trouble to compute the key hash, and prefetch
 the correct bucket ahead of time.

 Here’s a sketch of one way to write the required code:

 Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
 data required to form the record keys \* 4 packets ahead, form 2x record
 keys and call BV(clib_bihash_hash) or the explicit hash function to
 calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
 prefetch the buckets \* 2 packets ahead, call 2x
 BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
 In the processing section, call 2x
 BV(clib_bihash_search_inline_with_hash) to perform the search

 Programmer’s choice whether to stash the hash code somewhere in
 vnet_buffer(b) metadata, or to use local variables.

 Single-loop: \* Use simple search as shown above.

 Walking a bihash table
 ~~~~~~~~~~~~~~~~~~~~~~

 A fairly common scenario to build “show” commands involves walking a
 bihash table. It’s simple enough:

 .. code:: c

       my_main_t *mm = &my_main;
       clib_bihash_8_8_t *h;
       void callback_fn (clib_bihash_kv_8_8_t *, void *);

       h = &mm->hash_table;

       BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);

 To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
 across the entire table, calling callback_fn with active entries.

 Bihash table iteration safety
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 The iterator template “clib_bihash_foreach_key_value_pair” must be used
 with a certain amount of care. For one thing, the iterator template does
 *not* take the bihash hash table writer lock. If your use-case requires
 it, lock the table.

 For another, the iterator template is not safe under all conditions:

 -  It’s **OK to delete** bihash table entries during a table-walk. The
    iterator checks whether the current bucket has been freed after each
    *callback_fn(…)* invocation.

 -  It is **not OK to add** entries during a table-walk.

 The add-during-walk case involves a jackpot: while processing a
 key-value-pair in a particular bucket, add a certain number of entries.
 By luck, assume that one or more of the added entries causes the
 **current bucket** to split-and-rehash.

 Since we rehash KVP’s to different pages based on what amounts to a
 different hash function, either of these things can go wrong:

 -  We may revisit previously-visited entries. Depending on how one coded
    the use-case, we could end up in a recursive-add situation.

 -  We may skip entries that have not been visited

 One could build an add-safe iterator, at a significant cost in
 performance: copy the entire bucket, and walk the copy.

 It’s hard to imagine a worthwhile add-during walk use-case in the first
 place; let alone one which couldn’t be implemented by walking the table
 without modifying it, then adding a set of records.

 Creating a new template instance
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Creating a new template is easy. Use one of the existing templates as a
 model, and make the obvious changes. The hash and key_compare methods
 are performance-critical in multiple senses.

 If the key compare method is slow, every lookup will be slow. If the
 hash function is slow, same story. If the hash function has poor
 statistical properties, space efficiency will suffer. In the limit, a
 bad enough hash function will cause large portions of the table to
 revert to linear search.

 Use of the best available vector unit is well worth the trouble in the
 hash and key_compare functions.
	Bounded-index Extensible Hashing (bihash)
	=========================================

	Vpp uses bounded-index extensible hashing to solve a variety of
	exact-match (key, value) lookup problems. Benefits of the current
	implementation:

	- Very high record count scaling, tested to 100,000,000 records.
	- Lookup performance degrades gracefully as the number of records
	increases
	- No reader locking required
	- Template implementation, it’s easy to support arbitrary (key,value)
	types

	Bounded-index extensible hashing has been widely used in databases for
	decades.

	Bihash uses a two-level data structure:

	::

	+-----------------+
	\| bucket-0 \|
	\| log2_size \|
	\| backing store \|
	+-----------------+
	\| bucket-1 \|
	\| log2_size \| +--------------------------------+
	\| backing store \| --------> \| KVP_PER_PAGE * key-value-pairs \|
	+-----------------+ \| page 0 \|
	... +--------------------------------+
	+-----------------+ \| KVP_PER_PAGE * key-value-pairs \|
	\| bucket-2**N-1 \| \| page 1 \|
	\| log2_size \| +--------------------------------+
	\| backing store \| ---
	+-----------------+ +--------------------------------+
	\| KVP_PER_PAGE * key-value-pairs \|
	\| page 2**(log2(size)) - 1 \|
	+--------------------------------+

	Discussion of the algorithm
	---------------------------

	This structure has a couple of major advantages. In practice, each
	bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target
	CPU architectures support 64-bit atomic operations. When modifying the
	contents of a specific bucket, we do the following:

	- Make a working copy of the bucket’s backing storage
	- Atomically swap a pointer to the working copy into the bucket array
	- Change the original backing store data
	- Atomically swap back to the original

	So, no reader locking is required to search a bihash table.

	At lookup time, the implementation computes a key hash code. We use the
	least-significant N bits of the hash to select the bucket.

	With the bucket in hand, we learn log2 (nBackingPages) for the selected
	bucket. At this point, we use the next log2_size bits from the hash code
	to select the specific backing page in which the (key,value) page will
	be found.

	Net result: we search one backing page, not 2**log2_size pages. This
	is a key property of the algorithm.

	When sufficient collisions occur to fill the backing pages for a given
	bucket, we double the bucket size, rehash, and deal the bucket contents
	into a double-sized set of backing pages. In the future, we may
	represent the size as a linear combination of two powers-of-two, to
	increase space efficiency.

	To solve the “jackpot case” where a set of records collide under hashing
	in a bad way, the implementation will fall back to linear search across
	2**log2_size backing pages on a per-bucket basis.

	To maintain space efficiency, we should configure the bucket array so
	that backing pages are effectively utilized. Lookup performance tends to
	change very little if the bucket array is too small or too large.

	Bihash depends on selecting an effective hash function. If one were to
	use a truly broken hash function such as “return 1ULL.” bihash would
	still work, but it would be equivalent to poorly-programmed linear
	search.

	We often use cpu intrinsic functions - think crc32 - to rapidly compute
	a hash code which has decent statistics.

	Bihash Cookbook
	---------------

	Using current (key,value) template instance types
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	It’s quite easy to use one of the template instance types. As of this
	writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
	40, and 48 byte keys, u8 \* vector keys, and 8 byte values.

	See …/src/vppinfra/{bihash\_\_8}.h

	To define the data types, #include a specific template instance, most
	often in a subsystem header file:

	.. code:: c

	#include <vppinfra/bihash_8_8.h>

	If you’re building a standalone application, you’ll need to define the
	various functions by #including the method implementation file in a C
	source file.

	The core vpp engine currently uses most if not all of the known bihash
	types, so you probably won’t need to #include the method implementation
	file.

	.. code:: c

	#include <vppinfra/bihash_template.c>

	Add an instance of the selected bihash data structure to e.g. a “main_t”
	structure:

	.. code:: c

	typedef struct
	{
	...
	BVT (clib_bihash) hash_table;
	or
	clib_bihash_8_8_t hash_table;
	...
	} my_main_t;

	The BV macro concatenate its argument with the value of the preprocessor
	symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
	value of BIHASH_TYPE and the fixed-string “_t”. So in the above example,
	BVT (clib_bihash) generates “clib_bihash_8_8_t”.

	If you’re sure you won’t decide to change the template / type name
	later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth.

	In fact, if you #include multiple template instances in a single source
	file, you must use fully-enumerated type names. The macros stand no
	chance of working.

	Initializing a bihash table
	~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Call the init function as shown. As a rough guide, pick a number of
	buckets which is approximately
	number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
	template instance header-file. See previous discussion.

	The amount of memory selected should easily contain all of the records,
	with a generous allowance for hash collisions. Bihash memory is
	allocated separately from the main heap, and won’t cost anything except
	kernel PTE’s until touched, so it’s OK to be reasonably generous.

	For example:

	.. code:: c

	my_main_t *mm = &my_main;
	clib_bihash_8_8_t *h;

	h = &mm->hash_table;

	clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
	(uword) memory_size);

	Add or delete a key/value pair
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Use BV(clib_bihash_add_del), or the explicit type variant:

	.. code:: c

	clib_bihash_kv_8_8_t kv;
	clib_bihash_8_8_t * h;
	my_main_t *mm = &my_main;
	clib_bihash_8_8_t *h;

	h = &mm->hash_table;
	kv.key = key_to_add_or_delete;
	kv.value = value_to_add_or_delete;

	clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);

	In the delete case, kv.value is irrelevant. To change the value
	associated with an existing (key,value) pair, simply re-add the [new]
	pair.

	Simple search
	~~~~~~~~~~~~~

	The simplest possible (key, value) search goes like so:

	.. code:: c

	clib_bihash_kv_8_8_t search_kv, return_kv;
	clib_bihash_8_8_t * h;
	my_main_t *mm = &my_main;
	clib_bihash_8_8_t *h;

	h = &mm->hash_table;
	search_kv.key = key_to_add_or_delete;

	if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
	key_not_found();
	else
	key_found();

	Note that it’s perfectly fine to collect the lookup result

	.. code:: c

	if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
	key_not_found();
	etc.

	Bihash vector processing
	~~~~~~~~~~~~~~~~~~~~~~~~

	When processing a vector of packets which need a certain lookup
	performed, it’s worth the trouble to compute the key hash, and prefetch
	the correct bucket ahead of time.

	Here’s a sketch of one way to write the required code:

	Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet
	data required to form the record keys \* 4 packets ahead, form 2x record
	keys and call BV(clib_bihash_hash) or the explicit hash function to
	calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
	prefetch the buckets \* 2 packets ahead, call 2x
	BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
	In the processing section, call 2x
	BV(clib_bihash_search_inline_with_hash) to perform the search

	Programmer’s choice whether to stash the hash code somewhere in
	vnet_buffer(b) metadata, or to use local variables.

	Single-loop: \* Use simple search as shown above.

	Walking a bihash table
	~~~~~~~~~~~~~~~~~~~~~~

	A fairly common scenario to build “show” commands involves walking a
	bihash table. It’s simple enough:

	.. code:: c

	my_main_t *mm = &my_main;
	clib_bihash_8_8_t *h;
	void callback_fn (clib_bihash_kv_8_8_t , void );

	h = &mm->hash_table;

	BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);

	To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates
	across the entire table, calling callback_fn with active entries.

	Bihash table iteration safety
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	The iterator template “clib_bihash_foreach_key_value_pair” must be used
	with a certain amount of care. For one thing, the iterator template does
	not take the bihash hash table writer lock. If your use-case requires
	it, lock the table.

	For another, the iterator template is not safe under all conditions:

	- It’s OK to delete bihash table entries during a table-walk. The
	iterator checks whether the current bucket has been freed after each
	callback_fn(…) invocation.

	- It is not OK to add entries during a table-walk.

	The add-during-walk case involves a jackpot: while processing a
	key-value-pair in a particular bucket, add a certain number of entries.
	By luck, assume that one or more of the added entries causes the
	current bucket to split-and-rehash.

	Since we rehash KVP’s to different pages based on what amounts to a
	different hash function, either of these things can go wrong:

	- We may revisit previously-visited entries. Depending on how one coded
	the use-case, we could end up in a recursive-add situation.

	- We may skip entries that have not been visited

	One could build an add-safe iterator, at a significant cost in
	performance: copy the entire bucket, and walk the copy.

	It’s hard to imagine a worthwhile add-during walk use-case in the first
	place; let alone one which couldn’t be implemented by walking the table
	without modifying it, then adding a set of records.

	Creating a new template instance
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Creating a new template is easy. Use one of the existing templates as a
	model, and make the obvious changes. The hash and key_compare methods
	are performance-critical in multiple senses.

	If the key compare method is slow, every lookup will be slow. If the
	hash function is slow, same story. If the hash function has poor
	statistical properties, space efficiency will suffer. In the limit, a
	bad enough hash function will cause large portions of the table to
	revert to linear search.

	Use of the best available vector unit is well worth the trouble in the
	hash and key_compare functions.