blob: 9b62baaf9cf84529e703d58640fbb8a261891552 [file] [log] [blame]
Nathan Skrzypczak9ad39c02021-08-19 11:38:06 +02001Bounded-index Extensible Hashing (bihash)
2=========================================
3
4Vpp uses bounded-index extensible hashing to solve a variety of
5exact-match (key, value) lookup problems. Benefits of the current
6implementation:
7
8- Very high record count scaling, tested to 100,000,000 records.
9- Lookup performance degrades gracefully as the number of records
10 increases
11- No reader locking required
12- Template implementation, its easy to support arbitrary (key,value)
13 types
14
15Bounded-index extensible hashing has been widely used in databases for
16decades.
17
18Bihash uses a two-level data structure:
19
20::
21
22 +-----------------+
23 | bucket-0 |
24 | log2_size |
25 | backing store |
26 +-----------------+
27 | bucket-1 |
28 | log2_size | +--------------------------------+
29 | backing store | --------> | KVP_PER_PAGE * key-value-pairs |
30 +-----------------+ | page 0 |
31 ... +--------------------------------+
32 +-----------------+ | KVP_PER_PAGE * key-value-pairs |
33 | bucket-2**N-1 | | page 1 |
34 | log2_size | +--------------------------------+
35 | backing store | ---
36 +-----------------+ +--------------------------------+
37 | KVP_PER_PAGE * key-value-pairs |
38 | page 2**(log2(size)) - 1 |
39 +--------------------------------+
40
41Discussion of the algorithm
42---------------------------
43
44This structure has a couple of major advantages. In practice, each
45bucket entry fits into a 64-bit integer. Coincidentally, vpps target
46CPU architectures support 64-bit atomic operations. When modifying the
47contents of a specific bucket, we do the following:
48
49- Make a working copy of the buckets backing storage
50- Atomically swap a pointer to the working copy into the bucket array
51- Change the original backing store data
52- Atomically swap back to the original
53
54So, no reader locking is required to search a bihash table.
55
56At lookup time, the implementation computes a key hash code. We use the
57least-significant N bits of the hash to select the bucket.
58
59With the bucket in hand, we learn log2 (nBackingPages) for the selected
60bucket. At this point, we use the next log2_size bits from the hash code
61to select the specific backing page in which the (key,value) page will
62be found.
63
64Net result: we search **one** backing page, not 2**log2_size pages. This
65is a key property of the algorithm.
66
67When sufficient collisions occur to fill the backing pages for a given
68bucket, we double the bucket size, rehash, and deal the bucket contents
69into a double-sized set of backing pages. In the future, we may
70represent the size as a linear combination of two powers-of-two, to
71increase space efficiency.
72
73To solve the jackpot case where a set of records collide under hashing
74in a bad way, the implementation will fall back to linear search across
752**log2_size backing pages on a per-bucket basis.
76
77To maintain *space* efficiency, we should configure the bucket array so
78that backing pages are effectively utilized. Lookup performance tends to
79change *very little* if the bucket array is too small or too large.
80
81Bihash depends on selecting an effective hash function. If one were to
82use a truly broken hash function such as return 1ULL.” bihash would
83still work, but it would be equivalent to poorly-programmed linear
84search.
85
86We often use cpu intrinsic functions - think crc32 - to rapidly compute
87a hash code which has decent statistics.
88
89Bihash Cookbook
90---------------
91
92Using current (key,value) template instance types
93~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
94
95Its quite easy to use one of the template instance types. As of this
96writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24,
9740, and 48 byte keys, u8 \* vector keys, and 8 byte values.
98
99See …/src/vppinfra/{bihash\_\_8}.h
100
101To define the data types, #include a specific template instance, most
102often in a subsystem header file:
103
104.. code:: c
105
106 #include <vppinfra/bihash_8_8.h>
107
108If youre building a standalone application, youll need to define the
109various functions by #including the method implementation file in a C
110source file.
111
112The core vpp engine currently uses most if not all of the known bihash
113types, so you probably wont need to #include the method implementation
114file.
115
116.. code:: c
117
118 #include <vppinfra/bihash_template.c>
119
120Add an instance of the selected bihash data structure to e.ga main_t
121structure:
122
123.. code:: c
124
125 typedef struct
126 {
127 ...
128 BVT (clib_bihash) hash_table;
129 or
130 clib_bihash_8_8_t hash_table;
131 ...
132 } my_main_t;
133
134The BV macro concatenate its argument with the value of the preprocessor
135symbol BIHASH_TYPE. The BVT macro concatenates its argument with the
136value of BIHASH_TYPE and the fixed-string _t”. So in the above example,
137BVT (clib_bihash) generates clib_bihash_8_8_t”.
138
139If youre sure you wont decide to change the template / type name
140later, its perfectly OK to code clib_bihash_8_8_t and so forth.
141
142In fact, if you #include multiple template instances in a single source
143file, you **must** use fully-enumerated type names. The macros stand no
144chance of working.
145
146Initializing a bihash table
147~~~~~~~~~~~~~~~~~~~~~~~~~~~
148
149Call the init function as shown. As a rough guide, pick a number of
150buckets which is approximately
151number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant
152template instance header-file. See previous discussion.
153
154The amount of memory selected should easily contain all of the records,
155with a generous allowance for hash collisions. Bihash memory is
156allocated separately from the main heap, and wont cost anything except
157kernel PTEs until touched, so its OK to be reasonably generous.
158
159For example:
160
161.. code:: c
162
163 my_main_t *mm = &my_main;
164 clib_bihash_8_8_t *h;
165
166 h = &mm->hash_table;
167
168 clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets,
169 (uword) memory_size);
170
171Add or delete a key/value pair
172~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
173
174Use BV(clib_bihash_add_del), or the explicit type variant:
175
176.. code:: c
177
178 clib_bihash_kv_8_8_t kv;
179 clib_bihash_8_8_t * h;
180 my_main_t *mm = &my_main;
181 clib_bihash_8_8_t *h;
182
183 h = &mm->hash_table;
184 kv.key = key_to_add_or_delete;
185 kv.value = value_to_add_or_delete;
186
187 clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */);
188
189In the delete case, kv.value is irrelevant. To change the value
190associated with an existing (key,value) pair, simply re-add the [new]
191pair.
192
193Simple search
194~~~~~~~~~~~~~
195
196The simplest possible (key, value) search goes like so:
197
198.. code:: c
199
200 clib_bihash_kv_8_8_t search_kv, return_kv;
201 clib_bihash_8_8_t * h;
202 my_main_t *mm = &my_main;
203 clib_bihash_8_8_t *h;
204
205 h = &mm->hash_table;
206 search_kv.key = key_to_add_or_delete;
207
208 if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0)
209 key_not_found();
210 else
211 key_found();
212
213Note that its perfectly fine to collect the lookup result
214
215.. code:: c
216
217 if (clib_bihash_search_8_8 (h, &search_kv, &search_kv))
218 key_not_found();
219 etc.
220
221Bihash vector processing
222~~~~~~~~~~~~~~~~~~~~~~~~
223
224When processing a vector of packets which need a certain lookup
225performed, its worth the trouble to compute the key hash, and prefetch
226the correct bucket ahead of time.
227
228Heres a sketch of one way to write the required code:
229
230Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_ts and 2x packet
231data required to form the record keys \* 4 packets ahead, form 2x record
232keys and call BV(clib_bihash_hash) or the explicit hash function to
233calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to
234prefetch the buckets \* 2 packets ahead, call 2x
235BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \*
236In the processing section, call 2x
237BV(clib_bihash_search_inline_with_hash) to perform the search
238
239Programmers choice whether to stash the hash code somewhere in
240vnet_buffer(b) metadata, or to use local variables.
241
242Single-loop: \* Use simple search as shown above.
243
244Walking a bihash table
245~~~~~~~~~~~~~~~~~~~~~~
246
247A fairly common scenario to build show commands involves walking a
248bihash table. Its simple enough:
249
250.. code:: c
251
252 my_main_t *mm = &my_main;
253 clib_bihash_8_8_t *h;
254 void callback_fn (clib_bihash_kv_8_8_t *, void *);
255
256 h = &mm->hash_table;
257
258 BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg);
259
260To nobodys great surprise: clib_bihash_foreach_key_value_pair iterates
261across the entire table, calling callback_fn with active entries.
262
263Bihash table iteration safety
264^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
265
266The iterator template clib_bihash_foreach_key_value_pair must be used
267with a certain amount of care. For one thing, the iterator template does
268*not* take the bihash hash table writer lock. If your use-case requires
269it, lock the table.
270
271For another, the iterator template is not safe under all conditions:
272
273- Its **OK to delete** bihash table entries during a table-walk. The
274 iterator checks whether the current bucket has been freed after each
275 *callback_fn(…)* invocation.
276
277- It is **not OK to add** entries during a table-walk.
278
279The add-during-walk case involves a jackpot: while processing a
280key-value-pair in a particular bucket, add a certain number of entries.
281By luck, assume that one or more of the added entries causes the
282**current bucket** to split-and-rehash.
283
284Since we rehash KVPs to different pages based on what amounts to a
285different hash function, either of these things can go wrong:
286
287- We may revisit previously-visited entries. Depending on how one coded
288 the use-case, we could end up in a recursive-add situation.
289
290- We may skip entries that have not been visited
291
292One could build an add-safe iterator, at a significant cost in
293performance: copy the entire bucket, and walk the copy.
294
295Its hard to imagine a worthwhile add-during walk use-case in the first
296place; let alone one which couldnt be implemented by walking the table
297without modifying it, then adding a set of records.
298
299Creating a new template instance
300~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
301
302Creating a new template is easy. Use one of the existing templates as a
303model, and make the obvious changes. The hash and key_compare methods
304are performance-critical in multiple senses.
305
306If the key compare method is slow, every lookup will be slow. If the
307hash function is slow, same story. If the hash function has poor
308statistical properties, space efficiency will suffer. In the limit, a
309bad enough hash function will cause large portions of the table to
310revert to linear search.
311
312Use of the best available vector unit is well worth the trouble in the
313hash and key_compare functions.