Nathan Skrzypczak | 9ad39c0 | 2021-08-19 11:38:06 +0200 | [diff] [blame] | 1 | Bounded-index Extensible Hashing (bihash) |
| 2 | ========================================= |
| 3 | |
| 4 | Vpp uses bounded-index extensible hashing to solve a variety of |
| 5 | exact-match (key, value) lookup problems. Benefits of the current |
| 6 | implementation: |
| 7 | |
| 8 | - Very high record count scaling, tested to 100,000,000 records. |
| 9 | - Lookup performance degrades gracefully as the number of records |
| 10 | increases |
| 11 | - No reader locking required |
| 12 | - Template implementation, it’s easy to support arbitrary (key,value) |
| 13 | types |
| 14 | |
| 15 | Bounded-index extensible hashing has been widely used in databases for |
| 16 | decades. |
| 17 | |
| 18 | Bihash uses a two-level data structure: |
| 19 | |
| 20 | :: |
| 21 | |
| 22 | +-----------------+ |
| 23 | | bucket-0 | |
| 24 | | log2_size | |
| 25 | | backing store | |
| 26 | +-----------------+ |
| 27 | | bucket-1 | |
| 28 | | log2_size | +--------------------------------+ |
| 29 | | backing store | --------> | KVP_PER_PAGE * key-value-pairs | |
| 30 | +-----------------+ | page 0 | |
| 31 | ... +--------------------------------+ |
| 32 | +-----------------+ | KVP_PER_PAGE * key-value-pairs | |
| 33 | | bucket-2**N-1 | | page 1 | |
| 34 | | log2_size | +--------------------------------+ |
| 35 | | backing store | --- |
| 36 | +-----------------+ +--------------------------------+ |
| 37 | | KVP_PER_PAGE * key-value-pairs | |
| 38 | | page 2**(log2(size)) - 1 | |
| 39 | +--------------------------------+ |
| 40 | |
| 41 | Discussion of the algorithm |
| 42 | --------------------------- |
| 43 | |
| 44 | This structure has a couple of major advantages. In practice, each |
| 45 | bucket entry fits into a 64-bit integer. Coincidentally, vpp’s target |
| 46 | CPU architectures support 64-bit atomic operations. When modifying the |
| 47 | contents of a specific bucket, we do the following: |
| 48 | |
| 49 | - Make a working copy of the bucket’s backing storage |
| 50 | - Atomically swap a pointer to the working copy into the bucket array |
| 51 | - Change the original backing store data |
| 52 | - Atomically swap back to the original |
| 53 | |
| 54 | So, no reader locking is required to search a bihash table. |
| 55 | |
| 56 | At lookup time, the implementation computes a key hash code. We use the |
| 57 | least-significant N bits of the hash to select the bucket. |
| 58 | |
| 59 | With the bucket in hand, we learn log2 (nBackingPages) for the selected |
| 60 | bucket. At this point, we use the next log2_size bits from the hash code |
| 61 | to select the specific backing page in which the (key,value) page will |
| 62 | be found. |
| 63 | |
| 64 | Net result: we search **one** backing page, not 2**log2_size pages. This |
| 65 | is a key property of the algorithm. |
| 66 | |
| 67 | When sufficient collisions occur to fill the backing pages for a given |
| 68 | bucket, we double the bucket size, rehash, and deal the bucket contents |
| 69 | into a double-sized set of backing pages. In the future, we may |
| 70 | represent the size as a linear combination of two powers-of-two, to |
| 71 | increase space efficiency. |
| 72 | |
| 73 | To solve the “jackpot case” where a set of records collide under hashing |
| 74 | in a bad way, the implementation will fall back to linear search across |
| 75 | 2**log2_size backing pages on a per-bucket basis. |
| 76 | |
| 77 | To maintain *space* efficiency, we should configure the bucket array so |
| 78 | that backing pages are effectively utilized. Lookup performance tends to |
| 79 | change *very little* if the bucket array is too small or too large. |
| 80 | |
| 81 | Bihash depends on selecting an effective hash function. If one were to |
| 82 | use a truly broken hash function such as “return 1ULL.” bihash would |
| 83 | still work, but it would be equivalent to poorly-programmed linear |
| 84 | search. |
| 85 | |
| 86 | We often use cpu intrinsic functions - think crc32 - to rapidly compute |
| 87 | a hash code which has decent statistics. |
| 88 | |
| 89 | Bihash Cookbook |
| 90 | --------------- |
| 91 | |
| 92 | Using current (key,value) template instance types |
| 93 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 94 | |
| 95 | It’s quite easy to use one of the template instance types. As of this |
| 96 | writing, …/src/vppinfra provides pre-built templates for 8, 16, 20, 24, |
| 97 | 40, and 48 byte keys, u8 \* vector keys, and 8 byte values. |
| 98 | |
| 99 | See …/src/vppinfra/{bihash\_\_8}.h |
| 100 | |
| 101 | To define the data types, #include a specific template instance, most |
| 102 | often in a subsystem header file: |
| 103 | |
| 104 | .. code:: c |
| 105 | |
| 106 | #include <vppinfra/bihash_8_8.h> |
| 107 | |
| 108 | If you’re building a standalone application, you’ll need to define the |
| 109 | various functions by #including the method implementation file in a C |
| 110 | source file. |
| 111 | |
| 112 | The core vpp engine currently uses most if not all of the known bihash |
| 113 | types, so you probably won’t need to #include the method implementation |
| 114 | file. |
| 115 | |
| 116 | .. code:: c |
| 117 | |
| 118 | #include <vppinfra/bihash_template.c> |
| 119 | |
| 120 | Add an instance of the selected bihash data structure to e.g. a “main_t” |
| 121 | structure: |
| 122 | |
| 123 | .. code:: c |
| 124 | |
| 125 | typedef struct |
| 126 | { |
| 127 | ... |
| 128 | BVT (clib_bihash) hash_table; |
| 129 | or |
| 130 | clib_bihash_8_8_t hash_table; |
| 131 | ... |
| 132 | } my_main_t; |
| 133 | |
| 134 | The BV macro concatenate its argument with the value of the preprocessor |
| 135 | symbol BIHASH_TYPE. The BVT macro concatenates its argument with the |
| 136 | value of BIHASH_TYPE and the fixed-string “_t”. So in the above example, |
| 137 | BVT (clib_bihash) generates “clib_bihash_8_8_t”. |
| 138 | |
| 139 | If you’re sure you won’t decide to change the template / type name |
| 140 | later, it’s perfectly OK to code “clib_bihash_8_8_t” and so forth. |
| 141 | |
| 142 | In fact, if you #include multiple template instances in a single source |
| 143 | file, you **must** use fully-enumerated type names. The macros stand no |
| 144 | chance of working. |
| 145 | |
| 146 | Initializing a bihash table |
| 147 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 148 | |
| 149 | Call the init function as shown. As a rough guide, pick a number of |
| 150 | buckets which is approximately |
| 151 | number_of_expected_records/BIHASH_KVP_PER_PAGE from the relevant |
| 152 | template instance header-file. See previous discussion. |
| 153 | |
| 154 | The amount of memory selected should easily contain all of the records, |
| 155 | with a generous allowance for hash collisions. Bihash memory is |
| 156 | allocated separately from the main heap, and won’t cost anything except |
| 157 | kernel PTE’s until touched, so it’s OK to be reasonably generous. |
| 158 | |
| 159 | For example: |
| 160 | |
| 161 | .. code:: c |
| 162 | |
| 163 | my_main_t *mm = &my_main; |
| 164 | clib_bihash_8_8_t *h; |
| 165 | |
| 166 | h = &mm->hash_table; |
| 167 | |
| 168 | clib_bihash_init_8_8 (h, "test", (u32) number_of_buckets, |
| 169 | (uword) memory_size); |
| 170 | |
| 171 | Add or delete a key/value pair |
| 172 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 173 | |
| 174 | Use BV(clib_bihash_add_del), or the explicit type variant: |
| 175 | |
| 176 | .. code:: c |
| 177 | |
| 178 | clib_bihash_kv_8_8_t kv; |
| 179 | clib_bihash_8_8_t * h; |
| 180 | my_main_t *mm = &my_main; |
| 181 | clib_bihash_8_8_t *h; |
| 182 | |
| 183 | h = &mm->hash_table; |
| 184 | kv.key = key_to_add_or_delete; |
| 185 | kv.value = value_to_add_or_delete; |
| 186 | |
| 187 | clib_bihash_add_del_8_8 (h, &kv, is_add /* 1=add, 0=delete */); |
| 188 | |
| 189 | In the delete case, kv.value is irrelevant. To change the value |
| 190 | associated with an existing (key,value) pair, simply re-add the [new] |
| 191 | pair. |
| 192 | |
| 193 | Simple search |
| 194 | ~~~~~~~~~~~~~ |
| 195 | |
| 196 | The simplest possible (key, value) search goes like so: |
| 197 | |
| 198 | .. code:: c |
| 199 | |
| 200 | clib_bihash_kv_8_8_t search_kv, return_kv; |
| 201 | clib_bihash_8_8_t * h; |
| 202 | my_main_t *mm = &my_main; |
| 203 | clib_bihash_8_8_t *h; |
| 204 | |
| 205 | h = &mm->hash_table; |
| 206 | search_kv.key = key_to_add_or_delete; |
| 207 | |
| 208 | if (clib_bihash_search_8_8 (h, &search_kv, &return_kv) < 0) |
| 209 | key_not_found(); |
| 210 | else |
| 211 | key_found(); |
| 212 | |
| 213 | Note that it’s perfectly fine to collect the lookup result |
| 214 | |
| 215 | .. code:: c |
| 216 | |
| 217 | if (clib_bihash_search_8_8 (h, &search_kv, &search_kv)) |
| 218 | key_not_found(); |
| 219 | etc. |
| 220 | |
| 221 | Bihash vector processing |
| 222 | ~~~~~~~~~~~~~~~~~~~~~~~~ |
| 223 | |
| 224 | When processing a vector of packets which need a certain lookup |
| 225 | performed, it’s worth the trouble to compute the key hash, and prefetch |
| 226 | the correct bucket ahead of time. |
| 227 | |
| 228 | Here’s a sketch of one way to write the required code: |
| 229 | |
| 230 | Dual-loop: \* 6 packets ahead, prefetch 2x vlib_buffer_t’s and 2x packet |
| 231 | data required to form the record keys \* 4 packets ahead, form 2x record |
| 232 | keys and call BV(clib_bihash_hash) or the explicit hash function to |
| 233 | calculate the record hashes. Call 2x BV(clib_bihash_prefetch_bucket) to |
| 234 | prefetch the buckets \* 2 packets ahead, call 2x |
| 235 | BV(clib_bihash_prefetch_data) to prefetch 2x (key,value) data pages. \* |
| 236 | In the processing section, call 2x |
| 237 | BV(clib_bihash_search_inline_with_hash) to perform the search |
| 238 | |
| 239 | Programmer’s choice whether to stash the hash code somewhere in |
| 240 | vnet_buffer(b) metadata, or to use local variables. |
| 241 | |
| 242 | Single-loop: \* Use simple search as shown above. |
| 243 | |
| 244 | Walking a bihash table |
| 245 | ~~~~~~~~~~~~~~~~~~~~~~ |
| 246 | |
| 247 | A fairly common scenario to build “show” commands involves walking a |
| 248 | bihash table. It’s simple enough: |
| 249 | |
| 250 | .. code:: c |
| 251 | |
| 252 | my_main_t *mm = &my_main; |
| 253 | clib_bihash_8_8_t *h; |
| 254 | void callback_fn (clib_bihash_kv_8_8_t *, void *); |
| 255 | |
| 256 | h = &mm->hash_table; |
| 257 | |
| 258 | BV(clib_bihash_foreach_key_value_pair) (h, callback_fn, (void *) arg); |
| 259 | |
| 260 | To nobody’s great surprise: clib_bihash_foreach_key_value_pair iterates |
| 261 | across the entire table, calling callback_fn with active entries. |
| 262 | |
| 263 | Bihash table iteration safety |
| 264 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 265 | |
| 266 | The iterator template “clib_bihash_foreach_key_value_pair” must be used |
| 267 | with a certain amount of care. For one thing, the iterator template does |
| 268 | *not* take the bihash hash table writer lock. If your use-case requires |
| 269 | it, lock the table. |
| 270 | |
| 271 | For another, the iterator template is not safe under all conditions: |
| 272 | |
| 273 | - It’s **OK to delete** bihash table entries during a table-walk. The |
| 274 | iterator checks whether the current bucket has been freed after each |
| 275 | *callback_fn(…)* invocation. |
| 276 | |
| 277 | - It is **not OK to add** entries during a table-walk. |
| 278 | |
| 279 | The add-during-walk case involves a jackpot: while processing a |
| 280 | key-value-pair in a particular bucket, add a certain number of entries. |
| 281 | By luck, assume that one or more of the added entries causes the |
| 282 | **current bucket** to split-and-rehash. |
| 283 | |
| 284 | Since we rehash KVP’s to different pages based on what amounts to a |
| 285 | different hash function, either of these things can go wrong: |
| 286 | |
| 287 | - We may revisit previously-visited entries. Depending on how one coded |
| 288 | the use-case, we could end up in a recursive-add situation. |
| 289 | |
| 290 | - We may skip entries that have not been visited |
| 291 | |
| 292 | One could build an add-safe iterator, at a significant cost in |
| 293 | performance: copy the entire bucket, and walk the copy. |
| 294 | |
| 295 | It’s hard to imagine a worthwhile add-during walk use-case in the first |
| 296 | place; let alone one which couldn’t be implemented by walking the table |
| 297 | without modifying it, then adding a set of records. |
| 298 | |
| 299 | Creating a new template instance |
| 300 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 301 | |
| 302 | Creating a new template is easy. Use one of the existing templates as a |
| 303 | model, and make the obvious changes. The hash and key_compare methods |
| 304 | are performance-critical in multiple senses. |
| 305 | |
| 306 | If the key compare method is slow, every lookup will be slow. If the |
| 307 | hash function is slow, same story. If the hash function has poor |
| 308 | statistical properties, space efficiency will suffer. In the limit, a |
| 309 | bad enough hash function will cause large portions of the table to |
| 310 | revert to linear search. |
| 311 | |
| 312 | Use of the best available vector unit is well worth the trouble in the |
| 313 | hash and key_compare functions. |