Ed Warnicke | cb9cada | 2015-12-08 15:45:58 -0700 | [diff] [blame] | 1 | === vnet classifier theory of operation === |
| 2 | |
| 3 | The vnet classifier trades off simplicity and perf / scale |
| 4 | characteristics. At a certain level, it's a dumb robot. Given an |
| 5 | incoming packet, search an ordered list of (mask, match) tables. If |
| 6 | the classifier finds a matching entry, take the indicated action. If |
| 7 | not, take a last-resort action. |
| 8 | |
| 9 | We use the MMX-unit to match or hash 16 octets at a time. For hardware |
| 10 | backward compatibility, the code does not [currently] use 256-bit |
| 11 | (32-octet) vector instructions. |
| 12 | |
| 13 | Effective use of the classifier centers around building table lists |
| 14 | which "hit" as soon as practicable. In many cases, established |
| 15 | sessions hit in the first table. In this mode of operation, the |
| 16 | classifier easily processes multiple MPPS / core - even with millions |
| 17 | of sessions in the data base. Searching 357 tables on a regular basis |
| 18 | will neatly solve the halting problem. |
| 19 | |
| 20 | ==== Basic operation ==== |
| 21 | |
| 22 | The classifier mask-and-match operation proceeds as follows. Given a |
| 23 | starting classifier table index, lay hands on the indicated mask |
| 24 | vector. When building tables, we arrange for the mask to obey |
| 25 | mmx-unit (16-octet) alignment. |
| 26 | |
| 27 | We know that the first octet of packet data starts on a cache-line |
| 28 | boundary. Further, it's reasonably likely that folks won't want to use |
| 29 | the generalized classifier on the L2 header; preferring to decode the |
| 30 | Ethertype manually. That scheme makes it easy to select among ip4 / |
| 31 | ip6 / MPLS, etc. classifier table sets. |
| 32 | |
| 33 | A no-vlan-tag L2 header is 14 octets long. A typical ipv4 header |
| 34 | begins with the octets 0x4500: version=4, header_length=5, DSCP=0, |
| 35 | ECN=0. If one doesn't intend to classify on (DSCP, ECN) - the typical |
| 36 | case - we program the classifier to skip the first 16-octet vector. |
| 37 | |
| 38 | To classify untagged ipv4 packets on source address, we program the |
| 39 | classifier to skip one vector, and mask-and-match one vector. |
| 40 | |
| 41 | The basic match-and-match operation looks like this: |
| 42 | |
| 43 | switch (t->match_n_vectors) |
| 44 | { |
| 45 | case 1: |
| 46 | result = (data[0 + t->skip_n_vectors] & mask[0]) ^ key[0]; |
| 47 | break; |
| 48 | |
| 49 | case 2: |
| 50 | result = (data[0 + t->skip_n_vectors] & mask[0]) ^ key[0]; |
| 51 | result |= (data[1 + t->skip_n_vectors] & mask[1]) ^ key[1]; |
| 52 | break; |
| 53 | |
| 54 | <etc> |
| 55 | } |
| 56 | |
| 57 | result_mask = u32x4_zero_byte_mask (result); |
| 58 | if (result_mask == 0xffff) |
| 59 | return (v); |
| 60 | |
| 61 | Net of setup, it costs a couple of clock cycles to mask-and-match 16 |
| 62 | octets. |
| 63 | |
| 64 | At the risk of belaboring an obvious point, the control-plane |
| 65 | '''must''' pay attention to detail. When skipping one (or more) |
| 66 | vectors, masks and matches must reflect that decision. See |
| 67 | .../vnet/vnet/classify/vnet_classify.c:unformat_classify_[mask|match]. Note |
| 68 | that vec_validate (xxx, 13) creates a 14-element vector. |
| 69 | |
| 70 | ==== Creating a classifier table ==== |
| 71 | |
| 72 | To create a new classifier table via the control-plane API, send a |
| 73 | "classify_add_del_table" message. The underlying action routine, |
| 74 | vnet_classify_add_del_table(...), is located in |
| 75 | .../vnet/vnet/classify/vnet_classify.c, and has the following |
| 76 | prototype: |
| 77 | |
| 78 | int vnet_classify_add_del_table (vnet_classify_main_t * cm, |
| 79 | u8 * mask, |
| 80 | u32 nbuckets, |
| 81 | u32 memory_size, |
| 82 | u32 skip, |
| 83 | u32 match, |
| 84 | u32 next_table_index, |
| 85 | u32 miss_next_index, |
| 86 | u32 * table_index, |
| 87 | int is_add) |
| 88 | |
| 89 | Pass cm = &vnet_classify_main if calling this routine directly. Mask, |
| 90 | skip(_n_vectors) and match(_n_vectors) are as described above. Mask |
| 91 | need not be aligned, but it must be match*16 octets in length. To |
| 92 | avoid having your head explode, be absolutely certain that '''only''' |
| 93 | the bits you intend to match on are set. |
| 94 | |
| 95 | The classifier uses thread-safe, no-reader-locking-required |
| 96 | bounded-index extensible hashing. Nbuckets is the [fixed] size of the |
| 97 | hash bucket vector. The algorithm works in constant time regardless of |
| 98 | hash collisions, but wastes space when the bucket array is too |
| 99 | small. A good rule of thumb: let nbuckets = approximate number of |
| 100 | entries expected. |
| 101 | |
| 102 | At a signficant cost in complexity, it would be possible to resize the |
| 103 | bucket array dynamically. We have no plans to implement that function. |
| 104 | |
| 105 | Each classifier table has its own clib mheap memory allocation |
| 106 | arena. To pick the memory_size parameter, note that each classifier |
| 107 | table entry needs 16*(1 + match_n_vectors) bytes. Within reason, aim a |
| 108 | bit high. Clib mheap memory uses o/s level virtual memory - not wired |
| 109 | or hugetlb memory - so it's best not to scrimp on size. |
| 110 | |
| 111 | The "next_table_index" parameter is as described: the pool index in |
| 112 | vnet_classify_main.tables of the next table to search. Code ~0 to |
| 113 | indicate the end of the table list. 0 is a valid table index! |
| 114 | |
| 115 | We often create classification tables in reverse order - |
| 116 | last-table-searched to first-table-searched - so we can easily set |
| 117 | this parameter. Of course, one can manually adjust the data structure |
| 118 | after-the-fact. |
| 119 | |
| 120 | Specific classifier client nodes - for example, |
| 121 | .../vnet/vnet/classify/ip_classify.c - interpret the "miss_next_index" |
| 122 | parameter as a vpp graph-node next index. When packet classification |
| 123 | fails to produce a match, ip_classify_inline sends packets to the |
| 124 | indicated disposition. A classifier application might program this |
| 125 | parameter to send packets which don't match an existing session to a |
| 126 | "first-sign-of-life, create-new-session" node. |
| 127 | |
| 128 | Finally, the is_add parameter indicates whether to add or delete the |
| 129 | indicated table. The delete case implicitly terminates all sessions |
| 130 | with extreme prejudice, by freeing the specified clib mheap. |
| 131 | |
| 132 | ==== Creating a classifier session ==== |
| 133 | |
| 134 | To create a new classifier session via the control-plane API, send a |
| 135 | "classify_add_del_session" message. The underlying action routine, |
| 136 | vnet_classify_add_del_session(...), is located in |
| 137 | .../vnet/vnet/classify/vnet_classify.c, and has the following |
| 138 | prototype: |
| 139 | |
| 140 | int vnet_classify_add_del_session (vnet_classify_main_t * cm, |
| 141 | u32 table_index, |
| 142 | u8 * match, |
| 143 | u32 hit_next_index, |
| 144 | u32 opaque_index, |
| 145 | i32 advance, |
| 146 | int is_add) |
| 147 | |
| 148 | Pass cm = &vnet_classify_main if calling this routine directly. Table |
| 149 | index specifies the table which receives the new session / contains |
| 150 | the session to delete depending on is_add. |
| 151 | |
| 152 | Match is the key for the indicated session. It need not be aligned, |
| 153 | but it must be table->match_n_vectors*16 octets in length. As a |
| 154 | courtesy, vnet_classify_add_del_session applies the table's mask to |
| 155 | the stored key-value. In this way, one can create a session by passing |
| 156 | unmasked (packet_data + offset) as the "match" parameter, and end up |
| 157 | with unconfusing session keys. |
| 158 | |
| 159 | Specific classifier client nodes - for example, |
| 160 | .../vnet/vnet/classify/ip_classify.c - interpret the per-session |
| 161 | hit_next_index parameter as a vpp graph-node next index. When packet |
| 162 | classification produces a match, ip_classify_inline sends packets to |
| 163 | the indicated disposition. |
| 164 | |
| 165 | ip4/6_classify place the per-session opaque_index parameter into |
| 166 | vnet_buffer(b)->l2_classify.opaque_index; a slight misnomer, but |
| 167 | anyhow classifier applications can send session-hit packets to |
| 168 | specific graph nodes, with useful values in buffer metadata. Depending |
| 169 | on the required semantics, we send known-session traffic to a certain |
| 170 | node, with e.g. a session pool index in buffer metadata. It's totally |
| 171 | up to the control-plane and the specific use-case. |
| 172 | |
| 173 | Finally, nodes such as ip4/6-classify apply the advance parameter as a |
| 174 | [signed!] argument to vlib_buffer_advance(...); to "consume" a |
| 175 | networking layer. Example: if we classify incoming tunneled IP packets |
| 176 | by (inner) source/dest address and source/dest port, we might choose |
| 177 | to decapsulate and reencapsulate the inner packet. In such a case, |
| 178 | program the advance parameter to perform the tunnel decapsulation, and |
| 179 | program next_index to send traffic to a node which uses |
| 180 | e.g. opaque_index to output traffic on a specific tunnel interface. |