Nathan Skrzypczak | 9ad39c0 | 2021-08-19 11:38:06 +0200 | [diff] [blame] | 1 | Contiv/VPP Network Operation |
| 2 | ============================ |
| 3 | |
| 4 | This document describes the network operation of the Contiv/VPP k8s |
| 5 | network plugin. It elaborates the operation and config options of the |
| 6 | Contiv IPAM, as well as details on how the VPP gets programmed by |
| 7 | Contiv/VPP control plane. |
| 8 | |
| 9 | The following picture shows 2-node k8s deployment of Contiv/VPP, with a |
| 10 | VXLAN tunnel established between the nodes to forward inter-node POD |
| 11 | traffic. The IPAM options are depicted on the Node 1, whereas the VPP |
| 12 | programming is depicted on the Node 2. |
| 13 | |
| 14 | .. figure:: /_images/contiv-networking.png |
| 15 | :alt: contiv-networking.png |
| 16 | |
| 17 | Contiv/VPP Architecture |
| 18 | |
| 19 | Contiv/VPP IPAM (IP Address Management) |
| 20 | --------------------------------------- |
| 21 | |
| 22 | IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID |
| 23 | is a number that uniquely identifies a node in the k8s cluster. The |
| 24 | first node is assigned the ID of 1, the second node 2, etc. If a node |
| 25 | leaves the cluster, its ID is released back to the pool and will be |
| 26 | re-used by the next node. |
| 27 | |
| 28 | The Node ID is used to calculate per-node IP subnets for PODs and other |
| 29 | internal subnets that need to be unique on each node. Apart from the |
| 30 | Node ID, the input for IPAM calculations is a set of config knobs, which |
| 31 | can be specified in the ``IPAMConfig`` section of the [Contiv/VPP |
| 32 | deployment YAML](../../../k8s/contiv-vpp.yaml): |
| 33 | |
| 34 | - **PodSubnetCIDR** (default ``10.1.0.0/16``): each pod gets an IP |
| 35 | address assigned from this range. The size of this range (default |
| 36 | ``/16``) dictates upper limit of POD count for the entire k8s cluster |
| 37 | (default 65536 PODs). |
| 38 | |
| 39 | - **PodNetworkPrefixLen** (default ``24``): per-node dedicated |
| 40 | podSubnet range. From the allocatable range defined in |
| 41 | ``PodSubnetCIDR``, this value will dictate the allocation for each |
| 42 | node. With the default value (``24``) this indicates that each node |
| 43 | has a ``/24`` slice of the ``PodSubnetCIDR``. The Node ID is used to |
| 44 | address the node. In case of ``PodSubnetCIDR = 10.1.0.0/16``, |
| 45 | ``PodNetworkPrefixLen = 24`` and ``NodeID = 5``, the resulting POD |
| 46 | subnet for the node would be ``10.1.5.0/24``. |
| 47 | |
| 48 | - **PodIfIPCIDR** (default ``10.2.1.0/24``): VPP-internal addresses put |
| 49 | the VPP interfaces facing towards the PODs into L3 mode. This IP |
| 50 | range will be reused on each node, thereby it is never externally |
| 51 | addressable outside of the node itself. The only requirement is that |
| 52 | this subnet should not collide with any other IPAM subnet. |
| 53 | |
| 54 | - **VPPHostSubnetCIDR** (default ``172.30.0.0/16``): used for |
| 55 | addressing the interconnect of VPP with the Linux network stack, |
| 56 | within the same node. Since this subnet needs to be unique on each |
| 57 | node, the Node ID is used to determine the actual subnet used on the |
| 58 | node with the combination of ``VPPHostNetworkPrefixLen``, |
| 59 | ``PodSubnetCIDR`` and ``PodNetworkPrefixLen``. |
| 60 | |
| 61 | - **VPPHostNetworkPrefixLen** (default ``24``): used to calculate the |
| 62 | subnet for addressing the interconnect of VPP with the Linux network |
| 63 | stack, within the same node. With |
| 64 | ``VPPHostSubnetCIDR = 172.30.0.0/16``, |
| 65 | ``VPPHostNetworkPrefixLen = 24`` and ``NodeID = 5`` the resulting |
| 66 | subnet for the node would be ``172.30.5.0/24``. |
| 67 | |
| 68 | - **NodeInterconnectCIDR** (default ``192.168.16.0/24``): range for the |
| 69 | addresses assigned to the data plane interfaces managed by VPP. |
| 70 | Unless DHCP is used (``NodeInterconnectDHCP = True``), the Contiv/VPP |
| 71 | control plane automatically assigns an IP address from this range to |
| 72 | the DPDK-managed ethernet interface bound to VPP on each node. The |
| 73 | actual IP address will be calculated from the Node ID (e.g., with |
| 74 | ``NodeInterconnectCIDR = 192.168.16.0/24`` and ``NodeID = 5``, the |
| 75 | resulting IP address assigned to the ethernet interface on VPP will |
| 76 | be ``192.168.16.5`` ). |
| 77 | |
| 78 | - **NodeInterconnectDHCP** (default ``False``): instead of assigning |
| 79 | the IPs for the data plane interfaces, which are managed by VPP from |
| 80 | ``NodeInterconnectCIDR`` by the Contiv/VPP control plane, DHCP |
| 81 | assigns the IP addresses. The DHCP must be running in the network |
| 82 | where the data plane interface is connected, in case |
| 83 | ``NodeInterconnectDHCP = True``, ``NodeInterconnectCIDR`` is ignored. |
| 84 | |
| 85 | - **VxlanCIDR** (default ``192.168.30.0/24``): in order to provide |
| 86 | inter-node POD to POD connectivity via any underlay network (not |
| 87 | necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay |
| 88 | between each of the 2 nodes within the cluster. Each node needs its |
| 89 | unique IP address of the VXLAN BVI interface. This IP address is |
| 90 | automatically calculated from the Node ID, (e.g., with |
| 91 | ``VxlanCIDR = 192.168.30.0/24`` and ``NodeID = 5``, the resulting IP |
| 92 | address assigned to the VXLAN BVI interface will be |
| 93 | ``192.168.30.5``). |
| 94 | |
| 95 | VPP Programming |
| 96 | --------------- |
| 97 | |
| 98 | This section describes how the Contiv/VPP control plane programs VPP, |
| 99 | based on the events it receives from k8s. This section is not |
| 100 | necessarily for understanding basic Contiv/VPP operation, but is very |
| 101 | useful for debugging purposes. |
| 102 | |
| 103 | Contiv/VPP currently uses a single VRF to forward the traffic between |
| 104 | PODs on a node, PODs on different nodes, host network stack, and |
| 105 | DPDK-managed dataplane interface. The forwarding between each of them is |
| 106 | purely L3-based, even for cases of communication between 2 PODs within |
| 107 | the same node. |
| 108 | |
| 109 | DPDK-Managed Data Interface |
| 110 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 111 | |
| 112 | In order to allow inter-node communication between PODs on different |
| 113 | nodes and between PODs and outside world, Contiv/VPP uses data-plane |
| 114 | interfaces bound to VPP using DPDK. Each node should have one “main” VPP |
| 115 | interface, which is unbound from the host network stack and bound to |
| 116 | VPP. The Contiv/VPP control plane automatically configures the interface |
| 117 | either via DHCP, or with a statically assigned address (see |
| 118 | ``NodeInterconnectCIDR`` and ``NodeInterconnectDHCP`` yaml settings). |
| 119 | |
| 120 | PODs on the Same Node |
| 121 | ~~~~~~~~~~~~~~~~~~~~~ |
| 122 | |
| 123 | PODs are connected to VPP using virtio-based TAP interfaces created by |
| 124 | VPP, with the POD-end of the interface placed into the POD container |
| 125 | network namespace. Each POD is assigned an IP address from the |
| 126 | ``PodSubnetCIDR``. The allocated IP is configured with the prefix length |
| 127 | ``/32``. Additionally, a static route pointing towards the VPP is |
| 128 | configured in the POD network namespace. The prefix length ``/32`` means |
| 129 | that all IP traffic will be forwarded to the default route - VPP. To get |
| 130 | rid of unnecessary broadcasts between POD and VPP, a static ARP entry is |
| 131 | configured for the gateway IP in the POD namespace, as well as for POD |
| 132 | IP on VPP. Both ends of the TAP interface have a static (non-default) |
| 133 | MAC address applied. |
| 134 | |
| 135 | PODs with hostNetwork=true |
| 136 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 137 | |
| 138 | PODs with a ``hostNetwork=true`` attribute are not placed into a |
| 139 | separate network namespace, they instead use the main host Linux network |
| 140 | namespace; therefore, they are not directly connected to the VPP. They |
| 141 | rely on the interconnection between the VPP and the host Linux network |
| 142 | stack, which is described in the next paragraph. Note, when these PODs |
| 143 | access some service IP, their network communication will be NATed in |
| 144 | Linux (by iptables rules programmed by kube-proxy) as opposed to VPP, |
| 145 | which is the case for the PODs connected to VPP directly. |
| 146 | |
| 147 | Linux Host Network Stack |
| 148 | ~~~~~~~~~~~~~~~~~~~~~~~~ |
| 149 | |
| 150 | In order to interconnect the Linux host network stack with VPP (to allow |
| 151 | access to the cluster resources from the host itself, as well as for the |
| 152 | PODs with ``hostNetwork=true``), VPP creates a TAP interface between VPP |
| 153 | and the main network namespace. The TAP interface is configured with IP |
| 154 | addresses from the ``VPPHostSubnetCIDR`` range, with ``.1`` in the |
| 155 | latest octet on the VPP side, and ``.2`` on the host side. The name of |
| 156 | the host interface is ``vpp1``. The host has static routes pointing to |
| 157 | VPP configured with: - A route to the whole ``PodSubnetCIDR`` to route |
| 158 | traffic targeting PODs towards VPP. - A route to ``ServiceCIDR`` |
| 159 | (default ``10.96.0.0/12``), to route service IP targeted traffic that |
| 160 | has not been translated by kube-proxy for some reason towards VPP. - The |
| 161 | host also has a static ARP entry configured for the IP of the VPP-end |
| 162 | TAP interface, to get rid of unnecessary broadcasts between the main |
| 163 | network namespace and VPP. |
| 164 | |
| 165 | VXLANs to Other Nodes |
| 166 | ~~~~~~~~~~~~~~~~~~~~~ |
| 167 | |
| 168 | In order to provide inter-node POD to POD connectivity via any underlay |
| 169 | network (not necessarily an L2 network), Contiv/VPP sets up a VXLAN |
| 170 | tunnel overlay between each 2 nodes within the cluster (full mesh). |
| 171 | |
| 172 | All VXLAN tunnels are terminated in one bridge domain on each VPP. The |
| 173 | bridge domain has learning and flooding disabled, the l2fib of the |
| 174 | bridge domain contains a static entry for each VXLAN tunnel. Each bridge |
| 175 | domain has a BVI interface, which interconnects the bridge domain with |
| 176 | the main VRF (L3 forwarding). This interface needs a unique IP address, |
| 177 | which is assigned from the ``VxlanCIDR`` as describe above. |
| 178 | |
| 179 | The main VRF contains several static routes that point to the BVI IP |
| 180 | addresses of other nodes. For each node, it is a route to PODSubnet and |
| 181 | VppHostSubnet of the remote node, as well as a route to the management |
| 182 | IP address of the remote node. For each of these routes, the next hop IP |
| 183 | is the BVI interface IP of the remote node, which goes via the BVI |
| 184 | interface of the local node. |
| 185 | |
| 186 | The VXLAN tunnels and the static routes pointing to them are |
| 187 | added/deleted on each VPP, whenever a node is added/deleted in the k8s |
| 188 | cluster. |
| 189 | |
| 190 | More Info |
| 191 | ~~~~~~~~~ |
| 192 | |
| 193 | Please refer to the [Packet Flow Dev |
| 194 | Guide](../dev-guide/PACKET_FLOW.html) for more detailed description of |
| 195 | paths traversed by request and response packets inside Contiv/VPP |
| 196 | Kubernetes cluster under different situations. |