| Contiv/VPP Network Operation |
| ============================ |
| |
| This document describes the network operation of the Contiv/VPP k8s |
| network plugin. It elaborates the operation and config options of the |
| Contiv IPAM, as well as details on how the VPP gets programmed by |
| Contiv/VPP control plane. |
| |
| The following picture shows 2-node k8s deployment of Contiv/VPP, with a |
| VXLAN tunnel established between the nodes to forward inter-node POD |
| traffic. The IPAM options are depicted on the Node 1, whereas the VPP |
| programming is depicted on the Node 2. |
| |
| .. figure:: /_images/contiv-networking.png |
| :alt: contiv-networking.png |
| |
| Contiv/VPP Architecture |
| |
| Contiv/VPP IPAM (IP Address Management) |
| --------------------------------------- |
| |
| IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID |
| is a number that uniquely identifies a node in the k8s cluster. The |
| first node is assigned the ID of 1, the second node 2, etc. If a node |
| leaves the cluster, its ID is released back to the pool and will be |
| re-used by the next node. |
| |
| The Node ID is used to calculate per-node IP subnets for PODs and other |
| internal subnets that need to be unique on each node. Apart from the |
| Node ID, the input for IPAM calculations is a set of config knobs, which |
| can be specified in the ``IPAMConfig`` section of the [Contiv/VPP |
| deployment YAML](../../../k8s/contiv-vpp.yaml): |
| |
| - **PodSubnetCIDR** (default ``10.1.0.0/16``): each pod gets an IP |
| address assigned from this range. The size of this range (default |
| ``/16``) dictates upper limit of POD count for the entire k8s cluster |
| (default 65536 PODs). |
| |
| - **PodNetworkPrefixLen** (default ``24``): per-node dedicated |
| podSubnet range. From the allocatable range defined in |
| ``PodSubnetCIDR``, this value will dictate the allocation for each |
| node. With the default value (``24``) this indicates that each node |
| has a ``/24`` slice of the ``PodSubnetCIDR``. The Node ID is used to |
| address the node. In case of ``PodSubnetCIDR = 10.1.0.0/16``, |
| ``PodNetworkPrefixLen = 24`` and ``NodeID = 5``, the resulting POD |
| subnet for the node would be ``10.1.5.0/24``. |
| |
| - **PodIfIPCIDR** (default ``10.2.1.0/24``): VPP-internal addresses put |
| the VPP interfaces facing towards the PODs into L3 mode. This IP |
| range will be reused on each node, thereby it is never externally |
| addressable outside of the node itself. The only requirement is that |
| this subnet should not collide with any other IPAM subnet. |
| |
| - **VPPHostSubnetCIDR** (default ``172.30.0.0/16``): used for |
| addressing the interconnect of VPP with the Linux network stack, |
| within the same node. Since this subnet needs to be unique on each |
| node, the Node ID is used to determine the actual subnet used on the |
| node with the combination of ``VPPHostNetworkPrefixLen``, |
| ``PodSubnetCIDR`` and ``PodNetworkPrefixLen``. |
| |
| - **VPPHostNetworkPrefixLen** (default ``24``): used to calculate the |
| subnet for addressing the interconnect of VPP with the Linux network |
| stack, within the same node. With |
| ``VPPHostSubnetCIDR = 172.30.0.0/16``, |
| ``VPPHostNetworkPrefixLen = 24`` and ``NodeID = 5`` the resulting |
| subnet for the node would be ``172.30.5.0/24``. |
| |
| - **NodeInterconnectCIDR** (default ``192.168.16.0/24``): range for the |
| addresses assigned to the data plane interfaces managed by VPP. |
| Unless DHCP is used (``NodeInterconnectDHCP = True``), the Contiv/VPP |
| control plane automatically assigns an IP address from this range to |
| the DPDK-managed ethernet interface bound to VPP on each node. The |
| actual IP address will be calculated from the Node ID (e.g., with |
| ``NodeInterconnectCIDR = 192.168.16.0/24`` and ``NodeID = 5``, the |
| resulting IP address assigned to the ethernet interface on VPP will |
| be ``192.168.16.5`` ). |
| |
| - **NodeInterconnectDHCP** (default ``False``): instead of assigning |
| the IPs for the data plane interfaces, which are managed by VPP from |
| ``NodeInterconnectCIDR`` by the Contiv/VPP control plane, DHCP |
| assigns the IP addresses. The DHCP must be running in the network |
| where the data plane interface is connected, in case |
| ``NodeInterconnectDHCP = True``, ``NodeInterconnectCIDR`` is ignored. |
| |
| - **VxlanCIDR** (default ``192.168.30.0/24``): in order to provide |
| inter-node POD to POD connectivity via any underlay network (not |
| necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay |
| between each of the 2 nodes within the cluster. Each node needs its |
| unique IP address of the VXLAN BVI interface. This IP address is |
| automatically calculated from the Node ID, (e.g., with |
| ``VxlanCIDR = 192.168.30.0/24`` and ``NodeID = 5``, the resulting IP |
| address assigned to the VXLAN BVI interface will be |
| ``192.168.30.5``). |
| |
| VPP Programming |
| --------------- |
| |
| This section describes how the Contiv/VPP control plane programs VPP, |
| based on the events it receives from k8s. This section is not |
| necessarily for understanding basic Contiv/VPP operation, but is very |
| useful for debugging purposes. |
| |
| Contiv/VPP currently uses a single VRF to forward the traffic between |
| PODs on a node, PODs on different nodes, host network stack, and |
| DPDK-managed dataplane interface. The forwarding between each of them is |
| purely L3-based, even for cases of communication between 2 PODs within |
| the same node. |
| |
| DPDK-Managed Data Interface |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In order to allow inter-node communication between PODs on different |
| nodes and between PODs and outside world, Contiv/VPP uses data-plane |
| interfaces bound to VPP using DPDK. Each node should have one “main” VPP |
| interface, which is unbound from the host network stack and bound to |
| VPP. The Contiv/VPP control plane automatically configures the interface |
| either via DHCP, or with a statically assigned address (see |
| ``NodeInterconnectCIDR`` and ``NodeInterconnectDHCP`` yaml settings). |
| |
| PODs on the Same Node |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| PODs are connected to VPP using virtio-based TAP interfaces created by |
| VPP, with the POD-end of the interface placed into the POD container |
| network namespace. Each POD is assigned an IP address from the |
| ``PodSubnetCIDR``. The allocated IP is configured with the prefix length |
| ``/32``. Additionally, a static route pointing towards the VPP is |
| configured in the POD network namespace. The prefix length ``/32`` means |
| that all IP traffic will be forwarded to the default route - VPP. To get |
| rid of unnecessary broadcasts between POD and VPP, a static ARP entry is |
| configured for the gateway IP in the POD namespace, as well as for POD |
| IP on VPP. Both ends of the TAP interface have a static (non-default) |
| MAC address applied. |
| |
| PODs with hostNetwork=true |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| PODs with a ``hostNetwork=true`` attribute are not placed into a |
| separate network namespace, they instead use the main host Linux network |
| namespace; therefore, they are not directly connected to the VPP. They |
| rely on the interconnection between the VPP and the host Linux network |
| stack, which is described in the next paragraph. Note, when these PODs |
| access some service IP, their network communication will be NATed in |
| Linux (by iptables rules programmed by kube-proxy) as opposed to VPP, |
| which is the case for the PODs connected to VPP directly. |
| |
| Linux Host Network Stack |
| ~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In order to interconnect the Linux host network stack with VPP (to allow |
| access to the cluster resources from the host itself, as well as for the |
| PODs with ``hostNetwork=true``), VPP creates a TAP interface between VPP |
| and the main network namespace. The TAP interface is configured with IP |
| addresses from the ``VPPHostSubnetCIDR`` range, with ``.1`` in the |
| latest octet on the VPP side, and ``.2`` on the host side. The name of |
| the host interface is ``vpp1``. The host has static routes pointing to |
| VPP configured with: - A route to the whole ``PodSubnetCIDR`` to route |
| traffic targeting PODs towards VPP. - A route to ``ServiceCIDR`` |
| (default ``10.96.0.0/12``), to route service IP targeted traffic that |
| has not been translated by kube-proxy for some reason towards VPP. - The |
| host also has a static ARP entry configured for the IP of the VPP-end |
| TAP interface, to get rid of unnecessary broadcasts between the main |
| network namespace and VPP. |
| |
| VXLANs to Other Nodes |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| In order to provide inter-node POD to POD connectivity via any underlay |
| network (not necessarily an L2 network), Contiv/VPP sets up a VXLAN |
| tunnel overlay between each 2 nodes within the cluster (full mesh). |
| |
| All VXLAN tunnels are terminated in one bridge domain on each VPP. The |
| bridge domain has learning and flooding disabled, the l2fib of the |
| bridge domain contains a static entry for each VXLAN tunnel. Each bridge |
| domain has a BVI interface, which interconnects the bridge domain with |
| the main VRF (L3 forwarding). This interface needs a unique IP address, |
| which is assigned from the ``VxlanCIDR`` as describe above. |
| |
| The main VRF contains several static routes that point to the BVI IP |
| addresses of other nodes. For each node, it is a route to PODSubnet and |
| VppHostSubnet of the remote node, as well as a route to the management |
| IP address of the remote node. For each of these routes, the next hop IP |
| is the BVI interface IP of the remote node, which goes via the BVI |
| interface of the local node. |
| |
| The VXLAN tunnels and the static routes pointing to them are |
| added/deleted on each VPP, whenever a node is added/deleted in the k8s |
| cluster. |
| |
| More Info |
| ~~~~~~~~~ |
| |
| Please refer to the [Packet Flow Dev |
| Guide](../dev-guide/PACKET_FLOW.html) for more detailed description of |
| paths traversed by request and response packets inside Contiv/VPP |
| Kubernetes cluster under different situations. |