blob: b6799961c1d42c343f8302842d983eef002ea167 [file] [log] [blame]
Contiv/VPP Network Operation
============================
This document describes the network operation of the Contiv/VPP k8s
network plugin. It elaborates the operation and config options of the
Contiv IPAM, as well as details on how the VPP gets programmed by
Contiv/VPP control plane.
The following picture shows 2-node k8s deployment of Contiv/VPP, with a
VXLAN tunnel established between the nodes to forward inter-node POD
traffic. The IPAM options are depicted on the Node 1, whereas the VPP
programming is depicted on the Node 2.
.. figure:: /_images/contiv-networking.png
:alt: contiv-networking.png
Contiv/VPP Architecture
Contiv/VPP IPAM (IP Address Management)
---------------------------------------
IPAM in Contiv/VPP is based on the concept of **Node ID**. The Node ID
is a number that uniquely identifies a node in the k8s cluster. The
first node is assigned the ID of 1, the second node 2, etc. If a node
leaves the cluster, its ID is released back to the pool and will be
re-used by the next node.
The Node ID is used to calculate per-node IP subnets for PODs and other
internal subnets that need to be unique on each node. Apart from the
Node ID, the input for IPAM calculations is a set of config knobs, which
can be specified in the ``IPAMConfig`` section of the [Contiv/VPP
deployment YAML](../../../k8s/contiv-vpp.yaml):
- **PodSubnetCIDR** (default ``10.1.0.0/16``): each pod gets an IP
address assigned from this range. The size of this range (default
``/16``) dictates upper limit of POD count for the entire k8s cluster
(default 65536 PODs).
- **PodNetworkPrefixLen** (default ``24``): per-node dedicated
podSubnet range. From the allocatable range defined in
``PodSubnetCIDR``, this value will dictate the allocation for each
node. With the default value (``24``) this indicates that each node
has a ``/24`` slice of the ``PodSubnetCIDR``. The Node ID is used to
address the node. In case of ``PodSubnetCIDR = 10.1.0.0/16``,
``PodNetworkPrefixLen = 24`` and ``NodeID = 5``, the resulting POD
subnet for the node would be ``10.1.5.0/24``.
- **PodIfIPCIDR** (default ``10.2.1.0/24``): VPP-internal addresses put
the VPP interfaces facing towards the PODs into L3 mode. This IP
range will be reused on each node, thereby it is never externally
addressable outside of the node itself. The only requirement is that
this subnet should not collide with any other IPAM subnet.
- **VPPHostSubnetCIDR** (default ``172.30.0.0/16``): used for
addressing the interconnect of VPP with the Linux network stack,
within the same node. Since this subnet needs to be unique on each
node, the Node ID is used to determine the actual subnet used on the
node with the combination of ``VPPHostNetworkPrefixLen``,
``PodSubnetCIDR`` and ``PodNetworkPrefixLen``.
- **VPPHostNetworkPrefixLen** (default ``24``): used to calculate the
subnet for addressing the interconnect of VPP with the Linux network
stack, within the same node. With
``VPPHostSubnetCIDR = 172.30.0.0/16``,
``VPPHostNetworkPrefixLen = 24`` and ``NodeID = 5`` the resulting
subnet for the node would be ``172.30.5.0/24``.
- **NodeInterconnectCIDR** (default ``192.168.16.0/24``): range for the
addresses assigned to the data plane interfaces managed by VPP.
Unless DHCP is used (``NodeInterconnectDHCP = True``), the Contiv/VPP
control plane automatically assigns an IP address from this range to
the DPDK-managed ethernet interface bound to VPP on each node. The
actual IP address will be calculated from the Node ID (e.g., with
``NodeInterconnectCIDR = 192.168.16.0/24`` and ``NodeID = 5``, the
resulting IP address assigned to the ethernet interface on VPP will
be ``192.168.16.5`` ).
- **NodeInterconnectDHCP** (default ``False``): instead of assigning
the IPs for the data plane interfaces, which are managed by VPP from
``NodeInterconnectCIDR`` by the Contiv/VPP control plane, DHCP
assigns the IP addresses. The DHCP must be running in the network
where the data plane interface is connected, in case
``NodeInterconnectDHCP = True``, ``NodeInterconnectCIDR`` is ignored.
- **VxlanCIDR** (default ``192.168.30.0/24``): in order to provide
inter-node POD to POD connectivity via any underlay network (not
necessarily an L2 network), Contiv/VPP sets up a VXLAN tunnel overlay
between each of the 2 nodes within the cluster. Each node needs its
unique IP address of the VXLAN BVI interface. This IP address is
automatically calculated from the Node ID, (e.g., with
``VxlanCIDR = 192.168.30.0/24`` and ``NodeID = 5``, the resulting IP
address assigned to the VXLAN BVI interface will be
``192.168.30.5``).
VPP Programming
---------------
This section describes how the Contiv/VPP control plane programs VPP,
based on the events it receives from k8s. This section is not
necessarily for understanding basic Contiv/VPP operation, but is very
useful for debugging purposes.
Contiv/VPP currently uses a single VRF to forward the traffic between
PODs on a node, PODs on different nodes, host network stack, and
DPDK-managed dataplane interface. The forwarding between each of them is
purely L3-based, even for cases of communication between 2 PODs within
the same node.
DPDK-Managed Data Interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~
In order to allow inter-node communication between PODs on different
nodes and between PODs and outside world, Contiv/VPP uses data-plane
interfaces bound to VPP using DPDK. Each node should have one main VPP
interface, which is unbound from the host network stack and bound to
VPP. The Contiv/VPP control plane automatically configures the interface
either via DHCP, or with a statically assigned address (see
``NodeInterconnectCIDR`` and ``NodeInterconnectDHCP`` yaml settings).
PODs on the Same Node
~~~~~~~~~~~~~~~~~~~~~
PODs are connected to VPP using virtio-based TAP interfaces created by
VPP, with the POD-end of the interface placed into the POD container
network namespace. Each POD is assigned an IP address from the
``PodSubnetCIDR``. The allocated IP is configured with the prefix length
``/32``. Additionally, a static route pointing towards the VPP is
configured in the POD network namespace. The prefix length ``/32`` means
that all IP traffic will be forwarded to the default route - VPP. To get
rid of unnecessary broadcasts between POD and VPP, a static ARP entry is
configured for the gateway IP in the POD namespace, as well as for POD
IP on VPP. Both ends of the TAP interface have a static (non-default)
MAC address applied.
PODs with hostNetwork=true
~~~~~~~~~~~~~~~~~~~~~~~~~~
PODs with a ``hostNetwork=true`` attribute are not placed into a
separate network namespace, they instead use the main host Linux network
namespace; therefore, they are not directly connected to the VPP. They
rely on the interconnection between the VPP and the host Linux network
stack, which is described in the next paragraph. Note, when these PODs
access some service IP, their network communication will be NATed in
Linux (by iptables rules programmed by kube-proxy) as opposed to VPP,
which is the case for the PODs connected to VPP directly.
Linux Host Network Stack
~~~~~~~~~~~~~~~~~~~~~~~~
In order to interconnect the Linux host network stack with VPP (to allow
access to the cluster resources from the host itself, as well as for the
PODs with ``hostNetwork=true``), VPP creates a TAP interface between VPP
and the main network namespace. The TAP interface is configured with IP
addresses from the ``VPPHostSubnetCIDR`` range, with ``.1`` in the
latest octet on the VPP side, and ``.2`` on the host side. The name of
the host interface is ``vpp1``. The host has static routes pointing to
VPP configured with: - A route to the whole ``PodSubnetCIDR`` to route
traffic targeting PODs towards VPP. - A route to ``ServiceCIDR``
(default ``10.96.0.0/12``), to route service IP targeted traffic that
has not been translated by kube-proxy for some reason towards VPP. - The
host also has a static ARP entry configured for the IP of the VPP-end
TAP interface, to get rid of unnecessary broadcasts between the main
network namespace and VPP.
VXLANs to Other Nodes
~~~~~~~~~~~~~~~~~~~~~
In order to provide inter-node POD to POD connectivity via any underlay
network (not necessarily an L2 network), Contiv/VPP sets up a VXLAN
tunnel overlay between each 2 nodes within the cluster (full mesh).
All VXLAN tunnels are terminated in one bridge domain on each VPP. The
bridge domain has learning and flooding disabled, the l2fib of the
bridge domain contains a static entry for each VXLAN tunnel. Each bridge
domain has a BVI interface, which interconnects the bridge domain with
the main VRF (L3 forwarding). This interface needs a unique IP address,
which is assigned from the ``VxlanCIDR`` as describe above.
The main VRF contains several static routes that point to the BVI IP
addresses of other nodes. For each node, it is a route to PODSubnet and
VppHostSubnet of the remote node, as well as a route to the management
IP address of the remote node. For each of these routes, the next hop IP
is the BVI interface IP of the remote node, which goes via the BVI
interface of the local node.
The VXLAN tunnels and the static routes pointing to them are
added/deleted on each VPP, whenever a node is added/deleted in the k8s
cluster.
More Info
~~~~~~~~~
Please refer to the [Packet Flow Dev
Guide](../dev-guide/PACKET_FLOW.html) for more detailed description of
paths traversed by request and response packets inside Contiv/VPP
Kubernetes cluster under different situations.