Deployment description: Briefly describes the deployment, where an issue was spotted, number of k8s nodes, is DHCP/STN/TAP used.
Logs: Attach corresponding logs, at least from the vswitch pods.
VPP config: Attach output of the show commands.
Since contiv-vpp can be used with different configurations, it is helpful to attach the config that was applied. Either attach values.yaml
passed to the helm chart, or attach the corresponding part from the deployment yaml file.
contiv.yaml: |- TCPstackDisabled: true UseTAPInterfaces: true TAPInterfaceVersion: 2 NatExternalTraffic: true MTUSize: 1500 IPAMConfig: PodSubnetCIDR: 10.1.0.0/16 PodNetworkPrefixLen: 24 PodIfIPCIDR: 10.2.1.0/24 VPPHostSubnetCIDR: 172.30.0.0/16 VPPHostNetworkPrefixLen: 24 NodeInterconnectCIDR: 192.168.16.0/24 VxlanCIDR: 192.168.30.0/24 NodeInterconnectDHCP: False
Information that might be helpful:
kubectl get pods -o wide --all-namespaces
The most essential thing that needs to be done when debugging and reporting an issue in Contiv-VPP is collecting the logs from the contiv-vpp vswitch containers.
In order to collect the logs from individual vswitches in the cluster, connect to the master node and then find the POD names of the individual vswitch containers:
$ kubectl get pods --all-namespaces | grep vswitch kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
Then run the following command, with pod name replaced by the actual POD name:
$ kubectl logs <pod name> -n kube-system -c contiv-vswitch
Redirect the output to a file to save the logs, for example:
kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
If option a) does not work, then you can still collect the same logs using the plain docker command. For that, you need to connect to each individual node in the k8s cluster, and find the container ID of the vswitch container:
$ docker ps | grep contivvpp/vswitch b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
Now use the ID from the first column to dump the logs into the logs-master.txt
file:
$ docker logs b682b5837e52 > logs-master.txt
In order to debug an issue, it is good to start by grepping the logs for the level=error
string, for example:
$ cat logs-master.txt | grep level=error
Also, VPP or contiv-agent may crash with some bugs. To check if some process crashed, grep for the string exit
, for example:
$ cat logs-master.txt | grep exit 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected) 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
In STN (Steal The NIC) deployment scenarios, often need to collect and review the logs from the STN daemon. This needs to be done on each node:
$ docker logs contiv-stn > logs-stn-master.txt
If the vswitch is crashing in a loop (which can be determined by increasing the number in the RESTARTS
column of the kubectl get pods --all-namespaces
output), the kubectl logs
or docker logs
would give us the logs of the latest incarnation of the vswitch. That might not be the original root cause of the very first crash, so in order to debug that, we need to disable k8s health check probes to not restart the vswitch after the very first crash. This can be done by commenting-out the readinessProbe
and livenessProbe
in the contiv-vpp deployment YAML:
diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml index 3676047..ffa4473 100644 --- a/k8s/contiv-vpp.yaml +++ b/k8s/contiv-vpp.yaml @@ -224,18 +224,18 @@ spec: ports: # readiness + liveness probe - containerPort: 9999 - readinessProbe: - httpGet: - path: /readiness - port: 9999 - periodSeconds: 1 - initialDelaySeconds: 15 - livenessProbe: - httpGet: - path: /liveness - port: 9999 - periodSeconds: 1 - initialDelaySeconds: 60 + # readinessProbe: + # httpGet: + # path: /readiness + # port: 9999 + # periodSeconds: 1 + # initialDelaySeconds: 15 + # livenessProbe: + # httpGet: + # path: /liveness + # port: 9999 + # periodSeconds: 1 + # initialDelaySeconds: 60 env: - name: MICROSERVICE_LABEL valueFrom:
If VPP is the crashing process, please follow the CORE_FILES guide and provide the coredump file.
Inspect the following areas:
vpp# sh int addr GigabitEthernet0/9/0 (up): 192.168.16.1/24 local0 (dn): loop0 (up): l2 bridge bd_id 1 bvi shg 0 192.168.30.1/24 tapcli-0 (up): 172.30.1.1/24
vpp# sh ip fib ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ] 0.0.0.0/0 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]] [0] [@0]: dpo-drop ip4 0.0.0.0/32 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]] [0] [@0]: dpo-drop ip4 ... ... 255.255.255.255/32 unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]] [0] [@0]: dpo-drop ip4
vpp# sh ip arp Time IP4 Flags Ethernet Interface 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
DBGvpp# sh nat44 addresses NAT44 pool addresses: 192.168.16.10 tenant VRF independent 0 busy udp ports 0 busy tcp ports 0 busy icmp ports NAT44 twice-nat pool addresses:
vpp# sh nat44 static mappings NAT44 static mappings: tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
vpp# sh nat44 interfaces NAT44 interfaces: loop0 in out GigabitEthernet0/9/0 out tapcli-0 in out
vpp# sh nat44 sessions NAT44 sessions: 192.168.20.2: 0 dynamic translations, 3 static translations 10.1.1.3: 0 dynamic translations, 0 static translations 10.1.1.4: 0 dynamic translations, 0 static translations 10.1.1.2: 0 dynamic translations, 6 static translations 10.1.2.18: 0 dynamic translations, 2 static translations
vpp# sh acl-plugin acl
vpp# sh stn rules - rule_index: 0 address: 10.1.10.47 iface: tapcli-0 (2) next_node: tapcli-0-output (410)
vpp# sh errors
vpp# sh vxlan tunnels
vpp# sh vxlan tunnels
vpp# sh hardware-interfaces
contiv-vpp-bug-report.sh is an example of a script that may be a useful starting point to gathering the above information using kubectl.
Limitations:
Prerequisites:
To enable looging into a node without a password, copy your public key to the following node:
ssh-copy-id <user-id>@<node-name-or-ip-address>
To enable running sudo without a password for a given user, enter:
$ sudo visudo
Append the following entry to run ALL command without a password for a given user:
<userid> ALL=(ALL) NOPASSWD:ALL
You can also add user <user-id>
to group sudo
and edit the sudo
entry as follows:
# Allow members of group sudo to execute any command %sudo ALL=(ALL:ALL) NOPASSWD:ALL
Add user <user-id>
to group <group-id>
as follows:
sudo adduser <user-id> <group-id>
or as follows:
usermod -a -G <group-id> <user-id>
The script can be used to collect data from the Contiv-VPP test bed created with Vagrant. To collect debug information from this Contiv-VPP test bed, do the following steps:
vagrant ssh-config > vagrant-ssh.conf
./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf