blob: 8e55d5b3c8d413a70660180282ad3a34423cd676 [file] [log] [blame]
Nathan Skrzypczak9ad39c02021-08-19 11:38:06 +02001Debugging and Reporting Bugs in Contiv-VPP
2==========================================
3
4Bug Report Structure
5--------------------
6
7- `Deployment description <#describe-deployment>`__: Briefly describes
8 the deployment, where an issue was spotted, number of k8s nodes, is
9 DHCP/STN/TAP used.
10
11- `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
12 from the vswitch pods.
13
14- `VPP config <#inspect-vpp-config>`__: Attach output of the show
15 commands.
16
17- `Basic Collection Example <#basic-example>`__
18
19Describe Deployment
20~~~~~~~~~~~~~~~~~~~
21
22Since contiv-vpp can be used with different configurations, it is
23helpful to attach the config that was applied. Either attach
24``values.yaml`` passed to the helm chart, or attach the `corresponding
25part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
26from the deployment yaml file.
27
28.. code:: yaml
29
30 contiv.yaml: |-
31 TCPstackDisabled: true
32 UseTAPInterfaces: true
33 TAPInterfaceVersion: 2
34 NatExternalTraffic: true
35 MTUSize: 1500
36 IPAMConfig:
37 PodSubnetCIDR: 10.1.0.0/16
38 PodNetworkPrefixLen: 24
39 PodIfIPCIDR: 10.2.1.0/24
40 VPPHostSubnetCIDR: 172.30.0.0/16
41 VPPHostNetworkPrefixLen: 24
42 NodeInterconnectCIDR: 192.168.16.0/24
43 VxlanCIDR: 192.168.30.0/24
44 NodeInterconnectDHCP: False
45
46Information that might be helpful: - Whether node IPs are statically
47assigned, or if DHCP is used - STN is enabled - Version of TAP
48interfaces used - Output of
49``kubectl get pods -o wide --all-namespaces``
50
51Collecting the Logs
52~~~~~~~~~~~~~~~~~~~
53
54The most essential thing that needs to be done when debugging and
55**reporting an issue** in Contiv-VPP is **collecting the logs from the
56contiv-vpp vswitch containers**.
57
58a) Collecting Vswitch Logs Using kubectl
59^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
60
61In order to collect the logs from individual vswitches in the cluster,
62connect to the master node and then find the POD names of the individual
63vswitch containers:
64
65::
66
67 $ kubectl get pods --all-namespaces | grep vswitch
68 kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
69 kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
70
71Then run the following command, with *pod name* replaced by the actual
72POD name:
73
74::
75
76 $ kubectl logs <pod name> -n kube-system -c contiv-vswitch
77
78Redirect the output to a file to save the logs, for example:
79
80::
81
82 kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
83
84b) Collecting Vswitch Logs Using Docker
85^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
86
87If option a) does not work, then you can still collect the same logs
88using the plain docker command. For that, you need to connect to each
89individual node in the k8s cluster, and find the container ID of the
90vswitch container:
91
92::
93
94 $ docker ps | grep contivvpp/vswitch
95 b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
96
97Now use the ID from the first column to dump the logs into the
98``logs-master.txt`` file:
99
100::
101
102 $ docker logs b682b5837e52 > logs-master.txt
103
104Reviewing the Vswitch Logs
105^^^^^^^^^^^^^^^^^^^^^^^^^^
106
107In order to debug an issue, it is good to start by grepping the logs for
108the ``level=error`` string, for example:
109
110::
111
112 $ cat logs-master.txt | grep level=error
113
114Also, VPP or contiv-agent may crash with some bugs. To check if some
115process crashed, grep for the string ``exit``, for example:
116
117::
118
119 $ cat logs-master.txt | grep exit
120 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
121 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
122
123Collecting the STN Daemon Logs
124^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125
126In STN (Steal The NIC) deployment scenarios, often need to collect and
127review the logs from the STN daemon. This needs to be done on each node:
128
129::
130
131 $ docker logs contiv-stn > logs-stn-master.txt
132
133Collecting Logs in Case of Crash Loop
134^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136If the vswitch is crashing in a loop (which can be determined by
137increasing the number in the ``RESTARTS`` column of the
138``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
139``docker logs`` would give us the logs of the latest incarnation of the
140vswitch. That might not be the original root cause of the very first
141crash, so in order to debug that, we need to disable k8s health check
142probes to not restart the vswitch after the very first crash. This can
143be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
144in the contiv-vpp deployment YAML:
145
146.. code:: diff
147
148 diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
149 index 3676047..ffa4473 100644
150 --- a/k8s/contiv-vpp.yaml
151 +++ b/k8s/contiv-vpp.yaml
152 @@ -224,18 +224,18 @@ spec:
153 ports:
154 # readiness + liveness probe
155 - containerPort: 9999
156 - readinessProbe:
157 - httpGet:
158 - path: /readiness
159 - port: 9999
160 - periodSeconds: 1
161 - initialDelaySeconds: 15
162 - livenessProbe:
163 - httpGet:
164 - path: /liveness
165 - port: 9999
166 - periodSeconds: 1
167 - initialDelaySeconds: 60
168 + # readinessProbe:
169 + # httpGet:
170 + # path: /readiness
171 + # port: 9999
172 + # periodSeconds: 1
173 + # initialDelaySeconds: 15
174 + # livenessProbe:
175 + # httpGet:
176 + # path: /liveness
177 + # port: 9999
178 + # periodSeconds: 1
179 + # initialDelaySeconds: 60
180 env:
181 - name: MICROSERVICE_LABEL
182 valueFrom:
183
184If VPP is the crashing process, please follow the
185[CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
186
187Inspect VPP Config
188~~~~~~~~~~~~~~~~~~
189
190Inspect the following areas: - Configured interfaces (issues related
191basic node/pod connectivity issues):
192
193::
194
195 vpp# sh int addr
196 GigabitEthernet0/9/0 (up):
197 192.168.16.1/24
198 local0 (dn):
199 loop0 (up):
200 l2 bridge bd_id 1 bvi shg 0
201 192.168.30.1/24
202 tapcli-0 (up):
203 172.30.1.1/24
204
205- IP forwarding table:
206
207::
208
209 vpp# sh ip fib
210 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
211 0.0.0.0/0
212 unicast-ip4-chain
213 [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
214 [0] [@0]: dpo-drop ip4
215 0.0.0.0/32
216 unicast-ip4-chain
217 [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
218 [0] [@0]: dpo-drop ip4
219
220 ...
221 ...
222
223 255.255.255.255/32
224 unicast-ip4-chain
225 [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
226 [0] [@0]: dpo-drop ip4
227
228- ARP Table:
229
230::
231
232 vpp# sh ip arp
233 Time IP4 Flags Ethernet Interface
234 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
235 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
236 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
237 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
238 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
239 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
240
241- NAT configuration (issues related to services):
242
243::
244
245 DBGvpp# sh nat44 addresses
246 NAT44 pool addresses:
247 192.168.16.10
248 tenant VRF independent
249 0 busy udp ports
250 0 busy tcp ports
251 0 busy icmp ports
252 NAT44 twice-nat pool addresses:
253
254::
255
256 vpp# sh nat44 static mappings
257 NAT44 static mappings:
258 tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
259 tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
260 tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
261 tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
262 tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
263 tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
264 udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
265 tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
266
267::
268
269 vpp# sh nat44 interfaces
270 NAT44 interfaces:
271 loop0 in out
272 GigabitEthernet0/9/0 out
273 tapcli-0 in out
274
275::
276
277 vpp# sh nat44 sessions
278 NAT44 sessions:
279 192.168.20.2: 0 dynamic translations, 3 static translations
280 10.1.1.3: 0 dynamic translations, 0 static translations
281 10.1.1.4: 0 dynamic translations, 0 static translations
282 10.1.1.2: 0 dynamic translations, 6 static translations
283 10.1.2.18: 0 dynamic translations, 2 static translations
284
285- ACL config (issues related to policies):
286
287::
288
289 vpp# sh acl-plugin acl
290
291- Steal the NIC (STN)” config (issues related to host connectivity
292 when STN is active):
293
294::
295
296 vpp# sh stn rules
297 - rule_index: 0
298 address: 10.1.10.47
299 iface: tapcli-0 (2)
300 next_node: tapcli-0-output (410)
301
302- Errors:
303
304::
305
306 vpp# sh errors
307
308- Vxlan tunnels:
309
310::
311
312 vpp# sh vxlan tunnels
313
314- Vxlan tunnels:
315
316::
317
318 vpp# sh vxlan tunnels
319
320- Hardware interface information:
321
322::
323
324 vpp# sh hardware-interfaces
325
326Basic Example
327~~~~~~~~~~~~~
328
329`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
330is an example of a script that may be a useful starting point to
331gathering the above information using kubectl.
332
333Limitations: - The script does not include STN daemon logs nor does it
334handle the special case of a crash loop
335
336Prerequisites: - The user specified in the script must have passwordless
337access to all nodes in the cluster; on each node in the cluster the user
338must have passwordless access to sudo.
339
340Setting up Prerequisites
341^^^^^^^^^^^^^^^^^^^^^^^^
342
343To enable logging into a node without a password, copy your public key
344to the following node:
345
346::
347
348 ssh-copy-id <user-id>@<node-name-or-ip-address>
349
350To enable running sudo without a password for a given user, enter:
351
352::
353
354 $ sudo visudo
355
356Append the following entry to run ALL command without a password for a
357given user:
358
359::
360
361 <userid> ALL=(ALL) NOPASSWD:ALL
362
363You can also add user ``<user-id>`` to group ``sudo`` and edit the
364``sudo`` entry as follows:
365
366::
367
368 # Allow members of group sudo to execute any command
369 %sudo ALL=(ALL:ALL) NOPASSWD:ALL
370
371Add user ``<user-id>`` to group ``<group-id>`` as follows:
372
373::
374
375 sudo adduser <user-id> <group-id>
376
377or as follows:
378
379::
380
381 usermod -a -G <group-id> <user-id>
382
383Working with the Contiv-VPP Vagrant Test Bed
384^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
385
386The script can be used to collect data from the `Contiv-VPP test bed
387created with
388Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
389To collect debug information from this Contiv-VPP test bed, do the
390following steps: \* In the directory where you created your vagrant test
391bed, do:
392
393::
394
395 vagrant ssh-config > vagrant-ssh.conf
396
397- To collect the debug information do:
398
399::
400
401 ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf