Blame - docs/usecases/contiv/BUG_REPORTS.rst - fdio/vpp

blob: 8e55d5b3c8d413a70660180282ad3a34423cd676 [file] [log] [blame]

Nathan Skrzypczak	9ad39c0	2021-08-19 11:38:06 +0200	[diff] [blame]	1	Debugging and Reporting Bugs in Contiv-VPP
				2	==========================================
				3
				4	Bug Report Structure
				5	--------------------
				6
				7	- `Deployment description <#describe-deployment>`__: Briefly describes
				8	the deployment, where an issue was spotted, number of k8s nodes, is
				9	DHCP/STN/TAP used.
				10
				11	- `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least
				12	from the vswitch pods.
				13
				14	- `VPP config <#inspect-vpp-config>`__: Attach output of the show
				15	commands.
				16
				17	- `Basic Collection Example <#basic-example>`__
				18
				19	Describe Deployment
				20	~~~~~~~~~~~~~~~~~~~
				21
				22	Since contiv-vpp can be used with different configurations, it is
				23	helpful to attach the config that was applied. Either attach
				24	``values.yaml`` passed to the helm chart, or attach the `corresponding
				25	part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__
				26	from the deployment yaml file.
				27
				28	.. code:: yaml
				29
				30	contiv.yaml: \|-
				31	TCPstackDisabled: true
				32	UseTAPInterfaces: true
				33	TAPInterfaceVersion: 2
				34	NatExternalTraffic: true
				35	MTUSize: 1500
				36	IPAMConfig:
				37	PodSubnetCIDR: 10.1.0.0/16
				38	PodNetworkPrefixLen: 24
				39	PodIfIPCIDR: 10.2.1.0/24
				40	VPPHostSubnetCIDR: 172.30.0.0/16
				41	VPPHostNetworkPrefixLen: 24
				42	NodeInterconnectCIDR: 192.168.16.0/24
				43	VxlanCIDR: 192.168.30.0/24
				44	NodeInterconnectDHCP: False
				45
				46	Information that might be helpful: - Whether node IPs are statically
				47	assigned, or if DHCP is used - STN is enabled - Version of TAP
				48	interfaces used - Output of
				49	``kubectl get pods -o wide --all-namespaces``
				50
				51	Collecting the Logs
				52	~~~~~~~~~~~~~~~~~~~
				53
				54	The most essential thing that needs to be done when debugging and
				55	reporting an issue in Contiv-VPP is **collecting the logs from the
				56	contiv-vpp vswitch containers**.
				57
				58	a) Collecting Vswitch Logs Using kubectl
				59	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				60
				61	In order to collect the logs from individual vswitches in the cluster,
				62	connect to the master node and then find the POD names of the individual
				63	vswitch containers:
				64
				65	::
				66
				67	$ kubectl get pods --all-namespaces \| grep vswitch
				68	kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h
				69	kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h
				70
				71	Then run the following command, with pod name replaced by the actual
				72	POD name:
				73
				74	::
				75
				76	$ kubectl logs <pod name> -n kube-system -c contiv-vswitch
				77
				78	Redirect the output to a file to save the logs, for example:
				79
				80	::
				81
				82	kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt
				83
				84	b) Collecting Vswitch Logs Using Docker
				85	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				86
				87	If option a) does not work, then you can still collect the same logs
				88	using the plain docker command. For that, you need to connect to each
				89	individual node in the k8s cluster, and find the container ID of the
				90	vswitch container:
				91
				92	::
				93
				94	$ docker ps \| grep contivvpp/vswitch
				95	b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0
				96
				97	Now use the ID from the first column to dump the logs into the
				98	``logs-master.txt`` file:
				99
				100	::
				101
				102	$ docker logs b682b5837e52 > logs-master.txt
				103
				104	Reviewing the Vswitch Logs
				105	^^^^^^^^^^^^^^^^^^^^^^^^^^
				106
				107	In order to debug an issue, it is good to start by grepping the logs for
				108	the ``level=error`` string, for example:
				109
				110	::
				111
				112	$ cat logs-master.txt \| grep level=error
				113
				114	Also, VPP or contiv-agent may crash with some bugs. To check if some
				115	process crashed, grep for the string ``exit``, for example:
				116
				117	::
				118
				119	$ cat logs-master.txt \| grep exit
				120	2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
				121	2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request
				122
				123	Collecting the STN Daemon Logs
				124	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				125
				126	In STN (Steal The NIC) deployment scenarios, often need to collect and
				127	review the logs from the STN daemon. This needs to be done on each node:
				128
				129	::
				130
				131	$ docker logs contiv-stn > logs-stn-master.txt
				132
				133	Collecting Logs in Case of Crash Loop
				134	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				135
				136	If the vswitch is crashing in a loop (which can be determined by
				137	increasing the number in the ``RESTARTS`` column of the
				138	``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or
				139	``docker logs`` would give us the logs of the latest incarnation of the
				140	vswitch. That might not be the original root cause of the very first
				141	crash, so in order to debug that, we need to disable k8s health check
				142	probes to not restart the vswitch after the very first crash. This can
				143	be done by commenting-out the ``readinessProbe`` and ``livenessProbe``
				144	in the contiv-vpp deployment YAML:
				145
				146	.. code:: diff
				147
				148	diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
				149	index 3676047..ffa4473 100644
				150	--- a/k8s/contiv-vpp.yaml
				151	+++ b/k8s/contiv-vpp.yaml
				152	@@ -224,18 +224,18 @@ spec:
				153	ports:
				154	# readiness + liveness probe
				155	- containerPort: 9999
				156	- readinessProbe:
				157	- httpGet:
				158	- path: /readiness
				159	- port: 9999
				160	- periodSeconds: 1
				161	- initialDelaySeconds: 15
				162	- livenessProbe:
				163	- httpGet:
				164	- path: /liveness
				165	- port: 9999
				166	- periodSeconds: 1
				167	- initialDelaySeconds: 60
				168	+ # readinessProbe:
				169	+ # httpGet:
				170	+ # path: /readiness
				171	+ # port: 9999
				172	+ # periodSeconds: 1
				173	+ # initialDelaySeconds: 15
				174	+ # livenessProbe:
				175	+ # httpGet:
				176	+ # path: /liveness
				177	+ # port: 9999
				178	+ # periodSeconds: 1
				179	+ # initialDelaySeconds: 60
				180	env:
				181	- name: MICROSERVICE_LABEL
				182	valueFrom:
				183
				184	If VPP is the crashing process, please follow the
				185	[CORE_FILES](CORE_FILES.html) guide and provide the coredump file.
				186
				187	Inspect VPP Config
				188	~~~~~~~~~~~~~~~~~~
				189
				190	Inspect the following areas: - Configured interfaces (issues related
				191	basic node/pod connectivity issues):
				192
				193	::
				194
				195	vpp# sh int addr
				196	GigabitEthernet0/9/0 (up):
				197	192.168.16.1/24
				198	local0 (dn):
				199	loop0 (up):
				200	l2 bridge bd_id 1 bvi shg 0
				201	192.168.30.1/24
				202	tapcli-0 (up):
				203	172.30.1.1/24
				204
				205	- IP forwarding table:
				206
				207	::
				208
				209	vpp# sh ip fib
				210	ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
				211	0.0.0.0/0
				212	unicast-ip4-chain
				213	[@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
				214	[0] [@0]: dpo-drop ip4
				215	0.0.0.0/32
				216	unicast-ip4-chain
				217	[@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
				218	[0] [@0]: dpo-drop ip4
				219
				220	...
				221	...
				222
				223	255.255.255.255/32
				224	unicast-ip4-chain
				225	[@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
				226	[0] [@0]: dpo-drop ip4
				227
				228	- ARP Table:
				229
				230	::
				231
				232	vpp# sh ip arp
				233	Time IP4 Flags Ethernet Interface
				234	728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0
				235	542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0
				236	1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0
				237	15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1
				238	739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2
				239	739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3
				240
				241	- NAT configuration (issues related to services):
				242
				243	::
				244
				245	DBGvpp# sh nat44 addresses
				246	NAT44 pool addresses:
				247	192.168.16.10
				248	tenant VRF independent
				249	0 busy udp ports
				250	0 busy tcp ports
				251	0 busy icmp ports
				252	NAT44 twice-nat pool addresses:
				253
				254	::
				255
				256	vpp# sh nat44 static mappings
				257	NAT44 static mappings:
				258	tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only
				259	tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only
				260	tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only
				261	tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only
				262	tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only
				263	tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only
				264	udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
				265	tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only
				266
				267	::
				268
				269	vpp# sh nat44 interfaces
				270	NAT44 interfaces:
				271	loop0 in out
				272	GigabitEthernet0/9/0 out
				273	tapcli-0 in out
				274
				275	::
				276
				277	vpp# sh nat44 sessions
				278	NAT44 sessions:
				279	192.168.20.2: 0 dynamic translations, 3 static translations
				280	10.1.1.3: 0 dynamic translations, 0 static translations
				281	10.1.1.4: 0 dynamic translations, 0 static translations
				282	10.1.1.2: 0 dynamic translations, 6 static translations
				283	10.1.2.18: 0 dynamic translations, 2 static translations
				284
				285	- ACL config (issues related to policies):
				286
				287	::
				288
				289	vpp# sh acl-plugin acl
				290
				291	- “Steal the NIC (STN)” config (issues related to host connectivity
				292	when STN is active):
				293
				294	::
				295
				296	vpp# sh stn rules
				297	- rule_index: 0
				298	address: 10.1.10.47
				299	iface: tapcli-0 (2)
				300	next_node: tapcli-0-output (410)
				301
				302	- Errors:
				303
				304	::
				305
				306	vpp# sh errors
				307
				308	- Vxlan tunnels:
				309
				310	::
				311
				312	vpp# sh vxlan tunnels
				313
				314	- Vxlan tunnels:
				315
				316	::
				317
				318	vpp# sh vxlan tunnels
				319
				320	- Hardware interface information:
				321
				322	::
				323
				324	vpp# sh hardware-interfaces
				325
				326	Basic Example
				327	~~~~~~~~~~~~~
				328
				329	`contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__
				330	is an example of a script that may be a useful starting point to
				331	gathering the above information using kubectl.
				332
				333	Limitations: - The script does not include STN daemon logs nor does it
				334	handle the special case of a crash loop
				335
				336	Prerequisites: - The user specified in the script must have passwordless
				337	access to all nodes in the cluster; on each node in the cluster the user
				338	must have passwordless access to sudo.
				339
				340	Setting up Prerequisites
				341	^^^^^^^^^^^^^^^^^^^^^^^^
				342
				343	To enable logging into a node without a password, copy your public key
				344	to the following node:
				345
				346	::
				347
				348	ssh-copy-id <user-id>@<node-name-or-ip-address>
				349
				350	To enable running sudo without a password for a given user, enter:
				351
				352	::
				353
				354	$ sudo visudo
				355
				356	Append the following entry to run ALL command without a password for a
				357	given user:
				358
				359	::
				360
				361	<userid> ALL=(ALL) NOPASSWD:ALL
				362
				363	You can also add user ``<user-id>`` to group ``sudo`` and edit the
				364	``sudo`` entry as follows:
				365
				366	::
				367
				368	# Allow members of group sudo to execute any command
				369	%sudo ALL=(ALL:ALL) NOPASSWD:ALL
				370
				371	Add user ``<user-id>`` to group ``<group-id>`` as follows:
				372
				373	::
				374
				375	sudo adduser <user-id> <group-id>
				376
				377	or as follows:
				378
				379	::
				380
				381	usermod -a -G <group-id> <user-id>
				382
				383	Working with the Contiv-VPP Vagrant Test Bed
				384	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
				385
				386	The script can be used to collect data from the `Contiv-VPP test bed
				387	created with
				388	Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__.
				389	To collect debug information from this Contiv-VPP test bed, do the
				390	following steps: \* In the directory where you created your vagrant test
				391	bed, do:
				392
				393	::
				394
				395	vagrant ssh-config > vagrant-ssh.conf
				396
				397	- To collect the debug information do:
				398
				399	::
				400
				401	./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf