Nathan Skrzypczak | 9ad39c0 | 2021-08-19 11:38:06 +0200 | [diff] [blame^] | 1 | Debugging and Reporting Bugs in Contiv-VPP |
| 2 | ========================================== |
| 3 | |
| 4 | Bug Report Structure |
| 5 | -------------------- |
| 6 | |
| 7 | - `Deployment description <#describe-deployment>`__: Briefly describes |
| 8 | the deployment, where an issue was spotted, number of k8s nodes, is |
| 9 | DHCP/STN/TAP used. |
| 10 | |
| 11 | - `Logs <#collecting-the-logs>`__: Attach corresponding logs, at least |
| 12 | from the vswitch pods. |
| 13 | |
| 14 | - `VPP config <#inspect-vpp-config>`__: Attach output of the show |
| 15 | commands. |
| 16 | |
| 17 | - `Basic Collection Example <#basic-example>`__ |
| 18 | |
| 19 | Describe Deployment |
| 20 | ~~~~~~~~~~~~~~~~~~~ |
| 21 | |
| 22 | Since contiv-vpp can be used with different configurations, it is |
| 23 | helpful to attach the config that was applied. Either attach |
| 24 | ``values.yaml`` passed to the helm chart, or attach the `corresponding |
| 25 | part <https://github.com/contiv/vpp/blob/42b3bfbe8735508667b1e7f1928109a65dfd5261/k8s/contiv-vpp.yaml#L24-L38>`__ |
| 26 | from the deployment yaml file. |
| 27 | |
| 28 | .. code:: yaml |
| 29 | |
| 30 | contiv.yaml: |- |
| 31 | TCPstackDisabled: true |
| 32 | UseTAPInterfaces: true |
| 33 | TAPInterfaceVersion: 2 |
| 34 | NatExternalTraffic: true |
| 35 | MTUSize: 1500 |
| 36 | IPAMConfig: |
| 37 | PodSubnetCIDR: 10.1.0.0/16 |
| 38 | PodNetworkPrefixLen: 24 |
| 39 | PodIfIPCIDR: 10.2.1.0/24 |
| 40 | VPPHostSubnetCIDR: 172.30.0.0/16 |
| 41 | VPPHostNetworkPrefixLen: 24 |
| 42 | NodeInterconnectCIDR: 192.168.16.0/24 |
| 43 | VxlanCIDR: 192.168.30.0/24 |
| 44 | NodeInterconnectDHCP: False |
| 45 | |
| 46 | Information that might be helpful: - Whether node IPs are statically |
| 47 | assigned, or if DHCP is used - STN is enabled - Version of TAP |
| 48 | interfaces used - Output of |
| 49 | ``kubectl get pods -o wide --all-namespaces`` |
| 50 | |
| 51 | Collecting the Logs |
| 52 | ~~~~~~~~~~~~~~~~~~~ |
| 53 | |
| 54 | The most essential thing that needs to be done when debugging and |
| 55 | **reporting an issue** in Contiv-VPP is **collecting the logs from the |
| 56 | contiv-vpp vswitch containers**. |
| 57 | |
| 58 | a) Collecting Vswitch Logs Using kubectl |
| 59 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 60 | |
| 61 | In order to collect the logs from individual vswitches in the cluster, |
| 62 | connect to the master node and then find the POD names of the individual |
| 63 | vswitch containers: |
| 64 | |
| 65 | :: |
| 66 | |
| 67 | $ kubectl get pods --all-namespaces | grep vswitch |
| 68 | kube-system contiv-vswitch-lqxfp 2/2 Running 0 1h |
| 69 | kube-system contiv-vswitch-q6kwt 2/2 Running 0 1h |
| 70 | |
| 71 | Then run the following command, with *pod name* replaced by the actual |
| 72 | POD name: |
| 73 | |
| 74 | :: |
| 75 | |
| 76 | $ kubectl logs <pod name> -n kube-system -c contiv-vswitch |
| 77 | |
| 78 | Redirect the output to a file to save the logs, for example: |
| 79 | |
| 80 | :: |
| 81 | |
| 82 | kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt |
| 83 | |
| 84 | b) Collecting Vswitch Logs Using Docker |
| 85 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 86 | |
| 87 | If option a) does not work, then you can still collect the same logs |
| 88 | using the plain docker command. For that, you need to connect to each |
| 89 | individual node in the k8s cluster, and find the container ID of the |
| 90 | vswitch container: |
| 91 | |
| 92 | :: |
| 93 | |
| 94 | $ docker ps | grep contivvpp/vswitch |
| 95 | b682b5837e52 contivvpp/vswitch "/usr/bin/supervisor…" 2 hours ago Up 2 hours k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0 |
| 96 | |
| 97 | Now use the ID from the first column to dump the logs into the |
| 98 | ``logs-master.txt`` file: |
| 99 | |
| 100 | :: |
| 101 | |
| 102 | $ docker logs b682b5837e52 > logs-master.txt |
| 103 | |
| 104 | Reviewing the Vswitch Logs |
| 105 | ^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 106 | |
| 107 | In order to debug an issue, it is good to start by grepping the logs for |
| 108 | the ``level=error`` string, for example: |
| 109 | |
| 110 | :: |
| 111 | |
| 112 | $ cat logs-master.txt | grep level=error |
| 113 | |
| 114 | Also, VPP or contiv-agent may crash with some bugs. To check if some |
| 115 | process crashed, grep for the string ``exit``, for example: |
| 116 | |
| 117 | :: |
| 118 | |
| 119 | $ cat logs-master.txt | grep exit |
| 120 | 2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected) |
| 121 | 2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request |
| 122 | |
| 123 | Collecting the STN Daemon Logs |
| 124 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 125 | |
| 126 | In STN (Steal The NIC) deployment scenarios, often need to collect and |
| 127 | review the logs from the STN daemon. This needs to be done on each node: |
| 128 | |
| 129 | :: |
| 130 | |
| 131 | $ docker logs contiv-stn > logs-stn-master.txt |
| 132 | |
| 133 | Collecting Logs in Case of Crash Loop |
| 134 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 135 | |
| 136 | If the vswitch is crashing in a loop (which can be determined by |
| 137 | increasing the number in the ``RESTARTS`` column of the |
| 138 | ``kubectl get pods --all-namespaces`` output), the ``kubectl logs`` or |
| 139 | ``docker logs`` would give us the logs of the latest incarnation of the |
| 140 | vswitch. That might not be the original root cause of the very first |
| 141 | crash, so in order to debug that, we need to disable k8s health check |
| 142 | probes to not restart the vswitch after the very first crash. This can |
| 143 | be done by commenting-out the ``readinessProbe`` and ``livenessProbe`` |
| 144 | in the contiv-vpp deployment YAML: |
| 145 | |
| 146 | .. code:: diff |
| 147 | |
| 148 | diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml |
| 149 | index 3676047..ffa4473 100644 |
| 150 | --- a/k8s/contiv-vpp.yaml |
| 151 | +++ b/k8s/contiv-vpp.yaml |
| 152 | @@ -224,18 +224,18 @@ spec: |
| 153 | ports: |
| 154 | # readiness + liveness probe |
| 155 | - containerPort: 9999 |
| 156 | - readinessProbe: |
| 157 | - httpGet: |
| 158 | - path: /readiness |
| 159 | - port: 9999 |
| 160 | - periodSeconds: 1 |
| 161 | - initialDelaySeconds: 15 |
| 162 | - livenessProbe: |
| 163 | - httpGet: |
| 164 | - path: /liveness |
| 165 | - port: 9999 |
| 166 | - periodSeconds: 1 |
| 167 | - initialDelaySeconds: 60 |
| 168 | + # readinessProbe: |
| 169 | + # httpGet: |
| 170 | + # path: /readiness |
| 171 | + # port: 9999 |
| 172 | + # periodSeconds: 1 |
| 173 | + # initialDelaySeconds: 15 |
| 174 | + # livenessProbe: |
| 175 | + # httpGet: |
| 176 | + # path: /liveness |
| 177 | + # port: 9999 |
| 178 | + # periodSeconds: 1 |
| 179 | + # initialDelaySeconds: 60 |
| 180 | env: |
| 181 | - name: MICROSERVICE_LABEL |
| 182 | valueFrom: |
| 183 | |
| 184 | If VPP is the crashing process, please follow the |
| 185 | [CORE_FILES](CORE_FILES.html) guide and provide the coredump file. |
| 186 | |
| 187 | Inspect VPP Config |
| 188 | ~~~~~~~~~~~~~~~~~~ |
| 189 | |
| 190 | Inspect the following areas: - Configured interfaces (issues related |
| 191 | basic node/pod connectivity issues): |
| 192 | |
| 193 | :: |
| 194 | |
| 195 | vpp# sh int addr |
| 196 | GigabitEthernet0/9/0 (up): |
| 197 | 192.168.16.1/24 |
| 198 | local0 (dn): |
| 199 | loop0 (up): |
| 200 | l2 bridge bd_id 1 bvi shg 0 |
| 201 | 192.168.30.1/24 |
| 202 | tapcli-0 (up): |
| 203 | 172.30.1.1/24 |
| 204 | |
| 205 | - IP forwarding table: |
| 206 | |
| 207 | :: |
| 208 | |
| 209 | vpp# sh ip fib |
| 210 | ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ] |
| 211 | 0.0.0.0/0 |
| 212 | unicast-ip4-chain |
| 213 | [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]] |
| 214 | [0] [@0]: dpo-drop ip4 |
| 215 | 0.0.0.0/32 |
| 216 | unicast-ip4-chain |
| 217 | [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]] |
| 218 | [0] [@0]: dpo-drop ip4 |
| 219 | |
| 220 | ... |
| 221 | ... |
| 222 | |
| 223 | 255.255.255.255/32 |
| 224 | unicast-ip4-chain |
| 225 | [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]] |
| 226 | [0] [@0]: dpo-drop ip4 |
| 227 | |
| 228 | - ARP Table: |
| 229 | |
| 230 | :: |
| 231 | |
| 232 | vpp# sh ip arp |
| 233 | Time IP4 Flags Ethernet Interface |
| 234 | 728.6616 192.168.16.2 D 08:00:27:9c:0e:9f GigabitEthernet0/8/0 |
| 235 | 542.7045 192.168.30.2 S 1a:2b:3c:4d:5e:02 loop0 |
| 236 | 1.4241 172.30.1.2 D 86:41:d5:92:fd:24 tapcli-0 |
| 237 | 15.2485 10.1.1.2 SN 00:00:00:00:00:02 tapcli-1 |
| 238 | 739.2339 10.1.1.3 SN 00:00:00:00:00:02 tapcli-2 |
| 239 | 739.4119 10.1.1.4 SN 00:00:00:00:00:02 tapcli-3 |
| 240 | |
| 241 | - NAT configuration (issues related to services): |
| 242 | |
| 243 | :: |
| 244 | |
| 245 | DBGvpp# sh nat44 addresses |
| 246 | NAT44 pool addresses: |
| 247 | 192.168.16.10 |
| 248 | tenant VRF independent |
| 249 | 0 busy udp ports |
| 250 | 0 busy tcp ports |
| 251 | 0 busy icmp ports |
| 252 | NAT44 twice-nat pool addresses: |
| 253 | |
| 254 | :: |
| 255 | |
| 256 | vpp# sh nat44 static mappings |
| 257 | NAT44 static mappings: |
| 258 | tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0 out2in-only |
| 259 | tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0 out2in-only |
| 260 | tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0 out2in-only |
| 261 | tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0 out2in-only |
| 262 | tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0 out2in-only |
| 263 | tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0 out2in-only |
| 264 | udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only |
| 265 | tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0 out2in-only |
| 266 | |
| 267 | :: |
| 268 | |
| 269 | vpp# sh nat44 interfaces |
| 270 | NAT44 interfaces: |
| 271 | loop0 in out |
| 272 | GigabitEthernet0/9/0 out |
| 273 | tapcli-0 in out |
| 274 | |
| 275 | :: |
| 276 | |
| 277 | vpp# sh nat44 sessions |
| 278 | NAT44 sessions: |
| 279 | 192.168.20.2: 0 dynamic translations, 3 static translations |
| 280 | 10.1.1.3: 0 dynamic translations, 0 static translations |
| 281 | 10.1.1.4: 0 dynamic translations, 0 static translations |
| 282 | 10.1.1.2: 0 dynamic translations, 6 static translations |
| 283 | 10.1.2.18: 0 dynamic translations, 2 static translations |
| 284 | |
| 285 | - ACL config (issues related to policies): |
| 286 | |
| 287 | :: |
| 288 | |
| 289 | vpp# sh acl-plugin acl |
| 290 | |
| 291 | - “Steal the NIC (STN)” config (issues related to host connectivity |
| 292 | when STN is active): |
| 293 | |
| 294 | :: |
| 295 | |
| 296 | vpp# sh stn rules |
| 297 | - rule_index: 0 |
| 298 | address: 10.1.10.47 |
| 299 | iface: tapcli-0 (2) |
| 300 | next_node: tapcli-0-output (410) |
| 301 | |
| 302 | - Errors: |
| 303 | |
| 304 | :: |
| 305 | |
| 306 | vpp# sh errors |
| 307 | |
| 308 | - Vxlan tunnels: |
| 309 | |
| 310 | :: |
| 311 | |
| 312 | vpp# sh vxlan tunnels |
| 313 | |
| 314 | - Vxlan tunnels: |
| 315 | |
| 316 | :: |
| 317 | |
| 318 | vpp# sh vxlan tunnels |
| 319 | |
| 320 | - Hardware interface information: |
| 321 | |
| 322 | :: |
| 323 | |
| 324 | vpp# sh hardware-interfaces |
| 325 | |
| 326 | Basic Example |
| 327 | ~~~~~~~~~~~~~ |
| 328 | |
| 329 | `contiv-vpp-bug-report.sh <https://github.com/contiv/vpp/tree/master/scripts/contiv-vpp-bug-report.sh>`__ |
| 330 | is an example of a script that may be a useful starting point to |
| 331 | gathering the above information using kubectl. |
| 332 | |
| 333 | Limitations: - The script does not include STN daemon logs nor does it |
| 334 | handle the special case of a crash loop |
| 335 | |
| 336 | Prerequisites: - The user specified in the script must have passwordless |
| 337 | access to all nodes in the cluster; on each node in the cluster the user |
| 338 | must have passwordless access to sudo. |
| 339 | |
| 340 | Setting up Prerequisites |
| 341 | ^^^^^^^^^^^^^^^^^^^^^^^^ |
| 342 | |
| 343 | To enable logging into a node without a password, copy your public key |
| 344 | to the following node: |
| 345 | |
| 346 | :: |
| 347 | |
| 348 | ssh-copy-id <user-id>@<node-name-or-ip-address> |
| 349 | |
| 350 | To enable running sudo without a password for a given user, enter: |
| 351 | |
| 352 | :: |
| 353 | |
| 354 | $ sudo visudo |
| 355 | |
| 356 | Append the following entry to run ALL command without a password for a |
| 357 | given user: |
| 358 | |
| 359 | :: |
| 360 | |
| 361 | <userid> ALL=(ALL) NOPASSWD:ALL |
| 362 | |
| 363 | You can also add user ``<user-id>`` to group ``sudo`` and edit the |
| 364 | ``sudo`` entry as follows: |
| 365 | |
| 366 | :: |
| 367 | |
| 368 | # Allow members of group sudo to execute any command |
| 369 | %sudo ALL=(ALL:ALL) NOPASSWD:ALL |
| 370 | |
| 371 | Add user ``<user-id>`` to group ``<group-id>`` as follows: |
| 372 | |
| 373 | :: |
| 374 | |
| 375 | sudo adduser <user-id> <group-id> |
| 376 | |
| 377 | or as follows: |
| 378 | |
| 379 | :: |
| 380 | |
| 381 | usermod -a -G <group-id> <user-id> |
| 382 | |
| 383 | Working with the Contiv-VPP Vagrant Test Bed |
| 384 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 385 | |
| 386 | The script can be used to collect data from the `Contiv-VPP test bed |
| 387 | created with |
| 388 | Vagrant <https://github.com/contiv/vpp/blob/master/vagrant/README.md>`__. |
| 389 | To collect debug information from this Contiv-VPP test bed, do the |
| 390 | following steps: \* In the directory where you created your vagrant test |
| 391 | bed, do: |
| 392 | |
| 393 | :: |
| 394 | |
| 395 | vagrant ssh-config > vagrant-ssh.conf |
| 396 | |
| 397 | - To collect the debug information do: |
| 398 | |
| 399 | :: |
| 400 | |
| 401 | ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf |