| .. This work is licensed under a |
| Creative Commons Attribution 4.0 International License. |
| .. _integration-s3p: |
| |
| Stability/Resiliency |
| ==================== |
| |
| .. important:: |
| The Release stability has been evaluated by: |
| |
| - The daily Honolulu CI/CD chain |
| - Stability tests |
| - Resiliency tests |
| |
| .. note: |
| The scope of these tests remains limited and does not provide a full set of |
| KPIs to determinate the limits and the dimensioning of the ONAP solution. |
| |
| CI results |
| ---------- |
| |
| As usual, a daily CI chain dedicated to the release is created after RC0. |
| A Honolulu chain has been created on the 6th of April 2021. |
| |
| The daily results can be found in `LF daily results web site |
| <https://logs.onap.org/onap-integration/daily/onap_daily_pod4_honolulu/2021-04/>`_. |
| |
| Infrastructure Healthcheck Tests |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| These tests deal with the Kubernetes/Helm tests on ONAP cluster. |
| |
| The global expected criteria is **75%**. |
| The onap-k8s and onap-k8s-teardown providing a snapshop of the onap namespace in |
| Kubernetes as well as the onap-helm tests are expected to be PASS. |
| |
| nodeport_check_certs test is expected to fail. Even tremendous progress have |
| been done in this area, some certificates (unmaintained, upstream or integration |
| robot pods) are still not correct due to bad certificate issuers (Root CA |
| certificate non valid) or extra long validity. Most of the certificates have |
| been installed using cert-manager and will be easily renewable. |
| |
| .. image:: files/s3p/honolulu_daily_infrastructure_healthcheck.png |
| :align: center |
| |
| Healthcheck Tests |
| ~~~~~~~~~~~~~~~~~ |
| |
| These tests are the traditionnal robot healthcheck tests and additional tests |
| dealing with a single component. |
| |
| Some tests (basic_onboard, basic_cds) may fail episodically due to the fact that |
| the startup of the SDC is sometimes not fully completed. |
| |
| The same test is run as first step of smoke tests and is usually PASS. |
| The mechanism to detect that all the components are fully operational may be |
| improved, timer based solutions are not robust enough. |
| |
| The expectation is **100% OK**. |
| |
| .. image:: files/s3p/honolulu_daily_healthcheck.png |
| :align: center |
| |
| Smoke Tests |
| ~~~~~~~~~~~ |
| |
| These tests are end to end and automated use case tests. |
| See the :ref:`the Integration Test page <integration-tests>` for details. |
| |
| The expectation is **100% OK**. |
| |
| .. figure:: files/s3p/honolulu_daily_smoke.png |
| :align: center |
| |
| An error has been detected on the SDNC preventing the basic_vm_macro to work. |
| See `SDNC-1529 <https://jira.onap.org/browse/SDNC-1529/>`_ for details. |
| We may also notice that SO timeouts occured more frequently than in Guilin. |
| See `SO-3584 <https://jira.onap.org/browse/SO-3584>`_ for details. |
| |
| Security Tests |
| ~~~~~~~~~~~~~~ |
| |
| These tests are tests dealing with security. |
| See the :ref:`the Integration Test page <integration-tests>` for details. |
| |
| The expectation is **66% OK**. The criteria is met. |
| |
| It may even be above as 2 fail tests are almost correct: |
| |
| - The unlimited pod test is still fail due testing pod (DCAE-tca). |
| - The nonssl tests is FAIL due to so and so-etsi-sol003-adapter, which were |
| supposed to be managed with the ingress (not possible for this release) and |
| got a waiver in Frankfurt. The pods cds-blueprints-processor-http and aws-web |
| are used for tests. |
| |
| .. figure:: files/s3p/honolulu_daily_security.png |
| :align: center |
| |
| Resiliency tests |
| ---------------- |
| |
| The goal of the resiliency testing was to evaluate the capability of the |
| Honolulu solution to survive a stop or restart of a Kubernetes control or |
| worker node. |
| |
| Controller node resiliency |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| By default the ONAP solution is installed with 3 controllers for high |
| availability. The test for controller resiliency can be described as follows: |
| |
| - Run tests: check that they are PASS |
| - Stop a controller node: check that the node appears in NotReady state |
| - Run tests: check that they are PASS |
| |
| 2 tests were performed on the weekly honolulu lab. No problem was observed on |
| controller shutdown, tests were still PASS with a stoped controller node. |
| |
| More details can be found in <https://jira.onap.org/browse/TEST-309>. |
| |
| Worker node resiliency |
| ~~~~~~~~~~~~~~~~~~~~~~ |
| |
| In community weekly lab, the ONAP pods are distributed on 12 workers. The goal |
| of the test was to evaluate the behavior of the pod on a worker restart |
| (disaster scenario assuming that the node was moved accidentally from Ready to |
| NotReady state). |
| The original conditions of such tests may be different as the Kubernetes |
| scheduler does not distribute the pods on the same worker from an installation |
| to another. |
| |
| The test procedure can be described as follows: |
| |
| - Run tests: check that they are PASS (Healthcheck and basic_vm used) |
| - Check that all the workers are in ready state |
| :: |
| $ kubectl get nodes |
| NAME STATUS ROLES AGE VERSION |
| compute01-onap-honolulu Ready <none> 18h v1.19.9 |
| compute02-onap-honolulu Ready <none> 18h v1.19.9 |
| compute03-onap-honolulu Ready <none> 18h v1.19.9 |
| compute04-onap-honolulu Ready <none> 18h v1.19.9 |
| compute05-onap-honolulu Ready <none> 18h v1.19.9 |
| compute06-onap-honolulu Ready <none> 18h v1.19.9 |
| compute07-onap-honolulu Ready <none> 18h v1.19.9 |
| compute08-onap-honolulu Ready <none> 18h v1.19.9 |
| compute09-onap-honolulu Ready <none> 18h v1.19.9 |
| compute10-onap-honolulu Ready <none> 18h v1.19.9 |
| compute11-onap-honolulu Ready <none> 18h v1.19.9 |
| compute12-onap-honolulu Ready <none> 18h v1.19.9 |
| control01-onap-honolulu Ready master 18h v1.19.9 |
| control02-onap-honolulu Ready master 18h v1.19.9 |
| control03-onap-honolulu Ready master 18h v1.19.9 |
| |
| - Select a worker, list the impacted pods |
| :: |
| $ kubectl get pod -n onap --field-selector spec.nodeName=compute01-onap-honolulu |
| NAME READY STATUS RESTARTS AGE |
| onap-aaf-fs-7b6648db7f-shcn5 1/1 Running 1 22h |
| onap-aaf-oauth-5896545fb7-x6grg 1/1 Running 1 22h |
| onap-aaf-sms-quorumclient-2 1/1 Running 1 22h |
| onap-aai-modelloader-86d95c994b-87tsh 2/2 Running 2 22h |
| onap-aai-schema-service-75575cb488-7fxs4 2/2 Running 2 22h |
| onap-appc-cdt-58cb4766b6-vl78q 1/1 Running 1 22h |
| onap-appc-db-0 2/2 Running 4 22h |
| onap-appc-dgbuilder-5bb94d46bd-h2gbs 1/1 Running 1 22h |
| onap-awx-0 4/4 Running 4 22h |
| onap-cassandra-1 1/1 Running 1 22h |
| onap-cds-blueprints-processor-76f8b9b5c7-hb5bg 1/1 Running 1 22h |
| onap-dmaap-dr-db-1 2/2 Running 5 22h |
| onap-ejbca-6cbdb7d6dd-hmw6z 1/1 Running 1 22h |
| onap-kube2msb-858f46f95c-jws4m 1/1 Running 1 22h |
| onap-message-router-0 1/1 Running 1 22h |
| onap-message-router-kafka-0 1/1 Running 1 22h |
| onap-message-router-kafka-1 1/1 Running 1 22h |
| onap-message-router-kafka-2 1/1 Running 1 22h |
| onap-message-router-zookeeper-0 1/1 Running 1 22h |
| onap-multicloud-794c6dffc8-bfwr8 2/2 Running 2 22h |
| onap-multicloud-starlingx-58f6b86c55-mff89 3/3 Running 3 22h |
| onap-multicloud-vio-584d556876-87lxn 2/2 Running 2 22h |
| onap-music-cassandra-0 1/1 Running 1 22h |
| onap-netbox-nginx-8667d6675d-vszhb 1/1 Running 2 22h |
| onap-policy-api-6dbf8485d7-k7cpv 1/1 Running 1 22h |
| onap-policy-clamp-be-6d77597477-4mffk 1/1 Running 1 22h |
| onap-policy-pap-785bd79759-xxhvx 1/1 Running 1 22h |
| onap-policy-xacml-pdp-7d8fd58d59-d4m7g 1/1 Running 6 22h |
| onap-sdc-be-5f99c6c644-dcdz8 2/2 Running 2 22h |
| onap-sdc-fe-7577d58fb5-kwxpj 2/2 Running 2 22h |
| onap-sdc-wfd-fe-6997567759-gl9g6 2/2 Running 2 22h |
| onap-sdnc-dgbuilder-564d6475fd-xwwrz 1/1 Running 1 22h |
| onap-sdnrdb-master-0 1/1 Running 1 22h |
| onap-so-admin-cockpit-6c5b44694-h4d2n 1/1 Running 1 21h |
| onap-so-etsi-sol003-adapter-c9bf4464-pwn97 1/1 Running 1 21h |
| onap-so-sdc-controller-6899b98b8b-hfgvc 2/2 Running 2 21h |
| onap-vfc-mariadb-1 2/2 Running 4 21h |
| onap-vfc-nslcm-6c67677546-xcvl2 2/2 Running 2 21h |
| onap-vfc-vnflcm-78ff4d8778-sgtv6 2/2 Running 2 21h |
| onap-vfc-vnfres-6c96f9ff5b-swq5z 2/2 Running 2 21h |
| |
| - Stop the worker (shutdown the machine for baremetal or the VM if you installed |
| your Kubernetes on top of an OpenStack solution) |
| - Wait for the pod eviction procedure completion (5 minutes) |
| :: |
| $ kubectl get nodes |
| NAME STATUS ROLES AGE VERSION |
| compute01-onap-honolulu NotReady <none> 18h v1.19.9 |
| compute02-onap-honolulu Ready <none> 18h v1.19.9 |
| compute03-onap-honolulu Ready <none> 18h v1.19.9 |
| compute04-onap-honolulu Ready <none> 18h v1.19.9 |
| compute05-onap-honolulu Ready <none> 18h v1.19.9 |
| compute06-onap-honolulu Ready <none> 18h v1.19.9 |
| compute07-onap-honolulu Ready <none> 18h v1.19.9 |
| compute08-onap-honolulu Ready <none> 18h v1.19.9 |
| compute09-onap-honolulu Ready <none> 18h v1.19.9 |
| compute10-onap-honolulu Ready <none> 18h v1.19.9 |
| compute11-onap-honolulu Ready <none> 18h v1.19.9 |
| compute12-onap-honolulu Ready <none> 18h v1.19.9 |
| control01-onap-honolulu Ready master 18h v1.19.9 |
| control02-onap-honolulu Ready master 18h v1.19.9 |
| control03-onap-honolulu Ready master 18h v1.19.9 |
| |
| - Run the tests: check that they are PASS |
| |
| .. warning:: |
| In these conditions, **the tests will never be PASS**. In fact several components |
| will remeain in INIT state. |
| A procedure is required to ensure a clean restart. |
| |
| List the non running pods:: |
| |
| $ kubectl get pods -n onap --field-selector status.phase!=Running | grep -v Completed |
| NAME READY STATUS RESTARTS AGE |
| onap-appc-dgbuilder-5bb94d46bd-sxmmc 0/1 Init:3/4 15 156m |
| onap-cds-blueprints-processor-76f8b9b5c7-m7nmb 0/1 Init:1/3 0 156m |
| onap-portal-app-595bd6cd95-bkswr 0/2 Init:0/4 84 23h |
| onap-portal-db-config-6s75n 0/2 Error 0 23h |
| onap-portal-db-config-7trzx 0/2 Error 0 23h |
| onap-portal-db-config-jt2jl 0/2 Error 0 23h |
| onap-portal-db-config-mjr5q 0/2 Error 0 23h |
| onap-portal-db-config-qxvdt 0/2 Error 0 23h |
| onap-portal-db-config-z8c5n 0/2 Error 0 23h |
| onap-sdc-be-5f99c6c644-kplqx 0/2 Init:2/5 14 156 |
| onap-vfc-nslcm-6c67677546-86mmj 0/2 Init:0/1 15 156m |
| onap-vfc-vnflcm-78ff4d8778-h968x 0/2 Init:0/1 15 156m |
| onap-vfc-vnfres-6c96f9ff5b-kt9rz 0/2 Init:0/1 15 156m |
| |
| Some pods are not rescheduled (i.e. onap-awx-0 and onap-cassandra-1 above) |
| because they are part of a statefulset. List the statefulset objects:: |
| |
| $ kubectl get statefulsets.apps -n onap | grep -v "1/1" | grep -v "3/3" |
| NAME READY AGE |
| onap-aaf-sms-quorumclient 2/3 24h |
| onap-appc-db 2/3 24h |
| onap-awx 0/1 24h |
| onap-cassandra 2/3 24h |
| onap-dmaap-dr-db 2/3 24h |
| onap-message-router 0/1 24h |
| onap-message-router-kafka 0/3 24h |
| onap-message-router-zookeeper 2/3 24h |
| onap-music-cassandra 2/3 24h |
| onap-sdnrdb-master 2/3 24h |
| onap-vfc-mariadb 2/3 24h |
| |
| For the pods being part of the statefulset, a forced deleteion is required. |
| As an example if we consider the statefulset onap-sdnrdb-master, we must follow |
| the procedure:: |
| |
| $ kubectl get pods -n onap -o wide |grep onap-sdnrdb-master |
| onap-sdnrdb-master-0 1/1 Terminating 1 24h 10.42.3.92 node1 |
| onap-sdnrdb-master-1 1/1 Running 1 24h 10.42.1.122 node2 |
| onap-sdnrdb-master-2 1/1 Running 1 24h 10.42.2.134 node3 |
| |
| $ kubectl delete -n onap pod onap-sdnrdb-master-0 --force |
| warning: Immediate deletion does not wait for confirmation that the running |
| resource has been terminated. The resource may continue to run on the cluster |
| indefinitely. |
| pod "onap-sdnrdb-master-0" force deleted |
| |
| $ kubectl get pods |grep onap-sdnrdb-master |
| onap-sdnrdb-master-0 0/1 PodInitializing 0 11s |
| onap-sdnrdb-master-1 1/1 Running 1 24h |
| onap-sdnrdb-master-2 1/1 Running 1 24h |
| |
| $ kubectl get pods |grep onap-sdnrdb-master |
| onap-sdnrdb-master-0 1/1 Running 0 43s |
| onap-sdnrdb-master-1 1/1 Running 1 24h |
| onap-sdnrdb-master-2 1/1 Running 1 24h |
| |
| Once all the statefulset are properly restarted, the other components shall |
| continue their restart properly. |
| Once the restart of the pods is completed, the tests are PASS. |
| |
| .. important:: |
| |
| K8s node reboots/shutdown is showing some deficiencies in ONAP components in |
| regard of their availability measured with HC results. Some pods may |
| still fail to initialize after reboot/shutdown(pod rescheduled). |
| |
| However cluster as a whole behaves as expected, pods are rescheduled after |
| node shutdown (except pods being part of statefulset which need to be deleted |
| forcibly - normal Kubernetes behavior) |
| |
| On rebooted node, should its downtime not exceed eviction timeout, pods are |
| restarted back after it is again available. |
| |
| Please see `Integration Resiliency page <https://jira.onap.org/browse/TEST-308>`_ |
| for details. |
| |
| Stability tests |
| --------------- |
| |
| Three stability tests have been performed in Honolulu: |
| |
| - SDC stability test |
| - Simple instantiation test (basic_vm) |
| - Parallel instantiation test |
| |
| SDC stability test |
| ~~~~~~~~~~~~~~~~~~ |
| |
| In this test, we consider the basic_onboard automated test and we run 5 |
| simultaneous onboarding procedures in parallel during 72h. |
| |
| The basic_onboard test consists in the following steps: |
| |
| - [SDC] VendorOnboardStep: Onboard vendor in SDC. |
| - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC. |
| - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC. |
| - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file |
| in SDC. |
| |
| The test has been initiated on the honolulu weekly lab on the 19th of April. |
| |
| As already observed in daily|weekly|gating chain, we got race conditions on |
| some tests (https://jira.onap.org/browse/INT-1918). |
| |
| The success rate is above 95% on the 100 first model upload and above 80% |
| until we onboard more than 500 models. |
| |
| We may also notice that the function test_duration=f(time) increases |
| continuously. At the beginning the test takes about 200s, 24h later the same |
| test will take around 1000s. |
| Finally after 36h, the SDC systematically answers with a 500 HTTP answer code |
| explaining the linear decrease of the success rate. |
| |
| The following graphs provides a good view of the SDC stability test. |
| |
| .. image:: files/s3p/honolulu_sdc_stability.png |
| :align: center |
| |
| .. important:: |
| SDC can support up to 100s models onboarding. |
| The onbaording duration increases linearly with the number of onboarded |
| models |
| After a while, the SDC is no more usable. |
| No major Cluster resource issues have been detected during the test. The |
| memory consumption is however relatively high regarding the load. |
| |
| .. image:: files/s3p/honolulu_sdc_stability_resources.png |
| :align: center |
| |
| |
| Simple stability test |
| ~~~~~~~~~~~~~~~~~~~~~ |
| |
| This test consists on running the test basic_vm continuously during 72h. |
| |
| We observe the cluster metrics as well as the evolution of the test duration. |
| |
| The test basic_vm is described in :ref:`the Integration Test page <integration-tests>`. |
| |
| The basic_vm test consists in the different following steps: |
| |
| - [SDC] VendorOnboardStep: Onboard vendor in SDC. |
| - [SDC] YamlTemplateVspOnboardStep: Onboard vsp described in YAML file in SDC. |
| - [SDC] YamlTemplateVfOnboardStep: Onboard vf described in YAML file in SDC. |
| - [SDC] YamlTemplateServiceOnboardStep: Onboard service described in YAML file |
| in SDC. |
| - [AAI] RegisterCloudRegionStep: Register cloud region. |
| - [AAI] ComplexCreateStep: Create complex. |
| - [AAI] LinkCloudRegionToComplexStep: Connect cloud region with complex. |
| - [AAI] CustomerCreateStep: Create customer. |
| - [AAI] CustomerServiceSubscriptionCreateStep: Create customer's service |
| subscription. |
| - [AAI] ConnectServiceSubToCloudRegionStep: Connect service subscription with |
| cloud region. |
| - [SO] YamlTemplateServiceAlaCarteInstantiateStep: Instantiate service described |
| in YAML using SO a'la carte method. |
| - [SO] YamlTemplateVnfAlaCarteInstantiateStep: Instantiate vnf described in YAML |
| using SO a'la carte method. |
| - [SO] YamlTemplateVfModuleAlaCarteInstantiateStep: Instantiate VF module |
| described in YAML using SO a'la carte method. |
| |
| The test has been initiated on the Honolulu weekly lab on the 26th of April 2021. |
| This test has been run after the test described in the next section. |
| A first error occured after few hours (mariadbgalera), then the system |
| automatically recovered for some hours before a full crash of the mariadb |
| galera. |
| |
| :: |
| |
| debian@control01-onap-honolulu:~$ kubectl get pod -n onap |grep mariadb-galera |
| onap-mariadb-galera-0 1/2 CrashLoopBackOff 625 5d16h |
| onap-mariadb-galera-1 1/2 CrashLoopBackOff 1134 5d16h |
| onap-mariadb-galera-2 1/2 CrashLoopBackOff 407 5d16h |
| |
| |
| It was unfortunately not possible to collect the root cause (logs of the first |
| restart of onap-mariadb-galera-1). |
| |
| Community members reported that they already faced such issues and suggest to |
| deploy a single maria instance instead of using MariaDB galera. |
| Moreover, in Honolulu there were some changes in order to allign Camunda (SO) |
| requirements for MariaDB galera.. |
| |
| During the limited valid window, the success rate was about 78% (85% for the |
| same test in Guilin). |
| The duration of the test remain very variable as also already reported in Guilin |
| (https://jira.onap.org/browse/SO-3419). The duration of the same test may vary |
| from 500s to 2500s as illustrated in the following graph: |
| |
| .. image:: files/s3p/honolulu_so_stability_1_duration.png |
| :align: center |
| |
| The changes in MariaDB galera seems to have introduced some issues leading to |
| more unexpected timeouts. |
| A troubleshooting campaign has been launched to evaluate possible evolutions in |
| this area. |
| |
| Parallel instantiations stability test |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Still based on basic_vm, 5 instantiation attempts are done simultaneously on the |
| ONAP solution during 48h. |
| |
| The results can be described as follows: |
| |
| .. image:: files/s3p/honolulu_so_stability_5.png |
| :align: center |
| |
| For this test, we have to restart the SDNC once. The last failures are due to |
| a certificate infrastructure issue and are independent from ONAP. |
| |
| Cluster metrics |
| ~~~~~~~~~~~~~~~ |
| |
| .. important:: |
| No major cluster resource issues have been detected in the cluster metrics |
| |
| The metrics of the ONAP cluster have been recorded over the full week of |
| stability tests: |
| |
| .. csv-table:: CPU |
| :file: ./files/csv/stability_cluster_metric_cpu.csv |
| :widths: 20,20,20,20,20 |
| :delim: ; |
| :header-rows: 1 |
| |
| .. image:: files/s3p/honolulu_weekly_cpu.png |
| :align: center |
| |
| .. image:: files/s3p/honolulu_weekly_memory.png |
| :align: center |
| |
| The Top Ten for CPU consumption is given in the table below: |
| |
| .. csv-table:: CPU |
| :file: ./files/csv/stability_top10_cpu.csv |
| :widths: 20,15,15,20,15,15 |
| :delim: ; |
| :header-rows: 1 |
| |
| CPU consumption is negligeable and not dimensioning. It shall be reconsider for |
| use cases including extensive computation (loops, optimization algorithms). |
| |
| The Top Ten for Memory consumption is given in the table below: |
| |
| .. csv-table:: Memory |
| :file: ./files/csv/stability_top10_memory.csv |
| :widths: 20,15,15,20,15,15 |
| :delim: ; |
| :header-rows: 1 |
| |
| Without surprise, the Cassandra databases are using most of the memory. |
| |
| The Top Ten for Network consumption is given in the table below: |
| |
| .. csv-table:: Network |
| :file: ./files/csv/stability_top10_net.csv |
| :widths: 10,15,15,15,15,15,15 |
| :delim: ; |
| :header-rows: 1 |