Blame - docs/integration-s3p.rst - onap/integration

blob: 49c67850f6cf183d8820c191390f4fbbd163cee3 [file] [log] [blame]

Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	1	.. _integration-s3p:
				2
Gary Wu	cd47a01	2018-11-30 07:18:36 -0800	[diff] [blame]	3	ONAP Maturity Testing Notes
				4	---------------------------
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	5
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	6	Historically integration team used to execute specific stability and resilience
				7	tests on target release. For frankfurt a stability test was executed.
				8	Openlab, based on Frankfurt RC0 dockers was also observed a long duration
				9	period to evaluate the overall stability.
				10	Finally the CI daily chain created at Frankfurt RC0 was also a precious indicator
				11	to estimate the solution stability.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	12
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	13	No resilience or stress tests have been executed due to a lack of resources
				14	and late availability of the release. The testing strategy shall be amended in
				15	Guilin, several requirements have been created to improve the S3P testing domain.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	16
				17	Stability
				18	=========
mrichomme	057c10b	2019-10-18 18:54:16 +0200	[diff] [blame]	19
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	20	ONAP stability was tested through a 72 hour test.
				21	The intent of the 72 hour stability test is not to exhaustively test all
				22	functions but to run a steady load against the system and look for issues like
				23	memory leaks that cannot be found in the short duration install and functional
				24	testing during the development cycle.
mrichomme	057c10b	2019-10-18 18:54:16 +0200	[diff] [blame]	25
mrichomme	842d322	2019-10-09 14:02:17 +0200	[diff] [blame]	26	Integration Stability Testing verifies that the ONAP platform remains fully
				27	functional after running for an extended amounts of time.
				28	This is done by repeated running tests against an ONAP instance for a period of
				29	72 hours.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	30
mrichomme	0794e67	2020-06-09 15:37:37 +0200	[diff] [blame]	31	::
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	32
mrichomme	0794e67	2020-06-09 15:37:37 +0200	[diff] [blame]	33	The 72 hour stability run result was PASS
				34
				35	The onboard and instantiate tests ran for over 115 hours before environment
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	36	issues stopped the test. There were errors due to both tooling and environment
				37	errors.
				38
mrichomme	0794e67	2020-06-09 15:37:37 +0200	[diff] [blame]	39	The overall memory utilization only grew about 2% on the work nodes despite
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	40	the environment issues. Interestingly the kubernetes ochestration node memory
				41	grew more which could mean we are over driving the API's in some fashion.
				42
				43	We did not limit other tenant activities in Windriver during this test run and
				44	we saw the impact from things like the re-installation of SB00 in the tenant
				45	and general network latency impacts that caused openstack to be slower to
				46	instantiate.
				47	For future stability runs we should go back to the process of shutting down
				48	non-critical tenants in the test environment to free up host resources for
				49	the test run (or other ways to prevent other testing from affecting the stability
				50	run).
				51
				52	The control loop tests were 100% successful and the cycle time for the loop was
				53	fairly consistent despite the environment issues. Future control loop stability
				54	tests should consider doing more policy edit type activites and running more
				55	control loop if host resources are available. The 10 second VES telemetry event
				56	is quite aggressive so we are sending more load into the VES collector and TCA
				57	engine during onset events than would be typical so adding additional loops
				58	should factor that in. The jenkins jobs ran fairly well although the instantiate
				59	Demo vFWCL took longer than usual and should be factored into future test planning.
				60
				61
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	62	Methodology
				63	~~~~~~~~~~~
				64
				65	The Stability Test has two main components:
				66
mrichomme	842d322	2019-10-09 14:02:17 +0200	[diff] [blame]	67	- Running "ete stability72hr" Robot suite periodically. This test suite
				68	verifies that ONAP can instantiate vDNS, vFWCL, and VVG.
				69	- Set up vFW Closed Loop to remain running, then check periodically that the
				70	closed loop functionality is still working.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	71
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	72	The integration-longevity tenant in Intel/Windriver environment was used for the
				73	72 hour tests.
				74
				75	The onap-ci job for "Project windriver-longevity-release-manual" was used for
				76	the deployment with the OOM set to frankfurt and Integration branches set to
				77	master. Integration master was used so we could catch the latest updates to
				78	integration scripts and vnf heat templates.
				79
				80	The jenkins job needs a couple of updates for each release:
				81
				82	- Set the integration branch to 'origin/master'
				83	- Modify the parameters to deploy.sh to specify "-i master" and "-o frankfurt"
				84	to get integration master an oom frankfurt clones onto the nfs server.
				85
				86	The path for robot logs on dockerdata-nfs changed in Frankfurt so the
				87	/dev-robot/ becomes /dev/robot
				88
mrichomme	0794e67	2020-06-09 15:37:37 +0200	[diff] [blame]	89	.. note::
				90	For Frankfurt release, the stability test has been executed on an
				91	kubernetes infrastructure based on El Alto recommendations. The kubernetes
				92	version was 1.15.3 (frankfurt 1.15.11) and the helm version was 2.14.2
				93	(frankfurt 2.16.6). However the ONAP dockers were updated to Frankfurt RC2
				94	candidate versions. The results are informative and can be compared with
				95	previous campaigns. The stability tests used robot container image
				96	1.6.1-STAGING-20200519T201214Z. Robot container was patched to use GRA_API
				97	since VNF_API has been deprecated.
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	98
				99	Shakedown consists of creating some temporary tags for stability72hrvLB,
				100	stability72hrvVG,stability72hrVFWCL to make sure each sub test ran successfully
				101	(including cleanup) in the environment before the jenkins job started with the
				102	higher level testsuite tag stability72hr that covers all three test types.
				103
				104	Clean out the old buid jobs using a jenkins console script (manage jenkins)
				105
				106	::
				107
				108	def jobName = "windriver-longevity-stability72hr"=
				109	def job = Jenkins.instance.getItem(jobName)
				110	job.getBuilds().each { it.delete() }
				111	job.nextBuildNumber = 1
				112	job.save()
				113
				114
				115	appc.properties updated to apply the fix for DMaaP message processing to call
				116	http://localhost:8181 for the streams update.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	117
				118	Results: 100% PASS
				119	~~~~~~~~~~~~~~~~~~
Gary Wu	70d75b5	2019-06-21 10:06:17 -0700	[diff] [blame]	120	=================== ======== ========== ======== ========= =========
				121	Test Case Attempts Env Issues Failures Successes Pass Rate
				122	=================== ======== ========== ======== ========= =========
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	123	Stability 72 hours 77 19 0 58 100%
				124	vFW Closed Loop 60 0 0 100 100%
				125	Total 137 19 0 158 100%
Gary Wu	70d75b5	2019-06-21 10:06:17 -0700	[diff] [blame]	126	=================== ======== ========== ======== ========= =========
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	127
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	128	Detailed results can be found at https://wiki.onap.org/display/DW/Frankfurt+Stability+Run+Notes
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	129
				130	.. note::
Gary Wu	70d75b5	2019-06-21 10:06:17 -0700	[diff] [blame]	131	- Overall results were good. All of the test failures were due to
				132	issues with the unstable environment and tooling framework.
				133	- JIRAs were created for readiness/liveness probe issues found while
				134	testing under the unstable environment. Patches applied to oom and
				135	testsuite during the testing helped reduce test failures due to
				136	environment and tooling framework issues.
				137	- The vFW Closed Loop test was very stable and self recovered from
				138	environment issues.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	139
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	140	Resources overview
				141	~~~~~~~~~~~~~~~~~~
				142	============ ====================== =========== ========== ==========
				143	Date #1 CPU #1 RAM CPU* RAM**
				144	============ ====================== =========== ========== ==========
				145	May 20 18:45 dcae-tca-anaytics:511m appc:2901Mi 1649 36092
				146	May 21 12:33 dcae-tca-anaytics:664m appc:2901Mi 1605 38221
				147	May 22 09:35 dcae-tca-anaytics:425m appc:2837Mi 1459 38488
				148	May 23 11:01 cassandra-1:371m appc:2849Mi 1829 39431
				149	============ ====================== =========== ========== ==========
				150
				151	.. note::
				152	- Results are given from the command "kubectl -n onap top pods \| sort -rn -k 3
				153	\| head -20"
				154	- * sum of the top 20 CPU consumption
				155	- ** sum of the top 20 RAM consumption
				156
				157	CI results
				158	==========
				159
				160	A daily Frankfurt CI chain has been created after RC0.
				161
				162	The evolution of the full healthcheck test suite can be described as follows:
				163
				164	\|image1\|
				165
				166	Full healthcheck testsuite verifies the status of each component. It is
				167	composed of 47 tests. The success rate from the 9th to the 28th was never under
				168	95%.
				169
				170	4 test categories were defined:
				171
				172	- infrastructure healthcheck: test of ONAP kubernetes cluster and help chart status
				173	- healthcheck tests: verification of the components in the target deployment
				174	environment
				175	- smoke tests: basic VM tests (including onboarding/distribution/instantiation),
				176	and automated use cases (pnf-registrate, hvves, 5gbulkpm)
				177	- security tests
				178
				179	The security target (66% for Frankfurt) was reached after the RC1. A regression
				180	due to the automation of the hvves use case (triggering the exposition of a
				181	public port in HTTP) was fixed on the 28th of May.
				182
				183	\|image2\|
				184
				185	Orange Openlab
				186	==============
				187
				188	The Orange Openlab is a community lab targeting ONAP end user. It provides an
				189	ONAP and cloud resources to discover ONAP.
				190	A Frankfurt pre-RC0 version was installed beginning of May. The usual gating
				191	testing suite was run daily in addition of the traffic generated by the lab
				192	users. The VM instantiation has been working well without any reinstallation
				193	over the 27 last days.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	194
				195	Resilience
				196	==========
				197
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	198	The resilience test executed in El Alto was not realized in Frankfurt.
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	199
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	200	.. \|image1\| image:: files/s3p/daily_frankfurt1.png
				201	:width: 6.5in
Gary Wu	e8c445a	2018-11-29 11:54:52 -0800	[diff] [blame]	202
mrichomme	ac588d6	2020-05-28 22:38:30 +0200	[diff] [blame]	203	.. \|image2\| image:: files/s3p/daily_frankfurt2.png
				204	:width: 6.5in