blob: 29497003498efd67b608bf945dbe54467b26f791 [file] [log] [blame]
Saryu Shahf96ef832019-06-03 16:23:04 +00001
2.. This work is licensed under a Creative Commons Attribution 4.0 International License.
3.. http://creativecommons.org/licenses/by/4.0
4
jhh108dd8f2020-06-12 17:01:24 -05005.. _feature-sm-label:
6
Saryu Shahf96ef832019-06-03 16:23:04 +00007*************************
jhh108dd8f2020-06-12 17:01:24 -05008Feature: State Management
Saryu Shahf96ef832019-06-03 16:23:04 +00009*************************
10
11.. contents::
12 :depth: 2
13
Saryu Shahf96ef832019-06-03 16:23:04 +000014The State Management Feature provides:
15
16- Node-level health monitoring
17- Monitoring the health of dependency nodes - nodes on which a particular node is dependent
18- Ability to lock/unlock a node and suspend or resume all application processing
19- Ability to suspend application processing on a node that is disabled or in a standby state
20- Interworking/Coordination of state values
21- Support for ITU X.731 states and state transitions for:
22 - Administrative State
23 - Operational State
24 - Availability Status
25 - Standby Status
26
Saryu Shahf96ef832019-06-03 16:23:04 +000027
28Enabling and Disabling Feature State Management
jhh108dd8f2020-06-12 17:01:24 -050029===============================================
Saryu Shahf96ef832019-06-03 16:23:04 +000030
31The State Management Feature is enabled from the command line when logged in as policy after configuring the feature properties file (see Description Details section). From the command line:
32
33- > features status - Lists the status of features
34- > features enable state-management - Enables the State Management Feature
35- > features disable state-management - Disables the State Management Feature
36
37The Drools PDP must be stopped prior to enabling/disabling features and then restarted after the features have been enabled/disabled.
38
39 .. code-block:: bash
40 :caption: Enabling State Management Feature
41
42 policy@hyperion-4:/opt/app/policy$ policy stop
43 [drools-pdp-controllers]
44 L []: Stopping Policy Management... Policy Management (pid=354) is stopping... Policy Management has stopped.
45 policy@hyperion-4:/opt/app/policy$ features enable state-management
46 name version status
47 ---- ------- ------
48 controlloop-utils 1.1.0-SNAPSHOT disabled
49 healthcheck 1.1.0-SNAPSHOT disabled
50 test-transaction 1.1.0-SNAPSHOT disabled
51 eelf 1.1.0-SNAPSHOT disabled
52 state-management 1.1.0-SNAPSHOT enabled
53 active-standby-management 1.1.0-SNAPSHOT disabled
54 session-persistence 1.1.0-SNAPSHOT disabled
55
56Description Details
jhh108dd8f2020-06-12 17:01:24 -050057~~~~~~~~~~~~~~~~~~~
Saryu Shahf96ef832019-06-03 16:23:04 +000058
59State Model
jhh108dd8f2020-06-12 17:01:24 -050060"""""""""""
Saryu Shahf96ef832019-06-03 16:23:04 +000061
62The state model follows the ITU X.731 standard for state management. The supported state values are:
63 **Administrative State:**
64 - Locked - All application transaction processing is prohibited
65 - Unlocked - Application transaction processing is allowed
jhh108dd8f2020-06-12 17:01:24 -050066
Saryu Shahf96ef832019-06-03 16:23:04 +000067 **Administrative State Transitions:**
68 - The transition from Unlocked to Locked state is triggered with a Lock operation
69 - The transition from the Locked to Unlocked state is triggered with an Unlock operation
70
71 **Operational State:**
72 - Enabled - The node is healthy and able to process application transactions
jhh108dd8f2020-06-12 17:01:24 -050073 - Disabled - The node is not healthy and not able to process application transactions
Saryu Shahf96ef832019-06-03 16:23:04 +000074
75 **Operational State Transitions:**
76 - The transition from Enabled to Disabled is triggered with a disableFailed or disableDependency operation
77 - The transition from Disabled to Enabled is triggered with an enableNotFailed and enableNoDependency operation
jhh108dd8f2020-06-12 17:01:24 -050078
Saryu Shahf96ef832019-06-03 16:23:04 +000079 **Availability Status:**
80 - Null - The Operational State is Enabled
81 - Failed - The Operational State is Disabled because the node is no longer healthy
82 - Dependency - The Operational State is Disabled because all members of a dependency group are disabled
83 - Dependency.Failed - The Operational State is Disabled because the node is no longer healthy and all members of a dependency group are disabled
jhh108dd8f2020-06-12 17:01:24 -050084
Saryu Shahf96ef832019-06-03 16:23:04 +000085 **Availability Status Transitions:**
86 - The transition from Null to Failed is triggered with a disableFailed operation
87 - The transtion from Null to Dependency is triggered with a disableDependency operation
88 - The transition from Failed to Dependency.Failed is triggered with a disableDependency operation
89 - The transition from Dependency to Dependency.Failed is triggered with a disableFailed operation
90 - The transition from Dependency.Failed to Failed is triggered with an enableNoDependency operation
91 - The transition from Dependency.Failed to Dependency is triggered with an enableNotFailed operation
92 - The transition from Failed to Null is triggered with an enableNotFailed operation
93 - The transition from Dependency to Null is triggered with an enableNoDependency operation
jhh108dd8f2020-06-12 17:01:24 -050094
Saryu Shahf96ef832019-06-03 16:23:04 +000095 **Standby Status:**
96 - Null - The node does not support active-standby behavior
97 - ProvidingService - The node is actively providing application transaction service
98 - HotStandby - The node is capable of providing application transaction service, but is currently waiting to be promoted
99 - ColdStandby - The node is not capable of providing application service because of a failure
jhh108dd8f2020-06-12 17:01:24 -0500100
Saryu Shahf96ef832019-06-03 16:23:04 +0000101 **Standby Status Transitions:**
102 - The transition from Null to HotStandby is triggered by a demote operation when the Operational State is Enabled
103 - The transition for Null to ColdStandby is triggered is a demote operation when the Operational State is Disabled
104 - The transition from ColdStandby to HotStandby is triggered by a transition of the Operational State from Disabled to Enabled
105 - The transition from HotStandby to ColdStandby is triggered by a transition of the Operational State from Enabled to Disabled
106 - The transition from ProvidingService to ColdStandby is triggered by a transition of the Operational State from Enabled to Disabled
107 - The transition from HotStandby to ProvidingService is triggered by a Promote operation
108 - The transition from ProvidingService to HotStandby is triggered by a Demote operation
109
110Database
jhh108dd8f2020-06-12 17:01:24 -0500111~~~~~~~~
Saryu Shahf96ef832019-06-03 16:23:04 +0000112
113The State Management feature creates a StateManagement database having three tables:
114
115 **StateManagementEntity** - This table has the following columns:
116 - **id** - Automatically created unique identifier
117 - **resourceName** - The unique identifier for a node
118 - **adminState** - The Administrative State
119 - **opState** - The Operational State
120 - **availStatus** - The Availability Status
121 - **standbyStatus** - The Standby Status
122 - **created_Date** - The timestamp the resource entry was created
123 - **modifiedDate** - The timestamp the resource entry was last modified
124
125 **ForwardProgressEntity** - This table has the following columns:
126 - **forwardProgressId** - Automatically created unique identifier
127 - **resourceName** - The unique identifier for a node
128 - **fpc_count** - A forward progress counter which is periodically incremented if the node is healthy
129 - **created_date** - The timestamp the resource entry was created
130 - **last_updated** - The timestamp the resource entry was last updated
jhh108dd8f2020-06-12 17:01:24 -0500131
Saryu Shahf96ef832019-06-03 16:23:04 +0000132 **ResourceRegistrationEntity** - This table has the following columns:
133 - **ResourceRegistrationId** - Automatically created unique identifier
134 - **resourceName** - The unique identifier for a node
135 - **resourceUrl** - The JMX URL used to check the health of a node
136 - **site** - The name of the site in which the resource resides
137 - **nodeType** - The type of the node (i.e, pdp_xacml, pdp_drools, pap, pap_admin, logparser, brms_gateway, astra_gateway, elk_server, pypdp)
138 - **created_date** - The timestamp the resource entry was created
139 - **last_updated** - The timestamp the resource entry was last updated
140
141Node Health Monitoring
jhh108dd8f2020-06-12 17:01:24 -0500142~~~~~~~~~~~~~~~~~~~~~~
Saryu Shahf96ef832019-06-03 16:23:04 +0000143
144**Application Monitoring**
jhh108dd8f2020-06-12 17:01:24 -0500145
146 Application monitoring can be implemented using the *startTransaction()* and *endTransaction()* methods. Whenever a transaction is started, the *startTransaction()* method is called. If the node is locked, disabled or in a hot/cold standby state, the method will throw an exception. Otherwise, it resets the timer which triggers the default *testTransaction()* method.
147
Saryu Shahf96ef832019-06-03 16:23:04 +0000148 When a transaction completes, calling *endTransaction()* increments the forward process counter in the *ForwardProgressEntity* DB table. As long as this counter is updating, the integrity monitor will assume the node is healthy/sane.
jhh108dd8f2020-06-12 17:01:24 -0500149
Saryu Shahf96ef832019-06-03 16:23:04 +0000150 If the *startTransaction()* method is not called within a provisioned period of time, a timer will expire which calls the *testTransaction()* method. The default implementation of this method simply increments the forward progress counter. The *testTransaction()* method may be overwritten to perform a more meaningful test of system sanity, if desired.
jhh108dd8f2020-06-12 17:01:24 -0500151
Saryu Shahf96ef832019-06-03 16:23:04 +0000152 If the forward progress counter stops incrementing, the integrity monitoring routine will assume the node application has lost sanity and it will trigger a *statechange* (disableFailed) to cause the operational state to become disabled and the availability status attribute to become failed. Once the forward progress counter again begins incrementing, the operational state will return to enabled.
153
154**Application Monitoring with AllSeemsWell**
155
156 The IntegrityMonitor class provides a facility for applications to directly control updates of the forwardprogressentity table. As previously described, *startTransaction()* and *endTransaction()* are provided to monitor the forward progress of transactions. This, however, does not monitor things such as internal threads that may be blocked or die. An example is the feature-state-management *DroolsPdpElectionHandler.run()* method.
157
158 The *run()* method is monitored by a timer task, *checkWaitTimer()*. If the *run()* method is stalled an extended period of time, the *checkWaitTimer()* method will call *StateManagementFeature.allSeemsWell(<className>, <AllSeemsWell State>, <String message>)* with the AllSeemsWell state of Boolean.FALSE.
159
160 The IntegrityMonitor instance owned by StateManagementFeature will then store an entry in the allSeemsWellMap and block updates of the forwardprogressentity table. This in turn, will cause the Drools PDP operational state to be set to “disabled” and availability status to be set to “failed”.
161
162 Once the blocking condition is cleared, the *checkWaiTimer()* will again call the *allSeemsWell()* method and include an AllSeemsWell state of Boolean.True. This will cause the IntegrityMonitor to remove the entry for that className from the allSeemsWellMap and allow updating of the forwardprogressentity table, so long as there are no other entries in the map.
163
164**Dependency Monitoring**
165
166 When a Drools PDP (or other node using the *IntegrityMonitor* policy/common module) is dependent upon other nodes to perform its function, those other nodes can be defined as dependencies in the properties file. In order for the dependency algorithm to function, the other nodes must also be running the *IntegrityMonitor*. Periodically the Drools PDP will check the state of dependencies. If all of a node type have failed, the Drools PDP will declare that it can no longer function and change the operational state to disabled and the availability status to dependency.
167
168 In addition to other policy node types, there is a *subsystemTest()* method that is periodically called by the *IntegrityMonitor*. In Drools PDP, *subsystemTest* has been overwritten to execute an audit of the Database and of the Maven Repository. If the audit is unable to verify the function of either the DB or the Maven Repository, he Drools PDP will declare that it can no longer function and change the operational state to disabled and the availability status to dependency.
169
170 When a failed dependency returns to normal operation, the *IntegrityMontor* will change the operational state to enabled and availability status to null.
171
172**External Health Monitoring Interface**
173
174 The Drools PDP has a http test interface which, when called, will return 200 if all seems well and 500 otherwise. The test interface URL is defined in the properties file.
175
176
177Site Manager
jhh108dd8f2020-06-12 17:01:24 -0500178~~~~~~~~~~~~
Saryu Shahf96ef832019-06-03 16:23:04 +0000179
jhh108dd8f2020-06-12 17:01:24 -0500180The Site Manager is not deployed with the Drools PDP, but it is available in the policy/common repository in the site-manager directory.
Saryu Shahf96ef832019-06-03 16:23:04 +0000181The Site Manager provides a lock/unlock interface for nodes and a way to display node information and status.
182
183The following is from the README file included with the Site Manager.
184
185 .. code-block:: bash
186 :caption: Site Manager README extract
187
188 Before using 'siteManager', the file 'siteManager.properties' needs to be
189 edited to configure the parameters used to access the database:
jhh108dd8f2020-06-12 17:01:24 -0500190
Saryu Shahf96ef832019-06-03 16:23:04 +0000191 javax.persistence.jdbc.driver - typically 'org.mariadb.jdbc.Driver'
jhh108dd8f2020-06-12 17:01:24 -0500192
Saryu Shahf96ef832019-06-03 16:23:04 +0000193 javax.persistence.jdbc.url - URL referring to the database,
194 which typically has the form: 'jdbc:mariadb://<host>:<port>/<db>'
195 ('<db>' is probably 'xacml' in this case)
jhh108dd8f2020-06-12 17:01:24 -0500196
Saryu Shahf96ef832019-06-03 16:23:04 +0000197 javax.persistence.jdbc.user - the user id for accessing the database
jhh108dd8f2020-06-12 17:01:24 -0500198
Saryu Shahf96ef832019-06-03 16:23:04 +0000199 javax.persistence.jdbc.password - password for accessing the database
jhh108dd8f2020-06-12 17:01:24 -0500200
Saryu Shahf96ef832019-06-03 16:23:04 +0000201 Once the properties file has been updated, the 'siteManager' script can be
202 invoked as follows:
jhh108dd8f2020-06-12 17:01:24 -0500203
Saryu Shahf96ef832019-06-03 16:23:04 +0000204 siteManager show [ -s <site> | -r <resourceName> ] :
205 display node information (Site, NodeType, ResourceName, AdminState,
206 OpState, AvailStatus, StandbyStatus)
jhh108dd8f2020-06-12 17:01:24 -0500207
Saryu Shahf96ef832019-06-03 16:23:04 +0000208 siteManager setAdminState { -s <site> | -r <resourceName> } <new-state> :
209 update admin state on selected nodes
jhh108dd8f2020-06-12 17:01:24 -0500210
Saryu Shahf96ef832019-06-03 16:23:04 +0000211 siteManager lock { -s <site> | -r <resourceName> } :
212 lock selected nodes
jhh108dd8f2020-06-12 17:01:24 -0500213
Saryu Shahf96ef832019-06-03 16:23:04 +0000214 siteManager unlock { -s <site> | -r <resourceName> } :
215 unlock selected nodes
jhh108dd8f2020-06-12 17:01:24 -0500216
Saryu Shahf96ef832019-06-03 16:23:04 +0000217Note that the 'siteManager' script assumes that the script,
218'site-manager-${project.version}.jar' file and 'siteManager.properties' file
219are all in the same directory. If the files are separated, the 'siteManager'
220script will need to be modified so it can locate the jar and properties files.
221
222
223Properties
jhh108dd8f2020-06-12 17:01:24 -0500224~~~~~~~~~~
Saryu Shahf96ef832019-06-03 16:23:04 +0000225
226The feature-state-mangement.properties file controls the function of the State Management Feature. In general, the properties have adequate descriptions in the file. Parameters which must be replaced prior to usage are indicated thus: ${{parameter to be replaced}}.
227
228 .. code-block:: bash
229 :caption: feature-state-mangement.properties
230
231 # DB properties
232 javax.persistence.jdbc.driver=org.mariadb.jdbc.Driver
233 javax.persistence.jdbc.url=jdbc:mariadb://${{SQL_HOST}}:3306/statemanagement
234 javax.persistence.jdbc.user=${{SQL_USER}}
235 javax.persistence.jdbc.password=${{SQL_PASSWORD}}
jhh108dd8f2020-06-12 17:01:24 -0500236
Saryu Shahf96ef832019-06-03 16:23:04 +0000237 # DroolsPDPIntegrityMonitor Properties
238 # Test interface host and port defaults may be overwritten here
239 http.server.services.TEST.host=0.0.0.0
240 http.server.services.TEST.port=9981
241 #These properties will default to the following if no other values are provided:
242 # http.server.services.TEST.restClasses=org.onap.policy.drools.statemanagement.IntegrityMonitorRestManager
243 # http.server.services.TEST.managed=false
244 # http.server.services.TEST.swagger=true
jhh108dd8f2020-06-12 17:01:24 -0500245
Saryu Shahf96ef832019-06-03 16:23:04 +0000246 #IntegrityMonitor Properties
jhh108dd8f2020-06-12 17:01:24 -0500247
Saryu Shahf96ef832019-06-03 16:23:04 +0000248 # Must be unique across the system
249 resource.name=pdp1
250 # Name of the site in which this node is hosted
251 site_name=site1
252 # Forward Progress Monitor update interval seconds
253 fp_monitor_interval=30
254 # Failed counter threshold before failover
255 failed_counter_threshold=3
256 # Interval between test transactions when no traffic seconds
257 test_trans_interval=10
258 # Interval between writes of the FPC to the DB seconds
259 write_fpc_interval=5
jhh108dd8f2020-06-12 17:01:24 -0500260 # Node type Note: Make sure you don't leave any trailing spaces, or you'll get an 'invalid node type' error!
Saryu Shahf96ef832019-06-03 16:23:04 +0000261 node_type=pdp_drools
jhh108dd8f2020-06-12 17:01:24 -0500262 # Dependency groups are groups of resources upon which a node operational state is dependent upon.
Saryu Shahf96ef832019-06-03 16:23:04 +0000263 # Each group is a comma-separated list of resource names and groups are separated by a semicolon. For example:
264 # dependency_groups=site_1.astra_1,site_1.astra_2;site_1.brms_1,site_1.brms_2;site_1.logparser_1;site_1.pypdp_1
265 dependency_groups=
266 # When set to true, dependent health checks are performed by using JMX to invoke test() on the dependent.
267 # The default false is to use state checks for health.
268 test_via_jmx=true
269 # This is the max number of seconds beyond which a non incrementing FPC is considered a failure
270 max_fpc_update_interval=120
jhh108dd8f2020-06-12 17:01:24 -0500271 # Run the state audit every 60 seconds (60000 ms). The state audit finds stale DB entries in the
272 # forwardprogressentity table and marks the node as disabled/failed in the statemanagemententity
Saryu Shahf96ef832019-06-03 16:23:04 +0000273 # table. NOTE! It will only run on nodes that have a standbystatus = providingservice.
274 # A value of <= 0 will turn off the state audit.
275 state_audit_interval_ms=60000
jhh108dd8f2020-06-12 17:01:24 -0500276 # The refresh state audit is run every (default) 10 minutes (600000 ms) to clean up any state corruption in the
Saryu Shahf96ef832019-06-03 16:23:04 +0000277 # DB statemanagemententity table. It only refreshes the DB state entry for the local node. That is, it does not
jhh108dd8f2020-06-12 17:01:24 -0500278 # refresh the state of any other nodes. A value <= 0 will turn the audit off. Any other value will override
Saryu Shahf96ef832019-06-03 16:23:04 +0000279 # the default of 600000 ms.
280 refresh_state_audit_interval_ms=600000
jhh108dd8f2020-06-12 17:01:24 -0500281
Saryu Shahf96ef832019-06-03 16:23:04 +0000282 # Repository audit properties
283 # Assume it's the releaseRepository that needs to be audited,
284 # because that's the one BRMGW will publish to.
285 repository.audit.id=${{releaseRepositoryID}}
286 repository.audit.url=${{releaseRepositoryUrl}}
287 repository.audit.username=${{repositoryUsername}}
288 repository.audit.password=${{repositoryPassword}}
289 repository2.audit.id=${{releaseRepository2ID}}
290 repository2.audit.url=${{releaseRepository2Url}}
291 repository2.audit.username=${{repositoryUsername2}}
292 repository2.audit.password=${{repositoryPassword2}}
jhh108dd8f2020-06-12 17:01:24 -0500293
Saryu Shahf96ef832019-06-03 16:23:04 +0000294 # Repository Audit Properties
295 # Flag to control the execution of the subsystemTest for the Nexus Maven repository
296 repository.audit.is.active=false
297 repository.audit.ignore.errors=true
298 repository.audit.interval_sec=86400
299 repository.audit.failure.threshold=3
jhh108dd8f2020-06-12 17:01:24 -0500300
Saryu Shahf96ef832019-06-03 16:23:04 +0000301 # DB Audit Properties
302 # Flag to control the execution of the subsystemTest for the Database
303 db.audit.is.active=false
304
305
306End of Document
307
308.. SSNote: Wiki page ref. https://wiki.onap.org/display/DW/Feature+State+Management
309
310