blob: 70b0f186c69a821c8bfa2ba6d6c0707290167424 [file] [log] [blame]
Saryu Shah355b5792017-11-10 17:02:32 +00001
2.. This work is licensed under a Creative Commons Attribution 4.0 International License.
3.. http://creativecommons.org/licenses/by/4.0
4
5*************************
6Feature: State Management
7*************************
8
9.. contents::
10 :depth: 3
11
12Summary
13^^^^^^^
14The State Management Feature provides:
15
16 - Node-level health monitoring
17 - Monitoring the health of dependency nodes - nodes on which a particular node is dependent
18 - Ability to lock/unlock a node and suspend or resume all application processing
19 - Ability to suspend application processing on a node that is disabled or in a standby state
20 - Interworking/Coordination of state values
21 - Support for ITU X.731 states and state transitions for:
22 - Administrative State
23 - Operational State
24 - Availability Status
25 - Standby Status
26
27Usage
28^^^^^
29
30Enabling and Disabling Feature State Management
31-----------------------------------------------
32
33The State Management Feature is enabled from the command line when logged in as policy after configuring the feature properties file (see Description Details section). From the command line:
34
35- > features status - Lists the status of features
36- > features enable state-management - Enables the State Management Feature
37- > features disable state-management - Disables the State Management Feature
38
39The Drools PDP must be stopped prior to enabling/disabling features and then restarted after the features have been enabled/disabled.
40
41 .. code-block:: bash
42 :caption: Enabling State Management Feature
43
44 policy@hyperion-4:/opt/app/policy$ policy stop
45 [drools-pdp-controllers]
46 L []: Stopping Policy Management... Policy Management (pid=354) is stopping... Policy Management has stopped.
47 policy@hyperion-4:/opt/app/policy$ features enable state-management
48 name version status
49 ---- ------- ------
50 controlloop-utils 1.1.0-SNAPSHOT disabled
51 healthcheck 1.1.0-SNAPSHOT disabled
52 test-transaction 1.1.0-SNAPSHOT disabled
53 eelf 1.1.0-SNAPSHOT disabled
54 state-management 1.1.0-SNAPSHOT enabled
55 active-standby-management 1.1.0-SNAPSHOT disabled
56 session-persistence 1.1.0-SNAPSHOT disabled
57
58Description Details
59^^^^^^^^^^^^^^^^^^^
60
61State Model
62-----------
63
64The state model follows the ITU X.731 standard for state management. The supported state values are:
65 **Administrative State:**
66 - Locked - All application transaction processing is prohibited
67 - Unlocked - Application transaction processing is allowed
68
69 **Administrative State Transitions:**
70 - The transition from Unlocked to Locked state is triggered with a Lock operation
71 - The transition from the Locked to Unlocked state is triggered with an Unlock operation
72
73 **Operational State:**
74 - Enabled - The node is healthy and able to process application transactions
75 - Disabled - The node is not healthy and not able to process application transactions
76
77 **Operational State Transitions:**
78 - The transition from Enabled to Disabled is triggered with a disableFailed or disableDependency operation
79 - The transition from Disabled to Enabled is triggered with an enableNotFailed and enableNoDependency operation
80
81 **Availability Status:**
82 - Null - The Operational State is Enabled
83 - Failed - The Operational State is Disabled because the node is no longer healthy
84 - Dependency - The Operational State is Disabled because all members of a dependency group are disabled
85 - Dependency.Failed - The Operational State is Disabled because the node is no longer healthy and all members of a dependency group are disabled
86
87 **Availability Status Transitions:**
88 - The transition from Null to Failed is triggered with a disableFailed operation
89 - The transtion from Null to Dependency is triggered with a disableDependency operation
90 - The transition from Failed to Dependency.Failed is triggered with a disableDependency operation
91 - The transition from Dependency to Dependency.Failed is triggered with a disableFailed operation
92 - The transition from Dependency.Failed to Failed is triggered with an enableNoDependency operation
93 - The transition from Dependency.Failed to Dependency is triggered with an enableNotFailed operation
94 - The transition from Failed to Null is triggered with an enableNotFailed operation
95 - The transition from Dependency to Null is triggered with an enableNoDependency operation
96
97 **Standby Status:**
98 - Null - The node does not support active-standby behavior
99 - ProvidingService - The node is actively providing application transaction service
100 - HotStandby - The node is capable of providing application transaction service, but is currently waiting to be promoted
101 - ColdStandby - The node is not capable of providing application service because of a failure
102
103 **Standby Status Transitions:**
104 - The transition from Null to HotStandby is triggered by a demote operation when the Operational State is Enabled
105 - The transition for Null to ColdStandby is triggered is a demote operation when the Operational State is Disabled
106 - The transition from ColdStandby to HotStandby is triggered by a transition of the Operational State from Disabled to Enabled
107 - The transition from HotStandby to ColdStandby is triggered by a transition of the Operational State from Enabled to Disabled
108 - The transition from ProvidingService to ColdStandby is triggered by a transition of the Operational State from Enabled to Disabled
109 - The transition from HotStandby to ProvidingService is triggered by a Promote operation
110 - The transition from ProvidingService to HotStandby is triggered by a Demote operation
111
112Database
113--------
114
115The State Management feature creates a StateManagement database having three tables:
116
117 **StateManagementEntity** - This table has the following columns:
118 - **id** - Automatically created unique identifier
119 - **resourceName** - The unique identifier for a node
120 - **adminState** - The Administrative State
121 - **opState** - The Operational State
122 - **availStatus** - The Availability Status
123 - **standbyStatus** - The Standby Status
124 - **created_Date** - The timestamp the resource entry was created
125 - **modifiedDate** - The timestamp the resource entry was last modified
126
127 **ForwardProgressEntity** - This table has the following columns:
128 - **forwardProgressId** - Automatically created unique identifier
129 - **resourceName** - The unique identifier for a node
130 - **fpc_count** - A forward progress counter which is periodically incremented if the node is healthy
131 - **created_date** - The timestamp the resource entry was created
132 - **last_updated** - The timestamp the resource entry was last updated
133
134 **ResourceRegistrationEntity** - This table has the following columns:
135 - **ResourceRegistrationId** - Automatically created unique identifier
136 - **resourceName** - The unique identifier for a node
137 - **resourceUrl** - The JMX URL used to check the health of a node
138 - **site** - The name of the site in which the resource resides
139 - **nodeType** - The type of the node (i.e, pdp_xacml, pdp_drools, pap, pap_admin, logparser, brms_gateway, astra_gateway, elk_server, pypdp)
140 - **created_date** - The timestamp the resource entry was created
141 - **last_updated** - The timestamp the resource entry was last updated
142
143Node Health Monitoring
144----------------------
145
146**Application Monitoring**
147
148 Application monitoring can be implemented using the *startTransaction()* and *endTransaction()* methods. Whenever a transaction is started, the *startTransaction()* method is called. If the node is locked, disabled or in a hot/cold standby state, the method will throw an exception. Otherwise, it resets the timer which triggers the default *testTransaction()* method.
149
150 When a transaction completes, calling *endTransaction()* increments the forward process counter in the *ForwardProgressEntity* DB table. As long as this counter is updating, the integrity monitor will assume the node is healthy/sane.
151
152 If the *startTransaction()* method is not called within a provisioned period of time, a timer will expire which calls the *testTransaction()* method. The default implementation of this method simply increments the forward progress counter. The *testTransaction()* method may be overwritten to perform a more meaningful test of system sanity, if desired.
153
154 If the forward progress counter stops incrementing, the integrity monitoring routine will assume the node application has lost sanity and it will trigger a *statechange* (disableFailed) to cause the operational state to become disabled and the availability status attribute to become failed. Once the forward progress counter again begins incrementing, the operational state will return to enabled.
155
156**Dependency Monitoring**
157
158 When a Drools PDP (or other node using the *IntegrityMonitor* policy/common module) is dependent upon other nodes to perform its function, those other nodes can be defined as dependencies in the properties file. In order for the dependency algorithm to function, the other nodes must also be running the *IntegrityMonitor*. Periodically the Drools PDP will check the state of dependencies. If all of a node type have failed, the Drools PDP will declare that it can no longer function and change the operational state to disabled and the availability status to dependency.
159
160 In addition to other policy node types, there is a *subsystemTest()* method that is periodically called by the *IntegrityMonitor*. In Drools PDP, *subsystemTest* has been overwritten to execute an audit of the Database and of the Maven Repository. If the audit is unable to verify the function of either the DB or the Maven Repository, he Drools PDP will declare that it can no longer function and change the operational state to disabled and the availability status to dependency.
161
162 When a failed dependency returns to normal operation, the *IntegrityMontor* will change the operational state to enabled and availability status to null.
163
164**External Health Monitoring Interface**
165
166 The Drools PDP has a http test interface which, when called, will return 200 if all seems well and 500 otherwise. The test interface URL is defined in the properties file.
167
168
169Site Manager
170------------
171
172The Site Manager is not deployed with the Drools PDP, but it is available in the policy/common repository in the site-manager directory.
173The Site Manager provides a lock/unlock interface for nodes and a way to display node information and status.
174
175The following is from the README file included with the Site Manager.
176
177 .. code-block:: bash
178 :caption: Site Manager README extract
179
180 Before using 'siteManager', the file 'siteManager.properties' needs to be
181 edited to configure the parameters used to access the database:
182
183 javax.persistence.jdbc.driver - typically 'org.mariadb.jdbc.Driver'
184
185 javax.persistence.jdbc.url - URL referring to the database,
186 which typically has the form: 'jdbc:mariadb://<host>:<port>/<db>'
187 ('<db>' is probably 'xacml' in this case)
188
189 javax.persistence.jdbc.user - the user id for accessing the database
190
191 javax.persistence.jdbc.password - password for accessing the database
192
193 Once the properties file has been updated, the 'siteManager' script can be
194 invoked as follows:
195
196 siteManager show [ -s <site> | -r <resourceName> ] :
197 display node information (Site, NodeType, ResourceName, AdminState,
198 OpState, AvailStatus, StandbyStatus)
199
200 siteManager setAdminState { -s <site> | -r <resourceName> } <new-state> :
201 update admin state on selected nodes
202
203 siteManager lock { -s <site> | -r <resourceName> } :
204 lock selected nodes
205
206 siteManager unlock { -s <site> | -r <resourceName> } :
207 unlock selected nodes
208
209Note that the 'siteManager' script assumes that the script,
210'site-manager-${project.version}.jar' file and 'siteManager.properties' file
211are all in the same directory. If the files are separated, the 'siteManager'
212script will need to be modified so it can locate the jar and properties files.
213
214
215Properties
216----------
217
218The feature-state-mangement.properties file controls the function of the State Management Feature. In general, the properties have adequate descriptions in the file. Parameters which must be replaced prior to usage are indicated thus: ${{parameter to be replaced}}.
219
220 .. code-block:: bash
221 :caption: feature-state-mangement.properties
222
223 # DB properties
224 javax.persistence.jdbc.driver=org.mariadb.jdbc.Driver
225 javax.persistence.jdbc.url=jdbc:mariadb://${{SQL_HOST}}:3306/statemanagement
226 javax.persistence.jdbc.user=${{SQL_USER}}
227 javax.persistence.jdbc.password=${{SQL_PASSWORD}}
228
229 # DroolsPDPIntegrityMonitor Properties
230 # Test interface host and port defaults may be overwritten here
231 http.server.services.TEST.host=0.0.0.0
232 http.server.services.TEST.port=9981
233 #These properties will default to the following if no other values are provided:
234 # http.server.services.TEST.restClasses=org.onap.policy.drools.statemanagement.IntegrityMonitorRestManager
235 # http.server.services.TEST.managed=false
236 # http.server.services.TEST.swagger=true
237
238 #IntegrityMonitor Properties
239
240 # Must be unique across the system
241 resource.name=pdp1
242 # Name of the site in which this node is hosted
243 site_name=site1
244 # Forward Progress Monitor update interval seconds
245 fp_monitor_interval=30
246 # Failed counter threshold before failover
247 failed_counter_threshold=3
248 # Interval between test transactions when no traffic seconds
249 test_trans_interval=10
250 # Interval between writes of the FPC to the DB seconds
251 write_fpc_interval=5
252 # Node type Note: Make sure you don't leave any trailing spaces, or you'll get an 'invalid node type' error!
253 node_type=pdp_drools
254 # Dependency groups are groups of resources upon which a node operational state is dependent upon.
255 # Each group is a comma-separated list of resource names and groups are separated by a semicolon. For example:
256 # dependency_groups=site_1.astra_1,site_1.astra_2;site_1.brms_1,site_1.brms_2;site_1.logparser_1;site_1.pypdp_1
257 dependency_groups=
258 # When set to true, dependent health checks are performed by using JMX to invoke test() on the dependent.
259 # The default false is to use state checks for health.
260 test_via_jmx=true
261 # This is the max number of seconds beyond which a non incrementing FPC is considered a failure
262 max_fpc_update_interval=120
263 # Run the state audit every 60 seconds (60000 ms). The state audit finds stale DB entries in the
264 # forwardprogressentity table and marks the node as disabled/failed in the statemanagemententity
265 # table. NOTE! It will only run on nodes that have a standbystatus = providingservice.
266 # A value of <= 0 will turn off the state audit.
267 state_audit_interval_ms=60000
268 # The refresh state audit is run every (default) 10 minutes (600000 ms) to clean up any state corruption in the
269 # DB statemanagemententity table. It only refreshes the DB state entry for the local node. That is, it does not
270 # refresh the state of any other nodes. A value <= 0 will turn the audit off. Any other value will override
271 # the default of 600000 ms.
272 refresh_state_audit_interval_ms=600000
273
274
275 # Repository audit properties
276 # Assume it's the releaseRepository that needs to be audited,
277 # because that's the one BRMGW will publish to.
278 repository.audit.id=${{releaseRepositoryID}}
279 repository.audit.url=${{releaseRepositoryUrl}}
280 repository.audit.username=${{repositoryUsername}}
281 repository.audit.password=${{repositoryPassword}}
282 repository2.audit.id=${{releaseRepository2ID}}
283 repository2.audit.url=${{releaseRepository2Url}}
284 repository2.audit.username=${{repositoryUsername2}}
285 repository2.audit.password=${{repositoryPassword2}}
286
287 # Repository Audit Properties
288 # Flag to control the execution of the subsystemTest for the Nexus Maven repository
289 repository.audit.is.active=false
290 repository.audit.ignore.errors=true
291 repository.audit.interval_sec=86400
292 repository.audit.failure.threshold=3
293
294 # DB Audit Properties
295 # Flag to control the execution of the subsystemTest for the Database
296 db.audit.is.active=false
297
298
299End of Document
300
301.. SSNote: Wiki page ref. https://wiki.onap.org/display/DW/Feature+State+Management
302
303