blob: 39d0a71ce0d49a4f045a3218741c4fe4ede69de3 [file] [log] [blame]
a.sreekumard22e8852022-03-23 16:29:49 +00001.. This work is licensed under a
2.. Creative Commons Attribution 4.0 International License.
3.. http://creativecommons.org/licenses/by/4.0
4
5.. _prometheus-metrics:
6
7Prometheus Metrics support in Policy Framework Components
8#########################################################
9
10.. contents::
11 :depth: 3
12
13This page explains the prometheus metrics exposed by different Policy Framework components.
isaac532c80c2022-04-06 13:53:53 -050014
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040015
161. Context
17==========
18
19Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability.
20
21The goal of monitoring is to achieve the below operational needs:
22
231. Monitoring via dashboards: Provide visual aids to display health, key metrics for use by Operations.
242. Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation.
253. Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues.
264. Analyzing trends: How fast is the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands.
27
28The principles outlined in the `Four Golden Signals <https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals>`__ developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Framework.
29
30- Request Rate: # of requests per second as served by Policy services.
31- Event Processing rate: # of requests/events per second as processed by the PDPs.
32- Errors: # of those requests/events processed that are failing.
33- Latency/Duration: Amount of time those requests take, and for PDPs relevant metrics for event processing times.
34- Saturation: Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain.
35
36
372. Policy Framework Key metrics
38===============================
39
40System Metrics common across all Policy components
41--------------------------------------------------
42
43These standard metrics are available and exposed via a Prometheus endpoint since Istanbul release and can be categorized as below:
44
45CPU Usage
isaac532c80c2022-04-06 13:53:53 -050046*********
47
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040048CPU usage percentage can be derived *"system_cpu_usage"* for springboot applications and *"process_cpu_seconds_total* for non springboot applications using `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`__ .
isaac532c80c2022-04-06 13:53:53 -050049
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040050Process uptime
51**************
isaac532c80c2022-04-06 13:53:53 -050052
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040053The process uptime in seconds is available via *"process_uptime_seconds"* for springboot applications or *"process_start_time_seconds"* otherwise.
isaac532c80c2022-04-06 13:53:53 -050054
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040055Status of the applications is available via the standard *"up"* metric.
isaac532c80c2022-04-06 13:53:53 -050056
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040057JVM memory metrics
58******************
isaac532c80c2022-04-06 13:53:53 -050059
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040060These metrics begin with the prefix *"jvm_memory_"*.
61There is a lot of data here however, one of the key metric to monitor would be the total heap memory usage, *E.g. sum(jvm_memory_used_bytes{area="heap"})*.
isaac532c80c2022-04-06 13:53:53 -050062
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040063`PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`__ can be leveraged to represent the total or rate of memory usage.
isaac532c80c2022-04-06 13:53:53 -050064
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -040065JVM thread metrics
66******************
67
68These metrics begin with the prefix *"jvm_threads_"*. Some of the key data to monitor for are:
69
70- *"jvm_threads_live_threads"* (springboot apps), or *"jvm_threads_current"* (non springboot) shows the total number of live threads, including daemon and non-daemon threads
71- *"jvm_threads_peak_threads"* (springboot apps), or *"jvm_threads_peak"* (non springboot) shows the peak total number of threads since the JVM started
72- *"jvm_threads_states_threads"* (springboot apps), or *"jvm_threads_state"* (non springboot) shows number of threads by thread state
73
74JVM garbage collection metrics
75******************************
76
77There are many garbage collection metrics, with prefix *"jvm_gc_"* available to get deep insights into how the JVM is managing memory. They can be broadly categorized into:
78
79- Pause duration *"jvm_gc_pause_"* for springboot applications gives us information about how long GC took. For non springboot application, the collection duration metrics *"jvm_gc_collection_"* provide the same information.
80- Memory pool size increase can be assessed using *"jvm_gc_memory_allocated_bytes_total"* and *"jvm_gc_memory_promoted_bytes_total"* for springboot applications.
81
82Average garbage collection time and rate of garbage collection per second are key metrics to monitor.
83
84
85Key metrics for Policy API
86--------------------------
87
88+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
89| Metric name | Metric description | Metric labels |
90+=====================================+====================================================================================================+=======================================================================================================================================================================+
91| process_uptime_seconds | Uptime of policy-api application in seconds. | |
92+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
93| http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
94+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
95| http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
96+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
97
98Key metrics for Policy PAP
99--------------------------
100
101+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
102| Metric name | Metric description | Metric labels |
103+=====================================+====================================================================================================+=======================================================================================================================================================================+
104| process_uptime_seconds | Uptime of policy-pap application in seconds. | |
105+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
106| http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
107+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
108| http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked |
109+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
110| pap_policy_deployments | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL |
111+-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
112
113Key metrics for APEX-PDP
114------------------------
115
116+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
117| Metric name | Metric description | Metric labels |
118+=============================================+=====================================================================================+======================================================================================================================+
119| process_start_time_seconds | Uptime of apex-pdp application in seconds | |
120+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
Ram Krishna Verma49710732022-04-22 10:28:57 -0400121| pdpa_policy_deployments_total | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL|
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -0400122+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
Ram Krishna Verma49710732022-04-22 10:28:57 -0400123| pdpa_policy_executions_total | Number of TOSCA policy executions | "status": Execution status values - SUCCESS, FAILURE, TOTAL" |
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -0400124+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
125| pdpa_engine_state | State of APEX engine | "engine_instance_id": ID of the engine thread |
126+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
127| pdpa_engine_last_start_timestamp_epoch | Epoch timestamp of the instance when engine was last started to derive uptime from | "engine_instance_id": ID of the engine thread |
128+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
129| pdpa_engine_event_executions | Number of APEX event execution counter per engine thread | "engine_instance_id": ID of the engine thread |
130+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
131| pdpa_engine_average_execution_time_seconds | Average time taken to execute an APEX policy in seconds | "engine_instance_id": ID of the engine thread |
132+---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+
133
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -0400134Key metrics for XACML PDP
135-------------------------
136
137+--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
138| Metric name | Metric description | Metric labels |
139+================================+===================================================+==============================================================================================================================================================================================================================+
140| process_start_time_seconds | Uptime of policy-pap application in seconds. | |
141+--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
142| pdpx_policy_deployments_total | Counts the total number of deployment operations | "deploy": Counts the number of successful or failed deploys; "undeploy": Counts the number of successful or failed undeploys |
143+--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
144| pdpx_policy_decisions_total | Counts the total number of decisions | permit: Counts the number of permit decisions; "deny": Counts the number of deny decisions; "indeterminant": Counts the number of indeterminant decisions; "not_applicable": Counts the number of not applicable decisions. |
145+--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
jhh78430072022-04-26 13:19:36 -0500146| logback_appender_total | Counts the log entries | level: Counts on a per log level basis. |
147+--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -0400148
jhh78430072022-04-26 13:19:36 -0500149Key metrics for Drools PDP
150--------------------------
151
152+-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+
153| Metric name | Metric description |Metric labels |
154+===============================================+=======================================================+=======================================================+
155| process_start_time_seconds | Uptime of policy-drools-pdp component in seconds. | |
156+-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+
157| pdpd_policy_deployments_total | Count of policy deployments | operation: deploy|undeploy, status: SUCCESS|FAILURE |
158+-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+
159| pdpd_policy_executions_latency_seconds_count | Count of policy executions | controller, controlloop, policy |
160+-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+
161| pdpd_policy_executions_latency_seconds_sum | Count of policy execution latency in seconds | controller, controlloop, policy |
162+-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+
163| logback_appender_total | Count of log entries | level |
164+-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -0400165
166Key metrics for Policy Distribution
167-----------------------------------
168
Ram Krishna Verma49710732022-04-22 10:28:57 -0400169+------------------------------------+-------------------------------------------------------+
170| Metric name | Metric description |
171+====================================+=======================================================+
172| total_distribution_received_count | Total number of distribution received |
173+------------------------------------+-------------------------------------------------------+
174| distribution_success_count | Total number of distribution successfully processed |
175+------------------------------------+-------------------------------------------------------+
176| distribution_failure_count | Total number of distribution failures |
177+------------------------------------+-------------------------------------------------------+
178| total_download_received_count | Total number of download received |
179+------------------------------------+-------------------------------------------------------+
180| download_success_count | Total number of download successfully processed |
181+------------------------------------+-------------------------------------------------------+
182| download_failure_count | Total number of download failures |
183+------------------------------------+-------------------------------------------------------+
184
185
Rashmi Pujarbd9c35f2022-04-13 23:34:57 -04001863. OOM changes to enable prometheus monitoring for Policy Framework
187===================================================================
188
189Policy Framework uses ServiceMonitor custom resource definition (CRD) to allow Prometheus to monitor the services it exposes. Label selection is used to determine which services are selected to be monitored.
190For label management and troubleshooting refer to the documentation at: `Prometheus operator <https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md#overview-of-servicemonitor-tagging-and-related-elements>`__.
191
192`OOM charts <https://github.com/onap/oom/tree/master/kubernetes/policy/components>`__ for policy include ServiceMonitor and properties can be overrided based on the deployment specifics.