a.sreekumar | d22e885 | 2022-03-23 16:29:49 +0000 | [diff] [blame] | 1 | .. This work is licensed under a |
| 2 | .. Creative Commons Attribution 4.0 International License. |
| 3 | .. http://creativecommons.org/licenses/by/4.0 |
| 4 | |
| 5 | .. _prometheus-metrics: |
| 6 | |
| 7 | Prometheus Metrics support in Policy Framework Components |
| 8 | ######################################################### |
| 9 | |
| 10 | .. contents:: |
| 11 | :depth: 3 |
| 12 | |
| 13 | This page explains the prometheus metrics exposed by different Policy Framework components. |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 14 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 15 | |
| 16 | 1. Context |
| 17 | ========== |
| 18 | |
| 19 | Collecting application metrics is the first step towards gaining insights into Policy Fwk services and infrastructure from point of view of Availability, Performance, Reliability and Scalability. |
| 20 | |
| 21 | The goal of monitoring is to achieve the below operational needs: |
| 22 | |
| 23 | 1. Monitoring via dashboards: Provide visual aids to display health, key metrics for use by Operations. |
| 24 | 2. Alerting: Something is broken, and the issue must be addressed immediately OR, something might break soon, and proactive measures are taken to avoid such a situation. |
| 25 | 3. Conducting retrospective analysis: Rich information that is readily available to better troubleshoot issues. |
| 26 | 4. Analyzing trends: How fast is the usage growing? How is the incoming traffic like? Helps assess needs for scaling to meet forecasted demands. |
| 27 | |
| 28 | The principles outlined in the `Four Golden Signals <https://sre.google/sre-book/monitoring-distributed-systems/#xref_monitoring_golden-signals>`__ developed by Google Site Reliability Engineers has been adopted to define the key metrics for Policy Framework. |
| 29 | |
| 30 | - Request Rate: # of requests per second as served by Policy services. |
| 31 | - Event Processing rate: # of requests/events per second as processed by the PDPs. |
| 32 | - Errors: # of those requests/events processed that are failing. |
| 33 | - Latency/Duration: Amount of time those requests take, and for PDPs relevant metrics for event processing times. |
| 34 | - Saturation: Measures the degree of fullness or % utilization of a service emphasizing the resources that are most constrained: CPU, Memory, I/O, custom metrics by domain. |
| 35 | |
| 36 | |
| 37 | 2. Policy Framework Key metrics |
| 38 | =============================== |
| 39 | |
| 40 | System Metrics common across all Policy components |
| 41 | -------------------------------------------------- |
| 42 | |
| 43 | These standard metrics are available and exposed via a Prometheus endpoint since Istanbul release and can be categorized as below: |
| 44 | |
| 45 | CPU Usage |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 46 | ********* |
| 47 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 48 | CPU usage percentage can be derived *"system_cpu_usage"* for springboot applications and *"process_cpu_seconds_total* for non springboot applications using `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`__ . |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 49 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 50 | Process uptime |
| 51 | ************** |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 52 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 53 | The process uptime in seconds is available via *"process_uptime_seconds"* for springboot applications or *"process_start_time_seconds"* otherwise. |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 54 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 55 | Status of the applications is available via the standard *"up"* metric. |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 56 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 57 | JVM memory metrics |
| 58 | ****************** |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 59 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 60 | These metrics begin with the prefix *"jvm_memory_"*. |
| 61 | There is a lot of data here however, one of the key metric to monitor would be the total heap memory usage, *E.g. sum(jvm_memory_used_bytes{area="heap"})*. |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 62 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 63 | `PromQL <https://prometheus.io/docs/prometheus/latest/querying/basics/>`__ can be leveraged to represent the total or rate of memory usage. |
isaac | 532c80c | 2022-04-06 13:53:53 -0500 | [diff] [blame] | 64 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 65 | JVM thread metrics |
| 66 | ****************** |
| 67 | |
| 68 | These metrics begin with the prefix *"jvm_threads_"*. Some of the key data to monitor for are: |
| 69 | |
| 70 | - *"jvm_threads_live_threads"* (springboot apps), or *"jvm_threads_current"* (non springboot) shows the total number of live threads, including daemon and non-daemon threads |
| 71 | - *"jvm_threads_peak_threads"* (springboot apps), or *"jvm_threads_peak"* (non springboot) shows the peak total number of threads since the JVM started |
| 72 | - *"jvm_threads_states_threads"* (springboot apps), or *"jvm_threads_state"* (non springboot) shows number of threads by thread state |
| 73 | |
| 74 | JVM garbage collection metrics |
| 75 | ****************************** |
| 76 | |
| 77 | There are many garbage collection metrics, with prefix *"jvm_gc_"* available to get deep insights into how the JVM is managing memory. They can be broadly categorized into: |
| 78 | |
| 79 | - Pause duration *"jvm_gc_pause_"* for springboot applications gives us information about how long GC took. For non springboot application, the collection duration metrics *"jvm_gc_collection_"* provide the same information. |
| 80 | - Memory pool size increase can be assessed using *"jvm_gc_memory_allocated_bytes_total"* and *"jvm_gc_memory_promoted_bytes_total"* for springboot applications. |
| 81 | |
| 82 | Average garbage collection time and rate of garbage collection per second are key metrics to monitor. |
| 83 | |
| 84 | |
| 85 | Key metrics for Policy API |
| 86 | -------------------------- |
| 87 | |
| 88 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 89 | | Metric name | Metric description | Metric labels | |
| 90 | +=====================================+====================================================================================================+=======================================================================================================================================================================+ |
| 91 | | process_uptime_seconds | Uptime of policy-api application in seconds. | | |
| 92 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 93 | | http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | |
| 94 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 95 | | http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | |
| 96 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 97 | |
| 98 | Key metrics for Policy PAP |
| 99 | -------------------------- |
| 100 | |
| 101 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 102 | | Metric name | Metric description | Metric labels | |
| 103 | +=====================================+====================================================================================================+=======================================================================================================================================================================+ |
| 104 | | process_uptime_seconds | Uptime of policy-pap application in seconds. | | |
| 105 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 106 | | http_server_requests_seconds_count | Number of API requests filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | |
| 107 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 108 | | http_server_requests_seconds_sum | Time taken for an API request filtered by uri, REST method and response status among other labels | "exception": any exception string; "method": REST method used; "outcome": response status string; "status": http response status code; "uri": REST endpoint invoked | |
| 109 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 110 | | pap_policy_deployments | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL | |
| 111 | +-------------------------------------+----------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 112 | |
| 113 | Key metrics for APEX-PDP |
| 114 | ------------------------ |
| 115 | |
| 116 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
| 117 | | Metric name | Metric description | Metric labels | |
| 118 | +=============================================+=====================================================================================+======================================================================================================================+ |
| 119 | | process_start_time_seconds | Uptime of apex-pdp application in seconds | | |
| 120 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
Ram Krishna Verma | 4971073 | 2022-04-22 10:28:57 -0400 | [diff] [blame] | 121 | | pdpa_policy_deployments_total | Number of TOSCA policy deploy/undeploy operations | "operation": Possibles values are deploy, undeploy; "status": Deploy/Undeploy status values - SUCCESS, FAILURE, TOTAL| |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 122 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
Ram Krishna Verma | 4971073 | 2022-04-22 10:28:57 -0400 | [diff] [blame] | 123 | | pdpa_policy_executions_total | Number of TOSCA policy executions | "status": Execution status values - SUCCESS, FAILURE, TOTAL" | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 124 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
| 125 | | pdpa_engine_state | State of APEX engine | "engine_instance_id": ID of the engine thread | |
| 126 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
| 127 | | pdpa_engine_last_start_timestamp_epoch | Epoch timestamp of the instance when engine was last started to derive uptime from | "engine_instance_id": ID of the engine thread | |
| 128 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
| 129 | | pdpa_engine_event_executions | Number of APEX event execution counter per engine thread | "engine_instance_id": ID of the engine thread | |
| 130 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
| 131 | | pdpa_engine_average_execution_time_seconds | Average time taken to execute an APEX policy in seconds | "engine_instance_id": ID of the engine thread | |
| 132 | +---------------------------------------------+-------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+ |
| 133 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 134 | Key metrics for XACML PDP |
| 135 | ------------------------- |
| 136 | |
| 137 | +--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 138 | | Metric name | Metric description | Metric labels | |
| 139 | +================================+===================================================+==============================================================================================================================================================================================================================+ |
| 140 | | process_start_time_seconds | Uptime of policy-pap application in seconds. | | |
| 141 | +--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 142 | | pdpx_policy_deployments_total | Counts the total number of deployment operations | "deploy": Counts the number of successful or failed deploys; "undeploy": Counts the number of successful or failed undeploys | |
| 143 | +--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
| 144 | | pdpx_policy_decisions_total | Counts the total number of decisions | permit: Counts the number of permit decisions; "deny": Counts the number of deny decisions; "indeterminant": Counts the number of indeterminant decisions; "not_applicable": Counts the number of not applicable decisions. | |
| 145 | +--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
jhh | 7843007 | 2022-04-26 13:19:36 -0500 | [diff] [blame] | 146 | | logback_appender_total | Counts the log entries | level: Counts on a per log level basis. | |
| 147 | +--------------------------------+---------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 148 | |
jhh | 7843007 | 2022-04-26 13:19:36 -0500 | [diff] [blame] | 149 | Key metrics for Drools PDP |
| 150 | -------------------------- |
| 151 | |
| 152 | +-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+ |
| 153 | | Metric name | Metric description |Metric labels | |
| 154 | +===============================================+=======================================================+=======================================================+ |
| 155 | | process_start_time_seconds | Uptime of policy-drools-pdp component in seconds. | | |
| 156 | +-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+ |
| 157 | | pdpd_policy_deployments_total | Count of policy deployments | operation: deploy|undeploy, status: SUCCESS|FAILURE | |
| 158 | +-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+ |
| 159 | | pdpd_policy_executions_latency_seconds_count | Count of policy executions | controller, controlloop, policy | |
| 160 | +-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+ |
| 161 | | pdpd_policy_executions_latency_seconds_sum | Count of policy execution latency in seconds | controller, controlloop, policy | |
| 162 | +-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+ |
| 163 | | logback_appender_total | Count of log entries | level | |
| 164 | +-----------------------------------------------+-------------------------------------------------------+-------------------------------------------------------+ |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 165 | |
| 166 | Key metrics for Policy Distribution |
| 167 | ----------------------------------- |
| 168 | |
Ram Krishna Verma | 4971073 | 2022-04-22 10:28:57 -0400 | [diff] [blame] | 169 | +------------------------------------+-------------------------------------------------------+ |
| 170 | | Metric name | Metric description | |
| 171 | +====================================+=======================================================+ |
| 172 | | total_distribution_received_count | Total number of distribution received | |
| 173 | +------------------------------------+-------------------------------------------------------+ |
| 174 | | distribution_success_count | Total number of distribution successfully processed | |
| 175 | +------------------------------------+-------------------------------------------------------+ |
| 176 | | distribution_failure_count | Total number of distribution failures | |
| 177 | +------------------------------------+-------------------------------------------------------+ |
| 178 | | total_download_received_count | Total number of download received | |
| 179 | +------------------------------------+-------------------------------------------------------+ |
| 180 | | download_success_count | Total number of download successfully processed | |
| 181 | +------------------------------------+-------------------------------------------------------+ |
| 182 | | download_failure_count | Total number of download failures | |
| 183 | +------------------------------------+-------------------------------------------------------+ |
| 184 | |
| 185 | |
Rashmi Pujar | bd9c35f | 2022-04-13 23:34:57 -0400 | [diff] [blame] | 186 | 3. OOM changes to enable prometheus monitoring for Policy Framework |
| 187 | =================================================================== |
| 188 | |
| 189 | Policy Framework uses ServiceMonitor custom resource definition (CRD) to allow Prometheus to monitor the services it exposes. Label selection is used to determine which services are selected to be monitored. |
| 190 | For label management and troubleshooting refer to the documentation at: `Prometheus operator <https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/troubleshooting.md#overview-of-servicemonitor-tagging-and-related-elements>`__. |
| 191 | |
| 192 | `OOM charts <https://github.com/onap/oom/tree/master/kubernetes/policy/components>`__ for policy include ServiceMonitor and properties can be overrided based on the deployment specifics. |