kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 1 | .. This work is licensed under a Creative Commons Attribution 4.0 International License. |
| 2 | .. http://creativecommons.org/licenses/by/4.0 |
| 3 | |
| 4 | .. _healthcheck_and_monitoring: |
| 5 | |
| 6 | Healthcheck and Monitoring |
| 7 | ========================== |
| 8 | |
| 9 | Healthcheck |
| 10 | ----------- |
| 11 | Inside HV-VES docker container runs a small HTTP service for healthcheck. Port for healthchecks can be configured |
| 12 | at deployment using ``--health-check-api-port`` command line option or via `VESHV_HEALTHCHECK_API_PORT` environment variable (for details see :ref:`deployment`). |
| 13 | |
| 14 | This service exposes endpoint **GET /health/ready** which returns a **HTTP 200 OK** when HV-VES is healthy |
| 15 | and ready for connections. Otherwise it returns a **HTTP 503 Service Unavailable** message with a short reason of unhealthiness. |
| 16 | |
| 17 | |
| 18 | Monitoring |
| 19 | ---------- |
| 20 | HV-VES collector allows to collect metrics data at runtime. To serve this purpose HV-VES application exposes an endpoint **GET /monitoring/prometheus** |
| 21 | which returns a **HTTP 200 OK** message with a specific data in its body. Returned data is in a format readable by Prometheus service. |
| 22 | Prometheus endpoint shares a port with healthchecks. |
| 23 | |
| 24 | Metrics provided by HV-VES metrics: |
| 25 | |
| 26 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 27 | | Name of metric | Unit | Description | |
| 28 | +===============================================+==============+==========================================================================================+ |
| 29 | | hvves_clients_rejected_cause_total | cause/piece | number of rejected clients grouped by cause | |
| 30 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 31 | | hvves_clients_rejected_total | piece | total number of rejected clients | |
| 32 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 33 | | hvves_connections_active | piece | number of currently active connections | |
| 34 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 35 | | hvves_connections_total | piece | total number of connections | |
| 36 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 37 | | hvves_data_received_bytes_total | bytes | total number of received bytes | |
| 38 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 39 | | hvves_disconnections_total | piece | total number of disconnections | |
| 40 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 41 | | hvves_messages_dropped_cause_total | cause/piece | number of dropped messages grouped by cause | |
| 42 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 43 | | hvves_messages_dropped_total | piece | total number of dropped messages | |
| 44 | +-----------------------------------------------+--------------+-----------------------------------+------------------------------------------------------+ |
Filip Krzywka | f426f4e | 2019-01-22 14:37:33 +0100 | [diff] [blame^] | 45 | | hvves_messages_latency_seconds_bucket | seconds | latency is a time between | cumulative counters for the latency occurance | |
kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 46 | +-----------------------------------------------+--------------+ message.header.lastEpochMicrosec +------------------------------------------------------+ |
Filip Krzywka | f426f4e | 2019-01-22 14:37:33 +0100 | [diff] [blame^] | 47 | | hvves_messages_latency_seconds_count | piece | and time when data has been sent | counter for number of latency occurance | |
kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 48 | +-----------------------------------------------+--------------+ from HV-VES to Kafka +------------------------------------------------------+ |
Filip Krzywka | f426f4e | 2019-01-22 14:37:33 +0100 | [diff] [blame^] | 49 | | hvves_messages_latency_seconds_max | seconds | | maximal observed latency | |
| 50 | +-----------------------------------------------+--------------+ +------------------------------------------------------+ |
kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 51 | | hvves_messages_latency_seconds_sum | seconds | | sum of latency parameter from each message | |
| 52 | +-----------------------------------------------+--------------+-----------------------------------+------------------------------------------------------+ |
Filip Krzywka | f426f4e | 2019-01-22 14:37:33 +0100 | [diff] [blame^] | 53 | | hvves_messages_processing_time_seconds_bucket | seconds | processing time is time meassured | cumulative counters for processing time occurance | |
kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 54 | +-----------------------------------------------+--------------+ between decoding of WTP message +------------------------------------------------------+ |
Filip Krzywka | f426f4e | 2019-01-22 14:37:33 +0100 | [diff] [blame^] | 55 | | hvves_messages_processing_time_seconds_count | piece | and time when data has been sent | counter for number of processing time occurance | |
kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 56 | +-----------------------------------------------+--------------+ From HV-VES to Kafka +------------------------------------------------------+ |
Filip Krzywka | f426f4e | 2019-01-22 14:37:33 +0100 | [diff] [blame^] | 57 | | hvves_messages_processing_time_seconds_max | seconds | | maximal processing time | |
| 58 | +-----------------------------------------------+--------------+ +------------------------------------------------------+ |
kjaniak | 8b1223d | 2018-12-19 22:41:40 +0100 | [diff] [blame] | 59 | | hvves_messages_processing_time_seconds_sum | seconds | | sum of processing time from each message | |
| 60 | +-----------------------------------------------+--------------+-----------------------------------+------------------------------------------------------+ |
| 61 | | hvves_messages_received_payload_bytes_total | bytes | total number of received payload bytes | |
| 62 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 63 | | hvves_messages_received_total | piece | total number of received messages | |
| 64 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 65 | | hvves_messages_sent_topic_total | topic/piece | number of sent messages grouped by topic | |
| 66 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 67 | | hvves_messages_sent_total | piece | number of sent messages | |
| 68 | +-----------------------------------------------+--------------+------------------------------------------------------------------------------------------+ |
| 69 | |
| 70 | JVM metrics: |
| 71 | |
| 72 | - jvm_buffer_memory_used_bytes |
| 73 | - jvm_classes_unloaded_total |
| 74 | - jvm_gc_memory_promoted_bytes_total |
| 75 | - jvm_buffer_total_capacity_bytes |
| 76 | - jvm_threads_live |
| 77 | - jvm_classes_loaded |
| 78 | - jvm_gc_memory_allocated_bytes_total |
| 79 | - jvm_threads_daemon |
| 80 | - jvm_buffer_count |
| 81 | - jvm_gc_pause_seconds_count |
| 82 | - jvm_gc_pause_seconds_sum |
| 83 | - jvm_gc_pause_seconds_max |
| 84 | - jvm_gc_max_data_size_bytes |
| 85 | - jvm_memory_committed_bytes |
| 86 | - jvm_gc_live_data_size_bytes |
| 87 | - jvm_memory_max_bytes |
| 88 | - jvm_memory_used_bytes |
| 89 | - jvm_threads_peak |
| 90 | |
| 91 | Sample response for **GET monitoring/prometheus**: |
| 92 | |
| 93 | .. literalinclude:: metrics_sample_response.txt |