Alerts


/var/lib/prometheus/alerts/abgw.rules > abgw
Attempt to use migrated accounts (0 active)
alert: Attempt to use migrated accounts
expr: instance:abgw_inst_outdated:count > 0
for: 1m
labels:
  component: abgw
  severity: warning
annotations:
  summary: One or more attempts to use migrated accounts detected for the last 24 hours. Please contact the technical support.
Backup storage CRL is not up to date (0 active)
alert: Backup storage CRL is not up to date
expr: label_join(time() - instance_path_reg_type:abgw_next_certificate_expiration:min > 86400 * 2 and instance_path_reg_type:abgw_next_certificate_expiration:min{path!~".*root\\.crl$",type="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: warning
annotations:
  summary: 'The CRL has not been updated for more than 2 days. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'
Backup storage SSL certificate has expired (0 active)
alert: Backup storage SSL certificate has expired
expr: label_join(instance_path_reg_type:abgw_next_certificate_expiration:min - time() < 0 and instance_path_reg_type:abgw_next_certificate_expiration:min{type!="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: critical
annotations:
  summary: 'The {{$labels.type}} certificate has expired. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'
Backup storage SSL certificate will expire in less than 14 days (0 active)
Backup storage SSL certificate will expire in less than 21 days (0 active)
Backup storage SSL certificate will expire in less than 7 days (0 active)
Backup storage has high replica open error rate (0 active)
alert: Backup storage has high replica open error rate
expr: err:abgw_file_replica_open_errs:rate5m{err!="OK"} / on() group_left() sum(err:abgw_file_replica_open_errs:rate5m) > 0.05
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: Backup storage has the error rate when opening replica files "{{$labels.err}}" higher than 5%.
Backup storage has high replica removal error rate (0 active)
alert: Backup storage has high replica removal error rate
expr: err:abgw_rm_file_push_errs:rate5m{err!="OK"} / on() group_left() sum(err:abgw_rm_file_push_errs:rate5m) > 0.05
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: Backup storage has the error rate when removing secondary replica files "{{$labels.err}}" higher than 5%.
Backup storage has high replica write error rate (0 active)
alert: Backup storage has high replica write error rate
expr: err:abgw_push_replica_errs:rate5m{err!="OK"} / on() group_left() sum(err:abgw_push_replica_errs:rate5m) > 0.05
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: Backup storage has the error rate when writing replica files "{{$labels.err}}" higher than 5%.
Backup storage service is down (0 active)
alert: Backup storage service is down
expr: label_replace(node_systemd_unit_state{name=~"abgw-kvstore-proxy.service|abgw-setting.service|vstorage-abgw.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_abgw == 1
for: 5m
labels:
  component: abgw
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.
Backup storage throttling is activated (0 active)
alert: Backup storage throttling is activated
expr: job:abgw_append_throttle_delay_ms:rate5m != 0
for: 1m
labels:
  component: abgw
  severity: warning
annotations:
  summary: Backup storage started to throttle write operations due to the lack of free space. Visit https://kb.acronis.com/content/62823 to learn how to troubleshoot this issue.
Different number of collaborating backup storage services (0 active)
Found data file with inconsistent last_sync_offset (0 active)
alert: Found data file with inconsistent last_sync_offset
expr: sum by(path) (changes(abgw_file_sync_offset_mismatch_errs_total{job="abgw"}[3h])) != 0
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: The "{{ $labels.path }}" file's size is less than last_sync_offset stored in info file.
Storage I/O error (0 active)
alert: Storage I/O error
expr: instance:abgw_io_errors:count > 0
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: One or more errors detected during storage I/O operations for the last 24 hours. Please contact the technical support.
/var/lib/prometheus/alerts/backend.rules > Backend
Backend service is down (0 active)
alert: Backend service is down
expr: label_replace(sum by(name) (node_systemd_unit_state{name=~"vstorage-ui-backend.service",state="active"}), "name", "$1", "name", "(.*)\\.service") == 0
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }}'
  severity: critical
annotations:
  summary: Service {{$labels.name}} is down.
Changes to the management database are not replicated (0 active)
alert: Changes to the management database are not replicated
expr: db_replication_status == 2 and on(node) softwareupdates_node_state{state!~"updat.*"} == 1
for: 10m
labels:
  component: node
  severity: critical
annotations:
  summary: Changes to the management database are not replicated to the node "{{ $labels.host }}" because it is offline. Check the node's state and connectivity.
Changes to the management database are not replicated (0 active)
alert: Changes to the management database are not replicated
expr: db_replication_status == 1
labels:
  component: node
  severity: critical
annotations:
  summary: Changes to the management database are not replicated to the node "{{ $labels.host }}". Please contact the technical support.
Cluster update failed (0 active)
alert: Cluster update failed
expr: count(softwareupdates_cluster_info{state="failed"}) + count(softwareupdates_node_info{state="idle"}) == count(up{job="node"}) + 1
labels:
  component: cluster
  severity: critical
annotations:
  summary: Update failed for the cluster.
Compute cluster has failed (0 active)
alert: Compute cluster has failed
expr: compute_status == 2
labels:
  component: compute
  severity: critical
annotations:
  summary: Compute cluster has failed. Unable to manage virtual machines.
Compute node service is down (0 active)
alert: Compute node service is down
expr: label_replace(node_systemd_unit_state{name=~"openvswitch.service|ovs-vswitchd.service|ovsdb-server.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_compute == 1
for: 5m
labels:
  component: compute
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.
Critical node service is down (0 active)
alert: Critical node service is down
expr: label_replace(node_systemd_unit_state{name=~"nginx.service|vcmmd.service|vstorage-ui-agent.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1
for: 5m
labels:
  component: node
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.
Disk SMART warning (0 active)
alert: Disk SMART warning
expr: backend_node_disk_status{role!="unassigned",smart_status="failed"}
labels:
  component: cluster
  object_id: '{{ $labels.device }}-{{ $labels.serial_number }}-{{ $labels.instance }}'
  severity: error
annotations:
  summary: Disk "{{ $labels.device }}" ({{ $labels.serial_number }}) on node "{{ $labels.instance }}" has failed a S.M.A.R.T. check.
Disk error (0 active)
alert: Disk error
expr: backend_node_disk_status{disk_status=~"unavail|failed",role!="unassigned"}
labels:
  component: cluster
  object_id: '{{ $labels.device }}-{{ $labels.serial_number }}-{{ $labels.instance }}'
  severity: error
annotations:
  summary: Disk "{{ $labels.device }}" ({{ $labels.serial_number }}) has failed on node "{{ $labels.instance }}".
Entering maintenance for update failed (0 active)
alert: Entering maintenance for update failed
expr: softwareupdates_node_info{state="maintenance_failed"} * on(node) group_left(instance) up{job="node"}
labels:
  component: node
  severity: critical
annotations:
  summary: Entering maintenance failed while updating the node {{$labels.instance}}.
High availability for the admin panel must be configured (0 active)
alert: High availability for the admin panel must be configured
expr: count by(cluster_id) (backend_node_master) >= 3 and on(cluster_id) backend_ha_up == 0
for: 15m
labels:
  component: cluster
  severity: error
annotations:
  summary: |
    Configure high availability for the admin panel in SETTINGS > System settings > Management node high availability. Otherwise the admin panel will be a single point of failure.
High availability service is down (0 active)
alert: High availability service is down
expr: label_replace(node_systemd_unit_state{name=~"vstorage-ui-backend-raftor.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_management == 1 and on() backend_ha_up == 1
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }}'
  severity: critical
annotations:
  summary: Service {{$labels.name}} is down on host {{$labels.instance}}.
Identity provider connection error (0 active)
alert: Identity provider connection error
expr: backend_idp_error{error_type="connection_error"} == 1
for: 10m
labels:
  component: cluster
  severity: error
annotations:
  summary: Unable to connect to identity provider "{{ $labels.idp_name }}" in domain "{{ $labels.domain_name }}".
Identity provider validation error (0 active)
alert: Identity provider validation error
expr: backend_idp_error{error_type="validation_error"} == 1
labels:
  component: cluster
  severity: error
annotations:
  summary: Invalid identity provider configuration "{{ $labels.idp_name }}" in domain "{{ $labels.domain_name }}".
Incompatible hardware detected (0 active)
alert: Incompatible hardware detected
expr: backend_node_cpu_info{iommu="False",model=~"AMD EPYC.*"} * on(node_id) group_right(model) label_join(backend_node_nic_info{model=~"MT27800 Family \\[ConnectX\\-5\\]",type=~"Infiniband controller|Ethernet controller"}, "nic_model", "", "model")
for: 10m
labels:
  component: node
  severity: warning
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Incompatible hardware detected on node {{$labels.node_id}}: {{$labels.model}} & {{$labels.nic_model}}. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. Visit https://kb.acronis.com/content/64948 to learn how to troubleshoot this issue. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Incompatible hardware detected on node {{$labels.node_id}}: {{$labels.model}} & {{$labels.nic_model}}. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. Visit https://support.virtuozzo.com/hc/en-us/articles/19764365143953 to learn how to troubleshoot this issue. {{- end -}}'
Kafka SSL CA certificate has expired (0 active)
alert: Kafka SSL CA certificate has expired
expr: kafka_ssl_ca_cert_expire_in_days <= 0
labels:
  component: compute
  severity: critical
annotations:
  summary: Kafka SSL CA certificate has expired. Please renew the certificate.
Kafka SSL CA certificate will expire in less than 30 days (0 active)
alert: Kafka SSL CA certificate will expire in less than 30 days
expr: (kafka_ssl_ca_cert_expire_in_days > 0) <= 30
labels:
  component: compute
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Kafka SSL CA certificate will expire in {{ $value }} days. Please renew the certificate.
Kafka SSL client certificate has expired (0 active)
alert: Kafka SSL client certificate has expired
expr: kafka_ssl_client_cert_expire_in_days <= 0
labels:
  component: compute
  severity: critical
annotations:
  summary: Kafka SSL client certificate has expired. Please renew the certificate.
Kafka SSL client certificate will expire in less than 30 days (0 active)
alert: Kafka SSL client certificate will expire in less than 30 days
expr: (kafka_ssl_client_cert_expire_in_days > 0) <= 30
labels:
  component: compute
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Kafka SSL client certificate will expire in {{ $value }} days. Please renew the certificate.
Kernel is outdated (0 active)
alert: Kernel is outdated
expr: backend_node_kernel_outdated == 1 and on(node) softwareupdates_node_state{state="uptodate"} == 1
labels:
  component: node
  severity: warning
annotations:
  summary: Node "{{ $labels.instance }}" is not running the latest kernel.
License expired (0 active)
alert: License expired
expr: cluster_license_info{status=~"(expired|invalid|error|inactive)"} == 1
for: 30m
labels:
  component: cluster
  severity: critical
annotations:
  summary: The license of cluster "{{$labels.cluster_name}}" has expired. Сontact your reseller to update your license immediately!
License is not loaded (0 active)
alert: License is not loaded
expr: cluster_license_info{status="unknown"} == 1
for: 30m
labels:
  component: cluster
  severity: warning
annotations:
  summary: License is not loaded
License is not updated (0 active)
alert: License is not updated
expr: cluster_license_info{expire_in_days=~"[7-9]|(1[0-9])|20",is_spla="False"} == 1
for: 30m
labels:
  component: cluster
  severity: warning
annotations:
  summary: The license cannot be updated automatically and will expire in less than 21 days. Check the cluster connectivity to the license server or contact the technical support.
License will expire soon (0 active)
alert: License will expire soon
expr: cluster_license_info{expire_in_days=~"[1-6]"} == 1
for: 30m
labels:
  component: cluster
  severity: critical
annotations:
  summary: The license has not been updated automatically and will expire in less than 7 days. Check the cluster connectivity to the license server and contact the technical support immediately.
Management node HA has four nodes (0 active)
alert: Management node HA has four nodes
expr: count(backend_node_ha == 1) == 4
for: 10m
labels:
  component: cluster
  severity: warning
annotations:
  summary: The management node HA configuration has four nodes. It is recommended to have three or five nodes included.
Management node backup does not exist (0 active)
alert: Management node backup does not exist
expr: backend_database_backup_age{last_backup_date="None"} == 0
labels:
  component: cluster
  severity: error
annotations:
  summary: The last management node backup has failed or does not exist!
Management node backup is old (0 active)
alert: Management node backup is old
expr: (backend_database_backup_age > 0) < 3
for: 1h
labels:
  component: cluster
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Management node backup is older than {{ $value }} day
Management node backup is too old (0 active)
alert: Management node backup is too old
expr: backend_database_backup_age > 3
labels:
  component: cluster
  severity: error
  value: '{{ $value }}'
annotations:
  summary: Management node backup is older than {{ $value }} day
Management node service is down (0 active)
alert: Management node service is down
expr: label_replace(node_systemd_unit_state{name=~"alertmanager.service|prometheus.service|postgresql.service|pgbouncer.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_management == 1
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.
Management panel SSL certificate has expired (0 active)
alert: Management panel SSL certificate has expired
expr: backend_ui_ssl_cert_expire_in_days < 1
labels:
  component: cluster
  severity: critical
annotations:
  summary: The SSL certificate for the admin and self-service panels has expired. Renew the certificate, as described in the product documentation, or contact the technical support.
Management panel SSL certificate will expire in less than 30 days (0 active)
alert: Management panel SSL certificate will expire in less than 30 days
expr: (backend_ui_ssl_cert_expire_in_days > 7) <= 30
labels:
  component: cluster
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: The SSL certificate for the admin and self-service panels will expire in {{ $value }} days. Renew the certificate, as described in the product documentation, or contact the technical support.
Management panel SSL certificate will expire in less than 7 days (0 active)
alert: Management panel SSL certificate will expire in less than 7 days
expr: (backend_ui_ssl_cert_expire_in_days > 0) <= 7
labels:
  component: cluster
  severity: critical
  value: '{{ $value }}'
annotations:
  summary: The SSL certificate for the admin and self-service panels will expire in {{ $value }} days. Renew the certificate, as described in the product documentation, or contact the technical support.
Multiple update checks failed (0 active)
alert: Multiple update checks failed
expr: (sum_over_time(softwareupdates_node_info{state="check_failed"}[3d]) and softwareupdates_node_info{state="check_failed"}) * on(node) group_left(instance) up{job="node"} / (60 * 24 * 3) >= 0.9
labels:
  component: node
  severity: critical
annotations:
  summary: Update checks failed multiple times on the node {{$labels.instance}}. Please check access to the update repository.
Network connectivity failed (0 active)
alert: Network connectivity failed
expr: sum by(network_name) (increase(network_connectivity_received_packets_total[10m])) == 0 and sum by(network_name) (increase(network_connectivity_sent_packets_total[10m])) > 0 and on(network_name) label_replace(cluster_network_info_total, "network_name", "$1", "network", "(.*)")
labels:
  component: node
  object_id: '{{$labels.network_name}}'
  severity: critical
annotations:
  summary: No network traffic has been detected via network "{{$labels.network_name}}" from all nodes.
Node crash detected (0 active)
alert: Node crash detected
expr: hci_compute_node_crashed_fenced == 1
for: 30s
labels:
  component: node
  severity: critical
annotations:
  summary: Node {{$labels.hostname}} crashed, which started the VM evacuation.
Node failed to return to operation (0 active)
alert: Node failed to return to operation
expr: hci_compute_node_crashed_fenced == 1 and on(node) backend_node_online
for: 30m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{$labels.hostname}} has failed to automatically return to operation within 30 minutes after a crash. Check the node's hardware, and then try returning it to operation manually.
Node had a fenced state for 1 hour (0 active)
alert: Node had a fenced state for 1 hour
expr: sum_over_time(hci_compute_node_crashed_fenced[2h]) > 60
for: 5m
labels:
  component: compute
  severity: critical
annotations:
  summary: For the last 2 hours node {{$labels.hostname}} with ID {{$labels.node}} had a fenced state at least for 1 hour
Node has no internet access (0 active)
alert: Node has no internet access
expr: backend_node_internet_connected == 0
for: 10m
labels:
  component: node
  severity: warning
annotations:
  summary: Node "{{ $labels.instance }}" cannot reach the repository. Ensure the node has a working internet connection.
Node network MTU packet loss (0 active)
alert: Node network MTU packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[10m])) > 15 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[10m])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="mtu"}[10m]))) > 5
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the loss of some MTU-sized packets.
Node network connectivity problem (0 active)
alert: Node network connectivity problem
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[10m])) == 0 and sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[10m])) > 0 and on(dest_host) label_replace(sum_over_time(softwareupdates_node_state{state="rebooting"}[10m]) * on(node) group_left(hostname) (backend_node_online + 1), "dest_host", "$1", "hostname", "(.*)") == 0 and on(dest_host) label_replace(backend_node_online, "dest_host", "$1", "hostname", "(.*)") == 1 and on(network_name) label_replace(cluster_network_info_total, "network_name", "$1", "network", "(.*)")
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: critical
annotations:
  summary: Node "{{$labels.src_host}}" has no network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}".
Node network packet loss (0 active)
alert: Node network packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[10m])) > 15 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[10m])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[10m]))) > 5
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the loss of some packets.
Node network persistent MTU packet loss (0 active)
alert: Node network persistent MTU packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[2h])) > 450 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[2h])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="mtu"}[2h]))) > 50
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the persistent loss of some MTU-sized packets over the last two hours.
Node network persistent packet loss (0 active)
alert: Node network persistent packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[2h])) > 450 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[2h])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[2h]))) > 50
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the persistent loss of some packets over the last two hours.
Node network unstable connectivity (0 active)
alert: Node network unstable connectivity
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="mtu"}[10m])) == 0 and sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[10m])) > 0 and sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[10m])) > 0
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: critical
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the loss of all MTU-sized packets.
Node service is down (0 active)
alert: Node service is down
expr: label_replace(node_systemd_unit_state{name=~"mtail.service|chronyd.service|multipathd.service|sshd.service|disp-helper.service|abrtd.service|abrt-oops.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1
for: 5m
labels:
  component: node
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: warning
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.
Node update failed (0 active)
alert: Node update failed
expr: softwareupdates_node_info{state="update_failed"} * on(node) group_left(instance) up{job="node"}
labels:
  component: node
  severity: critical
annotations:
  summary: Software update failed on the node {{$labels.instance}}.
Primary management node service is down (0 active)
Software updates exist (0 active)
alert: Software updates exist
expr: sum by(job, version, available_version) (softwareupdates_node_available) > 0 or sum by(job, version, available_version) (softwareupdates_management_panel_available) > 0
labels:
  component: cluster
  object_id: '{{ $labels.available_version }}'
  severity: info
annotations:
  summary: 'Software updates exist for the cluster. Available version: {{$labels.available_version}}.'
Unable to apply SPLA license (0 active)
alert: Unable to apply SPLA license
expr: cluster_spla_last_action_days{action_type="update_key"} > 1
labels:
  component: cluster
  severity: error
annotations:
  summary: Unable to apply SPLA license for the cluster. Сontact your reseller to solve this issue.
Unable to get space usage (0 active)
alert: Unable to get space usage
expr: cluster_spla_last_action_days{action_type="get_usage"} > 1
labels:
  component: cluster
  severity: error
annotations:
  summary: Unable to get space usage for the cluster.
Unable to push space usage statistics (0 active)
alert: Unable to push space usage statistics
expr: cluster_spla_last_action_days{action_type="report"} > 1
labels:
  component: cluster
  severity: warning
annotations:
  summary: Unable to push space usage statistics for the cluster. Check the internet connection on the management node.
Update check failed (0 active)
alert: Update check failed
expr: softwareupdates_node_info{state="check_failed"} * on(node) group_left(instance) up{job="node"} and on() backend_node_internet_connected == 1
for: 10m
labels:
  component: node
  object_id: '{{ $labels.node }}'
  severity: warning
annotations:
  summary: Update check failed on the node {{$labels.instance}}. Please check access to the update repository.
Update download failed (0 active)
alert: Update download failed
expr: softwareupdates_node_info{state="download_failed"} * on(node) group_left(instance) up{job="node"}
labels:
  component: node
  severity: critical
annotations:
  summary: Update download failed on the node {{$labels.instance}}.
Update failed (0 active)
alert: Update failed
expr: sum by(job, instance, version, available_version) (softwareupdates_node_info{state="update_ctrl_plane_failed"})
labels:
  component: node
  severity: critical
annotations:
  summary: Update failed for the management panel and compute API.
/var/lib/prometheus/alerts/common.rules > common
Cluster is out of licensed space (0 active)
alert: Cluster is out of licensed space
expr: round((cluster_logical_free_space_size_bytes / cluster_logical_total_space_size_bytes) * 100, 0.01) < 0.1
for: 5m
labels:
  component: cluster
  severity: critical
annotations:
  summary: Сluster "{{ $labels.cluster_name }}" has run out of storage space allowed by license. No more data can be written. Please contact your reseller to update your license immediately!
Cluster is out of physical space (0 active)
alert: Cluster is out of physical space
expr: round((job:mdsd_cluster_raw_space_free:sum / job:mdsd_cluster_raw_space_total:sum) * 100, 0.01) < 10
for: 5m
labels:
  component: cluster
  severity: critical
annotations:
  summary: Cluster has just {{ $value }}% of physical storage space left. You may want to free some space or add more storage capacity.
  value: '{{ $value }}'
Compute node disk is out of space (0 active)
Connection tracking table is full (0 active)
alert: Connection tracking table is full
expr: increase(kernel_conntrack_table_full_total[10m]) > 0
labels:
  component: node
  severity: critical
annotations:
  summary: The kernel connection tracking table on node {{ $labels.instance}} has reached its maximum capacity. This may lead to network issues.
Disk is out of space (0 active)
Disk is running out of space (0 active)
alert: Disk is running out of space
expr: round(node_filesystem_free_bytes{job="node",mountpoint="/"} / node_filesystem_size_bytes{job="node",mountpoint="/"} * 100, 0.01) < 10 or node_filesystem_free_bytes{job="node",mountpoint="/"} < 5 * 1024 ^ 3
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Root partition on node "{{ $labels.instance }}" is running out of space
Four metadata services in cluster (0 active)
alert: Four metadata services in cluster
expr: count(cluster_mdsd_info) == 4 and count(cluster_mdsd_info) <= count(backend_node_master)
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has four metadata services. This configuration slows down the cluster performance and does not improve its availability. For a cluster of four nodes, it is enough to configure three MDSes. Delete an extra MDS from one of the cluster nodes.
Infrastructure interface has high receive packet drop rate (0 active)
alert: Infrastructure interface has high receive packet drop rate
expr: ((rate(node_network_receive_drop_total{device!~"lo|tap.*",job="node"}[5m])) / (rate(node_network_receive_packets_total{device!~"lo|tap.*",job="node"}[5m]) != 0)) * 100 > 5
for: 10m
labels:
  component: node
  object_id: '{{ $labels.device }}'
  severity: critical
annotations:
  summary: |
    Network interface {{ $labels.device }} on node {{ $labels.instance }} has receive packet drop rate higher than 5%. Please check physical network devices connectivity.
Infrastructure interface has high transmit packet drop rate (0 active)
alert: Infrastructure interface has high transmit packet drop rate
expr: ((rate(node_network_transmit_drop_total{device!~"lo|tap.*",job="node"}[5m])) / (rate(node_network_transmit_packets_total{device!~"lo|tap.*",job="node"}[5m]) != 0)) * 100 > 5
for: 10m
labels:
  component: node
  object_id: '{{ $labels.device }}'
  severity: critical
annotations:
  summary: |
    Network interface {{ $labels.device }} on node {{ $labels.instance }} has transmit packet drop rate higher than 5%. Please check physical network devices connectivity.
Licensed storage capacity is critically low (0 active)
alert: Licensed storage capacity is critically low
expr: (job:mdsd_fs_allocated_size_bytes:sum >= job:mdsd_cluster_licensed_space_bytes:sum * 0.9) / 1024 ^ 3
for: 5m
labels:
  component: cluster
  severity: critical
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Licensed storage capacity is critically low as the cluster has reached 90% of licensed storage capacity. Please switch to the SPLA licensing model. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Cluster has reached 90% of licensed storage capacity. {{- end -}}'
Licensed storage capacity is low (0 active)
alert: Licensed storage capacity is low
expr: ((job:mdsd_fs_allocated_size_bytes:sum > job:mdsd_cluster_licensed_space_bytes:sum * 0.8) < job:mdsd_cluster_licensed_space_bytes:sum * 0.9) / 1024 ^ 3
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Licensed storage capacity is low as the cluster has reached 80% of licensed storage capacity. Please switch to the SPLA licensing model. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Cluster has reached 80% of licensed storage capacity. {{- end -}}'
Low network interface speed (0 active)
alert: Low network interface speed
expr: node_network_speed_bytes > 0 and node_network_speed_bytes / 125000 < 1000 and on(device, node) cluster_network_info_total{network!="<unspecified>"}
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Network interface "{{ $labels.device}}" on node "{{ $labels.instance }}" has speed lower than the minimally required 1 Gbps.
MTU mismatch (0 active)
More than one metadata service per node (0 active)
alert: More than one metadata service per node
expr: count by(node, hostname) (cluster_mdsd_info * on(node) group_left(hostname) backend_node_master) > 1
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Node "{{ $labels.hostname }}" has more than one metadata service located on it. It is recommended to have only one metadata service per node. Delete the extra metadata services from this node and create them on other nodes instead.
Network bond is not redundant (0 active)
alert: Network bond is not redundant
expr: node_bonding_slaves - node_bonding_active > 0
for: 5m
labels:
  component: node
  severity: critical
  value: '{{ $value }}'
annotations:
  summary: Network bond {{ $labels.master }} on node {{ $labels.instance }} is missing {{ $value }} subordinate interface(s).
Network interface half duplex (0 active)
alert: Network interface half duplex
expr: node_network_info{duplex="half",operstate="up"} and on(device, node) cluster_network_info_total{network!="<unspecified>"}
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Network interface "{{$labels.device}}" on node "{{$labels.instance}}" is not in full duplex mode.
Network interface is flapping (0 active)
alert: Network interface is flapping
expr: round(increase(node_network_carrier_changes_total{job="node"}[15m])) > 5
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Network interface {{$labels.device}} on node {{$labels.instance}} is flapping.
Node got offline too many times (0 active)
alert: Node got offline too many times
expr: changes(backend_node_online[1h]) > (3 * 2)
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Node "{{ $labels.hostname }}" got offline too many times for the last hour.
Node has critically high swap usage (0 active)
Node has high CPU usage (0 active)
alert: Node has high CPU usage
expr: round(100 - (avg by(instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)) > 90
for: 15m
labels:
  component: node
  severity: critical
annotations:
  summary: Node {{ $labels.instance}} has CPU usage higher than 90%. The current value is {{ $value }}.
  value: '{{ $value }}'
Node has high disk I/O usage (0 active)
alert: Node has high disk I/O usage
expr: round(rate(node_disk_io_time_seconds_total{device=~".+",job="node"}[2m]) * 100) > 85
for: 15m
labels:
  component: node
  severity: critical
annotations:
  summary: Disk /dev/{{$labels.device}} on node {{$labels.instance}} has I/O usage higher than 85%. The current value is {{ $value }}.
  value: '{{ $value }}'
Node has high memory usage (0 active)
Node has high receive packet error rate (0 active)
alert: Node has high receive packet error rate
expr: instance_device:node_network_receive_errs:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) receive packet error rate. Please check node network settings.
  value: '{{ $value }}'
Node has high receive packet loss rate (0 active)
alert: Node has high receive packet loss rate
expr: instance_device:node_network_receive_drop:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) receive packet loss rate. Please check node network settings.
  value: '{{ $value }}'
Node has high swap usage (0 active)
Node has high transmit packet error rate (0 active)
alert: Node has high transmit packet error rate
expr: instance_device:node_network_transmit_errs:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) transmit packet error rate. Please check node network settings.
  value: '{{ $value }}'
Node has high transmit packet loss rate (0 active)
alert: Node has high transmit packet loss rate
expr: instance_device:node_network_transmit_drop:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) transmit packet loss rate. Please check node network settings.
  value: '{{ $value }}'
Node is offline (0 active)
alert: Node is offline
expr: label_replace(backend_node_online, "hostname", "$1", "hostname", "([^.]*).*") == 0 unless on(node) softwareupdates_node_state{state=~"updating|rebooting"} == 1
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Node {{ $labels.hostname }} with ID {{ $labels.node }} is offline.
Node time critically unsynced (0 active)
alert: Node time critically unsynced
expr: floor(abs(backend_node_time_seconds{role="node"} - scalar(backend_node_time_seconds{role="backend"})) > 30)
for: 6m
labels:
  component: node
  severity: critical
annotations:
  summary: Time on node {{$labels.instance}} is critically unsynced, differing from the time on backend node by more than {{ $value }} seconds.
  value: '{{ $value }}'
Node time not synced (0 active)
alert: Node time not synced
expr: floor((abs(backend_node_time_seconds{role="node"} - scalar(backend_node_time_seconds{role="backend"})) > 5) and (abs(backend_node_time_seconds{role="node"} - scalar(backend_node_time_seconds{role="backend"})) < 30))
for: 6m
labels:
  component: node
  severity: warning
annotations:
  summary: Time on node {{$labels.instance}} differs from the time on backend node by more than {{ $value }} seconds.
  value: '{{ $value }}'
Not enough cluster nodes (0 active)
alert: Not enough cluster nodes
expr: sum by(cluster_id) (backend_node_online) < 3
for: 5m
labels:
  component: cluster
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Cluster has only {{ $value }} node(s) instead of the recommended minimum of 3. Add more nodes to the cluster.
Not enough metadata disks (0 active)
alert: Not enough metadata disks
expr: count(cluster_mdsd_disk_info) == 2
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster requires more disks with the metadata role. Losing one more MDS will halt cluster operation.
Not enough storage disks (0 active)
alert: Not enough storage disks
expr: cluster_min_req_redundancy_number{failure_domain="host"} > scalar(count(count by(node) (cluster_csd_info))) or cluster_min_req_redundancy_number{failure_domain="disk"} > scalar(count(cluster_csd_info))
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.service }}'
  severity: warning
annotations:
  summary: Cluster requires more disks with the storage role to be able to provide the required level of redundancy for '{{ $labels.service }}' service.
Only one metadata disk in cluster (0 active)
alert: Only one metadata disk in cluster
expr: count(cluster_mdsd_disk_info) == 1
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has only one MDS. There is only one disk with the metadata role at the moment. Losing this disk will completely destroy all cluster data irrespective of the redundancy schema.
Over five metadata services in cluster (0 active)
alert: Over five metadata services in cluster
expr: count(cluster_mdsd_info) > 5 and count(cluster_mdsd_info) <= count(backend_node_master)
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has more than five metadata services. This configuration slows down the cluster performance and does not improve its availability. For a large cluster, it is enough to configure five MDSes. Delete extra MDSes from the cluster nodes.
Shaman service is down (0 active)
alert: Shaman service is down
expr: up{job="shaman"} == 0
for: 5m
labels:
  component: node
  object_id: '{{ $labels.instance }}'
  severity: critical
annotations:
  summary: Shaman service is down on host {{$labels.instance}}.
Software RAID is not fully synced (0 active)
alert: Software RAID is not fully synced
expr: round(((node_md_blocks_synced / node_md_blocks) * 100) < 100) and on() (node_md_state{state="active"} == 1)
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Software RAID {{ $labels.device }} on node {{ $labels.instance }} is {{ $value }} synced.
  value: '{{ $value }}'
Systemd service is flapping (0 active)
alert: Systemd service is flapping
expr: changes(node_systemd_unit_state{state="failed"}[5m]) > 5 or (changes(node_systemd_unit_state{state="failed"}[1h]) > 15 unless changes(node_systemd_unit_state{state="failed"}[30m]) < 7)
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Systemd service {{ $labels.name }} on node {{ $labels.instance }} has changed its state more than 5 times in 5 minutes or 15 times in one hour.
Zero storage disks (0 active)
alert: Zero storage disks
expr: absent(cluster_csd_info) and on() count(up{job="mds"}) > 0 and on() up{job="backend"} == 1
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has zero disks with the storage role and cannot provide the required level of redundancy.
/var/lib/prometheus/alerts/disk_smart_attrs.rules > Smart Disks Attributes
SMART Media Wearout critical (0 active)
alert: SMART Media Wearout critical
expr: smart_media_wearout_indicator{value="normalized"} <= 5 or smart_nvme_percent_used >= 95 or smart_percent_lifetime_remain{value="normalized"} <= 5 or smart_ssd_life_left{value="normalized"} <= 5
for: 5m
labels:
  component: node
  object_id: '{{ $labels.instance }}-{{ $labels.disk }}'
  severity: critical
annotations:
  summary: Disk {{ $labels.disk }} on node {{ $labels.instance }} is worn out and will fail soon. Consider replacement.
SMART Media Wearout warning (0 active)
/var/lib/prometheus/alerts/docker.rules > docker_service_common_alerts
Docker service is down (0 active)
alert: Docker service is down
expr: (node_systemd_unit_state{name="docker.service",state=~"failed|inactive"} * on(node) group_left() (backend_node_compute == 1)) == 1 and on(node) softwareupdates_node_state{state!~"updat.*"} == 1
for: 5m
labels:
  component: compute
  severity: critical
annotations:
  summary: Docker service is down on host {{$labels.instance}}.
/var/lib/prometheus/alerts/dr.rules > dr outage
Hybrid DR agent could not access compute services (0 active)
alert: Hybrid DR agent could not access compute services
expr: label_replace(runvm_agent_health{subsystem="ComputeAccess"} == 0, "object_id", "$1", "vmid", "(.*)")
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR agent on the virtual machine {{$labels.vmid}} can't perform any actions in the infrastructure. Visit https://kb.acronis.com/content/70244 to learn how to troubleshoot this issue.
Hybrid DR agent is unavailable (0 active)
alert: Hybrid DR agent is unavailable
expr: label_replace(runvm_infra_agent unless on(uuid) runvm_agent_info, "object_id", "$1", "uuid", "(.*)")
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR agent on the virtual machine {{$labels.uuid}} is unavailable. Visit https://kb.acronis.com/content/70244 to learn how to troubleshoot this issue.
Hybrid DR database is unavailable (0 active)
alert: Hybrid DR database is unavailable
expr: label_replace((up{job="virtual",service="postgres"} == 0) and on(vmid) up{service="dr-infra-manager"}, "object_id", "$1", "vmid", "(.*)")
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR database on virtual machine {{$labels.vmid}} is unavailable. Operation of Hybrid DR infrastructure is not possible. Visit https://kb.acronis.com/content/70244 to learn how to troubleshoot this issue.
Hybrid DR update is available (0 active)
alert: Hybrid DR update is available
expr: sum by(uuid, name, version, available_version) (runvm_infra_outdated{product="dr",type="aci"}) > 0
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR {{$labels.available_version}} is now available. Install the update as soon as possible. Otherwise your product functionality might be limited
/var/lib/prometheus/alerts/ipsec.rules > ipsec
Enabling IPv6 mode takes too much time (0 active)
alert: Enabling IPv6 mode takes too much time
expr: switch_ipv6_task_running > 0
for: 1h
labels:
  component: cluster
  severity: warning
annotations:
  summary: Operation to enable the IPv6 mode is running for more than 1 hour. Please contact the technical support.
Enabling traffic encryption takes too much time (0 active)
alert: Enabling traffic encryption takes too much time
expr: network_ipsec_task_running > 0
for: 1h
labels:
  component: cluster
  severity: warning
annotations:
  summary: Operation to enable traffic encryption is running for more than 1 hour. Please contact the technical support.
Node IPsec certificate has expired (0 active)
alert: Node IPsec certificate has expired
expr: network_ipsec_cert_days_to_expire <= 0
labels:
  component: cluster
  severity: critical
annotations:
  summary: IPsec certificate for node {{$labels.instance}} with ID {{$labels.node}} has expired. Renew the certificate, as described in the product documentation, or contact the technical support.
Node IPsec certificate will expire in less than 7 days (0 active)
alert: Node IPsec certificate will expire in less than 7 days
expr: (network_ipsec_cert_days_to_expire > 0) <= 7
labels:
  component: cluster
  severity: warning
annotations:
  summary: IPsec certificate for node {{$labels.instance}} with ID {{$labels.node}} will expire in less than 7 days. Renew the certificate, as described in the product documentation, or contact the technical support.
System configuration is not optimal for traffic encryption (0 active)
alert: System configuration is not optimal for traffic encryption
expr: sum by(cluster_id) (network_ipsec_enabled) > 0 and sum by(cluster_id) (node_ipv6_not_configured) > 0
for: 1m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Traffic encryption is enabled but the storage network is not in the IPv6 mode. Switch on the IPv6 configuration, as described in the product documentation.
/var/lib/prometheus/alerts/iscsi.rules > ISCSI
iSCSI has failed volumes (0 active)
alert: iSCSI has failed volumes
expr: label_replace(vstorage_target_manager_volume_failed, "object_id", "$1", "volume_id", "(.*)") > 0
for: 5m
labels:
  component: iSCSI
  severity: critical
annotations:
  summary: The volume {{$labels.volume_id}} has failed. Please contact the technical support.
iSCSI redundancy warning (0 active)
alert: iSCSI redundancy warning
expr: storage_redundancy_threshold{failure_domain="disk",type="iscsi"} > 0 and storage_redundancy_threshold{failure_domain="disk",type="iscsi"} <= scalar(count(backend_node_master))
for: 10m
labels:
  component: iSCSI
  severity: warning
annotations:
  summary: |
    iSCSI LUN {{ $labels.service }} of target group {{ $labels.group }} is set to failure domain "disk" even though there are enough available nodes. It is recommended to set the failure domain to "host" so that the LUN can survive host failures in addition to disk failures.
/var/lib/prometheus/alerts/kernel.rules > kernel_alerts
Block device error (0 active)
Blocked system task (0 active)
alert: Blocked system task
expr: increase(kernel_message_error{reason="blocked task"}[5m]) > 0 or kernel_message_error{reason="blocked task"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="blocked task"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.src }}'
  severity: warning
annotations:
  summary: Blocked task '{{ $labels.src }}' detected on '{{ $labels.instance }}' node. This may occur due to incorrect system configuration or malfunctioning hardware. Please contact support.
CPU throttled (0 active)
MCE hardware error (0 active)
OOM killer triggered (0 active)
alert: OOM killer triggered
expr: increase(kernel_message_error{reason="Out Of Memory"}[5m]) > 0 or kernel_message_error{reason="Out Of Memory"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="Out Of Memory"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.node }}-{{ $labels.src }}'
  severity: warning
annotations:
  summary: OOM killer has been triggered on node '{{ $labels.instance }}' for process '{{ $labels.src }}'. Investigate memory usage immediately.
SCSI disk failure (0 active)
alert: SCSI disk failure
expr: increase(kernel_scsi_failures_total[5m]) > 0 or kernel_scsi_failures_total == 1 and on(node, device, err) count_over_time(kernel_scsi_failures_total[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.device }}'
  severity: critical
annotations:
  summary: One or more errors '{{ $labels.err }}' detected for device '{{ $labels.device }}' on {{ $labels.instance }} node.
/var/lib/prometheus/alerts/nova_compute.rules > nova_compute_alerts
Conflict updating instance (0 active)
alert: Conflict updating instance
expr: increase(nova_compute_vm_error{reason="conflict updating"}[5m]) > 0 or nova_compute_vm_error{reason="conflict updating"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="conflict updating"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: 'One or more 'UnexpectedTaskStateError_Remote' errors detected. Conflict updating VM instance: {{ $labels.instance }}'
Failed to power off VM (0 active)
alert: Failed to power off VM
expr: increase(nova_compute_vm_error{reason="power off"}[5m]) > 0 or nova_compute_vm_error{reason="power off"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="power off"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: One or more failed attempts to power off VM '{{ $labels.instance }}' are detected.
Migration job conflict (0 active)
Temporary snapshot exists (0 active)
alert: Temporary snapshot exists
expr: increase(nova_compute_vm_error{reason="temporary snapshot"}[5m]) > 0 or nova_compute_vm_error{reason="temporary snapshot"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="temporary snapshot"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: Different operations with VMs might be blocked because temporary snapshot live migration leftovers are detected.
VM Device or resource busy (0 active)
alert: VM Device or resource busy
expr: increase(nova_compute_vm_error{reason="device busy"}[5m]) > 0 or nova_compute_vm_error{reason="device busy"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="device busy"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: One or more failed VM disk I/O operations detected because device is busy.
VM Invalid volume exception (0 active)
alert: VM Invalid volume exception
expr: increase(nova_compute_vm_error{reason="invalid volume"}[5m]) > 0 or nova_compute_vm_error{reason="invalid volume"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="invalid volume"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: One or more operations with invalid volume detected for VM '{{ $labels.instance }}'.
Virtual machine error (0 active)
alert: Virtual machine error
expr: increase(nova_compute_vm_error{reason=~"lock|unsupported configuration|storage file access|dbus|timeout|operation unsupported|qemu crash|live migration|binding"}[5m]) > 0 or nova_compute_vm_error{reason=~"lock|unsupported configuration|storage file access|dbus|timeout|operation unsupported|qemu crash|live migration|binding"} == 1 and on(instance, reason) count_over_time(nova_compute_vm_error{reason=~"lock|unsupported configuration|storage file access|dbus|timeout|operation unsupported|qemu crash|live migration|binding"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}-{{ $labels.reason }}'
  severity: warning
annotations:
  summary: 'One or more '{{ $labels.reason }}' errors detected for VM instance: {{ $labels.instance }}.'
Virtualization management service error (0 active)
alert: Virtualization management service error
expr: increase(nova_compute_libvirt_error{reason=~"lock|connection"}[5m]) > 0 or nova_compute_libvirt_error{reason=~"lock|connection"} == 1 and on(node, reason) count_over_time(nova_compute_libvirt_error{reason=~"lock|connection"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.node }}-{{ $labels.reason }}'
  severity: critical
annotations:
  summary: One or more virtualization management service '{{ $labels.reason }}' errors detected on '{{ $labels.instance }}' node.
/var/lib/prometheus/alerts/openstack_cluster.rules > openstack_cluster_alerts
Backup plan failed (1 active)
alert: Backup plan failed
expr: openstack_freezer_backup_plan_status == 1
for: 20m
labels:
  component: compute
  object_id: '{{$labels.id}}'
  severity: warning
annotations:
  summary: Backup plan {{$labels.name}} for compute volumes has three consecutive failures.
Labels State Active Since Value
alertname="Backup plan failed" component="compute" domain_id="921940ee9d6c465daa555fcc3e3764a2" id="325b2a1b1abe4b848c2bd3c07d25d807" instance="compute-api.svc" job="openstack_exporter" name="db" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="325b2a1b1abe4b848c2bd3c07d25d807" project_id="328a7224779a42db9f51a05a91054adc" severity="warning" status="scheduled" firing 2025-12-09 02:10:13.030437186 +0000 UTC 1
Cluster is out of memory (0 active)
Cluster is out of vCPU resources (0 active)
Cluster is running out of memory (0 active)
Cluster is running out of vCPU resources (0 active)
Kubernetes cluster update failed (0 active)
alert: Kubernetes cluster update failed
expr: openstack_container_infra_cluster_status == 4
for: 5m
labels:
  component: compute
  object_id: '{{ $labels.uuid }}'
  severity: warning
annotations:
  summary: Kubernetes cluster with ID "{{ $labels.uuid }}" has the "{{ $labels.status }}" status.
Licensed core limit exceeded (0 active)
alert: Licensed core limit exceeded
expr: (sum(openstack_nova_phys_cores_available) - on() licensed_core_number) > 0 and on() licensed_core_number > 0
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  core_number: '{{ printf `licensed_core_number`|query|first|value }}'
  core_number_used: '{{ printf `sum(openstack_nova_phys_cores_available)`|query|first|value }}'
  summary: |
    Number of physical cores used in the cluster is "{{ $labels.core_number_used }}", which exceeds the licensed core limit of "{{ $labels.core_number }}".
Load balancer error (0 active)
alert: Load balancer error
expr: openstack_loadbalancer_loadbalancer_status{provisioning_status="ERROR"}
labels:
  component: compute
  object_id: '{{$labels.id}}'
  severity: warning
annotations:
  summary: |
    Load balancer with ID "{{$labels.id}}" has the 'ERROR' provisioning status. Please check the Octavia service logs or contact the technical support.
Load balancer is stuck in pending state (0 active)
alert: Load balancer is stuck in pending state
expr: openstack_loadbalancer_loadbalancer_status{is_stale="true"}
labels:
  component: compute
  object_id: '{{$labels.id}}'
  severity: error
annotations:
  summary: |
    Load balancer with ID "{{$labels.id}}" is stuck with the "{{$labels.provisioning_status}}" status. Ensure that the load balancer configuration is consistent and perform a failover.
Neutron bridge mapping not found (0 active)
alert: Neutron bridge mapping not found
expr: label_replace(openstack_neutron_network_bridge_mapping * on(hostname) group_left(node) (backend_node_compute), "object_id", "$1", "provider_physical_network", "(.*)") == 0
for: 20m
labels:
  component: compute
  severity: critical
annotations:
  summary: |
    Physical network "{{$labels.provider_physical_network}}" is not found in the bridge mapping on node "{{$labels.hostname}}". Virtual network "{{$labels.network_name}}" on this node is most likely not functioning. Please contact the technical support.
Unrecognized DHCP servers detected (0 active)
alert: Unrecognized DHCP servers detected
expr: group by(network_id) (neutron_network_dhcp_reply_count >= 3) and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count >= 3) >= 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP service for virtual network "{{$labels.network_id}}" may be malfunctioning. Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.
Unrecognized DHCP servers detected from node (0 active)
alert: Unrecognized DHCP servers detected from node
expr: neutron_network_dhcp_reply_count >= 3 and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count >= 3) < 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP service for virtual network "{{$labels.network_id}}" may be malfunctioning on node "{{$labels.host}}". Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.
Virtual DHCP server HA degraded (0 active)
alert: Virtual DHCP server HA degraded
expr: group by(network_id) (neutron_network_dhcp_reply_count == 1) and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 1) >= 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Only one built-in DHCP server for virtual network "{{$labels.network_id}}" is reachable from cluster nodes. DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.
Virtual DHCP server HA degraded on node (0 active)
alert: Virtual DHCP server HA degraded on node
expr: neutron_network_dhcp_reply_count == 1 and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 1) < 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Only one built-in DHCP server for virtual network "{{$labels.network_id}}" is reachable from node "{{$labels.host}}". DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.
Virtual DHCP server is unavailable (0 active)
alert: Virtual DHCP server is unavailable
expr: group by(network_id) (neutron_network_dhcp_reply_count == 0) and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 0) >= 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP server for virtual network "{{$labels.network_id}}" is not available from cluster nodes. Please check the neutron-dhcp-agent service or contact the technical support.
Virtual DHCP server is unavailable from node (0 active)
alert: Virtual DHCP server is unavailable from node
expr: neutron_network_dhcp_reply_count == 0 and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 0) < 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP server for virtual network "{{$labels.network_id}}" is not available from node "{{$labels.host}}". Please check the neutron-dhcp-agent service or contact the technical support.
Virtual machine error (0 active)
alert: Virtual machine error
expr: label_replace(openstack_nova_server_status{status="ERROR"}, "object_id", "$1", "id", "(.*)")
labels:
  component: compute
  severity: critical
annotations:
  summary: Virtual machine {{$labels.name}} with ID {{$labels.id}} is in the 'Error' state.
Virtual machine has crashed (0 active)
alert: Virtual machine has crashed
expr: (libvirt_domain_info_state == 5 and libvirt_domain_info_state_reason == 3) or (libvirt_domain_info_state == 6 and libvirt_domain_info_state_reason == 1) or (libvirt_domain_info_state == 1 and libvirt_domain_info_state_reason == 9) or (libvirt_domain_info_state == 3 and libvirt_domain_info_state_reason == 10)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.domain_uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual machine with ID {{$labels.domain_uuid}} in project {{$labels.project_name}} has crashed. Restart the VM.
Virtual machine is not responding (0 active)
alert: Virtual machine is not responding
expr: sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_block_stats_read_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_block_stats_write_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_interface_stats_receive_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_interface_stats_transmit_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_info_cpu_time_seconds:rate5m) > 0.1
for: 10m
labels:
  component: compute
  object_id: '{{$labels.domain_uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual machine {{$labels.name}} in project {{$labels.project_name}} has stopped responding. Consider VM restart.
Virtual machine state mismatch (0 active)
alert: Virtual machine state mismatch
expr: label_join((count_over_time(nova:libvirt:server:diff[2h]) > 60) and (nova:libvirt:server:diff), "object_id", "", "id")
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: State of virtual machine {{$labels.name}} with ID {{$labels.id}} differs in the Nova databases and libvirt configuration.
Virtual network port check failed (0 active)
alert: Virtual network port check failed
expr: neutron_port_status_failed{check!="dhcp",device_owner!="network:dhcp"} == 1 unless on(device_id) label_join(openstack_nova_server_status{status="SHELVED_OFFLOADED"}, "device_id", "", "uuid")
for: 10m
labels:
  component: compute
  object_id: '{{$labels.port_id}}'
  severity: critical
annotations:
  summary: Neutron port with ID {{$labels.port_id}} failed {{$labels.check}} check. The port type is {{$labels.device_owner}} with owner ID {{$labels.device_id}}
Virtual network port check failed (0 active)
alert: Virtual network port check failed
expr: neutron_port_status_failed{check="dhcp"} == 1 unless on(device_id) label_join(openstack_nova_server_status{status="SHELVED_OFFLOADED"}, "device_id", "", "uuid")
for: 10m
labels:
  component: compute
  object_id: '{{$labels.port_id}}'
  severity: info
annotations:
  summary: Neutron port with ID {{$labels.port_id}} failed {{$labels.check}} check. The port type is {{$labels.device_owner}} with owner ID {{$labels.device_id}}
Virtual network port check failed (0 active)
alert: Virtual network port check failed
expr: neutron_port_status_failed{check!="dhcp",device_owner="network:dhcp"} == 1
for: 10m
labels:
  component: compute
  object_id: '{{$labels.port_id}}'
  severity: warning
annotations:
  summary: Neutron port with ID {{$labels.port_id}} failed {{$labels.check}} check. The port type is {{$labels.device_owner}} with owner ID {{$labels.device_id}}
Virtual router HA has more than one active L3 agent (0 active)
alert: Virtual router HA has more than one active L3 agent
expr: count by(ha_state, router_id) (openstack_neutron_l3_agent_of_router{ha_state="active"}) > 1
for: 10m
labels:
  component: compute
  object_id: '{{$labels.router_id}}'
  severity: critical
annotations:
  summary: |
    Virtual router HA with ID {{$labels.router_id}} has more than one active L3 agent. Please contact the technical support.
Virtual router HA has no active L3 agent (0 active)
alert: Virtual router HA has no active L3 agent
expr: count by(router_id) (openstack_neutron_l3_agent_of_router) - on(router_id) count by(router_id) (openstack_neutron_l3_agent_of_router{ha_state!~"active"}) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.router_id}}'
  severity: critical
annotations:
  summary: |
    Virtual router HA with ID {{$labels.router_id}} has no active L3 agent. Please contact the technical support.
Virtual router SNAT-related port has invalid host binding (0 active)
alert: Virtual router SNAT-related port has invalid host binding
expr: openstack_neutron_port{device_owner="network:router_centralized_snat"} and on(device_id, binding_host_id) (label_replace(label_replace(openstack_neutron_l3_agent_of_router{ha_state="standby"}, "device_id", "$1", "router_id", "(.+)"), "binding_host_id", "$1", "agent_host", "(.+)"))
for: 10m
labels:
  component: compute
  object_id: '{{$labels.uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual router SNAT-related port with ID {{$labels.uuid}} is bound to the Standby HA router node. Please contact the technical support.
Virtual router gateway port has invalid host binding (0 active)
alert: Virtual router gateway port has invalid host binding
expr: openstack_neutron_port{device_owner="network:router_gateway"} and on(device_id, binding_host_id) (label_replace(label_replace(openstack_neutron_l3_agent_of_router{ha_state="standby"}, "device_id", "$1", "router_id", "(.+)"), "binding_host_id", "$1", "agent_host", "(.+)"))
for: 10m
labels:
  component: compute
  object_id: '{{$labels.uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual router gateway port with ID {{$labels.uuid}} is bound to the Standby HA router node. Please contact the technical support.
Volume attachment details mismatch (0 active)
alert: Volume attachment details mismatch
expr: label_replace((count_over_time(cinder:libvirt:volume:diff[2h]) > 60) and (cinder:libvirt:volume:diff), "object_id", "$1", "volume_id", "(.*)")
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: Attachment details for volume with ID {{$labels.volume_id}} differ in the Nova and libvirt databases. Additionally, this may indicate the existence of an uncommitted temporary snapshot.
Volume has incorrect status (0 active)
alert: Volume has incorrect status
expr: openstack_cinder_volume_gb{status=~"error|error_deleting|error_managing|error_restoring|error_backing-up|error_extending"}
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.id }}'
  severity: critical
annotations:
  summary: Volume {{ $labels.id }} has the {{ $labels.status }} status.
Volume is stuck in transitional state (0 active)
alert: Volume is stuck in transitional state
expr: openstack_cinder_volume_gb{status=~"attaching|detaching|deleting|extending|reserved"}
for: 15m
labels:
  component: compute
  object_id: '{{ $labels.id }}'
  severity: warning
annotations:
  summary: Volume {{ $labels.id }} is stuck with the {{ $labels.status }} status for more than 5 minutes.
/var/lib/prometheus/alerts/openstack_domain.rules > openstack_exporter_domain_limits
Domain is out of memory (0 active)
alert: Domain is out of memory
expr: round(openstack_nova_limits_memory_used{is_domain="true"} / (openstack_nova_limits_memory_max{is_domain="true"} > 0) * 100) >= 80 < 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: info
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the memory allocation limit.
  value: '{{ $value }}'
Domain is out of memory (0 active)
alert: Domain is out of memory
expr: round(openstack_nova_limits_memory_used{is_domain="true"} / (openstack_nova_limits_memory_max{is_domain="true"} > 0) * 100) >= 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: warning
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the memory allocation limit.
  value: '{{ $value }}'
Domain is out of storage policy space (0 active)
alert: Domain is out of storage policy space
expr: round(openstack_cinder_limits_volume_storage_policy_used_gb{is_domain="true",volume_type!=""} / (openstack_cinder_limits_volume_storage_policy_max_gb{is_domain="true",volume_type!=""} > 0) * 100) >= 80 < 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}-{{ $labels.volume_type }}'
  severity: info
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the {{ $labels.volume_type }} storage policy allocation limit.
  value: '{{ $value }}'
Domain is out of storage policy space (0 active)
alert: Domain is out of storage policy space
expr: round(openstack_cinder_limits_volume_storage_policy_used_gb{is_domain="true",volume_type!=""} / (openstack_cinder_limits_volume_storage_policy_max_gb{is_domain="true",volume_type!=""} > 0) * 100) > 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}-{{ $labels.volume_type }}'
  severity: warning
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the {{ $labels.volume_type }} storage policy allocation limit.
  value: '{{ $value }}'
Domain is out of vCPU resources (0 active)
alert: Domain is out of vCPU resources
expr: round(openstack_nova_limits_vcpus_used{is_domain="true"} / (openstack_nova_limits_vcpus_max{is_domain="true"} > 0) * 100) >= 80 < 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: info
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the vCPU allocation limit.
  value: '{{ $value }}'
Domain is out of vCPU resources (0 active)
alert: Domain is out of vCPU resources
expr: round(openstack_nova_limits_vcpus_used{is_domain="true"} / (openstack_nova_limits_vcpus_max{is_domain="true"} > 0) * 100) >= 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: warning
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the vCPU allocation limit.
  value: '{{ $value }}'
/var/lib/prometheus/alerts/openstack_node.rules > openstack_node_alerts
Extra RAM reservation detected for compute placement service (0 active)
alert: Extra RAM reservation detected for compute placement service
expr: (sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="MEMORY_MB"}, "host", "$1", "hostname", "(.*).vstoragedomain"))) / 1024 - (sum by(host) (label_replace(libvirt_domain_info_memory_usage_bytes, "host", "$1", "instance", "(.*)"))) / (1024 * 1024 * 1024) > 0
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' GiB of RAM detected for the compute placement service on node '{{ $labels.host }}'.
Extra RAM reservation detected on hypervisor node (0 active)
alert: Extra RAM reservation detected on hypervisor node
expr: abs((sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="MEMORY_MB"}, "host", "$1", "hostname", "(.*).vstoragedomain"))) / 1024 - (sum by(host) (label_replace(libvirt_domain_info_memory_usage_bytes, "host", "$1", "instance", "(.*)"))) / (1024 * 1024 * 1024) < 0)
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: info
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' GiB of RAM detected on hypervisor node '{{ $labels.host }}'.
Extra vCPU reservation detected for compute placement service (0 active)
alert: Extra vCPU reservation detected for compute placement service
expr: sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="VCPU"}, "host", "$1", "hostname", "(.*).vstoragedomain")) - sum by(host) (label_replace(libvirt_domain_info_virtual_cpus, "host", "$1", "instance", "(.*)")) > 0
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' vCPUs detected for the compute placement service on node '{{ $labels.host }}'.
Extra vCPU reservation detected on hypervisor node (0 active)
alert: Extra vCPU reservation detected on hypervisor node
expr: abs(sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="VCPU"}, "host", "$1", "hostname", "(.*).vstoragedomain")) - sum by(host) (label_replace(libvirt_domain_info_virtual_cpus, "host", "$1", "instance", "(.*)")) < 0)
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: info
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' vCPUs detected on hypervisor node '{{ $labels.host }}'.
Libvirt service is down (0 active)
alert: Libvirt service is down
expr: libvirt_up == 0 and on(node) backend_node_compute == 1
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: Libvirt service is not responding on node {{$labels.instance}} with ID {{$labels.node}}. Check the service state. If the service cannot start, contact the technical support.
Node is out of memory (0 active)
Node is out of vCPU resources (0 active)
Node is running out of memory (0 active)
Node is running out of vCPU resources (0 active)
/var/lib/prometheus/alerts/openstack_projects.rules > openstack_exporter_project_limits
Project is out of floating IP addresses (34 active)
Labels State Active Since Value
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118058" tenant_id="fe2fb846d21946fe971ffdc58d3b4021" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-119280" tenant_id="62d89b6e807a414eb88837cf967de6bf" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122213" tenant_id="2479f7bbbdd048a892cdce5b6041eb0d" firing 2025-12-20 15:20:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118012" tenant_id="f7f4e7c84bb5421c90476e8b62d62be1" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122063" tenant_id="2856b90c9ce848949da775ba7190125e" firing 2025-12-16 17:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122519" tenant_id="3a831eead8904a1ba173172f9027a382" firing 2026-01-02 10:30:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123020" tenant_id="76589b5ba6db4544b76639a965c48087" firing 2026-01-02 12:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123263" tenant_id="1fc00ed8917c41268eaf1c18fc4475b2" firing 2026-01-06 05:20:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123458" tenant_id="337091c1254e4845844471037bb1d0a9" firing 2026-01-08 15:10:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122459" tenant_id="125599669f7e4a9bb3707ee1766754be" firing 2025-12-23 09:40:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122803" tenant_id="080d78c2d3e7422e97236dd562b464fd" firing 2025-12-29 10:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123018" tenant_id="d23d4718205143cbaa38899d58fb0902" firing 2026-01-02 12:20:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123021" tenant_id="313f523a7fc549898f4ff3272cc8edcb" firing 2026-01-03 15:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123275" tenant_id="655bf94757b34ce28393e3588e0a7559" firing 2026-01-06 17:50:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118139" tenant_id="7d1f8f08904e4d6caf5285894da50d1b" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-32926" tenant_id="8890bce87bc64d43b0c0b033f94bf8cc" firing 2025-12-22 16:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122499" tenant_id="84d97362dcbf40898dea3c822e2b304b" firing 2025-12-24 05:30:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-51475" tenant_id="a497d633d803473ebb9de92f9171563f" firing 2026-01-02 05:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123197" tenant_id="6b8a46bf990d4e409d598b327b65f532" firing 2026-01-08 02:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123373" tenant_id="2620ec59745f4ba69ba8e5bb97d7f800" firing 2026-01-08 16:10:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-121352" tenant_id="e7f50c5cd5b54155b6b152eff2bf71e8" firing 2025-12-11 16:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118480" tenant_id="e050246871f84196a42823491e6bd39b" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-97031" tenant_id="1fa7097b647240aa9d2028c31f8bbdbb" firing 2025-12-30 05:10:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122988" tenant_id="f96617b0cf214c81bd2507e838afcb80" firing 2026-01-02 10:20:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123417" tenant_id="9b52e5e261784b1baea412c1ee882ad0" firing 2026-01-08 10:20:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118481" tenant_id="8e577430b32a4881983777a3f69f771a" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122427" tenant_id="9a0d4e86a55d44d4bb7ef3a884482e62" firing 2025-12-23 06:40:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123405" tenant_id="f13f6749dbd04b13a51665e54460fb10" firing 2026-01-08 10:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123452" tenant_id="0855a604f4a84ffe9cd42c6ac060cbd6" firing 2026-01-08 15:10:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118320" tenant_id="c4cd1dc496624de5ba89fbcad7983cc0" firing 2025-11-14 08:01:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-120691" tenant_id="2bd0411d89c747219d257253006c19be" firing 2025-12-12 04:50:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122473" tenant_id="56371d58bd434873a84158a8d68baad6" firing 2025-12-31 05:00:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118892" tenant_id="a559965cfc17408f863683f27ce8a425" firing 2025-11-30 11:20:04.388696542 +0000 UTC 100
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122468" tenant_id="db4140291890405c919048a43ebdd1cf" firing 2026-01-02 15:50:04.388696542 +0000 UTC 100
Project is out of memory (27 active)
alert: Project is out of memory
expr: label_replace(openstack_nova_limits_memory_used{is_domain="false"} / (openstack_nova_limits_memory_max{is_domain="false"} > 0) * 100 > 95, "object_id", "$1", "tenant_id", "(.*)")
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the memory allocation limit.
Labels State Active Since Value
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="fe2fb846d21946fe971ffdc58d3b4021" severity="info" tenant="project-118058" tenant_id="fe2fb846d21946fe971ffdc58d3b4021" firing 2025-10-24 12:10:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="655bf94757b34ce28393e3588e0a7559" severity="info" tenant="project-123275" tenant_id="655bf94757b34ce28393e3588e0a7559" firing 2026-01-06 17:40:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="8e577430b32a4881983777a3f69f771a" severity="info" tenant="project-118481" tenant_id="8e577430b32a4881983777a3f69f771a" firing 2025-11-02 04:20:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2479f7bbbdd048a892cdce5b6041eb0d" severity="info" tenant="project-122213" tenant_id="2479f7bbbdd048a892cdce5b6041eb0d" firing 2025-12-20 15:20:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="125599669f7e4a9bb3707ee1766754be" severity="info" tenant="project-122459" tenant_id="125599669f7e4a9bb3707ee1766754be" firing 2025-12-23 14:20:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="6b8a46bf990d4e409d598b327b65f532" severity="info" tenant="project-123197" tenant_id="6b8a46bf990d4e409d598b327b65f532" firing 2026-01-05 10:20:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="56371d58bd434873a84158a8d68baad6" severity="info" tenant="project-122473" tenant_id="56371d58bd434873a84158a8d68baad6" firing 2025-12-31 08:20:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="7d1f8f08904e4d6caf5285894da50d1b" severity="info" tenant="project-118139" tenant_id="7d1f8f08904e4d6caf5285894da50d1b" firing 2025-10-25 11:40:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="62d89b6e807a414eb88837cf967de6bf" severity="info" tenant="project-119280" tenant_id="62d89b6e807a414eb88837cf967de6bf" firing 2025-11-12 02:00:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2bd0411d89c747219d257253006c19be" severity="info" tenant="project-120691" tenant_id="2bd0411d89c747219d257253006c19be" firing 2025-12-30 03:30:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="3a831eead8904a1ba173172f9027a382" severity="info" tenant="project-122519" tenant_id="3a831eead8904a1ba173172f9027a382" firing 2026-01-02 15:40:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e050246871f84196a42823491e6bd39b" severity="info" tenant="project-118480" tenant_id="e050246871f84196a42823491e6bd39b" firing 2025-11-04 05:00:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2856b90c9ce848949da775ba7190125e" severity="info" tenant="project-122063" tenant_id="2856b90c9ce848949da775ba7190125e" firing 2025-12-16 17:00:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2620ec59745f4ba69ba8e5bb97d7f800" severity="info" tenant="project-123373" tenant_id="2620ec59745f4ba69ba8e5bb97d7f800" firing 2026-01-08 16:10:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f7f4e7c84bb5421c90476e8b62d62be1" severity="info" tenant="project-118012" tenant_id="f7f4e7c84bb5421c90476e8b62d62be1" firing 2025-10-28 10:50:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="080d78c2d3e7422e97236dd562b464fd" severity="info" tenant="project-122803" tenant_id="080d78c2d3e7422e97236dd562b464fd" firing 2025-12-29 09:50:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="337091c1254e4845844471037bb1d0a9" severity="info" tenant="project-123458" tenant_id="337091c1254e4845844471037bb1d0a9" firing 2026-01-08 15:00:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="c4cd1dc496624de5ba89fbcad7983cc0" severity="info" tenant="project-118320" tenant_id="c4cd1dc496624de5ba89fbcad7983cc0" firing 2025-10-28 15:10:04 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e7f50c5cd5b54155b6b152eff2bf71e8" severity="info" tenant="project-121352" tenant_id="e7f50c5cd5b54155b6b152eff2bf71e8" firing 2025-12-11 16:00:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="84d97362dcbf40898dea3c822e2b304b" severity="info" tenant="project-122499" tenant_id="84d97362dcbf40898dea3c822e2b304b" firing 2025-12-24 12:00:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="76589b5ba6db4544b76639a965c48087" severity="info" tenant="project-123020" tenant_id="76589b5ba6db4544b76639a965c48087" firing 2026-01-02 12:40:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1679394c56df4dcd8e1c998f9c614d66" severity="info" tenant="project-123276" tenant_id="1679394c56df4dcd8e1c998f9c614d66" firing 2026-01-07 06:20:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9b52e5e261784b1baea412c1ee882ad0" severity="info" tenant="project-123417" tenant_id="9b52e5e261784b1baea412c1ee882ad0" firing 2026-01-08 09:40:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9a0d4e86a55d44d4bb7ef3a884482e62" severity="info" tenant="project-122427" tenant_id="9a0d4e86a55d44d4bb7ef3a884482e62" firing 2025-12-23 06:30:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f96617b0cf214c81bd2507e838afcb80" severity="info" tenant="project-122988" tenant_id="f96617b0cf214c81bd2507e838afcb80" firing 2026-01-02 10:20:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="d23d4718205143cbaa38899d58fb0902" severity="info" tenant="project-123018" tenant_id="d23d4718205143cbaa38899d58fb0902" firing 2026-01-02 14:40:04.388696542 +0000 UTC 100
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1fc00ed8917c41268eaf1c18fc4475b2" severity="info" tenant="project-123263" tenant_id="1fc00ed8917c41268eaf1c18fc4475b2" firing 2026-01-06 05:20:04.388696542 +0000 UTC 100
Project is out of vCPU resources (28 active)
alert: Project is out of vCPU resources
expr: label_replace(openstack_nova_limits_vcpus_used{is_domain="false"} / (openstack_nova_limits_vcpus_max{is_domain="false"} > 0) * 100 > 95, "object_id", "$1", "tenant_id", "(.*)")
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the vCPU allocation limit.
Labels State Active Since Value
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="337091c1254e4845844471037bb1d0a9" severity="info" tenant="project-123458" tenant_id="337091c1254e4845844471037bb1d0a9" firing 2026-01-08 15:00:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="7d1f8f08904e4d6caf5285894da50d1b" severity="info" tenant="project-118139" tenant_id="7d1f8f08904e4d6caf5285894da50d1b" firing 2025-10-25 11:40:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="8e577430b32a4881983777a3f69f771a" severity="info" tenant="project-118481" tenant_id="8e577430b32a4881983777a3f69f771a" firing 2025-11-02 04:20:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9a0d4e86a55d44d4bb7ef3a884482e62" severity="info" tenant="project-122427" tenant_id="9a0d4e86a55d44d4bb7ef3a884482e62" firing 2025-12-23 06:30:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="84d97362dcbf40898dea3c822e2b304b" severity="info" tenant="project-122499" tenant_id="84d97362dcbf40898dea3c822e2b304b" firing 2025-12-24 12:00:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="3a831eead8904a1ba173172f9027a382" severity="info" tenant="project-122519" tenant_id="3a831eead8904a1ba173172f9027a382" firing 2026-01-02 15:40:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="655bf94757b34ce28393e3588e0a7559" severity="info" tenant="project-123275" tenant_id="655bf94757b34ce28393e3588e0a7559" firing 2026-01-06 17:40:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="fe2fb846d21946fe971ffdc58d3b4021" severity="info" tenant="project-118058" tenant_id="fe2fb846d21946fe971ffdc58d3b4021" firing 2025-10-24 12:10:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="c4cd1dc496624de5ba89fbcad7983cc0" severity="info" tenant="project-118320" tenant_id="c4cd1dc496624de5ba89fbcad7983cc0" firing 2025-10-28 15:10:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1679394c56df4dcd8e1c998f9c614d66" severity="info" tenant="project-123276" tenant_id="1679394c56df4dcd8e1c998f9c614d66" firing 2026-01-07 06:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9b52e5e261784b1baea412c1ee882ad0" severity="info" tenant="project-123417" tenant_id="9b52e5e261784b1baea412c1ee882ad0" firing 2026-01-08 09:40:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e7f50c5cd5b54155b6b152eff2bf71e8" severity="info" tenant="project-121352" tenant_id="e7f50c5cd5b54155b6b152eff2bf71e8" firing 2025-12-11 16:00:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="56371d58bd434873a84158a8d68baad6" severity="info" tenant="project-122473" tenant_id="56371d58bd434873a84158a8d68baad6" firing 2025-12-31 08:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f96617b0cf214c81bd2507e838afcb80" severity="info" tenant="project-122988" tenant_id="f96617b0cf214c81bd2507e838afcb80" firing 2026-01-02 10:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="6b8a46bf990d4e409d598b327b65f532" severity="info" tenant="project-123197" tenant_id="6b8a46bf990d4e409d598b327b65f532" firing 2026-01-05 10:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="d23d4718205143cbaa38899d58fb0902" severity="info" tenant="project-123018" tenant_id="d23d4718205143cbaa38899d58fb0902" firing 2026-01-02 14:40:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="8890bce87bc64d43b0c0b033f94bf8cc" severity="info" tenant="project-32926" tenant_id="8890bce87bc64d43b0c0b033f94bf8cc" firing 2026-01-03 10:40:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2620ec59745f4ba69ba8e5bb97d7f800" severity="info" tenant="project-123373" tenant_id="2620ec59745f4ba69ba8e5bb97d7f800" firing 2026-01-08 16:10:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f7f4e7c84bb5421c90476e8b62d62be1" severity="info" tenant="project-118012" tenant_id="f7f4e7c84bb5421c90476e8b62d62be1" firing 2025-10-28 10:50:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="62d89b6e807a414eb88837cf967de6bf" severity="info" tenant="project-119280" tenant_id="62d89b6e807a414eb88837cf967de6bf" firing 2025-11-12 02:00:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="76589b5ba6db4544b76639a965c48087" severity="info" tenant="project-123020" tenant_id="76589b5ba6db4544b76639a965c48087" firing 2026-01-02 12:40:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1fc00ed8917c41268eaf1c18fc4475b2" severity="info" tenant="project-123263" tenant_id="1fc00ed8917c41268eaf1c18fc4475b2" firing 2026-01-06 05:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e050246871f84196a42823491e6bd39b" severity="info" tenant="project-118480" tenant_id="e050246871f84196a42823491e6bd39b" firing 2025-11-04 05:00:04 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2479f7bbbdd048a892cdce5b6041eb0d" severity="info" tenant="project-122213" tenant_id="2479f7bbbdd048a892cdce5b6041eb0d" firing 2025-12-20 15:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="125599669f7e4a9bb3707ee1766754be" severity="info" tenant="project-122459" tenant_id="125599669f7e4a9bb3707ee1766754be" firing 2025-12-23 14:20:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="080d78c2d3e7422e97236dd562b464fd" severity="info" tenant="project-122803" tenant_id="080d78c2d3e7422e97236dd562b464fd" firing 2025-12-29 09:50:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2bd0411d89c747219d257253006c19be" severity="info" tenant="project-120691" tenant_id="2bd0411d89c747219d257253006c19be" firing 2025-12-30 03:30:04.388696542 +0000 UTC 100
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2856b90c9ce848949da775ba7190125e" severity="info" tenant="project-122063" tenant_id="2856b90c9ce848949da775ba7190125e" firing 2025-12-16 17:00:04.388696542 +0000 UTC 100
Network is out of IP addresses (0 active)
alert: Network is out of IP addresses
expr: label_replace((openstack_neutron_network_ip_availabilities_used > 0) / (openstack_neutron_network_ip_availabilities_total) * 100 * on(project_id) group_left(name) label_replace(openstack_identity_project_info{name=~"admin|service"}, "project_id", "$1", "id", "(.*)") > 95, "object_id", "$1", "network_id", "(.*)")
for: 5m
labels:
  component: compute
  severity: info
annotations:
  summary: Network {{$labels.network_name}} with ID {{$labels.network_id}} in project {{$labels.name}} has reached 95% of the IP address allocation limit.
Project is out of storage policy space (0 active)
alert: Project is out of storage policy space
expr: label_join(openstack_cinder_limits_volume_storage_policy_used_gb{is_domain="false",volume_type!=""} / (openstack_cinder_limits_volume_storage_policy_max_gb{is_domain="false",volume_type!=""} > 0) * 100 > 95, "object_id", "-", "tenant_id", "volume_type")
for: 5m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the {{$labels.volume_type}} storage policy allocation limit.
/var/lib/prometheus/alerts/openstack_services.rules > openstack_service_is_down
All OpenStack service API upstreams are down (0 active)
High request error rate for OpenStack API requests detected (0 active)
alert: High request error rate for OpenStack API requests detected
expr: label_replace(sum by(instance, log_file) (rate(openstack_request_count{status=~"5.."}[1h])) / sum by(instance, log_file) (rate(openstack_request_count[1h])), "object_id", "$1", "log_file", "(.*).log") * 100 > 5
for: 10m
labels:
  component: compute
  severity: warning
annotations:
  summary: Request error rate more than 5% detected for {{$labels.object_id}} for the last 1 hour. Check {{$labels.object_id}} resource usage.
Keystone API service is down (0 active)
alert: Keystone API service is down
expr: (openstack_service_up{service=~"keystone.*"} == 0) and on(cluster_id) (backend_ha_reconfigure == 0) and on(cluster_id) (backend_compute_reconfigure == 0) and on(cluster_id) (backend_compute_deploy == 0)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.service}}'
  severity: critical
annotations:
  summary: '{{$labels.service}} API service is down.'
OpenStack Cinder Scheduler is down (0 active)
OpenStack Cinder Volume agent is down (0 active)
alert: OpenStack Cinder Volume agent is down
expr: sum without(uuid) (label_replace(label_replace(openstack_cinder_agent_state{adminState="enabled",service="cinder-volume"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*"), "storage_name", "$1", "hostname", ".*@(.*)") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}-{{$labels.storage_name}}'
  severity: critical
annotations:
  summary: OpenStack Block Storage (Cinder) Volume agent is down on host {{$labels.instance}} for storage {{$labels.storage_name}}.
OpenStack Neutron DHCP agent is down (0 active)
OpenStack Neutron L3 agent is down (0 active)
OpenStack Neutron Metadata agent is down (0 active)
OpenStack Neutron OpenvSwitch agent is down (0 active)
OpenStack Nova Compute is down (0 active)
OpenStack Nova Conductor is down (0 active)
OpenStack Nova Scheduler is down (0 active)
OpenStack Octavia HealthManager service is down (0 active)
OpenStack Octavia Housekeeping service is down (0 active)
OpenStack Octavia Provisioning Worker is down (0 active)
OpenStack service API upstream is down (0 active)
/var/lib/prometheus/alerts/pcs.rules > PCS
Possible lack of allocatable space (6 active)
alert: Possible lack of allocatable space
expr: (cluster_space_ok_without_node == 0) * on(node) group_right() backend_node_online
labels:
  component: storage
  object_id: '{{ $labels.node }}'
  severity: warning
annotations:
  summary: Losing node {{ $labels.hostname }} will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.
Labels State Active Since Value
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-cp01.vstoragedomain" instance="backend-api.svc" job="backend" node="af89cb54-f750-2f58-9267-1d004773625a" object_id="af89cb54-f750-2f58-9267-1d004773625a" severity="warning" firing 2025-11-14 08:01:57.787488231 +0000 UTC 0
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-cp02.vstoragedomain" instance="backend-api.svc" job="backend" node="1f578260-8804-bdef-1a71-b3f85f76038d" object_id="1f578260-8804-bdef-1a71-b3f85f76038d" severity="warning" firing 2025-11-14 08:01:57.787488231 +0000 UTC 0
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-cp03.vstoragedomain" instance="backend-api.svc" job="backend" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="warning" firing 2025-11-14 08:01:57.787488231 +0000 UTC 0
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-s3stg01.vstoragedomain" instance="backend-api.svc" job="backend" node="028df1f9-05b4-d9c8-6cf1-17e418bf9bf4" object_id="028df1f9-05b4-d9c8-6cf1-17e418bf9bf4" severity="warning" firing 2025-11-14 08:01:57.787488231 +0000 UTC 0
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-s3stg02.vstoragedomain" instance="backend-api.svc" job="backend" node="88cff5b6-d365-3356-d662-766ae503e58a" object_id="88cff5b6-d365-3356-d662-766ae503e58a" severity="warning" firing 2025-11-14 08:01:57.787488231 +0000 UTC 0
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-s3stg03.vstoragedomain" instance="backend-api.svc" job="backend" node="d56f49c6-e411-4d5d-0911-f9e5d5923b49" object_id="d56f49c6-e411-4d5d-0911-f9e5d5923b49" severity="warning" firing 2025-11-14 08:01:57.787488231 +0000 UTC 0
CS has excessive journal size (0 active)
alert: CS has excessive journal size
expr: cluster_csd_journal_size{journal_type="inner_cache"} * on(csid) group_left(instance, device) cluster_csd_disk_info > 512
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.csid }}'
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: The journal on CS#{{ $labels.csid }} on host {{ $labels.instance }}, disk {{ $labels.device }}, is {{ $value }} MiB. The recommended size is 256 MiB.
CS has inconsistent encryption settings (0 active)
alert: CS has inconsistent encryption settings
expr: count by(tier) (count by(tier, encryption) (cluster_csd_journal_size * on(csid) group_left(instance) cluster_csd_disk_info * on(csid) group_left(tier) cluster_csd_info)) > 1
for: 10m
labels:
  component: storage
  severity: warning
annotations:
  summary: Encryption is disabled for some CSs in tier {{ $labels.tier }} but enabled for others on the same tier.
CS journal device shared across multiple tiers (0 active)
alert: CS journal device shared across multiple tiers
expr: count by(instance, device) (count by(instance, device, tier) (cluster_csd_journal_size{journal_type="external_cache"} * on(csid) group_left(tier) (cluster_csd_info) * on(node) group_left(instance) (up{job="node"}))) >= 2
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.instance }}-{{ $labels.device }}'
  severity: warning
annotations:
  summary: CSes from multiple tiers are using the same journal '{{ $labels.device }}' on '{{ $labels.instance }}' node.
CS missing journal configuration (0 active)
alert: CS missing journal configuration
expr: cluster_csd_disk_info unless on(csid) cluster_csd_journal_size and on() count(up{job="mds"}) > 0
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: The journal is not configured for CS#{{ $labels.csid }} on node {{ $labels.instance }}.
Cluster has blocked or slow replication (0 active)
alert: Cluster has blocked or slow replication
expr: increase(mdsd_cluster_replication_stuck_chunks[5m]) > 0 or increase(mdsd_cluster_replication_touts_total[5m]) > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Chunk replication is blocked or too slow.
Cluster has critically high number of chunks (0 active)
alert: Cluster has critically high number of chunks
expr: job:mdsd_fs_chunk_maps:sum >= 1.5e+07
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: There are too many chunks in the cluster, which slows down the metadata service.
Cluster has critically high number of files (0 active)
alert: Cluster has critically high number of files
expr: job:mdsd_fs_files:sum >= 1e+07
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: There are too many files in the cluster, which slows down the metadata service.
Cluster has failed chunk services (0 active)
alert: Cluster has failed chunk services
expr: sum by(csid) (mdsd_cs_status{status=~"failed|failed rel"}) * on(csid) group_right() label_replace(cluster_csd_disk_info, "object_id", "$1", "csid", "(.*)") > 0
for: 5m
labels:
  component: core storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: 'Chunk service #{{ $labels.csid }} is in the 'failed' state on node {{ $labels.instance }}. Replace the disk or contact the technical support.'
Cluster has failed mount points (0 active)
alert: Cluster has failed mount points
expr: job:up_not_being_updated:count{job="fused"} - job:up_not_being_updated_with_restart:count{job="fused"} > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Some mount points stopped working and need to be recovered.
Cluster has offline chunk services (0 active)
alert: Cluster has offline chunk services
expr: sum by(csid) (mdsd_cs_status{status="offline"}) * on(csid) group_right() label_replace(cluster_csd_disk_info, "object_id", "$1", "csid", "(.*)") > 0
for: 5m
labels:
  component: core storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: 'Chunk service #{{ $labels.csid }} is in 'offline' state on {{ $labels.instance }} node. Check and restart it.'
Cluster has too many chunks (0 active)
alert: Cluster has too many chunks
expr: (job:mdsd_fs_chunk_maps:sum > 1e+07) < 1.5e+07
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: There are too many chunks in the cluster, which slows down the metadata service.
Cluster has too many files (0 active)
alert: Cluster has too many files
expr: (job:mdsd_fs_files:sum > 4e+06) < 1e+07
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: There are too many files in the cluster, which slows down the metadata service.
Cluster has unavailable metadata services (0 active)
alert: Cluster has unavailable metadata services
expr: up{job="mds"} unless on(mdsid) (job:up_with_restart{job="mds"} == 1 or job:up_with_restart{job="mds"} == bool 0 and on(node) (instance:being_updated))
for: 5m
labels:
  component: core storage
  object_id: '{{ $labels.mdsid }}'
  severity: warning
annotations:
  summary: 'Metadata service #{{ $labels.mdsid }} is offline or has failed on '{{ $labels.instance }}' node. Check and restart it.'
Cluster is out of physical space on tier (0 active)
alert: Cluster is out of physical space on tier
expr: label_replace(sum by(tier) (mdsd_cluster_free_space_bytes) / sum by(tier) (mdsd_cluster_space_bytes), "object_id", "tier-$1", "tier", "(.*)") < 0.1
for: 5m
labels:
  component: core storage
  severity: critical
annotations:
  summary: There is not enough free physical space on storage tier {{ $labels.tier }}
Cluster is running out of physical space on tier (0 active)
alert: Cluster is running out of physical space on tier
expr: label_replace(sum by(tier) (mdsd_cluster_free_space_bytes) / sum by(tier) (mdsd_cluster_space_bytes), "object_id", "tier-$1", "tier", "(.*)") < 0.2
for: 5m
labels:
  component: core storage
  severity: warning
annotations:
  summary: There is little free physical space left on storage tier {{ $labels.tier }}
Core storage service is down (0 active)
alert: Core storage service is down
expr: label_replace(node_systemd_unit_state{name="vstorage-disks-monitor.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_master == 1 and on() backend_virtual_cluster == 0
for: 5m
labels:
  component: storage
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: warning
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.
Disk cache settings are not optimal (0 active)
Master metadata service changes too often (0 active)
alert: Master metadata service changes too often
expr: topk(1, mdsd_is_master) and (delta(mdsd_master_uptime[1h]) < 300000) and on(node) softwareupdates_node_state{state!~"updat.*"} == 1
for: 10m
labels:
  component: core storage
  severity: warning
annotations:
  summary: Master metadata service has changed more than once in 5 minutes.
Metadata service has critically high commit latency (0 active)
alert: Metadata service has critically high commit latency
expr: histogram_quantile(0.95, instance_le:rjournal_commit_duration_seconds_bucket:rate5m{job="mds"}) >= 5
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Metadata service on {{$labels.instance}} has the 95th percentile latency higher than 5 seconds.
Metadata service has high CPU usage (0 active)
alert: Metadata service has high CPU usage
expr: (sum by(instance) (rate(process_cpu_seconds_total{job="mds"}[5m])) * 100) > 80
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: Metadata service on {{$labels.instance}} has CPU usage higher than 80%. The service may be overloaded.
Metadata service has high commit latency (0 active)
alert: Metadata service has high commit latency
expr: 5 > histogram_quantile(0.95, instance_le:rjournal_commit_duration_seconds_bucket:rate5m{job="mds"}) > 1
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: Metadata service on {{$labels.instance}} has the 95th percentile latency higher than 1 second.
Node has failed map requests (0 active)
alert: Node has failed map requests
expr: fused_maps_failed > 0 or rate(fused_map_failures_total[5m]) > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Some map requests on {{$labels.instance}} have failed.
Node has stuck I/O requests (0 active)
alert: Node has stuck I/O requests
expr: fused_stuck_reqs_30s > 0 or fused_stuck_reqs_10s > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Some I/O requests are stuck on {{$labels.instance}}.
Number of CSes per device does not match configuration (0 active)
alert: Number of CSes per device does not match configuration
expr: label_replace(backend_node_online == 1, "host", "$1", "hostname", "([^.]*).*") and on(node) (count by(device, instance, node, tier) (cluster_csd_info) - on(tier) group_left() (cluster_cs_per_tier_info)) != 0
for: 10m
labels:
  component: storage
  severity: warning
annotations:
  summary: Number of CSes per device on node {{$labels.host}} with id {{$labels.node}} does not match configuration. Check your disk configuration.
Reached "node crash per hour" threshold (0 active)
alert: Reached "node crash per hour" threshold
expr: shaman_node_crash_threshold == 1
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Node {{$labels.hostname}} with shaman node id {{$labels.client_node}} has reached the "node crash per hour" threshold. Visit https://kb.acronis.com/content/68797 to learn how to troubleshoot this issue. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Node {{$labels.hostname}} with shaman node id {{$labels.client_node}} has reached the "node crash per hour" threshold. {{- end -}}'
Storage disk is unresponsive (0 active)
alert: Storage disk is unresponsive
expr: sum by(csid) (mdsd_cs_status{status="ill"}) * on(csid) group_right() label_replace(cluster_csd_disk_info, "object_id", "$1", "csid", "(.*)") > 0
for: 1m
labels:
  component: core storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: Disk '{{$labels.device}}' (CS#{{$labels.csid}}) on node {{$labels.instance}} is unresponsive. Check or replace this disk.
/var/lib/prometheus/alerts/postgres.rules > postgresql database size
PostgreSQL database size is greater than 30 GB (0 active)
alert: PostgreSQL database size is greater than 30 GB
expr: pg_database_size_bytes > 3e+10
for: 10m
labels:
  component: cluster
  object_id: postgresql_exporter
  severity: critical
annotations:
  summary: PostgreSQL database "{{$labels.datname}}" on node "{{$labels.instance}}" is greater than 30 GB in size. Verify that deleted entries are archived or contact the technical support.
PostgreSQL database uses more than 50% of node root partition (0 active)
alert: PostgreSQL database uses more than 50% of node root partition
expr: (sum by(node, instance) (pg_database_size_bytes) / min by(node, instance) (node_filesystem_size_bytes{job="node",mountpoint="/"})) * 100 > 50
for: 10m
labels:
  component: cluster
  object_id: postgresql_exporter
  severity: warning
annotations:
  summary: PostgreSQL databases on node "{{$labels.instance}}" with ID "{{$labels.node}}" use more than 50% of node root partition. Verify that deleted entries are archived or contact the technical support.
/var/lib/prometheus/alerts/rabbitmq.rules > rabbitmq_alerts
RabbitMQ node is down (0 active)
RabbitMQ split brain detected (0 active)
alert: RabbitMQ split brain detected
expr: count(count by(job) (rabbitmq_queues)) > 1
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: RabbitMQ cluster has experienced a split brain due to a network partition.
/var/lib/prometheus/alerts/s3.rules > S3
FSMDS service failed to start (0 active)
alert: FSMDS service failed to start
expr: increase(ostor_svc_start_failed_count_total{service="fs"}[5m]) > 10
for: 1m
labels:
  component: NFS
  severity: critical
annotations:
  summary: Object storage agent failed to start file service on {{$labels.instance}}.
NFS service failed to start (0 active)
alert: NFS service failed to start
expr: increase(ostor_svc_start_failed_count_total{service=~"os|ns|s3gw",storage_type="NFS"}[5m]) > 10
for: 1m
labels:
  component: NFS
  severity: critical
annotations:
  summary: Object storage agent failed to start {{$labels.job}}({{$labels.svc_id}}) on {{$labels.instance}}
NFS service has unavailable FS services (0 active)
alert: NFS service has unavailable FS services
expr: count by(instance) (up{job="fs"}) > sum by(instance) (up{job="fs"})
for: 1m
labels:
  component: NFS
  severity: warning
annotations:
  summary: Some File services are not running on {{$labels.instance}}. Check the service status in the command-line interface.
NFS service is experiencing many network problems (0 active)
alert: NFS service is experiencing many network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"fs|os",vol_id=~"02.*"} > (10 / (5 * 60))
for: 2m
labels:
  component: NFS
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: critical
annotations:
  summary: NFS service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has many RPC errors. Check your network configuration.
NFS service is experiencing some network problems (0 active)
alert: NFS service is experiencing some network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"fs|os",vol_id=~"02.*"} > (5 / (5 * 60)) and instance_vol_svc:rpc_errors_total:rate5m{job=~"fs|os",vol_id=~"02.*"} <= (10 / (5 * 60))
for: 2m
labels:
  component: NFS
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: warning
annotations:
  summary: NFS service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has some RPC errors. Check your network configuration.
Name service has critically high commit latency (0 active)
alert: Name service has critically high commit latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="ns"})) >= 1e+07
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 10 seconds. Check the storage performance.
Name service has critically high request latency (0 active)
alert: Name service has critically high request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m)) >= 5000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 5 seconds.
Name service has high commit latency (0 active)
alert: Name service has high commit latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="ns"})) > 1e+06) < 1e+07
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 1 second. Check the storage performance.
Name service has high request latency (0 active)
alert: Name service has high request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m)) > 1000
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 1 second.
Object service has critically high commit latency (0 active)
alert: Object service has critically high commit latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="os"})) >= 1e+07
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 10 seconds. Check the storage performance.
Object service has critically high request latency (0 active)
alert: Object service has critically high request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m)) >= 5000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 5 seconds.
Object service has high commit latency (0 active)
alert: Object service has high commit latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="os"})) > 1e+06) < 1e+07
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 1 second. Check the storage performance.
Object service has high request latency (0 active)
alert: Object service has high request latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m)) > 1000) < 5000
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 1 second.
Object storage account control service is offline (0 active)
alert: Object storage account control service is offline
expr: up{job="acc"} == 0
for: 5m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage account control service is down on host {{$labels.instance}}.
Object storage agent is frozen for a long time (0 active)
alert: Object storage agent is frozen for a long time
expr: increase(pcs_process_inactive_seconds_total{job="ostor"}[5m]) > 0
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage agent on {{$labels.instance}} has the event loop inactive for more than 1 minute.
Object storage agent is not connected to configuration service (0 active)
alert: Object storage agent is not connected to configuration service
expr: increase(ostor_svc_registry_cfg_failed_total[5m]) > 3 and on(node) (instance:not_being_updated)
for: 5m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage agent failed to connect to the configuration service on {{$labels.instance}}.
Object storage agent is offline (0 active)
alert: Object storage agent is offline
expr: up{job="ostor"} == 0
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Object storage agent is offline on {{$labels.instance}}.
S3 Gateway service has critically high CPU usage (0 active)
alert: S3 Gateway service has critically high CPU usage
expr: (sum by(instance, svc_id) (rate(process_cpu_seconds_total{job="s3gw"}[5m])) * 100) >= 90
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has CPU usage higher than 90%. The service may be overloaded.
S3 Gateway service has critically high GET request latency (0 active)
alert: S3 Gateway service has critically high GET request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m)) >= 5000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the median GET request latency higher than 5 seconds.
S3 Gateway service has critically high cancel request rate (0 active)
alert: S3 Gateway service has critically high cancel request rate
expr: (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > 1 and ((sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req_cancelled:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m))) * 100 >= 30 and (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > (30 / 300)
for: 3m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the cancel request rate higher than 30%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests.
S3 Gateway service has high CPU usage (0 active)
alert: S3 Gateway service has high CPU usage
expr: ((sum by(instance, svc_id) (rate(process_cpu_seconds_total{job="s3gw"}[5m])) * 100) > 75) < 90
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has CPU usage higher than 75%. The service may be overloaded.
S3 Gateway service has high GET request latency (0 active)
alert: S3 Gateway service has high GET request latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m)) > 1000) < 5000
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the median GET request latency higher than 1 second.
S3 Gateway service has high cancel request rate (0 active)
alert: S3 Gateway service has high cancel request rate
expr: 30 > (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > 1 and ((sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req_cancelled:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m))) * 100 > 5 and (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > (30 / 300)
for: 3m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the cancel request rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests.
S3 Gateway service has too many failed requests (0 active)
alert: S3 Gateway service has too many failed requests
expr: ((sum by(instance, svc_id) (instance_vol_svc:ostor_req_server_err:rate5m)) / (sum by(instance, svc_id) (instance_vol_svc:ostor_s3gw_req:rate5m))) * 100 > 5
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has a lot of failed requests with a server error (5XX status code).
S3 NDS service has critically high notification processing error rate (0 active)
alert: S3 NDS service has critically high notification processing error rate
expr: ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_error_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 >= 15
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification processing error rate higher than 15%. It may be caused by connectivity issues, requests timeouts, or an S3 topics misconfiguration.
S3 NDS service has high notification deletion error rate (0 active)
alert: S3 NDS service has high notification deletion error rate
expr: ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_delete_error_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 > 5
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification deletion error rate higher than 5%. It may be caused by a storage misconfiguration, storage performance degradation, or other storage issues.
S3 NDS service has high notification processing error rate (0 active)
alert: S3 NDS service has high notification processing error rate
expr: 15 > ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_error_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 >= 5
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification processing error rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or an S3 topics misconfiguration.
S3 NDS service has high notification repetition rate (0 active)
alert: S3 NDS service has high notification repetition rate
expr: ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_repeat_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 > 5
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification repetition rate higher than 5%. It may be caused by a storage misconfiguration or other storage issues.
S3 NDS service has too many messages in simultaneous processing (0 active)
alert: S3 NDS service has too many messages in simultaneous processing
expr: nds_endpoint_process_count > 1000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has a lot of notifications in simultaneous processing on the endpoint. It may be caused by connectivity issues or an S3 topics misconfiguration.
S3 NDS service has too many staged unprocessed notifications (0 active)
alert: S3 NDS service has too many staged unprocessed notifications
expr: nds_staged_messages_count > 1000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has a lot of unprocessed notifications staged on the storage. It may be caused by connectivity or storage issues.
S3 cluster has too many open file descriptors (0 active)
alert: S3 cluster has too many open file descriptors
expr: (sum by(instance) (process_open_fds{job=~"gr|acc|s3gw|ns|os|ostor"})) > 9000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: There are more than 9000 open file descriptors on {{$labels.instance}}. Please contact the technical support.
S3 cluster has unavailable Geo-replication services (0 active)
alert: S3 cluster has unavailable Geo-replication services
expr: count by(instance) (up{job="gr"}) > sum by(instance) (up{job="gr"} == 1 or (up{job="gr"} == bool 0 and on(instance) (instance:being_updated)))
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some Geo-replication services are not running on {{$labels.instance}}. Check the service status in the command-line interface.
S3 cluster has unavailable S3 Gateway services (0 active)
alert: S3 cluster has unavailable S3 Gateway services
expr: count by(instance) (up{job="s3gw"}) > sum by(instance) (up{job="s3gw"})
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some S3 Gateway services are not running on {{$labels.instance}}. Check the service status in the command-line interface.
S3 cluster has unavailable name services (0 active)
alert: S3 cluster has unavailable name services
expr: count by(instance) (up{job="ns"}) > sum by(instance) (up{job="ns"} == 1 or (up{job="ns"} == bool 0 and on(instance) (instance:being_updated)))
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some Name services are not running on {{$labels.instance}}. Check the service status in the command-line interface.
S3 cluster has unavailable object services (0 active)
alert: S3 cluster has unavailable object services
expr: count by(instance) (up{job="os"}) > sum by(instance) (up{job="os"} == 1 or (up{job="os"} == bool 0 and on(instance) (instance:being_updated)))
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some Object services are not running on {{$labels.instance}}. Check the service status in the command-line interface.
S3 cluster misconfiguration (0 active)
alert: S3 cluster misconfiguration
expr: count(up{job="ostor"}) > 1 and count(ostor_svc_registry_cfg_failed_total) < 2
labels:
  component: S3
  severity: error
annotations:
  summary: |
    {{ if query "backend_vendor_info{vendor='acronis'}" }} The S3 cluster configuration is not highly available. If one S3 node fails, the entire S3 cluster may become non-operational. To ensure high availability, update the S3 cluster configuration, as described in the Knowledge Base at https://kb.acronis.com/node/64033 {{ else if query "backend_vendor_info{vendor='virtuozzo'}" }} The S3 cluster configuration is not highly available. If one S3 node fails, the entire S3 cluster may become non-operational. To ensure high availability, update the S3 cluster configuration, as described in the Knowledge Base at https://support.virtuozzo.com/hc/en-us/articles/27536517316753-Virtuozzo-Hybrid-Infrastructure-Alert-S3-cluster-misconfiguration {{ end }}
S3 node is in the automatic maintenance mode (0 active)
alert: S3 node is in the automatic maintenance mode
expr: auto_maintenance_status > 0
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} S3 services have been evacuated from {{$labels.instance}} because of too many failed S3 requests. Check the service logs. Visit https://kb.acronis.com/content/72408 to learn how to troubleshoot this issue. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} S3 services have been evacuated from {{$labels.instance}} because of too many failed S3 requests. Check the service logs. {{- end -}}'
S3 redundancy warning (0 active)
alert: S3 redundancy warning
expr: storage_redundancy_threshold{failure_domain="disk",type="s3"} > 0 and storage_redundancy_threshold{failure_domain="disk",type="s3"} <= scalar(count(backend_node_master))
for: 10m
labels:
  component: S3
  severity: warning
annotations:
  summary: |
    S3 is set to failure domain "disk" even though there are enough available nodes. It is recommended to set the failure domain to "host" so that S3 can survive host failures in addition to disk failures.
S3 service failed to start (0 active)
alert: S3 service failed to start
expr: increase(ostor_svc_start_failed_count_total{service=~"os|ns|s3gw",storage_type="S3"}[5m]) > 10
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage agent failed to start {{$labels.job}}({{$labels.svc_id}}) on {{$labels.instance}}
S3 service is experiencing many network problems (0 active)
alert: S3 service is experiencing many network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"s3gw|os|ns",vol_id=~"01.*"} > (10 / (5 * 60))
for: 2m
labels:
  component: S3
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: critical
annotations:
  summary: S3 service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has many RPC errors. Check your network configuration.
S3 service is experiencing some network problems (0 active)
alert: S3 service is experiencing some network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"s3gw|os|ns",vol_id=~"01.*"} > (5 / (5 * 60)) and instance_vol_svc:rpc_errors_total:rate5m{job=~"s3gw|os|ns",vol_id=~"01.*"} <= (10 / (5 * 60))
for: 2m
labels:
  component: S3
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: warning
annotations:
  summary: S3 service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has some RPC errors. Check your network configuration.
S3 service is frozen for a long time (0 active)
alert: S3 service is frozen for a long time
expr: increase(pcs_process_inactive_seconds_total{job=~"s3gw|os|ns"}[5m]) > 0
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has the event loop inactive for more than 1 minute.