Alerts

Inactive (269) Pending (0) Firing (5)

/var/lib/prometheus/alerts/abgw.rules > abgw

Attempt to use migrated accounts (0 active)

alert: Attempt to use migrated accounts
expr: instance:abgw_inst_outdated:count > 0
for: 1m
labels:
  component: abgw
  severity: warning
annotations:
  summary: One or more attempts to use migrated accounts detected for the last 24 hours. Please contact the technical support.

Backup storage CRL is not up to date (0 active)

alert: Backup storage CRL is not up to date
expr: label_join(time() - instance_path_reg_type:abgw_next_certificate_expiration:min > 86400 * 2 and instance_path_reg_type:abgw_next_certificate_expiration:min{path!~".*root\\.crl$",type="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: warning
annotations:
  summary: 'The CRL has not been updated for more than 2 days. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'

Backup storage SSL certificate has expired (0 active)

alert: Backup storage SSL certificate has expired
expr: label_join(instance_path_reg_type:abgw_next_certificate_expiration:min - time() < 0 and instance_path_reg_type:abgw_next_certificate_expiration:min{type!="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: critical
annotations:
  summary: 'The {{$labels.type}} certificate has expired. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'

Backup storage SSL certificate will expire in less than 14 days (0 active)

alert: Backup storage SSL certificate will expire in less than 14 days
expr: label_join(instance_path_reg_type:abgw_next_certificate_expiration:min - time() < 86400 * 14 and instance_path_reg_type:abgw_next_certificate_expiration:min - time() >= 86400 * 7 and instance_path_reg_type:abgw_next_certificate_expiration:min{type!="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: warning
annotations:
  summary: 'The {{$labels.type}} certificate will expire in less than 14 days. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'

Backup storage SSL certificate will expire in less than 21 days (0 active)

alert: Backup storage SSL certificate will expire in less than 21 days
expr: label_join(instance_path_reg_type:abgw_next_certificate_expiration:min - time() < 86400 * 21 and instance_path_reg_type:abgw_next_certificate_expiration:min - time() >= 86400 * 14 and instance_path_reg_type:abgw_next_certificate_expiration:min{type!="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: info
annotations:
  summary: 'The {{$labels.type}} certificate will expire in less than 21 days. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'

Backup storage SSL certificate will expire in less than 7 days (0 active)

alert: Backup storage SSL certificate will expire in less than 7 days
expr: label_join(instance_path_reg_type:abgw_next_certificate_expiration:min - time() < 86400 * 7 and instance_path_reg_type:abgw_next_certificate_expiration:min - time() >= 0 and instance_path_reg_type:abgw_next_certificate_expiration:min{type!="crl"}, "object_id", "-", "reg_name", "type", "path")
labels:
  component: abgw
  severity: critical
annotations:
  summary: 'The {{$labels.type}} certificate will expire in less than 7 days. Path: {{$labels.path}}. Registration name: {{$labels.reg_name}}.'

Backup storage has high replica open error rate (0 active)

alert: Backup storage has high replica open error rate
expr: err:abgw_file_replica_open_errs:rate5m{err!="OK"} / on() group_left() sum(err:abgw_file_replica_open_errs:rate5m) > 0.05
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: Backup storage has the error rate when opening replica files "{{$labels.err}}" higher than 5%.

Backup storage has high replica removal error rate (0 active)

alert: Backup storage has high replica removal error rate
expr: err:abgw_rm_file_push_errs:rate5m{err!="OK"} / on() group_left() sum(err:abgw_rm_file_push_errs:rate5m) > 0.05
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: Backup storage has the error rate when removing secondary replica files "{{$labels.err}}" higher than 5%.

Backup storage has high replica write error rate (0 active)

alert: Backup storage has high replica write error rate
expr: err:abgw_push_replica_errs:rate5m{err!="OK"} / on() group_left() sum(err:abgw_push_replica_errs:rate5m) > 0.05
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: Backup storage has the error rate when writing replica files "{{$labels.err}}" higher than 5%.

Backup storage service is down (0 active)

alert: Backup storage service is down
expr: label_replace(node_systemd_unit_state{name=~"abgw-kvstore-proxy.service|abgw-setting.service|vstorage-abgw.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_abgw == 1
for: 5m
labels:
  component: abgw
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.

Backup storage throttling is activated (0 active)

alert: Backup storage throttling is activated
expr: job:abgw_append_throttle_delay_ms:rate5m != 0
for: 1m
labels:
  component: abgw
  severity: warning
annotations:
  summary: Backup storage started to throttle write operations due to the lack of free space. Visit https://kb.acronis.com/content/62823 to learn how to troubleshoot this issue.

Different number of collaborating backup storage services (0 active)

alert: Different number of collaborating backup storage services
expr: sum(abgw_neighbors{type="count"}) != 0 and count(count(abgw_neighbors{type="count"} != scalar(count(abgw_neighbors{type="count"}))) or (((count(abgw_neighbors{type="index"}) - 1) * count(abgw_neighbors{type="index"}) / 2) != sum(abgw_neighbors{type="index"}))) > 0
for: 30m
labels:
  component: abgw
  severity: error
annotations:
  summary: Some backup storage services report a different number of collaborating services. Please contact the technical support.

Found data file with inconsistent last_sync_offset (0 active)

alert: Found data file with inconsistent last_sync_offset
expr: sum by(path) (changes(abgw_file_sync_offset_mismatch_errs_total{job="abgw"}[3h])) != 0
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: The "{{ $labels.path }}" file's size is less than last_sync_offset stored in info file.

Storage I/O error (0 active)

alert: Storage I/O error
expr: instance:abgw_io_errors:count > 0
for: 1m
labels:
  component: abgw
  severity: error
annotations:
  summary: One or more errors detected during storage I/O operations for the last 24 hours. Please contact the technical support.

/var/lib/prometheus/alerts/backend.rules > Backend

Backend service is down (0 active)

alert: Backend service is down
expr: label_replace(sum by(name) (node_systemd_unit_state{name=~"vstorage-ui-backend.service",state="active"}), "name", "$1", "name", "(.*)\\.service") == 0
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }}'
  severity: critical
annotations:
  summary: Service {{$labels.name}} is down.

Changes to the management database are not replicated (0 active)

alert: Changes to the management database are not replicated
expr: db_replication_status == 2 and on(node) softwareupdates_node_state{state!~"updat.*"} == 1
for: 10m
labels:
  component: node
  severity: critical
annotations:
  summary: Changes to the management database are not replicated to the node "{{ $labels.host }}" because it is offline. Check the node's state and connectivity.

Changes to the management database are not replicated (0 active)

alert: Changes to the management database are not replicated
expr: db_replication_status == 1
labels:
  component: node
  severity: critical
annotations:
  summary: Changes to the management database are not replicated to the node "{{ $labels.host }}". Please contact the technical support.

Cluster update failed (0 active)

alert: Cluster update failed
expr: count(softwareupdates_cluster_info{state="failed"}) + count(softwareupdates_node_info{state="idle"}) == count(up{job="node"}) + 1
labels:
  component: cluster
  severity: critical
annotations:
  summary: Update failed for the cluster.

Compute cluster has failed (0 active)

alert: Compute cluster has failed
expr: compute_status == 2
labels:
  component: compute
  severity: critical
annotations:
  summary: Compute cluster has failed. Unable to manage virtual machines.

Compute node service is down (0 active)

alert: Compute node service is down
expr: label_replace(node_systemd_unit_state{name=~"openvswitch.service|ovs-vswitchd.service|ovsdb-server.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_compute == 1
for: 5m
labels:
  component: compute
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.

Critical node service is down (0 active)

alert: Critical node service is down
expr: label_replace(node_systemd_unit_state{name=~"nginx.service|vcmmd.service|vstorage-ui-agent.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1
for: 5m
labels:
  component: node
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.

Disk SMART warning (0 active)

alert: Disk SMART warning
expr: backend_node_disk_status{role!="unassigned",smart_status="failed"}
labels:
  component: cluster
  object_id: '{{ $labels.device }}-{{ $labels.serial_number }}-{{ $labels.instance }}'
  severity: error
annotations:
  summary: Disk "{{ $labels.device }}" ({{ $labels.serial_number }}) on node "{{ $labels.instance }}" has failed a S.M.A.R.T. check.

Disk error (0 active)

alert: Disk error
expr: backend_node_disk_status{disk_status=~"unavail|failed",role!="unassigned"}
labels:
  component: cluster
  object_id: '{{ $labels.device }}-{{ $labels.serial_number }}-{{ $labels.instance }}'
  severity: error
annotations:
  summary: Disk "{{ $labels.device }}" ({{ $labels.serial_number }}) has failed on node "{{ $labels.instance }}".

Entering maintenance for update failed (0 active)

alert: Entering maintenance for update failed
expr: softwareupdates_node_info{state="maintenance_failed"} * on(node) group_left(instance) up{job="node"}
labels:
  component: node
  severity: critical
annotations:
  summary: Entering maintenance failed while updating the node {{$labels.instance}}.

High availability for the admin panel must be configured (0 active)

alert: High availability for the admin panel must be configured
expr: count by(cluster_id) (backend_node_master) >= 3 and on(cluster_id) backend_ha_up == 0
for: 15m
labels:
  component: cluster
  severity: error
annotations:
  summary: |
    Configure high availability for the admin panel in SETTINGS > System settings > Management node high availability. Otherwise the admin panel will be a single point of failure.

High availability service is down (0 active)

alert: High availability service is down
expr: label_replace(node_systemd_unit_state{name=~"vstorage-ui-backend-raftor.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_management == 1 and on() backend_ha_up == 1
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }}'
  severity: critical
annotations:
  summary: Service {{$labels.name}} is down on host {{$labels.instance}}.

Identity provider connection error (0 active)

alert: Identity provider connection error
expr: backend_idp_error{error_type="connection_error"} == 1
for: 10m
labels:
  component: cluster
  severity: error
annotations:
  summary: Unable to connect to identity provider "{{ $labels.idp_name }}" in domain "{{ $labels.domain_name }}".

Identity provider validation error (0 active)

alert: Identity provider validation error
expr: backend_idp_error{error_type="validation_error"} == 1
labels:
  component: cluster
  severity: error
annotations:
  summary: Invalid identity provider configuration "{{ $labels.idp_name }}" in domain "{{ $labels.domain_name }}".

Incompatible hardware detected (0 active)

alert: Incompatible hardware detected
expr: backend_node_cpu_info{iommu="False",model=~"AMD EPYC.*"} * on(node_id) group_right(model) label_join(backend_node_nic_info{model=~"MT27800 Family \\[ConnectX\\-5\\]",type=~"Infiniband controller|Ethernet controller"}, "nic_model", "", "model")
for: 10m
labels:
  component: node
  severity: warning
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Incompatible hardware detected on node {{$labels.node_id}}: {{$labels.model}} & {{$labels.nic_model}}. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. Visit https://kb.acronis.com/content/64948 to learn how to troubleshoot this issue. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Incompatible hardware detected on node {{$labels.node_id}}: {{$labels.model}} & {{$labels.nic_model}}. Using Mellanox and AMD may lead to data loss. Please double check that SR-IOV is properly enabled. Visit https://support.virtuozzo.com/hc/en-us/articles/19764365143953 to learn how to troubleshoot this issue. {{- end -}}'

Kafka SSL CA certificate has expired (0 active)

alert: Kafka SSL CA certificate has expired
expr: kafka_ssl_ca_cert_expire_in_days <= 0
labels:
  component: compute
  severity: critical
annotations:
  summary: Kafka SSL CA certificate has expired. Please renew the certificate.

Kafka SSL CA certificate will expire in less than 30 days (0 active)

alert: Kafka SSL CA certificate will expire in less than 30 days
expr: (kafka_ssl_ca_cert_expire_in_days > 0) <= 30
labels:
  component: compute
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Kafka SSL CA certificate will expire in {{ $value }} days. Please renew the certificate.

Kafka SSL client certificate has expired (0 active)

alert: Kafka SSL client certificate has expired
expr: kafka_ssl_client_cert_expire_in_days <= 0
labels:
  component: compute
  severity: critical
annotations:
  summary: Kafka SSL client certificate has expired. Please renew the certificate.

Kafka SSL client certificate will expire in less than 30 days (0 active)

alert: Kafka SSL client certificate will expire in less than 30 days
expr: (kafka_ssl_client_cert_expire_in_days > 0) <= 30
labels:
  component: compute
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Kafka SSL client certificate will expire in {{ $value }} days. Please renew the certificate.

Kernel is outdated (0 active)

alert: Kernel is outdated
expr: backend_node_kernel_outdated == 1 and on(node) softwareupdates_node_state{state="uptodate"} == 1
labels:
  component: node
  severity: warning
annotations:
  summary: Node "{{ $labels.instance }}" is not running the latest kernel.

License expired (0 active)

alert: License expired
expr: cluster_license_info{status=~"(expired|invalid|error|inactive)"} == 1
for: 30m
labels:
  component: cluster
  severity: critical
annotations:
  summary: The license of cluster "{{$labels.cluster_name}}" has expired. Сontact your reseller to update your license immediately!

License is not loaded (0 active)

alert: License is not loaded
expr: cluster_license_info{status="unknown"} == 1
for: 30m
labels:
  component: cluster
  severity: warning
annotations:
  summary: License is not loaded

License is not updated (0 active)

alert: License is not updated
expr: cluster_license_info{expire_in_days=~"[7-9]|(1[0-9])|20",is_spla="False"} == 1
for: 30m
labels:
  component: cluster
  severity: warning
annotations:
  summary: The license cannot be updated automatically and will expire in less than 21 days. Check the cluster connectivity to the license server or contact the technical support.

License will expire soon (0 active)

alert: License will expire soon
expr: cluster_license_info{expire_in_days=~"[1-6]"} == 1
for: 30m
labels:
  component: cluster
  severity: critical
annotations:
  summary: The license has not been updated automatically and will expire in less than 7 days. Check the cluster connectivity to the license server and contact the technical support immediately.

Management node HA has four nodes (0 active)

alert: Management node HA has four nodes
expr: count(backend_node_ha == 1) == 4
for: 10m
labels:
  component: cluster
  severity: warning
annotations:
  summary: The management node HA configuration has four nodes. It is recommended to have three or five nodes included.

Management node backup does not exist (0 active)

alert: Management node backup does not exist
expr: backend_database_backup_age{last_backup_date="None"} == 0
labels:
  component: cluster
  severity: error
annotations:
  summary: The last management node backup has failed or does not exist!

Management node backup is old (0 active)

alert: Management node backup is old
expr: (backend_database_backup_age > 0) < 3
for: 1h
labels:
  component: cluster
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Management node backup is older than {{ $value }} day

Management node backup is too old (0 active)

alert: Management node backup is too old
expr: backend_database_backup_age > 3
labels:
  component: cluster
  severity: error
  value: '{{ $value }}'
annotations:
  summary: Management node backup is older than {{ $value }} day

Management node service is down (0 active)

alert: Management node service is down
expr: label_replace(node_systemd_unit_state{name=~"alertmanager.service|prometheus.service|postgresql.service|pgbouncer.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_management == 1
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: critical
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.

Management panel SSL certificate has expired (0 active)

alert: Management panel SSL certificate has expired
expr: backend_ui_ssl_cert_expire_in_days < 1
labels:
  component: cluster
  severity: critical
annotations:
  summary: The SSL certificate for the admin and self-service panels has expired. Renew the certificate, as described in the product documentation, or contact the technical support.

Management panel SSL certificate will expire in less than 30 days (0 active)

alert: Management panel SSL certificate will expire in less than 30 days
expr: (backend_ui_ssl_cert_expire_in_days > 7) <= 30
labels:
  component: cluster
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: The SSL certificate for the admin and self-service panels will expire in {{ $value }} days. Renew the certificate, as described in the product documentation, or contact the technical support.

Management panel SSL certificate will expire in less than 7 days (0 active)

alert: Management panel SSL certificate will expire in less than 7 days
expr: (backend_ui_ssl_cert_expire_in_days > 0) <= 7
labels:
  component: cluster
  severity: critical
  value: '{{ $value }}'
annotations:
  summary: The SSL certificate for the admin and self-service panels will expire in {{ $value }} days. Renew the certificate, as described in the product documentation, or contact the technical support.

Multiple update checks failed (0 active)

alert: Multiple update checks failed
expr: (sum_over_time(softwareupdates_node_info{state="check_failed"}[3d]) and softwareupdates_node_info{state="check_failed"}) * on(node) group_left(instance) up{job="node"} / (60 * 24 * 3) >= 0.9
labels:
  component: node
  severity: critical
annotations:
  summary: Update checks failed multiple times on the node {{$labels.instance}}. Please check access to the update repository.

Network connectivity failed (0 active)

alert: Network connectivity failed
expr: sum by(network_name) (increase(network_connectivity_received_packets_total[10m])) == 0 and sum by(network_name) (increase(network_connectivity_sent_packets_total[10m])) > 0 and on(network_name) label_replace(cluster_network_info_total, "network_name", "$1", "network", "(.*)")
labels:
  component: node
  object_id: '{{$labels.network_name}}'
  severity: critical
annotations:
  summary: No network traffic has been detected via network "{{$labels.network_name}}" from all nodes.

Node crash detected (0 active)

alert: Node crash detected
expr: hci_compute_node_crashed_fenced == 1
for: 30s
labels:
  component: node
  severity: critical
annotations:
  summary: Node {{$labels.hostname}} crashed, which started the VM evacuation.

Node failed to return to operation (0 active)

alert: Node failed to return to operation
expr: hci_compute_node_crashed_fenced == 1 and on(node) backend_node_online
for: 30m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{$labels.hostname}} has failed to automatically return to operation within 30 minutes after a crash. Check the node's hardware, and then try returning it to operation manually.

Node had a fenced state for 1 hour (0 active)

alert: Node had a fenced state for 1 hour
expr: sum_over_time(hci_compute_node_crashed_fenced[2h]) > 60
for: 5m
labels:
  component: compute
  severity: critical
annotations:
  summary: For the last 2 hours node {{$labels.hostname}} with ID {{$labels.node}} had a fenced state at least for 1 hour

Node has no internet access (0 active)

alert: Node has no internet access
expr: backend_node_internet_connected == 0
for: 10m
labels:
  component: node
  severity: warning
annotations:
  summary: Node "{{ $labels.instance }}" cannot reach the repository. Ensure the node has a working internet connection.

Node network MTU packet loss (0 active)

alert: Node network MTU packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[10m])) > 15 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[10m])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="mtu"}[10m]))) > 5
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the loss of some MTU-sized packets.

Node network connectivity problem (0 active)

alert: Node network connectivity problem
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[10m])) == 0 and sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[10m])) > 0 and on(dest_host) label_replace(sum_over_time(softwareupdates_node_state{state="rebooting"}[10m]) * on(node) group_left(hostname) (backend_node_online + 1), "dest_host", "$1", "hostname", "(.*)") == 0 and on(dest_host) label_replace(backend_node_online, "dest_host", "$1", "hostname", "(.*)") == 1 and on(network_name) label_replace(cluster_network_info_total, "network_name", "$1", "network", "(.*)")
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: critical
annotations:
  summary: Node "{{$labels.src_host}}" has no network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}".

Node network packet loss (0 active)

alert: Node network packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[10m])) > 15 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[10m])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[10m]))) > 5
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the loss of some packets.

Node network persistent MTU packet loss (0 active)

alert: Node network persistent MTU packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[2h])) > 450 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[2h])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="mtu"}[2h]))) > 50
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the persistent loss of some MTU-sized packets over the last two hours.

Node network persistent packet loss (0 active)

alert: Node network persistent packet loss
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[2h])) > 450 and (sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="ord"}[2h])) - sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[2h]))) > 50
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: warning
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the persistent loss of some packets over the last two hours.

Node network unstable connectivity (0 active)

alert: Node network unstable connectivity
expr: sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="mtu"}[10m])) == 0 and sum by(src_host, dest_host, network_name) (increase(network_connectivity_received_packets_total{probe_type="ord"}[10m])) > 0 and sum by(src_host, dest_host, network_name) (increase(network_connectivity_sent_packets_total{probe_type="mtu"}[10m])) > 0
labels:
  component: node
  object_id: '{{$labels.network_name}}-{{$labels.src_host}}-{{$labels.dest_host}}'
  severity: critical
annotations:
  summary: Node "{{$labels.src_host}}" has a problem with network connectivity to node "{{$labels.dest_host}}" via network "{{$labels.network_name}}" due to the loss of all MTU-sized packets.

Node service is down (0 active)

alert: Node service is down
expr: label_replace(node_systemd_unit_state{name=~"mtail.service|chronyd.service|multipathd.service|sshd.service|disp-helper.service|abrtd.service|abrt-oops.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1
for: 5m
labels:
  component: node
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: warning
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.

Node update failed (0 active)

alert: Node update failed
expr: softwareupdates_node_info{state="update_failed"} * on(node) group_left(instance) up{job="node"}
labels:
  component: node
  severity: critical
annotations:
  summary: Software update failed on the node {{$labels.instance}}.

Primary management node service is down (0 active)

alert: Primary management node service is down
expr: label_replace(node_systemd_unit_state{name=~"vstorage-ui-backend-s3-proxy.service|vstorage-ui-backend-consul.service|vstorage-ui-backend-celery_periodic.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_master == 1
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.name }}'
  severity: critical
annotations:
  summary: Service {{$labels.name}} is down on host {{$labels.instance}}.

Software updates exist (0 active)

alert: Software updates exist
expr: sum by(job, version, available_version) (softwareupdates_node_available) > 0 or sum by(job, version, available_version) (softwareupdates_management_panel_available) > 0
labels:
  component: cluster
  object_id: '{{ $labels.available_version }}'
  severity: info
annotations:
  summary: 'Software updates exist for the cluster. Available version: {{$labels.available_version}}.'

Unable to apply SPLA license (0 active)

alert: Unable to apply SPLA license
expr: cluster_spla_last_action_days{action_type="update_key"} > 1
labels:
  component: cluster
  severity: error
annotations:
  summary: Unable to apply SPLA license for the cluster. Сontact your reseller to solve this issue.

Unable to get space usage (0 active)

alert: Unable to get space usage
expr: cluster_spla_last_action_days{action_type="get_usage"} > 1
labels:
  component: cluster
  severity: error
annotations:
  summary: Unable to get space usage for the cluster.

Unable to push space usage statistics (0 active)

alert: Unable to push space usage statistics
expr: cluster_spla_last_action_days{action_type="report"} > 1
labels:
  component: cluster
  severity: warning
annotations:
  summary: Unable to push space usage statistics for the cluster. Check the internet connection on the management node.

Update check failed (0 active)

alert: Update check failed
expr: softwareupdates_node_info{state="check_failed"} * on(node) group_left(instance) up{job="node"} and on() backend_node_internet_connected == 1
for: 10m
labels:
  component: node
  object_id: '{{ $labels.node }}'
  severity: warning
annotations:
  summary: Update check failed on the node {{$labels.instance}}. Please check access to the update repository.

Update download failed (0 active)

alert: Update download failed
expr: softwareupdates_node_info{state="download_failed"} * on(node) group_left(instance) up{job="node"}
labels:
  component: node
  severity: critical
annotations:
  summary: Update download failed on the node {{$labels.instance}}.

Update failed (0 active)

alert: Update failed
expr: sum by(job, instance, version, available_version) (softwareupdates_node_info{state="update_ctrl_plane_failed"})
labels:
  component: node
  severity: critical
annotations:
  summary: Update failed for the management panel and compute API.

/var/lib/prometheus/alerts/common.rules > common

Cluster is out of licensed space (0 active)

alert: Cluster is out of licensed space
expr: round((cluster_logical_free_space_size_bytes / cluster_logical_total_space_size_bytes) * 100, 0.01) < 0.1
for: 5m
labels:
  component: cluster
  severity: critical
annotations:
  summary: Сluster "{{ $labels.cluster_name }}" has run out of storage space allowed by license. No more data can be written. Please contact your reseller to update your license immediately!

Cluster is out of physical space (0 active)

alert: Cluster is out of physical space
expr: round((job:mdsd_cluster_raw_space_free:sum / job:mdsd_cluster_raw_space_total:sum) * 100, 0.01) < 10
for: 5m
labels:
  component: cluster
  severity: critical
annotations:
  summary: Cluster has just {{ $value }}% of physical storage space left. You may want to free some space or add more storage capacity.
  value: '{{ $value }}'

Compute node disk is out of space (0 active)

alert: Compute node disk is out of space
expr: (node_filesystem_free_bytes{job="node",mountpoint="/"} and on(node) (backend_node_compute_worker == 1)) < 1024 ^ 3 or (node_filesystem_free_bytes{job="node",mountpoint="/"} and on(node) (backend_node_compute_controller == 1)) < 10 * 1024 ^ 3
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Root partition on compute node "{{ $labels.instance }}" is running out of space.

Connection tracking table is full (0 active)

alert: Connection tracking table is full
expr: increase(kernel_conntrack_table_full_total[10m]) > 0
labels:
  component: node
  severity: critical
annotations:
  summary: The kernel connection tracking table on node {{ $labels.instance}} has reached its maximum capacity. This may lead to network issues.

Disk is out of space (0 active)

alert: Disk is out of space
expr: (round(node_filesystem_free_bytes{job="node",mountpoint="/"} / node_filesystem_size_bytes{job="node",mountpoint="/"} * 100, 0.01) and on(node) ((backend_node_compute == 1 and backend_node_compute_controller == 0) or backend_node_compute == 0)) < 5
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Root partition on node "{{ $labels.instance }}" is running out of space

Disk is running out of space (0 active)

alert: Disk is running out of space
expr: round(node_filesystem_free_bytes{job="node",mountpoint="/"} / node_filesystem_size_bytes{job="node",mountpoint="/"} * 100, 0.01) < 10 or node_filesystem_free_bytes{job="node",mountpoint="/"} < 5 * 1024 ^ 3
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Root partition on node "{{ $labels.instance }}" is running out of space

Four metadata services in cluster (0 active)

alert: Four metadata services in cluster
expr: count(cluster_mdsd_info) == 4 and count(cluster_mdsd_info) <= count(backend_node_master)
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has four metadata services. This configuration slows down the cluster performance and does not improve its availability. For a cluster of four nodes, it is enough to configure three MDSes. Delete an extra MDS from one of the cluster nodes.

Infrastructure interface has high receive packet drop rate (0 active)

alert: Infrastructure interface has high receive packet drop rate
expr: ((rate(node_network_receive_drop_total{device!~"lo|tap.*",job="node"}[5m])) / (rate(node_network_receive_packets_total{device!~"lo|tap.*",job="node"}[5m]) != 0)) * 100 > 5
for: 10m
labels:
  component: node
  object_id: '{{ $labels.device }}'
  severity: critical
annotations:
  summary: |
    Network interface {{ $labels.device }} on node {{ $labels.instance }} has receive packet drop rate higher than 5%. Please check physical network devices connectivity.

Infrastructure interface has high transmit packet drop rate (0 active)

alert: Infrastructure interface has high transmit packet drop rate
expr: ((rate(node_network_transmit_drop_total{device!~"lo|tap.*",job="node"}[5m])) / (rate(node_network_transmit_packets_total{device!~"lo|tap.*",job="node"}[5m]) != 0)) * 100 > 5
for: 10m
labels:
  component: node
  object_id: '{{ $labels.device }}'
  severity: critical
annotations:
  summary: |
    Network interface {{ $labels.device }} on node {{ $labels.instance }} has transmit packet drop rate higher than 5%. Please check physical network devices connectivity.

Licensed storage capacity is critically low (0 active)

alert: Licensed storage capacity is critically low
expr: (job:mdsd_fs_allocated_size_bytes:sum >= job:mdsd_cluster_licensed_space_bytes:sum * 0.9) / 1024 ^ 3
for: 5m
labels:
  component: cluster
  severity: critical
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Licensed storage capacity is critically low as the cluster has reached 90% of licensed storage capacity. Please switch to the SPLA licensing model. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Cluster has reached 90% of licensed storage capacity. {{- end -}}'

Licensed storage capacity is low (0 active)

alert: Licensed storage capacity is low
expr: ((job:mdsd_fs_allocated_size_bytes:sum > job:mdsd_cluster_licensed_space_bytes:sum * 0.8) < job:mdsd_cluster_licensed_space_bytes:sum * 0.9) / 1024 ^ 3
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Licensed storage capacity is low as the cluster has reached 80% of licensed storage capacity. Please switch to the SPLA licensing model. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Cluster has reached 80% of licensed storage capacity. {{- end -}}'

Low network interface speed (0 active)

alert: Low network interface speed
expr: node_network_speed_bytes > 0 and node_network_speed_bytes / 125000 < 1000 and on(device, node) cluster_network_info_total{network!="<unspecified>"}
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Network interface "{{ $labels.device}}" on node "{{ $labels.instance }}" has speed lower than the minimally required 1 Gbps.

MTU mismatch (0 active)

alert: MTU mismatch
expr: count without(mtu) (count by(mtu, network) (count_values without(job) ("mtu", node_network_mtu_bytes{job="node"}) * on(node, device) group_left(network) count without(job) (cluster_network_info_created{network!="<unspecified>"}))) > 1
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Network {{$labels.network}} has assigned interfaces with different MTU.

More than one metadata service per node (0 active)

alert: More than one metadata service per node
expr: count by(node, hostname) (cluster_mdsd_info * on(node) group_left(hostname) backend_node_master) > 1
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Node "{{ $labels.hostname }}" has more than one metadata service located on it. It is recommended to have only one metadata service per node. Delete the extra metadata services from this node and create them on other nodes instead.

Network bond is not redundant (0 active)

alert: Network bond is not redundant
expr: node_bonding_slaves - node_bonding_active > 0
for: 5m
labels:
  component: node
  severity: critical
  value: '{{ $value }}'
annotations:
  summary: Network bond {{ $labels.master }} on node {{ $labels.instance }} is missing {{ $value }} subordinate interface(s).

Network interface half duplex (0 active)

alert: Network interface half duplex
expr: node_network_info{duplex="half",operstate="up"} and on(device, node) cluster_network_info_total{network!="<unspecified>"}
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Network interface "{{$labels.device}}" on node "{{$labels.instance}}" is not in full duplex mode.

Network interface is flapping (0 active)

alert: Network interface is flapping
expr: round(increase(node_network_carrier_changes_total{job="node"}[15m])) > 5
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Network interface {{$labels.device}} on node {{$labels.instance}} is flapping.

Node got offline too many times (0 active)

alert: Node got offline too many times
expr: changes(backend_node_online[1h]) > (3 * 2)
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Node "{{ $labels.hostname }}" got offline too many times for the last hour.

Node has critically high swap usage (0 active)

alert: Node has critically high swap usage
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 >= 95 and (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 and (rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])) > 100 and delta(node_memory_SwapFree_bytes[5m]) < 0
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{$labels.instance}} has critically high swap usage, exceeding 95%. The current value is {{ $value }}.
  value: '{{ $value }}'

Node has high CPU usage (0 active)

alert: Node has high CPU usage
expr: round(100 - (avg by(instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m])) * 100)) > 90
for: 15m
labels:
  component: node
  severity: critical
annotations:
  summary: Node {{ $labels.instance}} has CPU usage higher than 90%. The current value is {{ $value }}.
  value: '{{ $value }}'

Node has high disk I/O usage (0 active)

alert: Node has high disk I/O usage
expr: round(rate(node_disk_io_time_seconds_total{device=~".+",job="node"}[2m]) * 100) > 85
for: 15m
labels:
  component: node
  severity: critical
annotations:
  summary: Disk /dev/{{$labels.device}} on node {{$labels.instance}} has I/O usage higher than 85%. The current value is {{ $value }}.
  value: '{{ $value }}'

Node has high memory usage (0 active)

alert: Node has high memory usage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 95 and on(instance) label_replace(openstack_placement_resource_allocation_ratio{resourcetype="MEMORY_MB"}, "instance", "$1", "hostname", "(.*).vstoragedomain") == 1 and (rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])) > 100
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{$labels.instance}} has memory usage higher than 95%. The current value is {{ $value }}.
  value: '{{ $value }}'

Node has high receive packet error rate (0 active)

alert: Node has high receive packet error rate
expr: instance_device:node_network_receive_errs:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) receive packet error rate. Please check node network settings.
  value: '{{ $value }}'

Node has high receive packet loss rate (0 active)

alert: Node has high receive packet loss rate
expr: instance_device:node_network_receive_drop:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) receive packet loss rate. Please check node network settings.
  value: '{{ $value }}'

Node has high swap usage (0 active)

alert: Node has high swap usage
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 70 and (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 < 95 and (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90 and (rate(node_vmstat_pswpin[5m]) + rate(node_vmstat_pswpout[5m])) > 100 and delta(node_memory_SwapFree_bytes[5m]) < 0
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{$labels.instance}} has swap usage higher than 70%. The current value is {{ $value }}.
  value: '{{ $value }}'

Node has high transmit packet error rate (0 active)

alert: Node has high transmit packet error rate
expr: instance_device:node_network_transmit_errs:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) transmit packet error rate. Please check node network settings.
  value: '{{ $value }}'

Node has high transmit packet loss rate (0 active)

alert: Node has high transmit packet loss rate
expr: instance_device:node_network_transmit_drop:rate5m{device!="br-int"} > 1000
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Node {{ $labels.instance }} has ({{ humanize $value }}) transmit packet loss rate. Please check node network settings.
  value: '{{ $value }}'

Node is offline (0 active)

alert: Node is offline
expr: label_replace(backend_node_online, "hostname", "$1", "hostname", "([^.]*).*") == 0 unless on(node) softwareupdates_node_state{state=~"updating|rebooting"} == 1
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Node {{ $labels.hostname }} with ID {{ $labels.node }} is offline.

Node time critically unsynced (0 active)

alert: Node time critically unsynced
expr: floor(abs(backend_node_time_seconds{role="node"} - scalar(backend_node_time_seconds{role="backend"})) > 30)
for: 6m
labels:
  component: node
  severity: critical
annotations:
  summary: Time on node {{$labels.instance}} is critically unsynced, differing from the time on backend node by more than {{ $value }} seconds.
  value: '{{ $value }}'

Node time not synced (0 active)

alert: Node time not synced
expr: floor((abs(backend_node_time_seconds{role="node"} - scalar(backend_node_time_seconds{role="backend"})) > 5) and (abs(backend_node_time_seconds{role="node"} - scalar(backend_node_time_seconds{role="backend"})) < 30))
for: 6m
labels:
  component: node
  severity: warning
annotations:
  summary: Time on node {{$labels.instance}} differs from the time on backend node by more than {{ $value }} seconds.
  value: '{{ $value }}'

Not enough cluster nodes (0 active)

alert: Not enough cluster nodes
expr: sum by(cluster_id) (backend_node_online) < 3
for: 5m
labels:
  component: cluster
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Cluster has only {{ $value }} node(s) instead of the recommended minimum of 3. Add more nodes to the cluster.

Not enough metadata disks (0 active)

alert: Not enough metadata disks
expr: count(cluster_mdsd_disk_info) == 2
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster requires more disks with the metadata role. Losing one more MDS will halt cluster operation.

Not enough storage disks (0 active)

alert: Not enough storage disks
expr: cluster_min_req_redundancy_number{failure_domain="host"} > scalar(count(count by(node) (cluster_csd_info))) or cluster_min_req_redundancy_number{failure_domain="disk"} > scalar(count(cluster_csd_info))
for: 5m
labels:
  component: cluster
  object_id: '{{ $labels.service }}'
  severity: warning
annotations:
  summary: Cluster requires more disks with the storage role to be able to provide the required level of redundancy for '{{ $labels.service }}' service.

Only one metadata disk in cluster (0 active)

alert: Only one metadata disk in cluster
expr: count(cluster_mdsd_disk_info) == 1
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has only one MDS. There is only one disk with the metadata role at the moment. Losing this disk will completely destroy all cluster data irrespective of the redundancy schema.

Over five metadata services in cluster (0 active)

alert: Over five metadata services in cluster
expr: count(cluster_mdsd_info) > 5 and count(cluster_mdsd_info) <= count(backend_node_master)
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has more than five metadata services. This configuration slows down the cluster performance and does not improve its availability. For a large cluster, it is enough to configure five MDSes. Delete extra MDSes from the cluster nodes.

Shaman service is down (0 active)

alert: Shaman service is down
expr: up{job="shaman"} == 0
for: 5m
labels:
  component: node
  object_id: '{{ $labels.instance }}'
  severity: critical
annotations:
  summary: Shaman service is down on host {{$labels.instance}}.

Software RAID is not fully synced (0 active)

alert: Software RAID is not fully synced
expr: round(((node_md_blocks_synced / node_md_blocks) * 100) < 100) and on() (node_md_state{state="active"} == 1)
for: 5m
labels:
  component: node
  severity: warning
annotations:
  summary: Software RAID {{ $labels.device }} on node {{ $labels.instance }} is {{ $value }} synced.
  value: '{{ $value }}'

Systemd service is flapping (0 active)

alert: Systemd service is flapping
expr: changes(node_systemd_unit_state{state="failed"}[5m]) > 5 or (changes(node_systemd_unit_state{state="failed"}[1h]) > 15 unless changes(node_systemd_unit_state{state="failed"}[30m]) < 7)
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: Systemd service {{ $labels.name }} on node {{ $labels.instance }} has changed its state more than 5 times in 5 minutes or 15 times in one hour.

Zero storage disks (0 active)

alert: Zero storage disks
expr: absent(cluster_csd_info) and on() count(up{job="mds"}) > 0 and on() up{job="backend"} == 1
for: 5m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Cluster has zero disks with the storage role and cannot provide the required level of redundancy.

/var/lib/prometheus/alerts/disk_smart_attrs.rules > Smart Disks Attributes

SMART Media Wearout critical (0 active)

alert: SMART Media Wearout critical
expr: smart_media_wearout_indicator{value="normalized"} <= 5 or smart_nvme_percent_used >= 95 or smart_percent_lifetime_remain{value="normalized"} <= 5 or smart_ssd_life_left{value="normalized"} <= 5
for: 5m
labels:
  component: node
  object_id: '{{ $labels.instance }}-{{ $labels.disk }}'
  severity: critical
annotations:
  summary: Disk {{ $labels.disk }} on node {{ $labels.instance }} is worn out and will fail soon. Consider replacement.

SMART Media Wearout warning (0 active)

alert: SMART Media Wearout warning
expr: (smart_media_wearout_indicator{value="normalized"} > 5 and smart_media_wearout_indicator{value="normalized"} < 10) or (smart_nvme_percent_used < 95 and smart_nvme_percent_used > 90) or (smart_percent_lifetime_remain{value="normalized"} > 5 and smart_percent_lifetime_remain{value="normalized"} < 10) or (smart_ssd_life_left{value="normalized"} > 5 and smart_ssd_life_left{value="normalized"} < 10)
for: 5m
labels:
  component: node
  object_id: '{{ $labels.instance }}-{{ $labels.disk }}'
  severity: warning
annotations:
  summary: Disk {{ $labels.disk }} on node {{ $labels.instance }} is almost worn out and may fail soon. Consider replacement.

/var/lib/prometheus/alerts/docker.rules > docker_service_common_alerts

Docker service is down (0 active)

alert: Docker service is down
expr: (node_systemd_unit_state{name="docker.service",state=~"failed|inactive"} * on(node) group_left() (backend_node_compute == 1)) == 1 and on(node) softwareupdates_node_state{state!~"updat.*"} == 1
for: 5m
labels:
  component: compute
  severity: critical
annotations:
  summary: Docker service is down on host {{$labels.instance}}.

/var/lib/prometheus/alerts/dr.rules > dr outage

Hybrid DR agent could not access compute services (0 active)

alert: Hybrid DR agent could not access compute services
expr: label_replace(runvm_agent_health{subsystem="ComputeAccess"} == 0, "object_id", "$1", "vmid", "(.*)")
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR agent on the virtual machine {{$labels.vmid}} can't perform any actions in the infrastructure. Visit https://kb.acronis.com/content/70244 to learn how to troubleshoot this issue.

Hybrid DR agent is unavailable (0 active)

alert: Hybrid DR agent is unavailable
expr: label_replace(runvm_infra_agent unless on(uuid) runvm_agent_info, "object_id", "$1", "uuid", "(.*)")
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR agent on the virtual machine {{$labels.uuid}} is unavailable. Visit https://kb.acronis.com/content/70244 to learn how to troubleshoot this issue.

Hybrid DR database is unavailable (0 active)

alert: Hybrid DR database is unavailable
expr: label_replace((up{job="virtual",service="postgres"} == 0) and on(vmid) up{service="dr-infra-manager"}, "object_id", "$1", "vmid", "(.*)")
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR database on virtual machine {{$labels.vmid}} is unavailable. Operation of Hybrid DR infrastructure is not possible. Visit https://kb.acronis.com/content/70244 to learn how to troubleshoot this issue.

Hybrid DR update is available (0 active)

alert: Hybrid DR update is available
expr: sum by(uuid, name, version, available_version) (runvm_infra_outdated{product="dr",type="aci"}) > 0
for: 5m
labels:
  component: dr
  severity: critical
annotations:
  summary: Hybrid DR {{$labels.available_version}} is now available. Install the update as soon as possible. Otherwise your product functionality might be limited

/var/lib/prometheus/alerts/ipsec.rules > ipsec

Enabling IPv6 mode takes too much time (0 active)

alert: Enabling IPv6 mode takes too much time
expr: switch_ipv6_task_running > 0
for: 1h
labels:
  component: cluster
  severity: warning
annotations:
  summary: Operation to enable the IPv6 mode is running for more than 1 hour. Please contact the technical support.

Enabling traffic encryption takes too much time (0 active)

alert: Enabling traffic encryption takes too much time
expr: network_ipsec_task_running > 0
for: 1h
labels:
  component: cluster
  severity: warning
annotations:
  summary: Operation to enable traffic encryption is running for more than 1 hour. Please contact the technical support.

Node IPsec certificate has expired (0 active)

alert: Node IPsec certificate has expired
expr: network_ipsec_cert_days_to_expire <= 0
labels:
  component: cluster
  severity: critical
annotations:
  summary: IPsec certificate for node {{$labels.instance}} with ID {{$labels.node}} has expired. Renew the certificate, as described in the product documentation, or contact the technical support.

Node IPsec certificate will expire in less than 7 days (0 active)

alert: Node IPsec certificate will expire in less than 7 days
expr: (network_ipsec_cert_days_to_expire > 0) <= 7
labels:
  component: cluster
  severity: warning
annotations:
  summary: IPsec certificate for node {{$labels.instance}} with ID {{$labels.node}} will expire in less than 7 days. Renew the certificate, as described in the product documentation, or contact the technical support.

System configuration is not optimal for traffic encryption (0 active)

alert: System configuration is not optimal for traffic encryption
expr: sum by(cluster_id) (network_ipsec_enabled) > 0 and sum by(cluster_id) (node_ipv6_not_configured) > 0
for: 1m
labels:
  component: cluster
  severity: warning
annotations:
  summary: Traffic encryption is enabled but the storage network is not in the IPv6 mode. Switch on the IPv6 configuration, as described in the product documentation.

/var/lib/prometheus/alerts/iscsi.rules > ISCSI

iSCSI has failed volumes (0 active)

alert: iSCSI has failed volumes
expr: label_replace(vstorage_target_manager_volume_failed, "object_id", "$1", "volume_id", "(.*)") > 0
for: 5m
labels:
  component: iSCSI
  severity: critical
annotations:
  summary: The volume {{$labels.volume_id}} has failed. Please contact the technical support.

iSCSI redundancy warning (0 active)

alert: iSCSI redundancy warning
expr: storage_redundancy_threshold{failure_domain="disk",type="iscsi"} > 0 and storage_redundancy_threshold{failure_domain="disk",type="iscsi"} <= scalar(count(backend_node_master))
for: 10m
labels:
  component: iSCSI
  severity: warning
annotations:
  summary: |
    iSCSI LUN {{ $labels.service }} of target group {{ $labels.group }} is set to failure domain "disk" even though there are enough available nodes. It is recommended to set the failure domain to "host" so that the LUN can survive host failures in addition to disk failures.

/var/lib/prometheus/alerts/kernel.rules > kernel_alerts

Block device error (0 active)

alert: Block device error
expr: increase(kernel_message_error{reason="block device I/O error"}[5m]) > 0 or kernel_message_error{reason="block device I/O error"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="block device I/O error"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.src }}'
  severity: critical
annotations:
  summary: Block device error detected for '{{ $labels.src }}' on '{{ $labels.instance }}' node.

Blocked system task (0 active)

alert: Blocked system task
expr: increase(kernel_message_error{reason="blocked task"}[5m]) > 0 or kernel_message_error{reason="blocked task"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="blocked task"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.src }}'
  severity: warning
annotations:
  summary: Blocked task '{{ $labels.src }}' detected on '{{ $labels.instance }}' node. This may occur due to incorrect system configuration or malfunctioning hardware. Please contact support.

CPU throttled (0 active)

alert: CPU throttled
expr: increase(kernel_message_error{reason="clock throttled",src="temperature"}[5m]) > 0 or kernel_message_error{reason="clock throttled",src="temperature"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="clock throttled",src="temperature"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.node }}'
  severity: warning
annotations:
  summary: CPU throttling was detected on '{{ $labels.instance }}' node due to high temperature or other hardware issues.

MCE hardware error (0 active)

alert: MCE hardware error
expr: increase(kernel_message_error{reason="hardware error",src="MCE"}[5m]) > 0 or kernel_message_error{reason="hardware error",src="MCE"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="hardware error",src="MCE"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.node }}'
  severity: warning
annotations:
  summary: Machine check events logged on '{{ $labels.instance }}' node.

OOM killer triggered (0 active)

alert: OOM killer triggered
expr: increase(kernel_message_error{reason="Out Of Memory"}[5m]) > 0 or kernel_message_error{reason="Out Of Memory"} == 1 and on(node, reason, src) count_over_time(kernel_message_error{reason="Out Of Memory"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.node }}-{{ $labels.src }}'
  severity: warning
annotations:
  summary: OOM killer has been triggered on node '{{ $labels.instance }}' for process '{{ $labels.src }}'. Investigate memory usage immediately.

SCSI disk failure (0 active)

alert: SCSI disk failure
expr: increase(kernel_scsi_failures_total[5m]) > 0 or kernel_scsi_failures_total == 1 and on(node, device, err) count_over_time(kernel_scsi_failures_total[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: node
  object_id: '{{ $labels.device }}'
  severity: critical
annotations:
  summary: One or more errors '{{ $labels.err }}' detected for device '{{ $labels.device }}' on {{ $labels.instance }} node.

/var/lib/prometheus/alerts/nova_compute.rules > nova_compute_alerts

Conflict updating instance (0 active)

alert: Conflict updating instance
expr: increase(nova_compute_vm_error{reason="conflict updating"}[5m]) > 0 or nova_compute_vm_error{reason="conflict updating"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="conflict updating"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: 'One or more 'UnexpectedTaskStateError_Remote' errors detected. Conflict updating VM instance: {{ $labels.instance }}'

Failed to power off VM (0 active)

alert: Failed to power off VM
expr: increase(nova_compute_vm_error{reason="power off"}[5m]) > 0 or nova_compute_vm_error{reason="power off"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="power off"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: One or more failed attempts to power off VM '{{ $labels.instance }}' are detected.

Migration job conflict (0 active)

alert: Migration job conflict
expr: increase(nova_compute_vm_error{reason="migration job conflict"}[5m]) > 0 or nova_compute_vm_error{reason="migration job conflict"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="migration job conflict"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: 'Another migration job is already running for VM instance: {{ $labels.instance }}'

Temporary snapshot exists (0 active)

alert: Temporary snapshot exists
expr: increase(nova_compute_vm_error{reason="temporary snapshot"}[5m]) > 0 or nova_compute_vm_error{reason="temporary snapshot"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="temporary snapshot"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: Different operations with VMs might be blocked because temporary snapshot live migration leftovers are detected.

VM Device or resource busy (0 active)

alert: VM Device or resource busy
expr: increase(nova_compute_vm_error{reason="device busy"}[5m]) > 0 or nova_compute_vm_error{reason="device busy"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="device busy"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: One or more failed VM disk I/O operations detected because device is busy.

VM Invalid volume exception (0 active)

alert: VM Invalid volume exception
expr: increase(nova_compute_vm_error{reason="invalid volume"}[5m]) > 0 or nova_compute_vm_error{reason="invalid volume"} == 1 and on(reason) count_over_time(nova_compute_vm_error{reason="invalid volume"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}'
  severity: warning
annotations:
  summary: One or more operations with invalid volume detected for VM '{{ $labels.instance }}'.

Virtual machine error (0 active)

alert: Virtual machine error
expr: increase(nova_compute_vm_error{reason=~"lock|unsupported configuration|storage file access|dbus|timeout|operation unsupported|qemu crash|live migration|binding"}[5m]) > 0 or nova_compute_vm_error{reason=~"lock|unsupported configuration|storage file access|dbus|timeout|operation unsupported|qemu crash|live migration|binding"} == 1 and on(instance, reason) count_over_time(nova_compute_vm_error{reason=~"lock|unsupported configuration|storage file access|dbus|timeout|operation unsupported|qemu crash|live migration|binding"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.instance }}-{{ $labels.reason }}'
  severity: warning
annotations:
  summary: 'One or more '{{ $labels.reason }}' errors detected for VM instance: {{ $labels.instance }}.'

Virtualization management service error (0 active)

alert: Virtualization management service error
expr: increase(nova_compute_libvirt_error{reason=~"lock|connection"}[5m]) > 0 or nova_compute_libvirt_error{reason=~"lock|connection"} == 1 and on(node, reason) count_over_time(nova_compute_libvirt_error{reason=~"lock|connection"}[6m]) < 3 and on(node) count_over_time(up{job="mtail"}[6m]) == 3
labels:
  component: compute
  object_id: '{{ $labels.node }}-{{ $labels.reason }}'
  severity: critical
annotations:
  summary: One or more virtualization management service '{{ $labels.reason }}' errors detected on '{{ $labels.instance }}' node.

/var/lib/prometheus/alerts/openstack_cluster.rules > openstack_cluster_alerts

Backup plan failed (1 active)

alert: Backup plan failed
expr: openstack_freezer_backup_plan_status == 1
for: 20m
labels:
  component: compute
  object_id: '{{$labels.id}}'
  severity: warning
annotations:
  summary: Backup plan {{$labels.name}} for compute volumes has three consecutive failures.

Labels	State	Active Since	Value
alertname="Backup plan failed" component="compute" domain_id="921940ee9d6c465daa555fcc3e3764a2" id="325b2a1b1abe4b848c2bd3c07d25d807" instance="compute-api.svc" job="openstack_exporter" name="db" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="325b2a1b1abe4b848c2bd3c07d25d807" project_id="328a7224779a42db9f51a05a91054adc" severity="warning" status="scheduled"	firing	2025-12-09 02:10:13.030437186 +0000 UTC	1
Annotations
summary Backup plan db for compute volumes has three consecutive failures.

Cluster is out of memory (0 active)

alert: Cluster is out of memory
expr: sum(openstack_placement_resource_usage{resourcetype="MEMORY_MB"}) / sum(openstack_placement_resource_allocation_ratio{resourcetype="MEMORY_MB"} * (openstack_placement_resource_total{resourcetype="MEMORY_MB"} - openstack_placement_resource_reserved{resourcetype="MEMORY_MB"})) * 100 > 95
for: 10m
labels:
  component: compute
  object_id: openstack_exporter
  severity: warning
annotations:
  summary: Cluster has reached 95% of the memory allocation limit.

Cluster is out of vCPU resources (0 active)

alert: Cluster is out of vCPU resources
expr: sum(openstack_placement_resource_usage{resourcetype="VCPU"}) / sum(openstack_placement_resource_allocation_ratio{resourcetype="VCPU"} * (openstack_placement_resource_total{resourcetype="VCPU"} - openstack_placement_resource_reserved{resourcetype="VCPU"})) * 100 > 95
for: 10m
labels:
  component: compute
  object_id: openstack_exporter
  severity: warning
annotations:
  summary: Cluster has reached 95% of the vCPU allocation limit.

Cluster is running out of memory (0 active)

alert: Cluster is running out of memory
expr: 80 < sum(openstack_placement_resource_usage{resourcetype="MEMORY_MB"}) / sum(openstack_placement_resource_allocation_ratio{resourcetype="MEMORY_MB"} * (openstack_placement_resource_total{resourcetype="MEMORY_MB"} - openstack_placement_resource_reserved{resourcetype="MEMORY_MB"})) * 100 < 95
for: 10m
labels:
  component: compute
  object_id: openstack_exporter
  severity: info
annotations:
  summary: Cluster has reached 80% of the memory allocation limit.

Cluster is running out of vCPU resources (0 active)

alert: Cluster is running out of vCPU resources
expr: 80 < sum(openstack_placement_resource_usage{resourcetype="VCPU"}) / sum(openstack_placement_resource_allocation_ratio{resourcetype="VCPU"} * (openstack_placement_resource_total{resourcetype="VCPU"} - openstack_placement_resource_reserved{resourcetype="VCPU"})) * 100 < 95
for: 10m
labels:
  component: compute
  object_id: openstack_exporter
  severity: info
annotations:
  summary: Cluster has reached 80% of the vCPU allocation limit.

Kubernetes cluster update failed (0 active)

alert: Kubernetes cluster update failed
expr: openstack_container_infra_cluster_status == 4
for: 5m
labels:
  component: compute
  object_id: '{{ $labels.uuid }}'
  severity: warning
annotations:
  summary: Kubernetes cluster with ID "{{ $labels.uuid }}" has the "{{ $labels.status }}" status.

Licensed core limit exceeded (0 active)

alert: Licensed core limit exceeded
expr: (sum(openstack_nova_phys_cores_available) - on() licensed_core_number) > 0 and on() licensed_core_number > 0
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  core_number: '{{ printf `licensed_core_number`|query|first|value }}'
  core_number_used: '{{ printf `sum(openstack_nova_phys_cores_available)`|query|first|value }}'
  summary: |
    Number of physical cores used in the cluster is "{{ $labels.core_number_used }}", which exceeds the licensed core limit of "{{ $labels.core_number }}".

Load balancer error (0 active)

alert: Load balancer error
expr: openstack_loadbalancer_loadbalancer_status{provisioning_status="ERROR"}
labels:
  component: compute
  object_id: '{{$labels.id}}'
  severity: warning
annotations:
  summary: |
    Load balancer with ID "{{$labels.id}}" has the 'ERROR' provisioning status. Please check the Octavia service logs or contact the technical support.

Load balancer is stuck in pending state (0 active)

alert: Load balancer is stuck in pending state
expr: openstack_loadbalancer_loadbalancer_status{is_stale="true"}
labels:
  component: compute
  object_id: '{{$labels.id}}'
  severity: error
annotations:
  summary: |
    Load balancer with ID "{{$labels.id}}" is stuck with the "{{$labels.provisioning_status}}" status. Ensure that the load balancer configuration is consistent and perform a failover.

Neutron bridge mapping not found (0 active)

alert: Neutron bridge mapping not found
expr: label_replace(openstack_neutron_network_bridge_mapping * on(hostname) group_left(node) (backend_node_compute), "object_id", "$1", "provider_physical_network", "(.*)") == 0
for: 20m
labels:
  component: compute
  severity: critical
annotations:
  summary: |
    Physical network "{{$labels.provider_physical_network}}" is not found in the bridge mapping on node "{{$labels.hostname}}". Virtual network "{{$labels.network_name}}" on this node is most likely not functioning. Please contact the technical support.

Unrecognized DHCP servers detected (0 active)

alert: Unrecognized DHCP servers detected
expr: group by(network_id) (neutron_network_dhcp_reply_count >= 3) and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count >= 3) >= 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP service for virtual network "{{$labels.network_id}}" may be malfunctioning. Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.

Unrecognized DHCP servers detected from node (0 active)

alert: Unrecognized DHCP servers detected from node
expr: neutron_network_dhcp_reply_count >= 3 and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count >= 3) < 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP service for virtual network "{{$labels.network_id}}" may be malfunctioning on node "{{$labels.host}}". Please ensure that virtual machines are receiving correct DHCP addresses or contact the technical support.

Virtual DHCP server HA degraded (0 active)

alert: Virtual DHCP server HA degraded
expr: group by(network_id) (neutron_network_dhcp_reply_count == 1) and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 1) >= 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Only one built-in DHCP server for virtual network "{{$labels.network_id}}" is reachable from cluster nodes. DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server HA degraded on node (0 active)

alert: Virtual DHCP server HA degraded on node
expr: neutron_network_dhcp_reply_count == 1 and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 1) < 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Only one built-in DHCP server for virtual network "{{$labels.network_id}}" is reachable from node "{{$labels.host}}". DHCP high availability entered the degraded state. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server is unavailable (0 active)

alert: Virtual DHCP server is unavailable
expr: group by(network_id) (neutron_network_dhcp_reply_count == 0) and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 0) >= 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP server for virtual network "{{$labels.network_id}}" is not available from cluster nodes. Please check the neutron-dhcp-agent service or contact the technical support.

Virtual DHCP server is unavailable from node (0 active)

alert: Virtual DHCP server is unavailable from node
expr: neutron_network_dhcp_reply_count == 0 and on(network_id) (count by(network_id) (neutron_network_dhcp_reply_count == 0) < 2) and on() (backend_ha_up == 1)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.network_id}}'
  severity: warning
annotations:
  summary: |
    Built-in DHCP server for virtual network "{{$labels.network_id}}" is not available from node "{{$labels.host}}". Please check the neutron-dhcp-agent service or contact the technical support.

Virtual machine error (0 active)

alert: Virtual machine error
expr: label_replace(openstack_nova_server_status{status="ERROR"}, "object_id", "$1", "id", "(.*)")
labels:
  component: compute
  severity: critical
annotations:
  summary: Virtual machine {{$labels.name}} with ID {{$labels.id}} is in the 'Error' state.

Virtual machine has crashed (0 active)

alert: Virtual machine has crashed
expr: (libvirt_domain_info_state == 5 and libvirt_domain_info_state_reason == 3) or (libvirt_domain_info_state == 6 and libvirt_domain_info_state_reason == 1) or (libvirt_domain_info_state == 1 and libvirt_domain_info_state_reason == 9) or (libvirt_domain_info_state == 3 and libvirt_domain_info_state_reason == 10)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.domain_uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual machine with ID {{$labels.domain_uuid}} in project {{$labels.project_name}} has crashed. Restart the VM.

Virtual machine is not responding (0 active)

alert: Virtual machine is not responding
expr: sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_block_stats_read_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_block_stats_write_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_interface_stats_receive_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_interface_stats_transmit_bytes:rate5m) == 0 and sum by(project_name, name, domain_uuid) (instance_domain:libvirt_domain_info_cpu_time_seconds:rate5m) > 0.1
for: 10m
labels:
  component: compute
  object_id: '{{$labels.domain_uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual machine {{$labels.name}} in project {{$labels.project_name}} has stopped responding. Consider VM restart.

Virtual machine state mismatch (0 active)

alert: Virtual machine state mismatch
expr: label_join((count_over_time(nova:libvirt:server:diff[2h]) > 60) and (nova:libvirt:server:diff), "object_id", "", "id")
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: State of virtual machine {{$labels.name}} with ID {{$labels.id}} differs in the Nova databases and libvirt configuration.

Virtual network port check failed (0 active)

alert: Virtual network port check failed
expr: neutron_port_status_failed{check!="dhcp",device_owner!="network:dhcp"} == 1 unless on(device_id) label_join(openstack_nova_server_status{status="SHELVED_OFFLOADED"}, "device_id", "", "uuid")
for: 10m
labels:
  component: compute
  object_id: '{{$labels.port_id}}'
  severity: critical
annotations:
  summary: Neutron port with ID {{$labels.port_id}} failed {{$labels.check}} check. The port type is {{$labels.device_owner}} with owner ID {{$labels.device_id}}

Virtual network port check failed (0 active)

alert: Virtual network port check failed
expr: neutron_port_status_failed{check="dhcp"} == 1 unless on(device_id) label_join(openstack_nova_server_status{status="SHELVED_OFFLOADED"}, "device_id", "", "uuid")
for: 10m
labels:
  component: compute
  object_id: '{{$labels.port_id}}'
  severity: info
annotations:
  summary: Neutron port with ID {{$labels.port_id}} failed {{$labels.check}} check. The port type is {{$labels.device_owner}} with owner ID {{$labels.device_id}}

Virtual network port check failed (0 active)

alert: Virtual network port check failed
expr: neutron_port_status_failed{check!="dhcp",device_owner="network:dhcp"} == 1
for: 10m
labels:
  component: compute
  object_id: '{{$labels.port_id}}'
  severity: warning
annotations:
  summary: Neutron port with ID {{$labels.port_id}} failed {{$labels.check}} check. The port type is {{$labels.device_owner}} with owner ID {{$labels.device_id}}

Virtual router HA has more than one active L3 agent (0 active)

alert: Virtual router HA has more than one active L3 agent
expr: count by(ha_state, router_id) (openstack_neutron_l3_agent_of_router{ha_state="active"}) > 1
for: 10m
labels:
  component: compute
  object_id: '{{$labels.router_id}}'
  severity: critical
annotations:
  summary: |
    Virtual router HA with ID {{$labels.router_id}} has more than one active L3 agent. Please contact the technical support.

Virtual router HA has no active L3 agent (0 active)

alert: Virtual router HA has no active L3 agent
expr: count by(router_id) (openstack_neutron_l3_agent_of_router) - on(router_id) count by(router_id) (openstack_neutron_l3_agent_of_router{ha_state!~"active"}) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.router_id}}'
  severity: critical
annotations:
  summary: |
    Virtual router HA with ID {{$labels.router_id}} has no active L3 agent. Please contact the technical support.

Virtual router SNAT-related port has invalid host binding (0 active)

alert: Virtual router SNAT-related port has invalid host binding
expr: openstack_neutron_port{device_owner="network:router_centralized_snat"} and on(device_id, binding_host_id) (label_replace(label_replace(openstack_neutron_l3_agent_of_router{ha_state="standby"}, "device_id", "$1", "router_id", "(.+)"), "binding_host_id", "$1", "agent_host", "(.+)"))
for: 10m
labels:
  component: compute
  object_id: '{{$labels.uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual router SNAT-related port with ID {{$labels.uuid}} is bound to the Standby HA router node. Please contact the technical support.

Virtual router gateway port has invalid host binding (0 active)

alert: Virtual router gateway port has invalid host binding
expr: openstack_neutron_port{device_owner="network:router_gateway"} and on(device_id, binding_host_id) (label_replace(label_replace(openstack_neutron_l3_agent_of_router{ha_state="standby"}, "device_id", "$1", "router_id", "(.+)"), "binding_host_id", "$1", "agent_host", "(.+)"))
for: 10m
labels:
  component: compute
  object_id: '{{$labels.uuid}}'
  severity: critical
annotations:
  summary: |
    Virtual router gateway port with ID {{$labels.uuid}} is bound to the Standby HA router node. Please contact the technical support.

Volume attachment details mismatch (0 active)

alert: Volume attachment details mismatch
expr: label_replace((count_over_time(cinder:libvirt:volume:diff[2h]) > 60) and (cinder:libvirt:volume:diff), "object_id", "$1", "volume_id", "(.*)")
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: Attachment details for volume with ID {{$labels.volume_id}} differ in the Nova and libvirt databases. Additionally, this may indicate the existence of an uncommitted temporary snapshot.

Volume has incorrect status (0 active)

alert: Volume has incorrect status
expr: openstack_cinder_volume_gb{status=~"error|error_deleting|error_managing|error_restoring|error_backing-up|error_extending"}
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.id }}'
  severity: critical
annotations:
  summary: Volume {{ $labels.id }} has the {{ $labels.status }} status.

Volume is stuck in transitional state (0 active)

alert: Volume is stuck in transitional state
expr: openstack_cinder_volume_gb{status=~"attaching|detaching|deleting|extending|reserved"}
for: 15m
labels:
  component: compute
  object_id: '{{ $labels.id }}'
  severity: warning
annotations:
  summary: Volume {{ $labels.id }} is stuck with the {{ $labels.status }} status for more than 5 minutes.

/var/lib/prometheus/alerts/openstack_domain.rules > openstack_exporter_domain_limits

Domain is out of memory (0 active)

alert: Domain is out of memory
expr: round(openstack_nova_limits_memory_used{is_domain="true"} / (openstack_nova_limits_memory_max{is_domain="true"} > 0) * 100) >= 80 < 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: info
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the memory allocation limit.
  value: '{{ $value }}'

Domain is out of memory (0 active)

alert: Domain is out of memory
expr: round(openstack_nova_limits_memory_used{is_domain="true"} / (openstack_nova_limits_memory_max{is_domain="true"} > 0) * 100) >= 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: warning
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the memory allocation limit.
  value: '{{ $value }}'

Domain is out of storage policy space (0 active)

alert: Domain is out of storage policy space
expr: round(openstack_cinder_limits_volume_storage_policy_used_gb{is_domain="true",volume_type!=""} / (openstack_cinder_limits_volume_storage_policy_max_gb{is_domain="true",volume_type!=""} > 0) * 100) >= 80 < 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}-{{ $labels.volume_type }}'
  severity: info
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the {{ $labels.volume_type }} storage policy allocation limit.
  value: '{{ $value }}'

Domain is out of storage policy space (0 active)

alert: Domain is out of storage policy space
expr: round(openstack_cinder_limits_volume_storage_policy_used_gb{is_domain="true",volume_type!=""} / (openstack_cinder_limits_volume_storage_policy_max_gb{is_domain="true",volume_type!=""} > 0) * 100) > 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}-{{ $labels.volume_type }}'
  severity: warning
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the {{ $labels.volume_type }} storage policy allocation limit.
  value: '{{ $value }}'

Domain is out of vCPU resources (0 active)

alert: Domain is out of vCPU resources
expr: round(openstack_nova_limits_vcpus_used{is_domain="true"} / (openstack_nova_limits_vcpus_max{is_domain="true"} > 0) * 100) >= 80 < 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: info
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the vCPU allocation limit.
  value: '{{ $value }}'

Domain is out of vCPU resources (0 active)

alert: Domain is out of vCPU resources
expr: round(openstack_nova_limits_vcpus_used{is_domain="true"} / (openstack_nova_limits_vcpus_max{is_domain="true"} > 0) * 100) >= 95
for: 10m
labels:
  component: compute
  object_id: '{{ $labels.tenant_id }}'
  severity: warning
annotations:
  summary: Domain {{ $labels.tenant }} has reached {{ $value }}% of the vCPU allocation limit.
  value: '{{ $value }}'

/var/lib/prometheus/alerts/openstack_node.rules > openstack_node_alerts

Extra RAM reservation detected for compute placement service (0 active)

alert: Extra RAM reservation detected for compute placement service
expr: (sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="MEMORY_MB"}, "host", "$1", "hostname", "(.*).vstoragedomain"))) / 1024 - (sum by(host) (label_replace(libvirt_domain_info_memory_usage_bytes, "host", "$1", "instance", "(.*)"))) / (1024 * 1024 * 1024) > 0
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' GiB of RAM detected for the compute placement service on node '{{ $labels.host }}'.

Extra RAM reservation detected on hypervisor node (0 active)

alert: Extra RAM reservation detected on hypervisor node
expr: abs((sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="MEMORY_MB"}, "host", "$1", "hostname", "(.*).vstoragedomain"))) / 1024 - (sum by(host) (label_replace(libvirt_domain_info_memory_usage_bytes, "host", "$1", "instance", "(.*)"))) / (1024 * 1024 * 1024) < 0)
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: info
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' GiB of RAM detected on hypervisor node '{{ $labels.host }}'.

Extra vCPU reservation detected for compute placement service (0 active)

alert: Extra vCPU reservation detected for compute placement service
expr: sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="VCPU"}, "host", "$1", "hostname", "(.*).vstoragedomain")) - sum by(host) (label_replace(libvirt_domain_info_virtual_cpus, "host", "$1", "instance", "(.*)")) > 0
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' vCPUs detected for the compute placement service on node '{{ $labels.host }}'.

Extra vCPU reservation detected on hypervisor node (0 active)

alert: Extra vCPU reservation detected on hypervisor node
expr: abs(sum by(host) (label_replace(openstack_placement_resource_usage{resourcetype="VCPU"}, "host", "$1", "hostname", "(.*).vstoragedomain")) - sum by(host) (label_replace(libvirt_domain_info_virtual_cpus, "host", "$1", "instance", "(.*)")) < 0)
for: 30m
labels:
  component: compute
  object_id: '{{ $labels.host }}'
  severity: info
  value: '{{ $value }}'
annotations:
  summary: Extra VM registrations consuming '{{ $value }}' vCPUs detected on hypervisor node '{{ $labels.host }}'.

Libvirt service is down (0 active)

alert: Libvirt service is down
expr: libvirt_up == 0 and on(node) backend_node_compute == 1
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: Libvirt service is not responding on node {{$labels.instance}} with ID {{$labels.node}}. Check the service state. If the service cannot start, contact the technical support.

Node is out of memory (0 active)

alert: Node is out of memory
expr: (openstack_placement_resource_usage{resourcetype="MEMORY_MB"} / (openstack_placement_resource_allocation_ratio{resourcetype="MEMORY_MB"} * (openstack_placement_resource_total{resourcetype="MEMORY_MB"} - openstack_placement_resource_reserved{resourcetype="MEMORY_MB"})) * 100 * on(hostname) group_left(node, instance) label_replace(node_uname_info{job="node"}, "hostname", "$1", "nodename", "(.*)")) > 95
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Node {{$labels.instance}} with ID {{$labels.node}} has reached 95% of the memory allocation limit.

Node is out of vCPU resources (0 active)

alert: Node is out of vCPU resources
expr: (openstack_placement_resource_usage{resourcetype="VCPU"} / (openstack_placement_resource_allocation_ratio{resourcetype="VCPU"} * (openstack_placement_resource_total{resourcetype="VCPU"} - openstack_placement_resource_reserved{resourcetype="VCPU"})) * 100 * on(hostname) group_left(node, instance) label_replace(node_uname_info{job="node"}, "hostname", "$1", "nodename", "(.*)")) > 95
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Node {{$labels.instance}} with ID {{$labels.node}} has reached 95% of the vCPU allocation limit.

Node is running out of memory (0 active)

alert: Node is running out of memory
expr: 80 < (openstack_placement_resource_usage{resourcetype="MEMORY_MB"} / (openstack_placement_resource_allocation_ratio{resourcetype="MEMORY_MB"} * (openstack_placement_resource_total{resourcetype="MEMORY_MB"} - openstack_placement_resource_reserved{resourcetype="MEMORY_MB"})) * 100 * on(hostname) group_left(node, instance) label_replace(node_uname_info{job="node"}, "hostname", "$1", "nodename", "(.*)")) < 95
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Node {{$labels.instance}} with ID {{$labels.node}} has reached 80% of the memory allocation limit.

Node is running out of vCPU resources (0 active)

alert: Node is running out of vCPU resources
expr: 80 < (openstack_placement_resource_usage{resourcetype="VCPU"} / (openstack_placement_resource_allocation_ratio{resourcetype="VCPU"} * (openstack_placement_resource_total{resourcetype="VCPU"} - openstack_placement_resource_reserved{resourcetype="VCPU"})) * 100 * on(hostname) group_left(node, instance) label_replace(node_uname_info{job="node"}, "hostname", "$1", "nodename", "(.*)")) < 95
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Node {{$labels.instance}} with ID {{$labels.node}} has reached 80% of the vCPU limit.

/var/lib/prometheus/alerts/openstack_projects.rules > openstack_exporter_project_limits

Project is out of floating IP addresses (34 active)

alert: Project is out of floating IP addresses
expr: label_replace((openstack_neutron_quotas_floating_ips_used) / (openstack_neutron_quotas_floating_ips_limit > 0) * 100 * on(tenant_id) group_left() label_replace(openstack_identity_project_info, "tenant_id", "$1", "id", "(.*)") > 95, "object_id", "$1", "network_id", "(.*)")
for: 5m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the floating IP address allocation limit.

Labels	State	Active Since	Value
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118058" tenant_id="fe2fb846d21946fe971ffdc58d3b4021"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-118058 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-119280" tenant_id="62d89b6e807a414eb88837cf967de6bf"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-119280 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122213" tenant_id="2479f7bbbdd048a892cdce5b6041eb0d"	firing	2025-12-20 15:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122213 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118012" tenant_id="f7f4e7c84bb5421c90476e8b62d62be1"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-118012 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122063" tenant_id="2856b90c9ce848949da775ba7190125e"	firing	2025-12-16 17:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122063 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122519" tenant_id="3a831eead8904a1ba173172f9027a382"	firing	2026-01-02 10:30:04.388696542 +0000 UTC	100
Annotations
summary Project project-122519 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123020" tenant_id="76589b5ba6db4544b76639a965c48087"	firing	2026-01-02 12:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-123020 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123263" tenant_id="1fc00ed8917c41268eaf1c18fc4475b2"	firing	2026-01-06 05:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123263 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123458" tenant_id="337091c1254e4845844471037bb1d0a9"	firing	2026-01-08 15:10:04.388696542 +0000 UTC	100
Annotations
summary Project project-123458 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122459" tenant_id="125599669f7e4a9bb3707ee1766754be"	firing	2025-12-23 09:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-122459 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122803" tenant_id="080d78c2d3e7422e97236dd562b464fd"	firing	2025-12-29 10:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122803 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123018" tenant_id="d23d4718205143cbaa38899d58fb0902"	firing	2026-01-02 12:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123018 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123021" tenant_id="313f523a7fc549898f4ff3272cc8edcb"	firing	2026-01-03 15:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-123021 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123275" tenant_id="655bf94757b34ce28393e3588e0a7559"	firing	2026-01-06 17:50:04.388696542 +0000 UTC	100
Annotations
summary Project project-123275 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118139" tenant_id="7d1f8f08904e4d6caf5285894da50d1b"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-118139 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-32926" tenant_id="8890bce87bc64d43b0c0b033f94bf8cc"	firing	2025-12-22 16:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-32926 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122499" tenant_id="84d97362dcbf40898dea3c822e2b304b"	firing	2025-12-24 05:30:04.388696542 +0000 UTC	100
Annotations
summary Project project-122499 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-51475" tenant_id="a497d633d803473ebb9de92f9171563f"	firing	2026-01-02 05:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-51475 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123197" tenant_id="6b8a46bf990d4e409d598b327b65f532"	firing	2026-01-08 02:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-123197 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123373" tenant_id="2620ec59745f4ba69ba8e5bb97d7f800"	firing	2026-01-08 16:10:04.388696542 +0000 UTC	100
Annotations
summary Project project-123373 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-121352" tenant_id="e7f50c5cd5b54155b6b152eff2bf71e8"	firing	2025-12-11 16:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-121352 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118480" tenant_id="e050246871f84196a42823491e6bd39b"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-118480 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-97031" tenant_id="1fa7097b647240aa9d2028c31f8bbdbb"	firing	2025-12-30 05:10:04.388696542 +0000 UTC	100
Annotations
summary Project project-97031 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122988" tenant_id="f96617b0cf214c81bd2507e838afcb80"	firing	2026-01-02 10:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122988 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123417" tenant_id="9b52e5e261784b1baea412c1ee882ad0"	firing	2026-01-08 10:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123417 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118481" tenant_id="8e577430b32a4881983777a3f69f771a"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-118481 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122427" tenant_id="9a0d4e86a55d44d4bb7ef3a884482e62"	firing	2025-12-23 06:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-122427 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123405" tenant_id="f13f6749dbd04b13a51665e54460fb10"	firing	2026-01-08 10:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-123405 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-123452" tenant_id="0855a604f4a84ffe9cd42c6ac060cbd6"	firing	2026-01-08 15:10:04.388696542 +0000 UTC	100
Annotations
summary Project project-123452 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118320" tenant_id="c4cd1dc496624de5ba89fbcad7983cc0"	firing	2025-11-14 08:01:04.388696542 +0000 UTC	100
Annotations
summary Project project-118320 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-120691" tenant_id="2bd0411d89c747219d257253006c19be"	firing	2025-12-12 04:50:04.388696542 +0000 UTC	100
Annotations
summary Project project-120691 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122473" tenant_id="56371d58bd434873a84158a8d68baad6"	firing	2025-12-31 05:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122473 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-118892" tenant_id="a559965cfc17408f863683f27ce8a425"	firing	2025-11-30 11:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-118892 has reached 95% of the floating IP address allocation limit.
alertname="Project is out of floating IP addresses" component="compute" instance="compute-api.svc" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="info" tenant="project-122468" tenant_id="db4140291890405c919048a43ebdd1cf"	firing	2026-01-02 15:50:04.388696542 +0000 UTC	100
Annotations
summary Project project-122468 has reached 95% of the floating IP address allocation limit.

Project is out of memory (27 active)

alert: Project is out of memory
expr: label_replace(openstack_nova_limits_memory_used{is_domain="false"} / (openstack_nova_limits_memory_max{is_domain="false"} > 0) * 100 > 95, "object_id", "$1", "tenant_id", "(.*)")
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the memory allocation limit.

Labels	State	Active Since	Value
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="fe2fb846d21946fe971ffdc58d3b4021" severity="info" tenant="project-118058" tenant_id="fe2fb846d21946fe971ffdc58d3b4021"	firing	2025-10-24 12:10:04 +0000 UTC	100
Annotations
summary Project project-118058 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="655bf94757b34ce28393e3588e0a7559" severity="info" tenant="project-123275" tenant_id="655bf94757b34ce28393e3588e0a7559"	firing	2026-01-06 17:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123275 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="8e577430b32a4881983777a3f69f771a" severity="info" tenant="project-118481" tenant_id="8e577430b32a4881983777a3f69f771a"	firing	2025-11-02 04:20:04 +0000 UTC	100
Annotations
summary Project project-118481 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2479f7bbbdd048a892cdce5b6041eb0d" severity="info" tenant="project-122213" tenant_id="2479f7bbbdd048a892cdce5b6041eb0d"	firing	2025-12-20 15:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122213 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="125599669f7e4a9bb3707ee1766754be" severity="info" tenant="project-122459" tenant_id="125599669f7e4a9bb3707ee1766754be"	firing	2025-12-23 14:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122459 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="6b8a46bf990d4e409d598b327b65f532" severity="info" tenant="project-123197" tenant_id="6b8a46bf990d4e409d598b327b65f532"	firing	2026-01-05 10:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123197 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="56371d58bd434873a84158a8d68baad6" severity="info" tenant="project-122473" tenant_id="56371d58bd434873a84158a8d68baad6"	firing	2025-12-31 08:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122473 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="7d1f8f08904e4d6caf5285894da50d1b" severity="info" tenant="project-118139" tenant_id="7d1f8f08904e4d6caf5285894da50d1b"	firing	2025-10-25 11:40:04 +0000 UTC	100
Annotations
summary Project project-118139 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="62d89b6e807a414eb88837cf967de6bf" severity="info" tenant="project-119280" tenant_id="62d89b6e807a414eb88837cf967de6bf"	firing	2025-11-12 02:00:04 +0000 UTC	100
Annotations
summary Project project-119280 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2bd0411d89c747219d257253006c19be" severity="info" tenant="project-120691" tenant_id="2bd0411d89c747219d257253006c19be"	firing	2025-12-30 03:30:04.388696542 +0000 UTC	100
Annotations
summary Project project-120691 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="3a831eead8904a1ba173172f9027a382" severity="info" tenant="project-122519" tenant_id="3a831eead8904a1ba173172f9027a382"	firing	2026-01-02 15:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-122519 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e050246871f84196a42823491e6bd39b" severity="info" tenant="project-118480" tenant_id="e050246871f84196a42823491e6bd39b"	firing	2025-11-04 05:00:04 +0000 UTC	100
Annotations
summary Project project-118480 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2856b90c9ce848949da775ba7190125e" severity="info" tenant="project-122063" tenant_id="2856b90c9ce848949da775ba7190125e"	firing	2025-12-16 17:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122063 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2620ec59745f4ba69ba8e5bb97d7f800" severity="info" tenant="project-123373" tenant_id="2620ec59745f4ba69ba8e5bb97d7f800"	firing	2026-01-08 16:10:04.388696542 +0000 UTC	100
Annotations
summary Project project-123373 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f7f4e7c84bb5421c90476e8b62d62be1" severity="info" tenant="project-118012" tenant_id="f7f4e7c84bb5421c90476e8b62d62be1"	firing	2025-10-28 10:50:04 +0000 UTC	100
Annotations
summary Project project-118012 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="080d78c2d3e7422e97236dd562b464fd" severity="info" tenant="project-122803" tenant_id="080d78c2d3e7422e97236dd562b464fd"	firing	2025-12-29 09:50:04.388696542 +0000 UTC	100
Annotations
summary Project project-122803 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="337091c1254e4845844471037bb1d0a9" severity="info" tenant="project-123458" tenant_id="337091c1254e4845844471037bb1d0a9"	firing	2026-01-08 15:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-123458 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="c4cd1dc496624de5ba89fbcad7983cc0" severity="info" tenant="project-118320" tenant_id="c4cd1dc496624de5ba89fbcad7983cc0"	firing	2025-10-28 15:10:04 +0000 UTC	100
Annotations
summary Project project-118320 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e7f50c5cd5b54155b6b152eff2bf71e8" severity="info" tenant="project-121352" tenant_id="e7f50c5cd5b54155b6b152eff2bf71e8"	firing	2025-12-11 16:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-121352 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="84d97362dcbf40898dea3c822e2b304b" severity="info" tenant="project-122499" tenant_id="84d97362dcbf40898dea3c822e2b304b"	firing	2025-12-24 12:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122499 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="76589b5ba6db4544b76639a965c48087" severity="info" tenant="project-123020" tenant_id="76589b5ba6db4544b76639a965c48087"	firing	2026-01-02 12:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123020 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1679394c56df4dcd8e1c998f9c614d66" severity="info" tenant="project-123276" tenant_id="1679394c56df4dcd8e1c998f9c614d66"	firing	2026-01-07 06:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123276 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9b52e5e261784b1baea412c1ee882ad0" severity="info" tenant="project-123417" tenant_id="9b52e5e261784b1baea412c1ee882ad0"	firing	2026-01-08 09:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123417 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9a0d4e86a55d44d4bb7ef3a884482e62" severity="info" tenant="project-122427" tenant_id="9a0d4e86a55d44d4bb7ef3a884482e62"	firing	2025-12-23 06:30:04.388696542 +0000 UTC	100
Annotations
summary Project project-122427 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f96617b0cf214c81bd2507e838afcb80" severity="info" tenant="project-122988" tenant_id="f96617b0cf214c81bd2507e838afcb80"	firing	2026-01-02 10:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122988 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="d23d4718205143cbaa38899d58fb0902" severity="info" tenant="project-123018" tenant_id="d23d4718205143cbaa38899d58fb0902"	firing	2026-01-02 14:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123018 has reached 95% of the memory allocation limit.
alertname="Project is out of memory" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1fc00ed8917c41268eaf1c18fc4475b2" severity="info" tenant="project-123263" tenant_id="1fc00ed8917c41268eaf1c18fc4475b2"	firing	2026-01-06 05:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123263 has reached 95% of the memory allocation limit.

Project is out of vCPU resources (28 active)

alert: Project is out of vCPU resources
expr: label_replace(openstack_nova_limits_vcpus_used{is_domain="false"} / (openstack_nova_limits_vcpus_max{is_domain="false"} > 0) * 100 > 95, "object_id", "$1", "tenant_id", "(.*)")
for: 10m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the vCPU allocation limit.

Labels	State	Active Since	Value
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="337091c1254e4845844471037bb1d0a9" severity="info" tenant="project-123458" tenant_id="337091c1254e4845844471037bb1d0a9"	firing	2026-01-08 15:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-123458 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="7d1f8f08904e4d6caf5285894da50d1b" severity="info" tenant="project-118139" tenant_id="7d1f8f08904e4d6caf5285894da50d1b"	firing	2025-10-25 11:40:04 +0000 UTC	100
Annotations
summary Project project-118139 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="8e577430b32a4881983777a3f69f771a" severity="info" tenant="project-118481" tenant_id="8e577430b32a4881983777a3f69f771a"	firing	2025-11-02 04:20:04 +0000 UTC	100
Annotations
summary Project project-118481 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9a0d4e86a55d44d4bb7ef3a884482e62" severity="info" tenant="project-122427" tenant_id="9a0d4e86a55d44d4bb7ef3a884482e62"	firing	2025-12-23 06:30:04.388696542 +0000 UTC	100
Annotations
summary Project project-122427 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="84d97362dcbf40898dea3c822e2b304b" severity="info" tenant="project-122499" tenant_id="84d97362dcbf40898dea3c822e2b304b"	firing	2025-12-24 12:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122499 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="3a831eead8904a1ba173172f9027a382" severity="info" tenant="project-122519" tenant_id="3a831eead8904a1ba173172f9027a382"	firing	2026-01-02 15:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-122519 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="655bf94757b34ce28393e3588e0a7559" severity="info" tenant="project-123275" tenant_id="655bf94757b34ce28393e3588e0a7559"	firing	2026-01-06 17:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123275 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="fe2fb846d21946fe971ffdc58d3b4021" severity="info" tenant="project-118058" tenant_id="fe2fb846d21946fe971ffdc58d3b4021"	firing	2025-10-24 12:10:04 +0000 UTC	100
Annotations
summary Project project-118058 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="c4cd1dc496624de5ba89fbcad7983cc0" severity="info" tenant="project-118320" tenant_id="c4cd1dc496624de5ba89fbcad7983cc0"	firing	2025-10-28 15:10:04 +0000 UTC	100
Annotations
summary Project project-118320 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1679394c56df4dcd8e1c998f9c614d66" severity="info" tenant="project-123276" tenant_id="1679394c56df4dcd8e1c998f9c614d66"	firing	2026-01-07 06:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123276 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9b52e5e261784b1baea412c1ee882ad0" severity="info" tenant="project-123417" tenant_id="9b52e5e261784b1baea412c1ee882ad0"	firing	2026-01-08 09:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123417 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e7f50c5cd5b54155b6b152eff2bf71e8" severity="info" tenant="project-121352" tenant_id="e7f50c5cd5b54155b6b152eff2bf71e8"	firing	2025-12-11 16:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-121352 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="56371d58bd434873a84158a8d68baad6" severity="info" tenant="project-122473" tenant_id="56371d58bd434873a84158a8d68baad6"	firing	2025-12-31 08:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122473 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f96617b0cf214c81bd2507e838afcb80" severity="info" tenant="project-122988" tenant_id="f96617b0cf214c81bd2507e838afcb80"	firing	2026-01-02 10:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122988 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="6b8a46bf990d4e409d598b327b65f532" severity="info" tenant="project-123197" tenant_id="6b8a46bf990d4e409d598b327b65f532"	firing	2026-01-05 10:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123197 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="d23d4718205143cbaa38899d58fb0902" severity="info" tenant="project-123018" tenant_id="d23d4718205143cbaa38899d58fb0902"	firing	2026-01-02 14:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123018 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="8890bce87bc64d43b0c0b033f94bf8cc" severity="info" tenant="project-32926" tenant_id="8890bce87bc64d43b0c0b033f94bf8cc"	firing	2026-01-03 10:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-32926 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2620ec59745f4ba69ba8e5bb97d7f800" severity="info" tenant="project-123373" tenant_id="2620ec59745f4ba69ba8e5bb97d7f800"	firing	2026-01-08 16:10:04.388696542 +0000 UTC	100
Annotations
summary Project project-123373 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="f7f4e7c84bb5421c90476e8b62d62be1" severity="info" tenant="project-118012" tenant_id="f7f4e7c84bb5421c90476e8b62d62be1"	firing	2025-10-28 10:50:04 +0000 UTC	100
Annotations
summary Project project-118012 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="62d89b6e807a414eb88837cf967de6bf" severity="info" tenant="project-119280" tenant_id="62d89b6e807a414eb88837cf967de6bf"	firing	2025-11-12 02:00:04 +0000 UTC	100
Annotations
summary Project project-119280 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="76589b5ba6db4544b76639a965c48087" severity="info" tenant="project-123020" tenant_id="76589b5ba6db4544b76639a965c48087"	firing	2026-01-02 12:40:04.388696542 +0000 UTC	100
Annotations
summary Project project-123020 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="1fc00ed8917c41268eaf1c18fc4475b2" severity="info" tenant="project-123263" tenant_id="1fc00ed8917c41268eaf1c18fc4475b2"	firing	2026-01-06 05:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-123263 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="e050246871f84196a42823491e6bd39b" severity="info" tenant="project-118480" tenant_id="e050246871f84196a42823491e6bd39b"	firing	2025-11-04 05:00:04 +0000 UTC	100
Annotations
summary Project project-118480 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2479f7bbbdd048a892cdce5b6041eb0d" severity="info" tenant="project-122213" tenant_id="2479f7bbbdd048a892cdce5b6041eb0d"	firing	2025-12-20 15:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122213 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="125599669f7e4a9bb3707ee1766754be" severity="info" tenant="project-122459" tenant_id="125599669f7e4a9bb3707ee1766754be"	firing	2025-12-23 14:20:04.388696542 +0000 UTC	100
Annotations
summary Project project-122459 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="080d78c2d3e7422e97236dd562b464fd" severity="info" tenant="project-122803" tenant_id="080d78c2d3e7422e97236dd562b464fd"	firing	2025-12-29 09:50:04.388696542 +0000 UTC	100
Annotations
summary Project project-122803 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2bd0411d89c747219d257253006c19be" severity="info" tenant="project-120691" tenant_id="2bd0411d89c747219d257253006c19be"	firing	2025-12-30 03:30:04.388696542 +0000 UTC	100
Annotations
summary Project project-120691 has reached 95% of the vCPU allocation limit.
alertname="Project is out of vCPU resources" component="compute" instance="compute-api.svc" is_domain="false" job="openstack_exporter" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="2856b90c9ce848949da775ba7190125e" severity="info" tenant="project-122063" tenant_id="2856b90c9ce848949da775ba7190125e"	firing	2025-12-16 17:00:04.388696542 +0000 UTC	100
Annotations
summary Project project-122063 has reached 95% of the vCPU allocation limit.

Network is out of IP addresses (0 active)

alert: Network is out of IP addresses
expr: label_replace((openstack_neutron_network_ip_availabilities_used > 0) / (openstack_neutron_network_ip_availabilities_total) * 100 * on(project_id) group_left(name) label_replace(openstack_identity_project_info{name=~"admin|service"}, "project_id", "$1", "id", "(.*)") > 95, "object_id", "$1", "network_id", "(.*)")
for: 5m
labels:
  component: compute
  severity: info
annotations:
  summary: Network {{$labels.network_name}} with ID {{$labels.network_id}} in project {{$labels.name}} has reached 95% of the IP address allocation limit.

Project is out of storage policy space (0 active)

alert: Project is out of storage policy space
expr: label_join(openstack_cinder_limits_volume_storage_policy_used_gb{is_domain="false",volume_type!=""} / (openstack_cinder_limits_volume_storage_policy_max_gb{is_domain="false",volume_type!=""} > 0) * 100 > 95, "object_id", "-", "tenant_id", "volume_type")
for: 5m
labels:
  component: compute
  severity: info
annotations:
  summary: Project {{$labels.tenant}} has reached 95% of the {{$labels.volume_type}} storage policy allocation limit.

/var/lib/prometheus/alerts/openstack_services.rules > openstack_service_is_down

All OpenStack service API upstreams are down (0 active)

alert: All OpenStack service API upstreams are down
expr: (sum by(cluster_id, service) (openstack_service_up{service!~"keystone.*"}) <= 0) and on(cluster_id) (backend_ha_reconfigure == 0) and on(cluster_id) (backend_compute_reconfigure == 0) and on(cluster_id) (backend_compute_deploy == 0) and on() (compute_cluster_state{state="reconfiguring"} == 0)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.service}}'
  severity: critical
annotations:
  summary: All OpenStack {{$labels.service}} API upstreams are down.

High request error rate for OpenStack API requests detected (0 active)

alert: High request error rate for OpenStack API requests detected
expr: label_replace(sum by(instance, log_file) (rate(openstack_request_count{status=~"5.."}[1h])) / sum by(instance, log_file) (rate(openstack_request_count[1h])), "object_id", "$1", "log_file", "(.*).log") * 100 > 5
for: 10m
labels:
  component: compute
  severity: warning
annotations:
  summary: Request error rate more than 5% detected for {{$labels.object_id}} for the last 1 hour. Check {{$labels.object_id}} resource usage.

Keystone API service is down (0 active)

alert: Keystone API service is down
expr: (openstack_service_up{service=~"keystone.*"} == 0) and on(cluster_id) (backend_ha_reconfigure == 0) and on(cluster_id) (backend_compute_reconfigure == 0) and on(cluster_id) (backend_compute_deploy == 0)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.service}}'
  severity: critical
annotations:
  summary: '{{$labels.service}} API service is down.'

OpenStack Cinder Scheduler is down (0 active)

alert: OpenStack Cinder Scheduler is down
expr: sum without(uuid) (label_replace(openstack_cinder_agent_state{adminState="enabled",service="cinder-scheduler"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Block Storage (Cinder) Scheduler agent is down on host {{$labels.instance}}.

OpenStack Cinder Volume agent is down (0 active)

alert: OpenStack Cinder Volume agent is down
expr: sum without(uuid) (label_replace(label_replace(openstack_cinder_agent_state{adminState="enabled",service="cinder-volume"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*"), "storage_name", "$1", "hostname", ".*@(.*)") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}-{{$labels.storage_name}}'
  severity: critical
annotations:
  summary: OpenStack Block Storage (Cinder) Volume agent is down on host {{$labels.instance}} for storage {{$labels.storage_name}}.

OpenStack Neutron DHCP agent is down (0 active)

alert: OpenStack Neutron DHCP agent is down
expr: (label_replace(openstack_neutron_agent_state{service="neutron-dhcp-agent"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Networking (Neutron) DHCP agent is down on host {{$labels.instance}}.

OpenStack Neutron L3 agent is down (0 active)

alert: OpenStack Neutron L3 agent is down
expr: (label_replace(openstack_neutron_agent_state{service="neutron-l3-agent"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_compute == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Networking (Neutron) L3 agent is down on host {{$labels.instance}}.

OpenStack Neutron Metadata agent is down (0 active)

alert: OpenStack Neutron Metadata agent is down
expr: (label_replace(openstack_neutron_agent_state{service="neutron-metadata-agent"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_compute == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Networking (Neutron) Metadata agent is down on host {{$labels.instance}}.

OpenStack Neutron OpenvSwitch agent is down (0 active)

alert: OpenStack Neutron OpenvSwitch agent is down
expr: (label_replace(openstack_neutron_agent_state{service="neutron-openvswitch-agent"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_compute == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Networking (Neutron) OpenvSwitch agent is down on host {{$labels.instance}}.

OpenStack Nova Compute is down (0 active)

alert: OpenStack Nova Compute is down
expr: (label_replace(openstack_nova_agent_state{adminState="enabled",service="nova-compute"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Compute (Nova) agent is down on host {{$labels.instance}}.

OpenStack Nova Conductor is down (0 active)

alert: OpenStack Nova Conductor is down
expr: (label_replace(openstack_nova_agent_state{adminState="enabled",service="nova-conductor"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Compute (Nova) Conductor agent is down on host {{$labels.instance}}.

OpenStack Nova Scheduler is down (0 active)

alert: OpenStack Nova Scheduler is down
expr: (label_replace(openstack_nova_agent_state{adminState="enabled",service="nova-scheduler"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_management == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Compute (Nova) Scheduler agent is down on host {{$labels.instance}}.

OpenStack Octavia HealthManager service is down (0 active)

alert: OpenStack Octavia HealthManager service is down
expr: (label_replace(openstack_loadbalancer_service_state{service="octavia_health_manager"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_compute_controller == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Loadbalancing (Octavia) health manager service is down on host {{$labels.instance}}.

OpenStack Octavia Housekeeping service is down (0 active)

alert: OpenStack Octavia Housekeeping service is down
expr: (label_replace(openstack_loadbalancer_service_state{service="octavia_housekeeping"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_compute_controller == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Loadbalancing (Octavia) housekeeping service is down on host {{$labels.instance}}.

OpenStack Octavia Provisioning Worker is down (0 active)

alert: OpenStack Octavia Provisioning Worker is down
expr: (label_replace(openstack_loadbalancer_service_state{service="octavia_worker"}, "nodename", "$1", "hostname", "(.*vstoragedomain).*") * on(nodename) group_left(node) node_uname_info{job="node"} * on(node) group_left() (backend_node_compute_controller == 1) * on(node) group_left(instance) up{job="node"} + on(node) group_left() max by(node) (softwareupdates_node_state{state=~"updating|rebooting"}) + scalar(backend_ha_reconfigure == bool 1)) == 0
for: 10m
labels:
  component: compute
  object_id: '{{$labels.instance}}'
  severity: critical
annotations:
  summary: OpenStack Loadbalancing (Octavia) provisioning worker is down on host {{$labels.instance}}.

OpenStack service API upstream is down (0 active)

alert: OpenStack service API upstream is down
expr: (sum by(cluster_id, service) (openstack_service_up{service!~"keystone.*|gnocchi*"}) > 0 < scalar(sum by(cluster_id) (backend_node_compute_controller == 1))) and on(cluster_id) (backend_ha_reconfigure == 0) and on(cluster_id) (backend_compute_reconfigure == 0) and on(cluster_id) (backend_compute_deploy == 0)
for: 10m
labels:
  component: compute
  object_id: '{{$labels.service}}'
  severity: warning
annotations:
  summary: One or more OpenStack {{$labels.service}} API upstreams are down.

/var/lib/prometheus/alerts/pcs.rules > PCS

Possible lack of allocatable space (6 active)

alert: Possible lack of allocatable space
expr: (cluster_space_ok_without_node == 0) * on(node) group_right() backend_node_online
labels:
  component: storage
  object_id: '{{ $labels.node }}'
  severity: warning
annotations:
  summary: Losing node {{ $labels.hostname }} will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.

Labels	State	Active Since	Value
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-cp01.vstoragedomain" instance="backend-api.svc" job="backend" node="af89cb54-f750-2f58-9267-1d004773625a" object_id="af89cb54-f750-2f58-9267-1d004773625a" severity="warning"	firing	2025-11-14 08:01:57.787488231 +0000 UTC	0
Annotations
summary Losing node mbfn3-cp01.vstoragedomain will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-cp02.vstoragedomain" instance="backend-api.svc" job="backend" node="1f578260-8804-bdef-1a71-b3f85f76038d" object_id="1f578260-8804-bdef-1a71-b3f85f76038d" severity="warning"	firing	2025-11-14 08:01:57.787488231 +0000 UTC	0
Annotations
summary Losing node mbfn3-cp02.vstoragedomain will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-cp03.vstoragedomain" instance="backend-api.svc" job="backend" node="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" object_id="9d776fe4-dac5-ea49-b5d4-fa73c5b2d9b4" severity="warning"	firing	2025-11-14 08:01:57.787488231 +0000 UTC	0
Annotations
summary Losing node mbfn3-cp03.vstoragedomain will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-s3stg01.vstoragedomain" instance="backend-api.svc" job="backend" node="028df1f9-05b4-d9c8-6cf1-17e418bf9bf4" object_id="028df1f9-05b4-d9c8-6cf1-17e418bf9bf4" severity="warning"	firing	2025-11-14 08:01:57.787488231 +0000 UTC	0
Annotations
summary Losing node mbfn3-s3stg01.vstoragedomain will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-s3stg02.vstoragedomain" instance="backend-api.svc" job="backend" node="88cff5b6-d365-3356-d662-766ae503e58a" object_id="88cff5b6-d365-3356-d662-766ae503e58a" severity="warning"	firing	2025-11-14 08:01:57.787488231 +0000 UTC	0
Annotations
summary Losing node mbfn3-s3stg02.vstoragedomain will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.
alertname="Possible lack of allocatable space" cluster_id="1" component="storage" hostname="mbfn3-s3stg03.vstoragedomain" instance="backend-api.svc" job="backend" node="d56f49c6-e411-4d5d-0911-f9e5d5923b49" object_id="d56f49c6-e411-4d5d-0911-f9e5d5923b49" severity="warning"	firing	2025-11-14 08:01:57.787488231 +0000 UTC	0
Annotations
summary Losing node mbfn3-s3stg03.vstoragedomain will lead to the lack of allocatable space or failure domains in the storage cluster. Add more storage disks or nodes to the cluster, depending on your failure domain configuration.

CS has excessive journal size (0 active)

alert: CS has excessive journal size
expr: cluster_csd_journal_size{journal_type="inner_cache"} * on(csid) group_left(instance, device) cluster_csd_disk_info > 512
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.csid }}'
  severity: warning
  value: '{{ $value }}'
annotations:
  summary: The journal on CS#{{ $labels.csid }} on host {{ $labels.instance }}, disk {{ $labels.device }}, is {{ $value }} MiB. The recommended size is 256 MiB.

CS has inconsistent encryption settings (0 active)

alert: CS has inconsistent encryption settings
expr: count by(tier) (count by(tier, encryption) (cluster_csd_journal_size * on(csid) group_left(instance) cluster_csd_disk_info * on(csid) group_left(tier) cluster_csd_info)) > 1
for: 10m
labels:
  component: storage
  severity: warning
annotations:
  summary: Encryption is disabled for some CSs in tier {{ $labels.tier }} but enabled for others on the same tier.

CS journal device shared across multiple tiers (0 active)

alert: CS journal device shared across multiple tiers
expr: count by(instance, device) (count by(instance, device, tier) (cluster_csd_journal_size{journal_type="external_cache"} * on(csid) group_left(tier) (cluster_csd_info) * on(node) group_left(instance) (up{job="node"}))) >= 2
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.instance }}-{{ $labels.device }}'
  severity: warning
annotations:
  summary: CSes from multiple tiers are using the same journal '{{ $labels.device }}' on '{{ $labels.instance }}' node.

CS missing journal configuration (0 active)

alert: CS missing journal configuration
expr: cluster_csd_disk_info unless on(csid) cluster_csd_journal_size and on() count(up{job="mds"}) > 0
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: The journal is not configured for CS#{{ $labels.csid }} on node {{ $labels.instance }}.

Cluster has blocked or slow replication (0 active)

alert: Cluster has blocked or slow replication
expr: increase(mdsd_cluster_replication_stuck_chunks[5m]) > 0 or increase(mdsd_cluster_replication_touts_total[5m]) > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Chunk replication is blocked or too slow.

Cluster has critically high number of chunks (0 active)

alert: Cluster has critically high number of chunks
expr: job:mdsd_fs_chunk_maps:sum >= 1.5e+07
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: There are too many chunks in the cluster, which slows down the metadata service.

Cluster has critically high number of files (0 active)

alert: Cluster has critically high number of files
expr: job:mdsd_fs_files:sum >= 1e+07
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: There are too many files in the cluster, which slows down the metadata service.

Cluster has failed chunk services (0 active)

alert: Cluster has failed chunk services
expr: sum by(csid) (mdsd_cs_status{status=~"failed|failed rel"}) * on(csid) group_right() label_replace(cluster_csd_disk_info, "object_id", "$1", "csid", "(.*)") > 0
for: 5m
labels:
  component: core storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: 'Chunk service #{{ $labels.csid }} is in the 'failed' state on node {{ $labels.instance }}. Replace the disk or contact the technical support.'

Cluster has failed mount points (0 active)

alert: Cluster has failed mount points
expr: job:up_not_being_updated:count{job="fused"} - job:up_not_being_updated_with_restart:count{job="fused"} > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Some mount points stopped working and need to be recovered.

Cluster has offline chunk services (0 active)

alert: Cluster has offline chunk services
expr: sum by(csid) (mdsd_cs_status{status="offline"}) * on(csid) group_right() label_replace(cluster_csd_disk_info, "object_id", "$1", "csid", "(.*)") > 0
for: 5m
labels:
  component: core storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: 'Chunk service #{{ $labels.csid }} is in 'offline' state on {{ $labels.instance }} node. Check and restart it.'

Cluster has too many chunks (0 active)

alert: Cluster has too many chunks
expr: (job:mdsd_fs_chunk_maps:sum > 1e+07) < 1.5e+07
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: There are too many chunks in the cluster, which slows down the metadata service.

Cluster has too many files (0 active)

alert: Cluster has too many files
expr: (job:mdsd_fs_files:sum > 4e+06) < 1e+07
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: There are too many files in the cluster, which slows down the metadata service.

Cluster has unavailable metadata services (0 active)

alert: Cluster has unavailable metadata services
expr: up{job="mds"} unless on(mdsid) (job:up_with_restart{job="mds"} == 1 or job:up_with_restart{job="mds"} == bool 0 and on(node) (instance:being_updated))
for: 5m
labels:
  component: core storage
  object_id: '{{ $labels.mdsid }}'
  severity: warning
annotations:
  summary: 'Metadata service #{{ $labels.mdsid }} is offline or has failed on '{{ $labels.instance }}' node. Check and restart it.'

Cluster is out of physical space on tier (0 active)

alert: Cluster is out of physical space on tier
expr: label_replace(sum by(tier) (mdsd_cluster_free_space_bytes) / sum by(tier) (mdsd_cluster_space_bytes), "object_id", "tier-$1", "tier", "(.*)") < 0.1
for: 5m
labels:
  component: core storage
  severity: critical
annotations:
  summary: There is not enough free physical space on storage tier {{ $labels.tier }}

Cluster is running out of physical space on tier (0 active)

alert: Cluster is running out of physical space on tier
expr: label_replace(sum by(tier) (mdsd_cluster_free_space_bytes) / sum by(tier) (mdsd_cluster_space_bytes), "object_id", "tier-$1", "tier", "(.*)") < 0.2
for: 5m
labels:
  component: core storage
  severity: warning
annotations:
  summary: There is little free physical space left on storage tier {{ $labels.tier }}

Core storage service is down (0 active)

alert: Core storage service is down
expr: label_replace(node_systemd_unit_state{name="vstorage-disks-monitor.service",state="active"}, "name", "$1", "name", "(.*)\\.service") != 1 and on(node) backend_node_master == 1 and on() backend_virtual_cluster == 0
for: 5m
labels:
  component: storage
  object_id: '{{ $labels.name }} - {{ $labels.instance }}'
  severity: warning
annotations:
  summary: Service {{ $labels.name }} is down on host {{ $labels.instance }}.

Disk cache settings are not optimal (0 active)

alert: Disk cache settings are not optimal
expr: count by(node, instance, tier) (cluster_csd_journal_size{journal_type="inner_cache"} * on(csid) group_left(tier) cluster_csd_info * on(node) group_left(instance) up{job="node"}) + count by(node, instance, tier) (cluster_csd_journal_size{journal_type="external_cache"} * on(csid) group_left(tier) cluster_csd_info * on(node) group_left(instance) up{job="node"}) >= 2
for: 10m
labels:
  component: storage
  object_id: '{{ $labels.node }}-{{ $labels.tier }}'
  severity: warning
annotations:
  summary: CSes are set up with different cache settings in the same tier '{{ $labels.tier }}' on '{{ $labels.instance }}' node

Master metadata service changes too often (0 active)

alert: Master metadata service changes too often
expr: topk(1, mdsd_is_master) and (delta(mdsd_master_uptime[1h]) < 300000) and on(node) softwareupdates_node_state{state!~"updat.*"} == 1
for: 10m
labels:
  component: core storage
  severity: warning
annotations:
  summary: Master metadata service has changed more than once in 5 minutes.

Metadata service has critically high commit latency (0 active)

alert: Metadata service has critically high commit latency
expr: histogram_quantile(0.95, instance_le:rjournal_commit_duration_seconds_bucket:rate5m{job="mds"}) >= 5
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Metadata service on {{$labels.instance}} has the 95th percentile latency higher than 5 seconds.

Metadata service has high CPU usage (0 active)

alert: Metadata service has high CPU usage
expr: (sum by(instance) (rate(process_cpu_seconds_total{job="mds"}[5m])) * 100) > 80
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: Metadata service on {{$labels.instance}} has CPU usage higher than 80%. The service may be overloaded.

Metadata service has high commit latency (0 active)

alert: Metadata service has high commit latency
expr: 5 > histogram_quantile(0.95, instance_le:rjournal_commit_duration_seconds_bucket:rate5m{job="mds"}) > 1
for: 1m
labels:
  component: core storage
  severity: warning
annotations:
  summary: Metadata service on {{$labels.instance}} has the 95th percentile latency higher than 1 second.

Node has failed map requests (0 active)

alert: Node has failed map requests
expr: fused_maps_failed > 0 or rate(fused_map_failures_total[5m]) > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Some map requests on {{$labels.instance}} have failed.

Node has stuck I/O requests (0 active)

alert: Node has stuck I/O requests
expr: fused_stuck_reqs_30s > 0 or fused_stuck_reqs_10s > 0
for: 1m
labels:
  component: core storage
  severity: critical
annotations:
  summary: Some I/O requests are stuck on {{$labels.instance}}.

Number of CSes per device does not match configuration (0 active)

alert: Number of CSes per device does not match configuration
expr: label_replace(backend_node_online == 1, "host", "$1", "hostname", "([^.]*).*") and on(node) (count by(device, instance, node, tier) (cluster_csd_info) - on(tier) group_left() (cluster_cs_per_tier_info)) != 0
for: 10m
labels:
  component: storage
  severity: warning
annotations:
  summary: Number of CSes per device on node {{$labels.host}} with id {{$labels.node}} does not match configuration. Check your disk configuration.

Reached "node crash per hour" threshold (0 active)

alert: Reached "node crash per hour" threshold
expr: shaman_node_crash_threshold == 1
for: 5m
labels:
  component: node
  severity: critical
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} Node {{$labels.hostname}} with shaman node id {{$labels.client_node}} has reached the "node crash per hour" threshold. Visit https://kb.acronis.com/content/68797 to learn how to troubleshoot this issue. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} Node {{$labels.hostname}} with shaman node id {{$labels.client_node}} has reached the "node crash per hour" threshold. {{- end -}}'

Storage disk is unresponsive (0 active)

alert: Storage disk is unresponsive
expr: sum by(csid) (mdsd_cs_status{status="ill"}) * on(csid) group_right() label_replace(cluster_csd_disk_info, "object_id", "$1", "csid", "(.*)") > 0
for: 1m
labels:
  component: core storage
  object_id: '{{ $labels.csid }}'
  severity: warning
annotations:
  summary: Disk '{{$labels.device}}' (CS#{{$labels.csid}}) on node {{$labels.instance}} is unresponsive. Check or replace this disk.

/var/lib/prometheus/alerts/postgres.rules > postgresql database size

PostgreSQL database size is greater than 30 GB (0 active)

alert: PostgreSQL database size is greater than 30 GB
expr: pg_database_size_bytes > 3e+10
for: 10m
labels:
  component: cluster
  object_id: postgresql_exporter
  severity: critical
annotations:
  summary: PostgreSQL database "{{$labels.datname}}" on node "{{$labels.instance}}" is greater than 30 GB in size. Verify that deleted entries are archived or contact the technical support.

PostgreSQL database uses more than 50% of node root partition (0 active)

alert: PostgreSQL database uses more than 50% of node root partition
expr: (sum by(node, instance) (pg_database_size_bytes) / min by(node, instance) (node_filesystem_size_bytes{job="node",mountpoint="/"})) * 100 > 50
for: 10m
labels:
  component: cluster
  object_id: postgresql_exporter
  severity: warning
annotations:
  summary: PostgreSQL databases on node "{{$labels.instance}}" with ID "{{$labels.node}}" use more than 50% of node root partition. Verify that deleted entries are archived or contact the technical support.

/var/lib/prometheus/alerts/rabbitmq.rules > rabbitmq_alerts

RabbitMQ node is down (0 active)

alert: RabbitMQ node is down
expr: backend_compute_deploy == 0 and on(job) backend_compute_reconfigure == 0 and on(job) backend_ha_reconfigure == 0 and on(job) (backend_node_compute_controller == 1 unless on(node) rabbitmq_build_info and on(node) softwareupdates_node_state{state=~"updating.*"} == 0)
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: RabbitMQ cluster has less than 3 nodes running.

RabbitMQ split brain detected (0 active)

alert: RabbitMQ split brain detected
expr: count(count by(job) (rabbitmq_queues)) > 1
for: 10m
labels:
  component: compute
  severity: critical
annotations:
  summary: RabbitMQ cluster has experienced a split brain due to a network partition.

/var/lib/prometheus/alerts/s3.rules > S3

FSMDS service failed to start (0 active)

alert: FSMDS service failed to start
expr: increase(ostor_svc_start_failed_count_total{service="fs"}[5m]) > 10
for: 1m
labels:
  component: NFS
  severity: critical
annotations:
  summary: Object storage agent failed to start file service on {{$labels.instance}}.

NFS service failed to start (0 active)

alert: NFS service failed to start
expr: increase(ostor_svc_start_failed_count_total{service=~"os|ns|s3gw",storage_type="NFS"}[5m]) > 10
for: 1m
labels:
  component: NFS
  severity: critical
annotations:
  summary: Object storage agent failed to start {{$labels.job}}({{$labels.svc_id}}) on {{$labels.instance}}

NFS service has unavailable FS services (0 active)

alert: NFS service has unavailable FS services
expr: count by(instance) (up{job="fs"}) > sum by(instance) (up{job="fs"})
for: 1m
labels:
  component: NFS
  severity: warning
annotations:
  summary: Some File services are not running on {{$labels.instance}}. Check the service status in the command-line interface.

NFS service is experiencing many network problems (0 active)

alert: NFS service is experiencing many network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"fs|os",vol_id=~"02.*"} > (10 / (5 * 60))
for: 2m
labels:
  component: NFS
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: critical
annotations:
  summary: NFS service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has many RPC errors. Check your network configuration.

NFS service is experiencing some network problems (0 active)

alert: NFS service is experiencing some network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"fs|os",vol_id=~"02.*"} > (5 / (5 * 60)) and instance_vol_svc:rpc_errors_total:rate5m{job=~"fs|os",vol_id=~"02.*"} <= (10 / (5 * 60))
for: 2m
labels:
  component: NFS
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: warning
annotations:
  summary: NFS service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has some RPC errors. Check your network configuration.

Name service has critically high commit latency (0 active)

alert: Name service has critically high commit latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="ns"})) >= 1e+07
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 10 seconds. Check the storage performance.

Name service has critically high request latency (0 active)

alert: Name service has critically high request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m)) >= 5000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 5 seconds.

Name service has high commit latency (0 active)

alert: Name service has high commit latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="ns"})) > 1e+06) < 1e+07
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 1 second. Check the storage performance.

Name service has high request latency (0 active)

alert: Name service has high request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m)) > 1000
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Name service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 1 second.

Object service has critically high commit latency (0 active)

alert: Object service has critically high commit latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="os"})) >= 1e+07
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 10 seconds. Check the storage performance.

Object service has critically high request latency (0 active)

alert: Object service has critically high request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m)) >= 5000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 5 seconds.

Object service has high commit latency (0 active)

alert: Object service has high commit latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_commit_latency_us_bucket:rate5m{job="os"})) > 1e+06) < 1e+07
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median commit latency higher than 1 second. Check the storage performance.

Object service has high request latency (0 active)

alert: Object service has high request latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m)) > 1000) < 5000
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Object service ({{$labels.svc_id}}) on {{$labels.instance}} has the median request latency higher than 1 second.

Object storage account control service is offline (0 active)

alert: Object storage account control service is offline
expr: up{job="acc"} == 0
for: 5m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage account control service is down on host {{$labels.instance}}.

Object storage agent is frozen for a long time (0 active)

alert: Object storage agent is frozen for a long time
expr: increase(pcs_process_inactive_seconds_total{job="ostor"}[5m]) > 0
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage agent on {{$labels.instance}} has the event loop inactive for more than 1 minute.

Object storage agent is not connected to configuration service (0 active)

alert: Object storage agent is not connected to configuration service
expr: increase(ostor_svc_registry_cfg_failed_total[5m]) > 3 and on(node) (instance:not_being_updated)
for: 5m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage agent failed to connect to the configuration service on {{$labels.instance}}.

Object storage agent is offline (0 active)

alert: Object storage agent is offline
expr: up{job="ostor"} == 0
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Object storage agent is offline on {{$labels.instance}}.

S3 Gateway service has critically high CPU usage (0 active)

alert: S3 Gateway service has critically high CPU usage
expr: (sum by(instance, svc_id) (rate(process_cpu_seconds_total{job="s3gw"}[5m])) * 100) >= 90
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has CPU usage higher than 90%. The service may be overloaded.

S3 Gateway service has critically high GET request latency (0 active)

alert: S3 Gateway service has critically high GET request latency
expr: histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m)) >= 5000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the median GET request latency higher than 5 seconds.

S3 Gateway service has critically high cancel request rate (0 active)

alert: S3 Gateway service has critically high cancel request rate
expr: (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > 1 and ((sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req_cancelled:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m))) * 100 >= 30 and (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > (30 / 300)
for: 3m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the cancel request rate higher than 30%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests.

S3 Gateway service has high CPU usage (0 active)

alert: S3 Gateway service has high CPU usage
expr: ((sum by(instance, svc_id) (rate(process_cpu_seconds_total{job="s3gw"}[5m])) * 100) > 75) < 90
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has CPU usage higher than 75%. The service may be overloaded.

S3 Gateway service has high GET request latency (0 active)

alert: S3 Gateway service has high GET request latency
expr: (histogram_quantile(0.5, sum by(instance, svc_id, le) (instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m)) > 1000) < 5000
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the median GET request latency higher than 1 second.

S3 Gateway service has high cancel request rate (0 active)

alert: S3 Gateway service has high cancel request rate
expr: 30 > (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > 1 and ((sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req_cancelled:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m))) * 100 > 5 and (sum by(svc_id, instance) (instance_vol_svc:ostor_s3gw_req:rate5m)) > (30 / 300)
for: 3m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has the cancel request rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests.

S3 Gateway service has too many failed requests (0 active)

alert: S3 Gateway service has too many failed requests
expr: ((sum by(instance, svc_id) (instance_vol_svc:ostor_req_server_err:rate5m)) / (sum by(instance, svc_id) (instance_vol_svc:ostor_s3gw_req:rate5m))) * 100 > 5
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 Gateway service ({{$labels.svc_id}}) on {{$labels.instance}} has a lot of failed requests with a server error (5XX status code).

S3 NDS service has critically high notification processing error rate (0 active)

alert: S3 NDS service has critically high notification processing error rate
expr: ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_error_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 >= 15
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification processing error rate higher than 15%. It may be caused by connectivity issues, requests timeouts, or an S3 topics misconfiguration.

S3 NDS service has high notification deletion error rate (0 active)

alert: S3 NDS service has high notification deletion error rate
expr: ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_delete_error_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 > 5
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification deletion error rate higher than 5%. It may be caused by a storage misconfiguration, storage performance degradation, or other storage issues.

S3 NDS service has high notification processing error rate (0 active)

alert: S3 NDS service has high notification processing error rate
expr: 15 > ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_error_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 >= 5
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification processing error rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or an S3 topics misconfiguration.

S3 NDS service has high notification repetition rate (0 active)

alert: S3 NDS service has high notification repetition rate
expr: ((sum by(svc_id, instance) (instance_vol_svc:ostor_nds_repeat_total:rate5m)) / (sum by(svc_id, instance) (instance_vol_svc:ostor_nds_total:rate5m))) * 100 > 5
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has the notification repetition rate higher than 5%. It may be caused by a storage misconfiguration or other storage issues.

S3 NDS service has too many messages in simultaneous processing (0 active)

alert: S3 NDS service has too many messages in simultaneous processing
expr: nds_endpoint_process_count > 1000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has a lot of notifications in simultaneous processing on the endpoint. It may be caused by connectivity issues or an S3 topics misconfiguration.

S3 NDS service has too many staged unprocessed notifications (0 active)

alert: S3 NDS service has too many staged unprocessed notifications
expr: nds_staged_messages_count > 1000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 NDS service ({{$labels.svc_id}}) on {{$labels.instance}} has a lot of unprocessed notifications staged on the storage. It may be caused by connectivity or storage issues.

S3 cluster has too many open file descriptors (0 active)

alert: S3 cluster has too many open file descriptors
expr: (sum by(instance) (process_open_fds{job=~"gr|acc|s3gw|ns|os|ostor"})) > 9000
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: There are more than 9000 open file descriptors on {{$labels.instance}}. Please contact the technical support.

S3 cluster has unavailable Geo-replication services (0 active)

alert: S3 cluster has unavailable Geo-replication services
expr: count by(instance) (up{job="gr"}) > sum by(instance) (up{job="gr"} == 1 or (up{job="gr"} == bool 0 and on(instance) (instance:being_updated)))
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some Geo-replication services are not running on {{$labels.instance}}. Check the service status in the command-line interface.

S3 cluster has unavailable S3 Gateway services (0 active)

alert: S3 cluster has unavailable S3 Gateway services
expr: count by(instance) (up{job="s3gw"}) > sum by(instance) (up{job="s3gw"})
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some S3 Gateway services are not running on {{$labels.instance}}. Check the service status in the command-line interface.

S3 cluster has unavailable name services (0 active)

alert: S3 cluster has unavailable name services
expr: count by(instance) (up{job="ns"}) > sum by(instance) (up{job="ns"} == 1 or (up{job="ns"} == bool 0 and on(instance) (instance:being_updated)))
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some Name services are not running on {{$labels.instance}}. Check the service status in the command-line interface.

S3 cluster has unavailable object services (0 active)

alert: S3 cluster has unavailable object services
expr: count by(instance) (up{job="os"}) > sum by(instance) (up{job="os"} == 1 or (up{job="os"} == bool 0 and on(instance) (instance:being_updated)))
for: 1m
labels:
  component: S3
  severity: warning
annotations:
  summary: Some Object services are not running on {{$labels.instance}}. Check the service status in the command-line interface.

S3 cluster misconfiguration (0 active)

alert: S3 cluster misconfiguration
expr: count(up{job="ostor"}) > 1 and count(ostor_svc_registry_cfg_failed_total) < 2
labels:
  component: S3
  severity: error
annotations:
  summary: |
    {{ if query "backend_vendor_info{vendor='acronis'}" }} The S3 cluster configuration is not highly available. If one S3 node fails, the entire S3 cluster may become non-operational. To ensure high availability, update the S3 cluster configuration, as described in the Knowledge Base at https://kb.acronis.com/node/64033 {{ else if query "backend_vendor_info{vendor='virtuozzo'}" }} The S3 cluster configuration is not highly available. If one S3 node fails, the entire S3 cluster may become non-operational. To ensure high availability, update the S3 cluster configuration, as described in the Knowledge Base at https://support.virtuozzo.com/hc/en-us/articles/27536517316753-Virtuozzo-Hybrid-Infrastructure-Alert-S3-cluster-misconfiguration {{ end }}

S3 node is in the automatic maintenance mode (0 active)

alert: S3 node is in the automatic maintenance mode
expr: auto_maintenance_status > 0
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: '{{- if query "backend_vendor_info{vendor='acronis'}" -}} S3 services have been evacuated from {{$labels.instance}} because of too many failed S3 requests. Check the service logs. Visit https://kb.acronis.com/content/72408 to learn how to troubleshoot this issue. {{- else if query "backend_vendor_info{vendor='virtuozzo'}" -}} S3 services have been evacuated from {{$labels.instance}} because of too many failed S3 requests. Check the service logs. {{- end -}}'

S3 redundancy warning (0 active)

alert: S3 redundancy warning
expr: storage_redundancy_threshold{failure_domain="disk",type="s3"} > 0 and storage_redundancy_threshold{failure_domain="disk",type="s3"} <= scalar(count(backend_node_master))
for: 10m
labels:
  component: S3
  severity: warning
annotations:
  summary: |
    S3 is set to failure domain "disk" even though there are enough available nodes. It is recommended to set the failure domain to "host" so that S3 can survive host failures in addition to disk failures.

S3 service failed to start (0 active)

alert: S3 service failed to start
expr: increase(ostor_svc_start_failed_count_total{service=~"os|ns|s3gw",storage_type="S3"}[5m]) > 10
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: Object storage agent failed to start {{$labels.job}}({{$labels.svc_id}}) on {{$labels.instance}}

S3 service is experiencing many network problems (0 active)

alert: S3 service is experiencing many network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"s3gw|os|ns",vol_id=~"01.*"} > (10 / (5 * 60))
for: 2m
labels:
  component: S3
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: critical
annotations:
  summary: S3 service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has many RPC errors. Check your network configuration.

S3 service is experiencing some network problems (0 active)

alert: S3 service is experiencing some network problems
expr: instance_vol_svc:rpc_errors_total:rate5m{job=~"s3gw|os|ns",vol_id=~"01.*"} > (5 / (5 * 60)) and instance_vol_svc:rpc_errors_total:rate5m{job=~"s3gw|os|ns",vol_id=~"01.*"} <= (10 / (5 * 60))
for: 2m
labels:
  component: S3
  object_id: '{{$labels.svc_id}}-{{$labels.instance}}'
  severity: warning
annotations:
  summary: S3 service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has some RPC errors. Check your network configuration.

S3 service is frozen for a long time (0 active)

alert: S3 service is frozen for a long time
expr: increase(pcs_process_inactive_seconds_total{job=~"s3gw|os|ns"}[5m]) > 0
for: 1m
labels:
  component: S3
  severity: critical
annotations:
  summary: S3 service ({{$labels.job}}, {{$labels.svc_id}}) on {{$labels.instance}} has the event loop inactive for more than 1 minute.