Link Search Menu Expand Document

Using the Prometheus metrics you may want to setup alerts on your buildfarm cluster. Below are some example alerts:

High Cpu Queue avg(queue_size{job="your_instance", queue_name="cpu_queue"}) Critical = Triggered when the value is greater than 1000 for 60m Warning = Triggered when the value is greater than 1000 for 30m

High Cluster Utilization avg(execution_slot_usage{job="your_instance", service="buildfarm-worker"}) Critical = Triggered when the value is greater than 85 for 60m Warning = Triggered when the value is greater than 85 for 30m

Multiple Bazel Versions Detected count by (tool_name) (count(dispatched_operations_tools_amount{job="your_instance", service="buildfarm-server"}) without(host_ip, instance)) Warning = Triggered when the value is greater than 1 for 1m

No available workers avg(worker_pool_size{job="your_instance", service="buildfarm-server"}) Critical = Triggered when the value is equal to 0 for 5m Warning = Triggered when the value is equal to 0 for 2m

Stuck CPU Workers

(avg(delta(execution_slot_usage{job="your_instance", service="buildfarm-worker"}[5m])) by (host_ip) == 0) and (avg(execution_slot_usage{job="your_instance", service="buildfarm-worker"}) by (host_ip) > 0) and (avg(execution_slot_usage{job="your_instance", service="buildfarm-worker"}) by (host_ip) < 89)

Critical = Triggered when the value is greater than or equal to 0 for 30m

Key Expirations - Builds May Be Slow

avg(increase(expired_key_total{job="your_instance", service="buildfarm-worker"}[5m])) by (host_ip)

Warning = Triggered when the value is greater than 10000 for 1h

High Prequeue pre_queue_size{job="your_instance", service="buildfarm-server"} Critical = Triggered when the value is greater than 0 for 15m Warning = Triggered when the value is greater than 0 for 5m