Prometheus Alerting အကြောင်း

SRE နဲ့ Platform Engineering မှာမဖြစ်မနေ သိထားသင့်တဲ့ Alerting အကြောင်း ပြောကြပါမယ်။ ဒီအတွက် လိုအပ်တဲ့ အခြေခံရှိအောင်တော့ ရှေ့မှာပြောထားတဲ့ PromQL နဲ့ပတ်သက်ပြီး အရင်ဖတ်ထားသင့်ပါတယ်။

ကျွန်တော်တို့ system တွေ၊ platform တွေတည်ဆောက်တဲ့အခါမှာ တတ်နိုင်သလောက် 24/7 available ဖြစ်နေဖို့ အရေးကြီးပါတယ်။ ဒါပေမယ့် 100% uptime အပြည့်အဝ ဖြစ်နေဖို့ဆိုတာ မလွယ်တဲ့အတွက် 99.5% uptime စသဖြင့် ပိုလက်တွေ့ကျမယ့် goal မျိုးကို သတ်မှတ်ကြရပါတယ်။ ပြီးတော့ system ထဲမလိုလားအပ်တဲ့ incident တွေဖြစ်လာတဲ့အခါမှာလည်း ခပ်မြန်မြန် action ယူနိုင်ဖို့ လိုပါတယ်။ အဲဒီအတွက် ကျွန်တော်တို့ on-call engineer တွေထားကြတယ်။

ဒီလို on-call engineer တွေနဲ့ SRE (Meta, Apple စသဖြင့် ကုမ္ပဏီအပေါ် မူတည်ပြီး production engineer, service engineer လို့လည်းခေါ်တဲ့) engineer တွေက သူတို့ရဲ့ platform ဒါမှမဟုတ် system ထဲဘာဖြစ်နေသလဲဆိုတာ သိရဖို့ alert တွေဆင်ထားရတယ်။ လက်တွေ့မှာ alert တွေက monitoring system ပေါင်းစုံကနေ လာကြပေမယ့် ဒီ article မှာတော့ ကျွန်တော်တို့က prometheus လို metrics monitoring system ကနေ alert တွေဘယ်လိုဆင်မလဲဆိုတာကို လေ့လာသွားကြမှာဖြစ်တယ်။

Alert တခုဆင်ဖို့အတွက် ၃ ပိုင်းလိုအပ်ပါတယ်။

1) metrics တခု threshold ကျော်ပြီဆိုတာ သိရမယ်။

2) အဲဒီ metrics က threshold ကျော်ပြီဆိုတာနဲ့ alert အဖြစ် သတ်မှတ်ပြီး system တခုကို notify လုပ်ပေးရပါမယ်။

3) notification တွေလက်ခံမယ့် system တခု ရှိရပါမယ်။

နံပါတ်တစ်အချက်အတွက် Prometheus server မှာ PromQL ရေးထားပြီး တောက်လျှောက် evaluate လုပ်ပေးမယ့် feature ရှိပြီးသားပါ။ နံပါတ်နှစ်အတွက်လည်း Prometheus Team ကထုတ်ထားတဲ့ Alert Manager ဆိုတာရှိပါတယ်။ Prometheus Server ကိုရေးတဲ့ team ကရေးတဲ့ software ပဲဖြစ်လို့ integration ကတော်တော်လေး ရိုးရှင်းပါတယ်။ နံပါတ်သုံးအတွက်ကိုလည်း ကိုယ့်စိတ်ကြိုက် Notification System တခုခု Pager, Slack, Mail စသဖြင့် ရွေးချယ်လို့ရပါတယ်။

အရင်ဆုံး Alert Manager ကိုကျွန်တော်တို့ official page ကနေ download လုပ်ပြီး run ကြည့်ကြမယ်။ default က port 9093 မှာ run တာဖြစ်လို့ http://localhost:9093 မှာဝင်ကြည့်လိုက်ရင် Alert Manager ရဲ့ Web UI တက်လာတာကို တွေ့ရလိမ့်မယ်။

wget https://github.com/prometheus/alertmanager/releases/download/v0.28.1/alertmanager-0.28.1.linux-amd64.tar.gz

tar xvzf alertmanager-0.28.1.linux-amd64.tar.gz

cd alertmanager-0.28.1.linux-amd64/

./alertmanager

နောက်တဆင့် Node Exporter ကို install လုပ်ကြပါမယ်။ Prometheus ရဲ့ design က pull-based ဖြစ်တဲ့အတွက် server ကလာ scrape တဲ့အခါ metrics တွေပေးနိုင်ဖို့ ကျွန်တော်တို့ရဲ့ application တွေကို metrics တွေ serve လုပ်နိုင်အောင် ရေးထားရပါတယ်။ အဲဒါကို instrumentation လုပ်တယ်လို့ ခေါ်ပါတယ်။ ဒါပေမယ့် 3rd party system တခုခုက instrument လုပ်ပေးမထားတာဖြစ်ဖြစ်၊ instrumentation က Prometheus format နဲ့မဟုတ်လို့ပဲဖြစ်ဖြစ်ဆိုတဲ့ အခြေအနေတွေမှာ ကျွန်တော်တို့ Adapter လိုအပ်ပါတယ်။ Linux OS ကထင်ရှားတဲ့ ဥပမာတခုပါ။ ဒါကြောင့် Linux OS ဆီက metric တွေလိုချင်တယ်ဆိုရင် OS ဆီက stats တွေဖတ်ပြီး Prometheus format နဲ့ expose လုပ်ပေးတဲ့ proxy software တခုလိုအပ်ပါတယ်။ အဲဒီ proxy software အတွက် Prometheus Team ကထုတ်ထားတဲ့ Node Exporter ကိုသုံးနိုင်တယ်။

install လုပ်ဖို့အတွက် ကိုယ့် Operating System နဲ့ကိုက်တဲ့ဟာကို ဒီကနေ download ယူပါ။

wget https://github.com/prometheus/node_exporter/releases/download/v1.9.1/node_exporter-1.9.1.linux-amd64.tar.gz

tar xvzf node_exporter-1.9.1.linux-amd64.tar.gz

cd node_exporter-1.9.1.linux-amd64/

./node_exporter

Node Exporter က default port 9100 မှာ run တာဖြစ်လို့ localhost:9100/metrics မှာဝင်ကြည့်လိုက်ရင် ဒီလို metrics exposition ကိုမြင်ရလိမ့်မယ်။

# HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 64152.65
node_cpu_seconds_total{cpu="0",mode="iowait"} 78.66
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 17.75
node_cpu_seconds_total{cpu="0",mode="softirq"} 211.24
node_cpu_seconds_total{cpu="0",mode="steal"} 224.54
node_cpu_seconds_total{cpu="0",mode="system"} 2500.07
node_cpu_seconds_total{cpu="0",mode="user"} 5214.54

နောက်တဆင့်မှာ Prometheus Server ကို install လုပ်ကြပါမယ်။ Prometheus ဘယ်လို install ရမလဲ မသိရင် ဒီမှာ ပြန်ကြည့်ပါ။ Prometheus Server ကိုတော့ ကျွန်တော်တို့ ဒီလို configure လုပ်လိုက်ပါမယ်။

global:
  scrape_interval: 15s
      
scrape_configs:
- job_name: "nodeexporter"
  static_configs:
  - targets:
    - "localhost:9100"

ဒီ configuration က Node Exporter ကို scrape ဖို့အသိပေးလိုက်တာဖြစ်တယ်။ ဒီ configuration နဲ့ run လိုက်ပြီဆိုတာနဲ့ Prometheus ရဲ့ Status ထဲက Target health မှာ Node Exporter ကိုမြင်နိုင်သလို PromQL UI မှာလည်း up ဆိုတဲ့ metric ကိုရှာလိုက်ရင် ဒီလိုမြင်ရပါလိမ့်မယ်။

up{instance="localhost:9100", job="nodeexporter"}       1

ဒါဆိုရင် နောက်တဆင့် Prometheus ကို Alert Manager နဲ့ချိတ်ဖို့ လုပ်ကြပါမယ်။ Prometheus configuration file ကိုဒီလိုပြင်လိုက်ပါ။ ဆိုလိုချင်တာက alerts.yaml ဆိုတဲ့ file ထဲမှာ alert တွေရှိမယ်လို့ အသိပေးလိုက်တာပါပဲ။

global:
  scrape_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
- alerts.yaml
      
scrape_configs:
- job_name: "nodeexporter"
  static_configs:
  - targets:
    - "localhost:9100"

alerts.yaml ထဲမှာလည်း ဒီလို alert တခုဆောက်လိုက်ပါ။ သဘောက up ဆိုတဲ့ metric က 0 ဖြစ်သွားရင် (တနည်း unreachable ဖြစ်သွားရင်) စက္ကန့် 30 ထိစောင့်ကြည့်ပြီး ထူးမလာဘူးဆို ချက်ချင်း alert ပေးဖို့ ဆင်လိုက်တာဖြစ်တယ်။

groups:
- name: nodeexporter.rules
  rules:
  - alert: NodeExporterDown
    expr: up{job="nodeexporter"} == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Node Exporter is down"
      description: "{{$labels.instance}} of job {{$labels.job}} has been down for more than 30 seconds."

Prometheus ကိုပြန် run လိုက်ပြီး Browser မှာဝင်ကြည့်တဲ့အခါ Alerts ဆိုတဲ့အောက်မှာ ကျွန်တော်တို့ configure လုပ်ပေးလိုက်တဲ့ alert ကိုတွေ့ရမယ်။ လောလောဆယ်မှာတော့ alert ရဲ့ status က INACTIVE ဖြစ်နေတယ်။ ဒါပေမယ့် Node Exporter ကိုရပ်လိုက်တာနဲ့ နောက် scrape cycle မှာ up metrics က 0 ဖြစ်သွားတဲ့အတွက် alert ကချက်ချင်း PENDING ဖြစ်သွားတာ တွေ့ရမယ်။ evalution interval တမိနစ်နဲ့ alert ရဲ့ toleration စက္ကန့် 30 အကြာမှာတော့ status က PENDING ကနေ FIRING ကိုပြောင်းသွားတာ မြင်ရမယ်။ တပြိုင်တည်းမှာပဲ Alert Manager မှာလည်း alert တခုရောက်နေတာကို တွေ့ရလိမ့်မယ်။ ဒါဆို ကျွန်တော်တို့ နောက်ဆုံးတဆင့်ဖြစ်တဲ့ notification လက်ခံမယ့် system တခုခုနဲ့ ချိတ်ပေးဖို့ပဲ ကျန်ပါတော့တယ်။

ကျွန်တော်တို့ notification အတွက် Slack ကိုသုံးပါမယ်။ ကုမ္ပဏီတိုင်းမှာ on-call အတွက် ဘယ် platform ကနေ ဆက်သွယ်မယ်ဆိုတာမျိုး ရှိပြီးသားမို့ ကိုယ့်ကုမ္ပဏီကသုံးတဲ့ messaging software နဲ့ integrate လုပ်ကြတဲ့သဘောပါ။ ကျွန်တော်ကတော့ Slack သုံးလို့ Slack နဲ့လုပ်ပါမယ်။ ဒီ url ကတဆင့် Your apps => Create New App မှာ Application တခုဝင်ဆောက်ပါ။ ဆောက်လိုက်တဲ့ Application ရဲ့ Incoming Webhooks မှာ webhook တခု ထပ်ဝင်ဆောက်ပါ။ webhook ဆောက်တဲ့အခါ message လက်ခံမယ့် channel တခုရွေးပေးရပါတယ်။ ကျွန်တော်ကတော့ #alerts-prod ကိုရွေးလိုက်ပါမယ်။

alertmanager.yml ကိုဒီလိုပြင်လိုက်ပြီး Alert Manager ကို restart လုပ်လိုက်ပါ။ ဒီနေရာမှာ Webhook URL က sensitive information ဖြစ်တဲ့အတွက် လုံလုံခြုံခြုံ သိမ်းထားဖို့ အရေးကြီးပါတယ်။

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'slack'

receivers:
- name: 'slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/ABC/XYZ/pqr'
    channel: '#alerts-prod'

Node Exporter ကို restart လုပ်လိုက်ပါ။ Prometheus ထဲမှာ alert ရဲ့ status က INACTIVE ပြန်ဖြစ်သွားတာ တွေ့ရမယ်။ အခုတခါ Node Exporter ကိုပြန်ရပ်လိုက်ရင်တော့ status FIRING ဖြစ်သွားတဲ့အခါ ကိုယ်ရွေးထားတဲ့ Slack channel ထဲကို message ရောက်လာတာကို တွေ့ရလိမ့်မယ်။ ကျွန်တော် ဒီနေရာမှာ တခုဖြည့်ပြောချင်တာက Alert Manager ကအခုလို ရပ်သွားရင် ရှေ့က alert တွေအကုန်ပျောက်သွားမှာလား၊ ပြီးတော့ alert တွေက သက်ဆိုင်ရာ platform ကိုရောက်ဖို့ရော အာမခံရဲ့လား၊ enterprise စတိုင်လ် message queue နဲ့ buffer တခုခုခံဖို့ လိုမလား စတဲ့ မေးခွန်းတွေပါ။

Alert Manager တို့လို notification dispatcher service တွေမှာ alert တွေ (notification job တွေ) ပျောက်မသွားအောင်လို့ resiliency အတွက် built-in mechanism တွေတခါတည်း ပါပြီးသားဖြစ်တယ်။ Alert Manager မှာဆိုရင်လည်း ပို့ပြီးသား alert တွေ၊ မပို့ရမယ့် muted alert တွေကို disk ပေါ်ရေးပါတယ်။ နောက်ပြီး HA setup အတွက် peer state ဆိုတာကိုလည်း disk ပေါ်သိမ်းလို့ရပါတယ်။ နောက်တချက်က in-memory queue နဲ့ retry logic လည်းပါပြီးဖြစ်လို့ alert တခုကို destination ဆီပို့လို့မရောက်လိုက်ဘူးဆိုရင် backoff strategy သုံးပြီး ပြန်ပို့ဖို့ ကြိုးစားမှာပါ။

ဒါဆို in-memory queue ဖြစ်လို့ process crash သွားတာ ဒါမှမဟုတ် pod ကျသွားတာတွေဖြစ်ရင်ရော notification တွေပျောက်သွားမှာလားဆိုရင် မပျောက်ပါဘူး။ ဘာကြောင့်လဲဆိုတော့ Prometheus ရဲ့ Alert Evaluation ကသတ်မှတ်ထားတဲ့ interval မှာဆက်တိုက်ဖြစ်နေတာမို့လို့ resolve မဖြစ်မချင်း alert ကို Alert Manager ဆီပို့နေဦးမှာဖြစ်တဲ့အတွက် Alert Manager process ပြန်တက်လာတဲ့အခါ alert ကိုထပ်ရဦးမှာဖြစ်တယ်။

နောက်တပိုင်းမှာ IaC သုံးပြီး kubernetes cluster ပေါ်မှာ Prometheus နဲ့ Alert Manager ကိုဘယ်လို run မလဲ ကြည့်ကြပါမယ်။