notification service တခုတည်ဆောက်ခြင်း

notification တွေက ဒီနေ့ခေတ် application system တွေနဲ့ခွဲမရအောင် ကပ်ပါလာပြီလို့ပြောရင် လွန်မယ်မထင်ဘူး။ ကုမ္ပဏီတိုင်းလိုလိုမှာ အနည်းဆုံးတော့ Firebase နဲ့ Mailgun တို့လောက်သုံးပြီး push notification တွေ email တွေပို့ဖို့လုပ်ကြတာ သာမန်ဖြစ်နေပြီ။ ဒါပေမယ့် များသောအားဖြင့် notification logic ကို service ထဲမှာပဲ ထည့်ရေးလိုက်ကြတာတော့ မကောင်းလှတဲ့ shortcut တခုပေါ့။ ကိုယ့် system မှာ app service တခုနှစ်ခုလောက်ကပဲ notification တွေနဲ့အလုပ်လုပ်ဖို့လိုတယ်ဆိုရင် အဲဒီလို interface-based implementation ကအဆင်ပြေတဲ့ solution ပါ။ အဲလိုမဟုတ်ပဲ notification ပို့ရတဲ့ service တွေဆယ်ချီလာပြီဆိုရင် client service တွေက device token management service, user preferences service တွေနဲ့ temporal coupling ဖြစ်လာသလို test ရမယ့် အစိတ်အပိုင်းတွေလည်း ပိုများလာတယ်။ ရှေ့နှစ် ခရစ်စမတ် ဒီအချိန်လောက်တုန်းက အလုပ်မှာ ကျွန်တော်တို့ ဒီပြဿနာကို ကြုံနေခဲ့ရပြီး ပိုကောင်းတဲ့အဖြေတခုကို ရှာနေခဲ့ကြတယ်။

ကျွန်တော်တို့ဆီမှာ အဲအချိန်က microservice ပေါင်း 200 ကျော်ရှိပါတယ်။ 200 ကျော်က များတယ်ထင်ရပေမယ့် မများပါဘူး။ UK ဘဏ်တခုဖြစ်တဲ့ Monzo Bank ဆိုရင် 2000 လောက်ရှိတာပါ။ အဲဒီ microservice 200 ထဲမှာ ထိပ်ဆုံးဆယ်ခုလောက်က တရက်ပျှမ်းမျှ notification 30K နီးပါး ပို့ရပြီး သိပ် notification heavy မဖြစ်တဲ့ တခြား service တွေအကုန်ပေါင်းလိုက်ရင်တော့ 15K လောက်ရှိပါတယ်။ ဒီတော့ total (30,000 * 10) + 15,000 = 315,000 မို့ တရက်ကို notification သုံးသိန်းကျော် process လုပ်နေရတဲ့သဘော။ အဲအချိန် platform throughput က 200K/sec မို့လို့ တရက်စာ traffic 17.25B နဲ့ယှဥ်ရင် ဒါကတော်တော်လေးနည်းတဲ့ ကလေးကစားစရာလို ပမာဏလို့ ပြောလို့ရတယ်။ ဒါပေမယ့် notification 16% လောက် drop ဖြစ်နေတဲ့ ပြဿနာ၊ reliability mechanism အတွက် retry, backpressure စသဖြင့်တွေမှာ system-wide standard မရှိတဲ့ပြဿနာ၊ app service တွေရဲ့ user info နဲ့ preferences cache တွေမှာ freshness နဲ့ latency ကို trade နေရပြီး ကောင်းကောင်း tune မလုပ်နိုင်တဲ့ ပြဿနာ ဒါတွေကို ရှင်းဖို့တော့ လိုအပ်နေပြီဆိုတာ နားလည်ခဲ့ကြတယ်။

ကျွန်တော်တို့ရဲ့အဓိက goal က notification ကို standalone subsystem တခုအဖြစ်ခွဲထုတ်ပစ်လိုက်ဖို့ပဲ။ functional goal က ၃ ခုရှိတယ်။ နံပါတ်တစ် notification flow ကိုတနေရာတည်းကနေ test လို့ရ၊ audit လို့ရရမယ်။ ဇူလိုင်လ 16 ရက် မနက် 9:30 မှာ user 123 ဆီပို့လိုက်တဲ့ email ကဘာကြောင့် fail သွားတာလဲ၊ retry ဘယ်နှကြိမ်လုပ်ခဲ့လဲ၊ provider error လားစတာမျိုးတွေကို တနေရာတည်းကနေ trace လိုက်လို့ရရမယ်။ နံပါတ်နှစ် user တယောက်စီတိုင်းအတွက် notification တွေက loss ဖြစ်လို့မရဘူး။ order ကျော်လို့မရဘူး၊ အစဥ်လိုက် ပို့ရမယ်။ at-least-once ဖြစ်ရမယ်။ နံပါတ်သုံး email ကလွဲလို့ ကျန်တဲ့ notification အမျိုးအစားတွေမှာ user တယောက်အတွက် notification ပို့တာသိပ်များသွားရင် အဲဒီ notification အမျိုးအစား ကို server ဖက်ကနေ client-side throttling လုပ်ပေးရမယ်။ non functional goal တွေကတော့ ရှင်းပါတယ်။ highly available ဖြစ်ရမယ်။ 99.99 uptime ရှိရမယ်။ မဟုတ်ရင် feature owner တွေကို in-app notification ကိုလက်လွှတ်ဖို့ convince လုပ်ရခက်မယ်။ နောက် performant ဖြစ်ရမယ်။ notification အတွက် latency budget က p99 20-30ms ထက်မပိုရဘူး။ notification က bottleneck ဖြစ်လာရင် upstream client တွေ SLO breach ဖြစ်နိုင်ချေရှိတယ်။ service chain တခုလုံး propagated timeout သွားပြီး retry storm ဖြစ်နိုင်တာကို စဥ်းစားရမယ်။

ပထမဆုံး iteration မှာတော့ ကျွန်တော်တို့ Go သုံးပြီး provider specific implementation တွေမပါသေးတဲ့ notification dummy service တခုစဆောက်ခဲ့တယ်။ လိုအပ်တဲ့ gRPC contract တွေသေချာ define လုပ်နိုင်ခဲ့တယ်။ per-user ordering အတွက် backend partitioning နဲ့ consistent hash load balancing ကိုသေချာလုပ်နိုင်ခဲ့တယ်။ 1st iteration ကို risk အနည်းဆုံးဖြစ်မယ့် app service တွေနဲ့ production environment မှာ အရင်ဆုံးစပြီး စမ်းခဲ့ကြတယ်။ prometheus နဲ့ alerting rule တွေ setup လုပ်တယ်။ ရောက်လာတဲ့ notification request တွေကို log ပြီး resource consumption, load, traffic metrics တွေပါတဲ့ အကြမ်းဖျင်း service profile တခုဆောက်ခဲ့တယ်။ ဒါပေမယ့် အားနည်းချက်တခုက app service တွေမှာ tail latency သိသိသာသာ တက်လာတာ တွေ့ရတယ်။ ဒီပြဿနာကို ဖြေရှင်းဖို့ notification request တွေကို batch ဖို့နဲ့ queueing/buffering ကိုသေချာ tune ဖို့လိုတယ်။ အဲအတွက် library တခုရေးထားလိုက်လို့ရပေမယ့် ပြဿနာက ကိုယ်တို့ Java, Go, Rust, Python နဲ့ JavaScript ဆိုပြီး client library 5 ခုလိုလိမ့်မယ်။ နောက်ပြီး per-user ordering အတွက် sticky သုံးထားတာကလည်း fragile ဖြစ်လွန်းတယ်။ တွေးကြည့်ရင် notification backend တွေက ဘာ state မှမယ်မယ်ရရ မရှိပဲ sticky routing ထားတာက design flaw ဖြစ်နေတယ်။ We can do better လေ။

ဒါကြောင့် ဒုတိယ iteration မှာ client app service တွေနဲ့ notification server ကြား buffer ထားဖို့ဆုံးဖြတ်လိုက်ကြတယ်။ မဟုတ်လည်း ကျွန်တော်တို့ရဲ့ app service အများစုက event-driven တွေဖြစ်နေပြီးသား။ confluent ပေါ်မှာ Kafka cluster လည်းရှိတယ်။ ဒီတော့ client app service တိုင်းက notification request event တွေကို log layer ဆီ partition အလိုက် append လုပ်ဖို့ ရွေးလိုက်တယ်။ service အများစုမှာ နဂိုကတည်းက Kafka client တွေသုံးထားပြီးသားမို့လို့ management ရဲ့စောင်မမှုနဲ့ 2 ပတ်အတွင်းမှာ non-critical service တွေ ဒီ iteration ကို adopt ပြီးသားဖြစ်သွားတယ်။ ရှိပြီးသား producer practice တွေကိုပြန်အသုံးချလိုက်တဲ့အတွက် idempotency, batching စသဖြင့် အဆင်သင့် ရသွားသလို latency ကိုဖြတ်ချလိုက်နိုင်ပြီး sticky routing သုံးနေရတဲ့ ပြဿနာပါ ရှင်းပြီးသားဖြစ်သွားတယ်ပေါ့။

တတိယ iteration မှာ Android နဲ့ Email implementation စထည့်ခဲ့တယ်။ ရှေ့ iteration တွေရဲ့ logging ကနေသိလာရတာတခုက Android push notification နဲ့ Email အချိုးက 23:1 ရှိတယ်ဆိုတာပဲ။ notification နှစ်ခုက reliability mechanism တွေဖြစ်တဲ့ retry logic, protocol handling နဲ့ throttling အစစ မတူတာမလို့ mini service တွေအဖြစ် ခွဲထုတ်လိုက်ဖို့ ဆုံးဖြတ်ခဲ့ကြတယ်။ ဒါဆို သီးခြားစီ scalable ဖြစ်သွားမယ်။ နောက်ပြီး နဂိုတုန်းက app service တွေက user preferences service ကို sync နဲ့ခေါ်ပြီး user တွေရဲ့ opt-out setting ကိုဖတ်၊ notification filtering လုပ်ကြတယ်။ sync call တွေကိုလျှော့ချဖို့ read through cache ထားထားတာတောင်မှ user preferences ကတော်တော်လေး popular ဖြစ်တဲ့ service တခုဖြစ်ခဲ့တယ်။

ဒါကြောင့် 3rd iteration မှာ user preferences service ဆီက preferences changed event နဲ့ users service ဆီက info updated event တွေကိုတိုက်ရိုက် consume လုပ်ပြီး notification service အတွက် local redis materialized view ဆောက်ဖို့ထည့်စီစဥ်ခဲ့တယ်။ ဒီတော့ data stale ဖြစ်နိုင်ချေက အရင်လို cache invalidation timeout နဲ့မဟုတ်တော့ပဲ consumer lag အပေါ်မူတည်သွားပြီး AWS elastic cache အတွက်လည်း ကုန်ကျစရိတ် 8% လောက်သက်သာသွားတယ်။ notification filtering နဲ့ enrichment ကို တနေရာတည်းကနေ လုပ်နိုင်သွားတဲ့အတွက် user preferences service မှာ internal traffic 70% နီးပါးလျော့သွားတယ်။ ဒီ iteration မှာ notification server က request event တွေကို process ဖို့ redis MV ကို query တယ်။ drop ပစ်ရမယ့် notification drop ပစ်တယ်။ enrichment အတွက် token တွေ၊ email နဲ့တခြားလိုအပ်တဲ့ channel metadata တွေထည့်တယ်။ message ကို encrypt လုပ်ပြီး notification type အလိုက် android, mail topic တွေဆီကို fan out ပြန်လုပ်တယ်။ Android နဲ့ Mail consumer group တွေက ဒီ topic တွေကနေ တဆင့်ပြန် consume လုပ်ပြီး end-device တွေဆီ notification ပို့တယ်။

4th iteration မှာ email templating အတွက် centralized template service ထည့်တယ်။ retry mechanism အတွက် in-memory ပုံစံကနေ dedicated topic တွေကိုရေးတဲ့ပုံစံ ပြောင်းခဲ့တယ်။ ဒီအချက်က at-least-once semantics ကို server-side boundary ထဲ end to end enforce လုပ်နိုင်ခဲ့တယ်။ iOS အတွက် APN နဲ့ sms အတွက် Twilio သုံးပြီး consumer service ၂ ခုထပ်ပါလာခဲ့တယ်။ ဒါက fan out သုံးလိုက်ခြင်းရဲ့ extensibility အကျိုးပါပဲ။ နောက်ထပ် notification type တွေထပ်တိုးဖို့ဖြစ်ဖြစ်၊ notification type တမျိုးမျိုးကို ပုံစံပြောင်းပြီး process ဖို့ပဲဖြစ်ဖြစ် pipeline က maximum flexibility ရှိတယ်။ စိတ်ဝင်စားတဲ့သူတွေအတွက် Data on the Outside ဆိုတဲ့ paper ကိုဖတ်ကြည့်စေချင်တယ်။ နောက် custom metric adapter မှာ consumer lag အပေါ်အခြေခံတဲ့ promql query တွေရေးခဲ့ကြတယ်။ ဒီတော့ Android push notification တွေများများ process ဖို့လိုရင် fanout topic ကိုကြည့်ပြီး horizontal scale out လိုက်ရုံပဲ။ ကျန်တဲ့ notification type တွေကို ဂရုစိုက်ဖို့မလိုသလို ingestion နဲ့လည်း လုံး၀ decoupled ဖြစ်သွားတယ်။ 4th iteration မှာ consumer group အလိုက် notification burst ဖြစ်ရင် throttle လုပ်ဖို့၊ backpressure ပေးဖို့ mechanism တွေထည့်ခဲ့ကြတယ်။

project အစအဆုံး ၄ လကြာသွားခဲ့ပေမယ့် လိုချင်တဲ့ design goal တွေအကုန်လုံး checked ဖြစ်ခဲ့တာကတော့ SRE နဲ့ feature team တွေအကုန်လုံးရဲ့ engineering effort ပါပဲ။ ဒီအနေအထားမှာ notification subsystem ကအတော်လေး robust ဖြစ်သွားပေမယ့် ဒါက universal solution တခုလားလို့မေးရင် မဟုတ်ပါဘူး။ ကိုယ့်ကုမ္ပဏီက တနေ့ကို notification ထောင်ဂဏန်းလောက်ရှိတဲ့ volume မျိုး၊ lossless ဖြစ်ဖို့မလိုတဲ့ ordering guarantee ရှိဖို့မလိုတဲ့ requirement မျိုးဆိုရင် Notification interface ထားပြီး notification.send(recipient, message) လို့ခေါ်လိုက်တာဖြစ်ဖြစ်၊ RabbitMQ ဒါမှမဟုတ် SQS တို့လို queue တခုခုဆီ AMQP protocol သုံး၊ notification job တွေလှမ်းရေးပြီး worker group တခုက consume လုပ်၊ notification တွေဒိုင်ခံပို့ပေးတဲ့ ပုံစံမျိုးတွေက အများကြီး ပိုသင့်တော် အဓိပ္ပါယ်ရှိတဲ့ solution တွေဖြစ်ပါလိမ့်မယ်ခင်ဗျာ။