Target State · Scalable, Reliable, Available, Elastic, Secure
v3.0 · Proposal · 2026 Target State
Estimated additional cost: ~$400-470/month (~$5K/year)
Savings from optimization (retire Gateway 1, consolidate Aurora, right-size EC2, Reserved Instances): ~$300-500/month
Net impact: near-neutral, with significant improvements in Security, Observability, Reliability, and Developer Experience.
| AWS WAF | Web Application Firewall sits in front of all gateways. Blocks DDoS, SQL injection, XSS, and bot traffic before requests reach your services. Currently gateways are exposed directly — this is a critical gap for any financial system. |
| Istio Service Mesh | Deploys sidecar proxies alongside every pod on k8s. Provides mutual TLS (mTLS) between all services — zero-trust networking. Currently services communicate in plaintext within the cluster, vulnerable to man-in-the-middle attacks. |
| Secrets Manager | Centralizes all secrets (DB passwords, API keys, tokens) with automatic rotation. Currently secrets live in config files or environment variables — if source code leaks, all credentials are compromised. |
| IAM + RBAC | Fine-grained role-based access control for all AWS resources and k8s namespaces. Ensures least-privilege principle — each service only accesses what it needs. |
| Circuit Breaker | Resilience4j/Hystrix pattern on all cross-service calls, especially LynkiD Gateway → Cloud (via VPN). When a downstream service fails, circuit opens automatically, returns fallback response instead of cascading failure. Monitored via Prometheus metrics (open/half-open/closed state). Critical for the 4 VPN calls from LynkiD Gateway and Exchange Partner API calls. |
Est. cost: ~$40-65/month (WAF $25-50, Secrets $15, Istio/IAM/Circuit Breaker are free — library-level or open-source)
| CloudWatch Logs | Centralized log aggregation for all 30+ services. Currently debugging requires SSH into individual pods — takes 30-60 min per incident. With centralized logs, search across all services in seconds. |
| AWS X-Ray | Distributed tracing across services, especially critical for requests crossing the VPN tunnel (OnPrem↔Cloud). Currently this path is a complete black box — impossible to trace latency or failures. |
| Prometheus + Grafana | Metrics collection, dashboards, and alerting. Know when a service is degrading BEFORE users report it. Set SLA thresholds and get alerted automatically. |
| Health Check API | Centralized health endpoint for all services. Kubernetes liveness/readiness probes ensure unhealthy pods are automatically restarted or removed from load balancing. |
Est. cost: ~$105-155/month (CW Logs $30-50, X-Ray $25, Prometheus+Grafana $50-80 managed or $0 self-hosted)
| DocumentDB | Replaces MongoDB single pod. Currently MongoDB runs as a single pod on k8s — if the pod crashes, ALL notification history is lost with no backup. DocumentDB is managed, multi-AZ, auto-backup, and MongoDB-compatible — near zero code changes required. |
| ElastiCache | Replaces Redis pod on Cloud. Currently when Redis pod restarts, entire cache is lost, causing a thundering herd of requests to hit the database. ElastiCache provides multi-AZ replication, auto-failover, and persistent storage — cache survives restarts. |
Est. cost: ~$240/month (DocumentDB $140, ElastiCache $100). Replaces existing pod resources.
| ArgoCD | GitOps for Kubernetes. Every deployment is a Git commit — full audit trail, easy rollback (just revert the commit). Currently deployments are manual or semi-automated, prone to human error and hard to rollback. |
| Cloud Pipeline | AWS CodePipeline for Cloud services. Automated Build → Test → Deploy pipeline. Triggered on PR merge, runs tests, deploys to staging, then production with approval gate. |
| OnPrem Pipeline | Separate pipeline for OnPremise services (Jenkins/GitLab CI). Different deployment target (bare-metal k8s via VPN) requires different pipeline, but same quality gates. |
| SonarQube | Static code analysis runs on every PR. Catches bugs, security vulnerabilities, code smells BEFORE they reach production. Quality gate blocks merge if thresholds not met. |
Est. cost: ~$10-30/month (CodePipeline $10, ArgoCD & SonarQube CE are free/open-source)
DR cho Cloud AWS triển khai trước (2026), OnPrem DR sẽ theo sau. Strategy: Warm Standby — DR region chạy ở capacity thấp, scale up khi failover.
| Route 53 DNS Failover | Health-check based DNS failover. Khi primary region unhealthy, Route 53 tự động chuyển traffic sang DR region. RPO < 1 phút, RTO < 5 phút. Cost: ~$1/health check/month. |
| Aurora Global Database | Cross-region replication với lag < 1 giây. Khi failover, promote DR replica thành primary trong < 1 phút. Áp dụng cho tất cả Aurora schemas (operator_api, merchant_api, partner_integration, cms_api). Cost: ~$200-300/month cho DR instances. |
| S3 Cross-Region Replication | Tự động replicate tất cả S3 objects (images, excel, backups) sang DR region. Cost: ~$5-10/month (chỉ trả storage + transfer). |
| EKS Standby Cluster | Warm standby EKS cluster ở DR region. ArgoCD sync manifests tự động. Chạy minimum replicas (1 pod/service). Khi failover, Karpenter auto-scale lên full capacity. Cost: ~$150-250/month (minimal EC2 + EKS control plane $74). |
Est. DR cost: ~$400-600/month. OnPrem DR (Phase 2) sẽ cần thêm secondary site hoặc cloud-based DR — estimate riêng.
Gateway 1 is marked Legacy but still running and even SEC Compliant. Migrate its util sync functions to Common API Gateway, then decommission. Saves ~$20-30/month in pod resources + reduces maintenance burden and attack surface.
4 separate Aurora schemas (vpid_operator_api, vpid_merchant_api, linkid_partner_integration, cms_api) can share fewer Aurora instances with schema-level isolation. Saves ~$100-200/month by reducing Aurora instance count.
RabbitMQ runs as a single pod — SPOF for all async flows. Option A: Amazon MQ for RabbitMQ (~$50/month, managed, multi-AZ). Option B: SQS+SNS (~$5/month, serverless). Option B saves most but requires Consumer refactor.
Many scheduled jobs (Agent Jobs, Batch Jobs) run for minutes but pods are always running. Use Karpenter auto-scaler + Spot Instances for non-critical batch workloads. Saves 40-60% EC2 cost for these workloads.
If not already using RI or Savings Plans for RDS/Aurora/EC2, a 1-year commitment saves 30-40% vs On-Demand. For a stable system like LynkiD, this is the single biggest cost reduction opportunity.
VPB Util API exists because "VPB Loyalty API is too rigid to edit." This is a workaround, not architecture. Refactor VPB Loyalty API to be more modular, then merge Util API back. Reduces 1 service + eliminates dual-access to same DB.
Mapped against all 6 pillars of the AWS Well-Architected Framework for a production-grade loyalty platform.
New additions cost: ~$400-470/month (Security + Observability + CI/CD + DB Optimization)
Disaster Recovery cost: ~$400-600/month (Route 53 + Aurora Global + S3 CRR + EKS Standby)
Total new cost: ~$800-1,070/month
Optimization savings: ~$300-500/month (retire GW1 + consolidate Aurora + right-size EC2 + RI/Savings Plans)
Net additional: ~$500-570/month (~$6K-7K/year) — a reasonable investment for production-grade DR, security, observability, and CI/CD for a financial loyalty platform serving millions of users.
Phiên bản tiếng Việt cho các thành phần mới đề xuất trong kiến trúc 2026.
| AWS WAF | Tường lửa ứng dụng web, đặt trước tất cả Gateway. Chặn DDoS, SQL injection, XSS trước khi request đến service. Hiện tại Gateway đang expose trực tiếp — rủi ro lớn cho hệ thống tài chính. |
| Istio Service Mesh | Triển khai sidecar proxy bên cạnh mỗi pod trên k8s. Cung cấp mTLS (mã hóa) giữa tất cả services — mô hình zero-trust. Hiện tại services giao tiếp plaintext trong cluster. |
| Circuit Breaker | Pattern Resilience4j trên tất cả cross-service calls, đặc biệt LynkiD Gateway → Cloud qua VPN. Khi service downstream fail, circuit tự mở, trả fallback thay vì cascade failure. Giám sát qua Prometheus (trạng thái open/half-open/closed). |
| Secrets Manager | Quản lý tập trung tất cả secrets (mật khẩu DB, API keys) với tự động rotation. Hiện tại secrets nằm trong config files — nếu leak source code, toàn bộ credentials bị lộ. |
| IAM + RBAC | Kiểm soát truy cập chi tiết theo vai trò cho tất cả AWS resources và k8s namespaces. Đảm bảo nguyên tắc least-privilege. |
Chi phí ước tính: ~$40-65/tháng
| CloudWatch Logs | Tập trung log từ 30+ services. Hiện tại debug phải SSH vào từng pod — mất 30-60 phút/incident. Với log tập trung, tìm kiếm xuyên suốt tất cả services trong vài giây. |
| AWS X-Ray | Distributed tracing xuyên suốt services, đặc biệt quan trọng cho request qua VPN (OnPrem↔Cloud). Hiện tại đường đi này là hộp đen — không thể trace latency hay failures. |
| Prometheus + Grafana | Thu thập metrics, dashboards, và alerting. Biết khi nào service đang degrading TRƯỚC KHI user report. Đặt ngưỡng SLA và nhận alert tự động. |
| VPN Monitor | Giám sát sức khỏe VPN tunnel: latency, packet loss, uptime. Alert khi degradation. Auto-failover nếu tunnel chính fail. |
Chi phí ước tính: ~$105-155/tháng
| DocumentDB | Thay thế MongoDB single pod. Hiện MongoDB chạy 1 pod duy nhất — pod crash = MẤT TOÀN BỘ lịch sử notification, không backup. DocumentDB là managed, multi-AZ, auto-backup, tương thích MongoDB API — gần như không cần sửa code. |
| ElastiCache | Thay thế Redis pod trên Cloud. Hiện khi Redis pod restart, toàn bộ cache mất, gây thundering herd lên DB. ElastiCache cung cấp multi-AZ, auto-failover, persistent storage — cache sống sót qua restart. |
Chi phí ước tính: ~$240/tháng (DocumentDB $140, ElastiCache $100)
| ArgoCD | GitOps cho Kubernetes. Mỗi deployment là 1 Git commit — full audit trail, dễ rollback (chỉ cần revert commit). Hiện tại deploy manual, dễ lỗi và khó rollback. |
| Cloud Pipeline | AWS CodePipeline cho Cloud services. Pipeline tự động Build → Test → Deploy. Trigger khi merge PR. |
| OnPrem Pipeline | Pipeline riêng cho OnPremise (Jenkins/GitLab CI). Target deploy khác (bare-metal k8s qua VPN) nên cần pipeline riêng. |
| SonarQube | Phân tích code tĩnh chạy trên mỗi PR. Bắt bugs, lỗ hổng bảo mật TRƯỚC KHI đến production. Quality gate chặn merge nếu không đạt ngưỡng. |
Chi phí ước tính: ~$10-30/tháng
DR cho Cloud AWS triển khai trước (2026), OnPrem DR sẽ theo sau. Chiến lược: Warm Standby — DR region chạy ở capacity thấp, scale up khi failover.
| Route 53 DNS Failover | Failover DNS dựa trên health-check. Khi primary region unhealthy, Route 53 tự động chuyển traffic sang DR region. RPO < 1 phút, RTO < 5 phút. |
| Aurora Global Database | Replication cross-region với lag < 1 giây. Khi failover, promote DR replica thành primary trong < 1 phút. |
| S3 Cross-Region | Tự động replicate tất cả S3 objects (ảnh, excel, backups) sang DR region. |
| EKS Standby | Warm standby EKS cluster ở DR region. ArgoCD sync manifests tự động. Chạy minimum replicas. Khi failover, Karpenter auto-scale lên full capacity. |
Chi phí DR ước tính: ~$400-600/tháng. OnPrem DR (Giai đoạn 2) cần thêm secondary site — dự toán riêng.
Chi phí thêm mới: ~$400-470/tháng (Bảo mật + Giám sát + CI/CD + Tối ưu DB)
Chi phí DR: ~$400-600/tháng (Route 53 + Aurora Global + S3 CRR + EKS Standby)
Tổng chi phí mới: ~$800-1,070/tháng
Tiết kiệm từ tối ưu: ~$300-500/tháng (retire GW1 + gộp Aurora + right-size EC2 + RI)
Chi phí ròng thêm: ~$500-570/tháng (~$6K-7K/năm) — đầu tư hợp lý cho DR, bảo mật, giám sát, và CI/CD cấp production cho nền tảng loyalty tài chính phục vụ hàng triệu người dùng.