LynkiD — System Architecture · PROPOSAL 2026

Target State · Scalable, Reliable, Available, Elastic, Secure

v3.0 · Proposal · 2026 Target State

API Service
Portal / Web
Scheduled Job
Data Store
Infrastructure
Gateway
External
🆕 NEW 2026
Direct
Via VPN
Async (MQ)
🛡️ SEC Compliant

📋 Proposal Summary

Estimated additional cost: ~$400-470/month (~$5K/year)

Savings from optimization (retire Gateway 1, consolidate Aurora, right-size EC2, Reserved Instances): ~$300-500/month

Net impact: near-neutral, with significant improvements in Security, Observability, Reliability, and Developer Experience.

🆕 New 2026 Components — Detailed Explanation

🛡️ Security Layer

AWS WAFWeb Application Firewall sits in front of all gateways. Blocks DDoS, SQL injection, XSS, and bot traffic before requests reach your services. Currently gateways are exposed directly — this is a critical gap for any financial system.
Istio Service MeshDeploys sidecar proxies alongside every pod on k8s. Provides mutual TLS (mTLS) between all services — zero-trust networking. Currently services communicate in plaintext within the cluster, vulnerable to man-in-the-middle attacks.
Secrets ManagerCentralizes all secrets (DB passwords, API keys, tokens) with automatic rotation. Currently secrets live in config files or environment variables — if source code leaks, all credentials are compromised.
IAM + RBACFine-grained role-based access control for all AWS resources and k8s namespaces. Ensures least-privilege principle — each service only accesses what it needs.
Circuit BreakerResilience4j/Hystrix pattern on all cross-service calls, especially LynkiD Gateway → Cloud (via VPN). When a downstream service fails, circuit opens automatically, returns fallback response instead of cascading failure. Monitored via Prometheus metrics (open/half-open/closed state). Critical for the 4 VPN calls from LynkiD Gateway and Exchange Partner API calls.

Est. cost: ~$40-65/month (WAF $25-50, Secrets $15, Istio/IAM/Circuit Breaker are free — library-level or open-source)

📊 Observability Stack

CloudWatch LogsCentralized log aggregation for all 30+ services. Currently debugging requires SSH into individual pods — takes 30-60 min per incident. With centralized logs, search across all services in seconds.
AWS X-RayDistributed tracing across services, especially critical for requests crossing the VPN tunnel (OnPrem↔Cloud). Currently this path is a complete black box — impossible to trace latency or failures.
Prometheus + GrafanaMetrics collection, dashboards, and alerting. Know when a service is degrading BEFORE users report it. Set SLA thresholds and get alerted automatically.
Health Check APICentralized health endpoint for all services. Kubernetes liveness/readiness probes ensure unhealthy pods are automatically restarted or removed from load balancing.

Est. cost: ~$105-155/month (CW Logs $30-50, X-Ray $25, Prometheus+Grafana $50-80 managed or $0 self-hosted)

💾 Database Optimization

DocumentDBReplaces MongoDB single pod. Currently MongoDB runs as a single pod on k8s — if the pod crashes, ALL notification history is lost with no backup. DocumentDB is managed, multi-AZ, auto-backup, and MongoDB-compatible — near zero code changes required.
ElastiCacheReplaces Redis pod on Cloud. Currently when Redis pod restarts, entire cache is lost, causing a thundering herd of requests to hit the database. ElastiCache provides multi-AZ replication, auto-failover, and persistent storage — cache survives restarts.

Est. cost: ~$240/month (DocumentDB $140, ElastiCache $100). Replaces existing pod resources.

🚀 CI/CD & GitOps

ArgoCDGitOps for Kubernetes. Every deployment is a Git commit — full audit trail, easy rollback (just revert the commit). Currently deployments are manual or semi-automated, prone to human error and hard to rollback.
Cloud PipelineAWS CodePipeline for Cloud services. Automated Build → Test → Deploy pipeline. Triggered on PR merge, runs tests, deploys to staging, then production with approval gate.
OnPrem PipelineSeparate pipeline for OnPremise services (Jenkins/GitLab CI). Different deployment target (bare-metal k8s via VPN) requires different pipeline, but same quality gates.
SonarQubeStatic code analysis runs on every PR. Catches bugs, security vulnerabilities, code smells BEFORE they reach production. Quality gate blocks merge if thresholds not met.

Est. cost: ~$10-30/month (CodePipeline $10, ArgoCD & SonarQube CE are free/open-source)

🔄 Disaster Recovery — AWS Cloud (Phase 1: 2026)

DR cho Cloud AWS triển khai trước (2026), OnPrem DR sẽ theo sau. Strategy: Warm Standby — DR region chạy ở capacity thấp, scale up khi failover.

Route 53 DNS FailoverHealth-check based DNS failover. Khi primary region unhealthy, Route 53 tự động chuyển traffic sang DR region. RPO < 1 phút, RTO < 5 phút. Cost: ~$1/health check/month.
Aurora Global DatabaseCross-region replication với lag < 1 giây. Khi failover, promote DR replica thành primary trong < 1 phút. Áp dụng cho tất cả Aurora schemas (operator_api, merchant_api, partner_integration, cms_api). Cost: ~$200-300/month cho DR instances.
S3 Cross-Region ReplicationTự động replicate tất cả S3 objects (images, excel, backups) sang DR region. Cost: ~$5-10/month (chỉ trả storage + transfer).
EKS Standby ClusterWarm standby EKS cluster ở DR region. ArgoCD sync manifests tự động. Chạy minimum replicas (1 pod/service). Khi failover, Karpenter auto-scale lên full capacity. Cost: ~$150-250/month (minimal EC2 + EKS control plane $74).

Est. DR cost: ~$400-600/month. OnPrem DR (Phase 2) sẽ cần thêm secondary site hoặc cloud-based DR — estimate riêng.

📉 Optimization & Cost Reduction Proposals

🔄 Retire Gateway 1 (Legacy)

Gateway 1 is marked Legacy but still running and even SEC Compliant. Migrate its util sync functions to Common API Gateway, then decommission. Saves ~$20-30/month in pod resources + reduces maintenance burden and attack surface.

📦 Consolidate Aurora Schemas

4 separate Aurora schemas (vpid_operator_api, vpid_merchant_api, linkid_partner_integration, cms_api) can share fewer Aurora instances with schema-level isolation. Saves ~$100-200/month by reducing Aurora instance count.

⚡ RabbitMQ → Amazon MQ or SQS

RabbitMQ runs as a single pod — SPOF for all async flows. Option A: Amazon MQ for RabbitMQ (~$50/month, managed, multi-AZ). Option B: SQS+SNS (~$5/month, serverless). Option B saves most but requires Consumer refactor.

💤 Right-size EC2 + Spot Instances

Many scheduled jobs (Agent Jobs, Batch Jobs) run for minutes but pods are always running. Use Karpenter auto-scaler + Spot Instances for non-critical batch workloads. Saves 40-60% EC2 cost for these workloads.

🏷️ Reserved Instances / Savings Plans

If not already using RI or Savings Plans for RDS/Aurora/EC2, a 1-year commitment saves 30-40% vs On-Demand. For a stable system like LynkiD, this is the single biggest cost reduction opportunity.

🔧 Merge VPB Util API

VPB Util API exists because "VPB Loyalty API is too rigid to edit." This is a workaround, not architecture. Refactor VPB Loyalty API to be more modular, then merge Util API back. Reduces 1 service + eliminates dual-access to same DB.

🏗️ AWS Well-Architected Framework — Full Alignment

Mapped against all 6 pillars of the AWS Well-Architected Framework for a production-grade loyalty platform.

🔒 Security Pillar

  • ✅ AWS WAF — perimeter defense, OWASP Top 10 rules
  • ✅ Istio mTLS — in-cluster encryption, zero-trust
  • ✅ Secrets Manager — credential rotation, no hardcoded secrets
  • ✅ IAM + RBAC — least-privilege, service accounts
  • ✅ Circuit Breaker — prevents auth cascade on downstream failure
  • ⚠️ Add: AWS Shield Advanced — DDoS protection for financial services
  • ⚠️ Add: AWS GuardDuty — threat detection, anomaly monitoring
  • ⚠️ Add: AWS Config — compliance auditing, drift detection
  • ⚠️ Add: VPC Flow Logs — network traffic analysis
  • ⚠️ Add: AWS CloudTrail — API call auditing for compliance

⚡ Reliability Pillar

  • ✅ Circuit Breaker (Resilience4j) — prevents cascade failures
  • ✅ DocumentDB multi-AZ — replaces single-pod MongoDB
  • ✅ ElastiCache multi-AZ — replaces single Redis pod
  • ✅ Health Check API — liveness/readiness probes
  • ✅ VPN Monitoring — latency, uptime, auto-failover
  • ⚠️ Add: Dead Letter Queue (DLQ) for RabbitMQ — failed messages don't disappear
  • ⚠️ Add: Saga Pattern for Exchange Partners — compensating transactions when partner API fails
  • ⚠️ Add: Multi-AZ RabbitMQ (Amazon MQ) — eliminate MQ single point of failure
  • ⚠️ Add: Backup VPN tunnel — redundant VPN for OnPrem↔Cloud
  • ⚠️ Add: AWS FIS (Fault Injection Simulator) — chaos engineering to test resilience

📊 Operational Excellence Pillar

  • ✅ CloudWatch Logs — centralized logging
  • ✅ AWS X-Ray — distributed tracing across VPN
  • ✅ Prometheus + Grafana — metrics, dashboards, alerting
  • ✅ ArgoCD — GitOps, auditable deployments
  • ✅ SonarQube — code quality gates
  • ⚠️ Add: Runbook Automation (AWS SSM) — automated incident response
  • ⚠️ Add: PagerDuty/OpsGenie — on-call rotation, escalation
  • ⚠️ Add: AWS Chatbot — Slack/Teams alerts from CloudWatch
  • ⚠️ Add: Centralized Dashboard — single pane of glass for all services
  • ⚠️ Add: Post-Incident Review process — blameless retrospectives

💰 Cost Optimization Pillar

  • ✅ Right-sizing EC2 + Karpenter auto-scaler
  • ✅ Spot Instances for batch jobs
  • ✅ Reserved Instances / Savings Plans
  • ✅ Consolidate Aurora schemas
  • ✅ Retire Legacy Gateway 1
  • ⚠️ Add: Graviton (ARM) instances — 20% cheaper, same performance
  • ⚠️ Add: AWS Cost Explorer + Budgets — ongoing cost monitoring
  • ⚠️ Add: S3 Intelligent-Tiering — auto-optimize storage costs
  • ⚠️ Add: RDS Proxy — connection pooling, reduce DB load

🚀 Performance Efficiency Pillar

  • ✅ ElastiCache — managed caching layer
  • ✅ Cloudflare CDN — static content delivery
  • ⚠️ Add: Amazon CloudFront — CDN for API responses (edge caching)
  • ⚠️ Add: Aurora Read Replicas — offload read traffic from primary
  • ⚠️ Add: API Response Caching at Gateway level
  • ⚠️ Add: Connection Pooling (RDS Proxy) — reduce DB connection overhead
  • ⚠️ Add: Async processing for heavy operations (expand RabbitMQ usage)
  • ⚠️ Add: Database query optimization + indexing review

🌱 Sustainability Pillar

  • ⚠️ Add: Graviton instances — more energy-efficient ARM processors
  • ⚠️ Add: Auto-scaling to zero for non-peak hours (batch jobs)
  • ⚠️ Add: Right-size over-provisioned resources
  • ⚠️ Add: S3 lifecycle policies — auto-archive old data
  • ⚠️ Add: Measure and track carbon footprint via AWS Customer Carbon Footprint Tool

📊 ROI Summary

New additions cost: ~$400-470/month (Security + Observability + CI/CD + DB Optimization)
Disaster Recovery cost: ~$400-600/month (Route 53 + Aurora Global + S3 CRR + EKS Standby)
Total new cost: ~$800-1,070/month
Optimization savings: ~$300-500/month (retire GW1 + consolidate Aurora + right-size EC2 + RI/Savings Plans)
Net additional: ~$500-570/month (~$6K-7K/year) — a reasonable investment for production-grade DR, security, observability, and CI/CD for a financial loyalty platform serving millions of users.

🇻🇳 Giải thích chi tiết (Tiếng Việt)

Phiên bản tiếng Việt cho các thành phần mới đề xuất trong kiến trúc 2026.

🛡️ Tầng Bảo mật

AWS WAFTường lửa ứng dụng web, đặt trước tất cả Gateway. Chặn DDoS, SQL injection, XSS trước khi request đến service. Hiện tại Gateway đang expose trực tiếp — rủi ro lớn cho hệ thống tài chính.
Istio Service MeshTriển khai sidecar proxy bên cạnh mỗi pod trên k8s. Cung cấp mTLS (mã hóa) giữa tất cả services — mô hình zero-trust. Hiện tại services giao tiếp plaintext trong cluster.
Circuit BreakerPattern Resilience4j trên tất cả cross-service calls, đặc biệt LynkiD Gateway → Cloud qua VPN. Khi service downstream fail, circuit tự mở, trả fallback thay vì cascade failure. Giám sát qua Prometheus (trạng thái open/half-open/closed).
Secrets ManagerQuản lý tập trung tất cả secrets (mật khẩu DB, API keys) với tự động rotation. Hiện tại secrets nằm trong config files — nếu leak source code, toàn bộ credentials bị lộ.
IAM + RBACKiểm soát truy cập chi tiết theo vai trò cho tất cả AWS resources và k8s namespaces. Đảm bảo nguyên tắc least-privilege.

Chi phí ước tính: ~$40-65/tháng

📊 Hệ thống Giám sát (Observability)

CloudWatch LogsTập trung log từ 30+ services. Hiện tại debug phải SSH vào từng pod — mất 30-60 phút/incident. Với log tập trung, tìm kiếm xuyên suốt tất cả services trong vài giây.
AWS X-RayDistributed tracing xuyên suốt services, đặc biệt quan trọng cho request qua VPN (OnPrem↔Cloud). Hiện tại đường đi này là hộp đen — không thể trace latency hay failures.
Prometheus + GrafanaThu thập metrics, dashboards, và alerting. Biết khi nào service đang degrading TRƯỚC KHI user report. Đặt ngưỡng SLA và nhận alert tự động.
VPN MonitorGiám sát sức khỏe VPN tunnel: latency, packet loss, uptime. Alert khi degradation. Auto-failover nếu tunnel chính fail.

Chi phí ước tính: ~$105-155/tháng

💾 Tối ưu Cơ sở dữ liệu

DocumentDBThay thế MongoDB single pod. Hiện MongoDB chạy 1 pod duy nhất — pod crash = MẤT TOÀN BỘ lịch sử notification, không backup. DocumentDB là managed, multi-AZ, auto-backup, tương thích MongoDB API — gần như không cần sửa code.
ElastiCacheThay thế Redis pod trên Cloud. Hiện khi Redis pod restart, toàn bộ cache mất, gây thundering herd lên DB. ElastiCache cung cấp multi-AZ, auto-failover, persistent storage — cache sống sót qua restart.

Chi phí ước tính: ~$240/tháng (DocumentDB $140, ElastiCache $100)

🚀 CI/CD & GitOps

ArgoCDGitOps cho Kubernetes. Mỗi deployment là 1 Git commit — full audit trail, dễ rollback (chỉ cần revert commit). Hiện tại deploy manual, dễ lỗi và khó rollback.
Cloud PipelineAWS CodePipeline cho Cloud services. Pipeline tự động Build → Test → Deploy. Trigger khi merge PR.
OnPrem PipelinePipeline riêng cho OnPremise (Jenkins/GitLab CI). Target deploy khác (bare-metal k8s qua VPN) nên cần pipeline riêng.
SonarQubePhân tích code tĩnh chạy trên mỗi PR. Bắt bugs, lỗ hổng bảo mật TRƯỚC KHI đến production. Quality gate chặn merge nếu không đạt ngưỡng.

Chi phí ước tính: ~$10-30/tháng

🔄 Khôi phục Thảm họa — AWS Cloud (Giai đoạn 1: 2026)

DR cho Cloud AWS triển khai trước (2026), OnPrem DR sẽ theo sau. Chiến lược: Warm Standby — DR region chạy ở capacity thấp, scale up khi failover.

Route 53 DNS FailoverFailover DNS dựa trên health-check. Khi primary region unhealthy, Route 53 tự động chuyển traffic sang DR region. RPO < 1 phút, RTO < 5 phút.
Aurora Global DatabaseReplication cross-region với lag < 1 giây. Khi failover, promote DR replica thành primary trong < 1 phút.
S3 Cross-RegionTự động replicate tất cả S3 objects (ảnh, excel, backups) sang DR region.
EKS StandbyWarm standby EKS cluster ở DR region. ArgoCD sync manifests tự động. Chạy minimum replicas. Khi failover, Karpenter auto-scale lên full capacity.

Chi phí DR ước tính: ~$400-600/tháng. OnPrem DR (Giai đoạn 2) cần thêm secondary site — dự toán riêng.

📊 Tổng kết ROI

Chi phí thêm mới: ~$400-470/tháng (Bảo mật + Giám sát + CI/CD + Tối ưu DB)
Chi phí DR: ~$400-600/tháng (Route 53 + Aurora Global + S3 CRR + EKS Standby)
Tổng chi phí mới: ~$800-1,070/tháng
Tiết kiệm từ tối ưu: ~$300-500/tháng (retire GW1 + gộp Aurora + right-size EC2 + RI)
Chi phí ròng thêm: ~$500-570/tháng (~$6K-7K/năm) — đầu tư hợp lý cho DR, bảo mật, giám sát, và CI/CD cấp production cho nền tảng loyalty tài chính phục vụ hàng triệu người dùng.