Blue/green storefront on EKS¶
A complete, non-trivial mgtt walkthrough based on a real production e-commerce platform. PHP-FPM behind nginx, blue/green traffic switching, an async queue tier, scheduled cron, and an AWS-managed data layer (RDS + ElastiCache + AmazonMQ + S3 + CloudFront).
This page is a walkthrough, not a reference. It shows how a working engineer reaches the model you see here — including the places where the first attempt was wrong and had to be refined.
On this page¶
- The system
- The model
- Validate
- Scenarios
- Simulate in CI
- At 3am:
mgtt diagnose - Lessons from the first few weeks
The system¶
acme-shop is a storefront serving shop.acme.example. Deployed on EKS with blue/green traffic switching: two identical deployment sets (color: blue, color: green) run side-by-side, and a single Service (acme-shop-svc) points its selector at whichever color is live.
graph LR
CF([Cloudflare CDN])
ALB[ALB — live]
SVC[acme-shop-svc<br/>selector: color=active]
NB[nginx-blue]
NG[nginx-green]
PB[php-fpm-blue]
PG[php-fpm-green]
CB[consumer-blue]
CG[consumer-green]
CRON[cron]
ESO[external-secrets<br/>operator]
SSM[(SSM parameter)]
RDS[(RDS MySQL)]
REDIS[(ElastiCache)]
MQ[(AmazonMQ — RabbitMQ)]
OS[(OpenSearch)]
S3[(S3 media)]
CFM[CloudFront<br/>media]
CF --> ALB --> SVC
SVC -->|active color| NB
SVC -->|active color| NG
NB --> PB
NG --> PG
PB --> RDS & REDIS & MQ & OS & S3 & ESO
PG --> RDS & REDIS & MQ & OS & S3 & ESO
CB --> MQ & RDS & REDIS & ESO
CG --> MQ & RDS & REDIS & ESO
CRON --> RDS & REDIS & MQ & ESO
ESO --> SSM
S3 --> CFM
The stateless tiers (nginx, php-fpm, consumer) come in blue and green pairs. The shared data layer (RDS, Redis, MQ, OpenSearch, S3) is color-agnostic — both colors connect identically. Cron is a singleton; it reads the env config from the same Secret the active color uses.
This is the serving path only. Cluster operators (ArgoCD, Karpenter, ALB Controller), the observability stack (Alloy/Loki/Prometheus/Grafana), and the blue/green direct-QA ingresses are deliberately outside the model — their failures don't directly break HTTP serving, and adding them would multiply enumerated scenarios without diagnostic value.
The model¶
system.model.yaml:
meta:
name: acme-shop
version: "1.0"
providers:
- mgt-tool/kubernetes@^3.0.0
- mgt-tool/aws@^1.0.0
vars:
namespace: default
cluster: acme-shop-stage
region: eu-central-1
env: stage # flip to `prod` on the prod branch
scenarios: none # see note below on enumerated scenarios
components:
# ─── Edge / public entry ────────────────────────────────────────
cloudflare:
type: cdn
depends:
- on: acme-shop-ingress
acme-shop-ingress:
type: ingress
depends:
- on: acme-shop-svc
# ─── Service layer (color-selecting) ────────────────────────────
acme-shop-svc:
type: service
depends:
- on: acme-shop-nginx-blue
- on: acme-shop-nginx-green
# ─── nginx tier (per color) ─────────────────────────────────────
acme-shop-nginx-blue:
type: deployment
depends:
- on: acme-shop-php-fpm-blue
acme-shop-nginx-green:
type: deployment
depends:
- on: acme-shop-php-fpm-green
# ─── php-fpm tier (per color) ───────────────────────────────────
# All backend work lives here; depends on the full shared data
# layer. The app-config Secret selects which color's env is live.
acme-shop-php-fpm-blue:
type: deployment
depends:
- on: external-secrets
- on: rds
- on: redis
- on: mq
- on: opensearch
- on: media-bucket
acme-shop-php-fpm-green:
type: deployment
depends:
- on: external-secrets
- on: rds
- on: redis
- on: mq
- on: opensearch
- on: media-bucket
# ─── Async workers (per color) ──────────────────────────────────
acme-shop-consumer-blue:
type: deployment
depends: [{on: mq}, {on: rds}, {on: redis}, {on: external-secrets}]
acme-shop-consumer-green:
type: deployment
depends: [{on: mq}, {on: rds}, {on: redis}, {on: external-secrets}]
# ─── Cron (singleton, color-agnostic) ───────────────────────────
acme-shop-cron:
type: deployment
depends: [{on: external-secrets}, {on: rds}, {on: redis}, {on: mq}]
scheduled_jobs:
type: business_process
depends: [{on: acme-shop-cron}]
async_jobs:
type: business_process
depends: [{on: acme-shop-consumer-blue}, {on: acme-shop-consumer-green}, {on: mq}]
# ─── In-cluster search ──────────────────────────────────────────
opensearch:
type: deployment
healthy:
- ready_replicas >= 1
# ─── AWS managed data layer ─────────────────────────────────────
rds:
type: rds_instance
providers: [mgt-tool/aws@^1.0.0]
resource: acme-shop-{env}-rds
healthy:
- connection_count < 500
redis:
type: elasticache_cluster
providers: [mgt-tool/aws@^1.0.0]
resource: acme-shop-{env}-redis
healthy:
- available == true
mq:
type: mq_broker
providers: [mgt-tool/aws@^1.0.0]
resource: acme-shop-{env}-mq
healthy:
- queue_depth < 10000
media-bucket:
type: s3_bucket
providers: [mgt-tool/aws@^1.0.0]
resource: acme-shop-stage-media-8f3e9c
cloudfront:
type: cloudfront_distribution
providers: [mgt-tool/aws@^1.0.0]
resource: EABCDEF123456X
depends:
- on: media-bucket
# ─── Config / secrets sources ───────────────────────────────────
ssm-app-config:
type: ssm_parameter
providers: [mgt-tool/aws@^1.0.0]
resource: /platform/defaults/app-config
external-secrets:
type: operator
vars:
namespace: external-secrets
depends:
- on: ssm-app-config
Why it looks like this¶
Kubernetes components use the real resource name as the mgtt key (acme-shop-nginx-blue), because the kubernetes provider's kubectl get <type> <key> probes resolve it directly in default. AWS components use readable short keys (rds, redis, mq) with resource: carrying the real AWS identifier.
resource: acme-shop-{env}-rds — templating from meta.vars.env means this same file ships unchanged for stage and prod; only the branch-specific vars.env flips.
Random suffixes are hardcoded. The S3 bucket acme-shop-stage-media-8f3e9c doesn't follow the {env} pattern (random bucket suffix). Same for CloudFront distribution IDs and SSM paths. Readable key + hardcoded resource is fine — the decoupling still holds.
scenarios: none on the model. The enumerated scenarios.yaml for this system is tens of thousands of chains — ~40MB. Too chunky for git history. Regenerate locally with mgtt model validate --write-scenarios, gitignore the result, and skip the drift check on this model. Hand-authored scenarios stay in scenarios/ and run in CI via mgtt simulate --all.
nginx → php-fpm is a deployment edge, not a service edge. The acme-shop-php-fpm-{blue,green} Services exist in the cluster but share the Deployment's exact name — mgtt requires unique component keys, so the model walks to the deployment directly. The Deployment's ready_replicas / restart_count are what actually diagnose php-fpm health; the svc-level probe wouldn't have added signal.
healthy: overrides replace the type default, they don't merge. Three cases where that matters:
opensearch— the defaultdeploymenttype has many rules; for a single-replica staging deployment, the full set over-constrains.ready_replicas >= 1is the whole signal.redis— theelasticache_clusterdefault includescache_hit_ratio > 80, which trips on idle stage (ratio is 0 when nothing reads). The component-levelavailable == truereplaces the whole rule set. A real Redis outage still flipsavailableto false.mq— droppedconsumer_count > 0from the default. On stage, consumers are scaled to 0 and never attach to the broker, so the rule flagged idle-by-design as an incident.queue_depth < 10000is the real safety bound: a backed-up queue is the symptom regardless of how many consumers are attached.
Business-process components (scheduled_jobs, async_jobs) give the engine a user-visible symptom layer for chains rooted at cron or mq. Treating them as generic components with operator-observable facts (are jobs processed on time, is the queue draining) gives mgtt a terminal to reason from.
External Secrets Operator is modelled, not the Secret it produces. The operator's health is the real signal — a healthy operator implies a fresh Secret. And the view-only IAM policy used by the CI diagnose role excludes secrets, so probing the Secret directly would 403. The operator is probed with deployment_ready only (not crd_registered or webhook_*), because cluster-scoped CRD gets also aren't in the view policy, and this ESO install ships no webhook configuration — setting the vars for probes that would come back forbidden just adds noise.
The operator lives in its own namespace. vars.namespace: external-secrets on that single component overrides the model-wide meta.vars.namespace: default, so the kubectl -n argument in the probe resolves correctly.
Validate¶
$ mgtt model validate
✓ cloudflare 1 dependency valid
✓ acme-shop-ingress 1 dependency valid
✓ acme-shop-svc 2 dependencies valid
✓ acme-shop-nginx-blue 1 dependency valid
✓ acme-shop-nginx-green 1 dependency valid
✓ acme-shop-php-fpm-blue 6 dependencies valid
✓ acme-shop-php-fpm-green 6 dependencies valid
✓ acme-shop-consumer-blue 4 dependencies valid
✓ acme-shop-consumer-green 4 dependencies valid
✓ acme-shop-cron 4 dependencies valid
✓ scheduled_jobs 1 dependency valid
✓ async_jobs 3 dependencies valid
✓ opensearch healthy override valid
✓ rds healthy override valid
✓ redis healthy override valid
✓ mq healthy override valid
✓ media-bucket resolved
✓ cloudfront 1 dependency valid
✓ ssm-app-config resolved
✓ external-secrets 1 dependency valid
20 components · 0 errors · 0 warnings
At this point the model is syntactically correct and every type/fact resolves against the declared providers. Nothing has been said yet about whether it reasons well — that's what scenarios are for.
Scenarios¶
Five scenarios live in scenarios/, each testing a different lesson the engine has to get right.
1. All healthy — no false positives¶
scenarios/all-healthy.yaml injects healthy facts for every component and asserts root_cause: none. The moment this fails, you've added a rule that trips on a healthy system — the consumer_count > 0 mistake gets caught here.
name: all components healthy
inject:
acme-shop-nginx-blue: { ready_replicas: 2, desired_replicas: 2, condition_available: true, restart_count: 0 }
acme-shop-nginx-green: { ready_replicas: 2, desired_replicas: 2, condition_available: true, restart_count: 0 }
acme-shop-php-fpm-blue: { ready_replicas: 3, desired_replicas: 3, condition_available: true, restart_count: 0 }
acme-shop-php-fpm-green: { ready_replicas: 3, desired_replicas: 3, condition_available: true, restart_count: 0 }
acme-shop-consumer-blue: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
acme-shop-consumer-green: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
acme-shop-cron: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
opensearch: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
rds: { available: true, connection_count: 120 }
redis: { available: true, connection_count: 200, cache_hit_ratio: 0.95 }
mq: { available: true, consumer_count: 22, queue_depth: 34 }
media-bucket: { accessible: true }
cloudfront: { enabled: true, deployed: true }
external-secrets: { deployment_ready: true, crd_registered: true }
expect:
root_cause: none
eliminated:
- cloudflare
- external-secrets
- acme-shop-ingress
- acme-shop-nginx-blue
- acme-shop-nginx-green
- acme-shop-php-fpm-blue
- acme-shop-php-fpm-green
- acme-shop-svc
- mq
- opensearch
- rds
- redis
- media-bucket
- ssm-app-config
2. Blue nginx crash-loop, green healthy¶
A bad image was rolled to blue nginx. Blue crash-loops, green is fine. Traffic is currently on blue, so the live path is broken — runbook here is "flip the service selector to green".
The engine sees acme-shop-nginx-blue degraded AND acme-shop-nginx-green healthy. Because acme-shop-svc hard-depends on both, mgtt flags acme-shop-nginx-blue as the root cause of the live-traffic symptom — and acme-shop-nginx-green gets eliminated, confirming the failover target is actually available.
name: blue nginx crash-looping, green healthy
inject:
acme-shop-nginx-blue: { ready_replicas: 0, desired_replicas: 2, condition_available: false, restart_count: 18 }
acme-shop-nginx-green: { ready_replicas: 2, desired_replicas: 2, condition_available: true, restart_count: 0 }
acme-shop-svc: { endpoint_count: 0, selector_match: false }
acme-shop-php-fpm-blue: { ready_replicas: 3, desired_replicas: 3, condition_available: true, restart_count: 0 }
acme-shop-php-fpm-green: { ready_replicas: 3, desired_replicas: 3, condition_available: true, restart_count: 0 }
rds: { available: true, connection_count: 140 }
redis: { available: true, connection_count: 200, cache_hit_ratio: 0.94 }
mq: { available: true, consumer_count: 22, queue_depth: 40 }
opensearch: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
media-bucket: { accessible: true }
external-secrets: { deployment_ready: true, crd_registered: true }
expect:
root_cause: acme-shop-nginx-blue
path: [cloudflare, acme-shop-ingress, acme-shop-svc, acme-shop-nginx-blue]
eliminated: [external-secrets, acme-shop-nginx-green, acme-shop-php-fpm-blue, acme-shop-php-fpm-green, mq, opensearch, rds, redis, media-bucket, ssm-app-config]
3. RDS unavailable — dramatic cascade¶
RDS stops accepting connections. php-fpm (both colors) can't load config or run catalog queries, so pods crash-loop. nginx returns 5xx. Cron and consumers are equally affected. The failure is splattered across half the dependency graph — the engine must trace it to rds, not blame a color.
name: rds unavailable
inject:
rds: { available: false, connection_count: 0 }
redis: { available: true, connection_count: 210, cache_hit_ratio: 0.82 }
mq: { available: true, consumer_count: 0, queue_depth: 4200 }
opensearch: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
media-bucket: { accessible: true }
external-secrets: { deployment_ready: true, crd_registered: true }
acme-shop-php-fpm-blue: { ready_replicas: 0, desired_replicas: 3, condition_available: false, restart_count: 14 }
acme-shop-php-fpm-green: { ready_replicas: 0, desired_replicas: 3, condition_available: false, restart_count: 12 }
acme-shop-consumer-blue: { ready_replicas: 0, desired_replicas: 1, condition_available: false, restart_count: 8 }
acme-shop-consumer-green: { ready_replicas: 0, desired_replicas: 1, condition_available: false, restart_count: 9 }
expect:
root_cause: rds
path: [cloudflare, acme-shop-ingress, acme-shop-svc, acme-shop-nginx-blue, acme-shop-php-fpm-blue, rds]
eliminated: [external-secrets, mq, opensearch, redis, media-bucket, ssm-app-config]
4. Redis unavailable — observability trap¶
This scenario tests whether the engine gets fooled. Redis goes down; sessions evict; cache hit ratio collapses. php-fpm pods stay up (they don't require Redis to start), but every request now hits the database — RDS connection count climbs to 480 (just under the 500 limit). If you probed naively, RDS would look stressed.
The engine has to recognise that redis.available == false is the actual failed predicate, and RDS is still within its healthy: bound. Root cause: redis.
name: redis unavailable
inject:
redis: { available: false, connection_count: 0, cache_hit_ratio: 0.0 }
rds: { available: true, connection_count: 480 } # stressed but within bound
mq: { available: true, consumer_count: 22, queue_depth: 38 }
opensearch: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
media-bucket: { accessible: true }
external-secrets: { deployment_ready: true, crd_registered: true }
acme-shop-php-fpm-blue: { ready_replicas: 3, desired_replicas: 3, condition_available: true, restart_count: 1 }
acme-shop-php-fpm-green: { ready_replicas: 3, desired_replicas: 3, condition_available: true, restart_count: 1 }
expect:
root_cause: redis
path: [cloudflare, acme-shop-ingress, acme-shop-svc, acme-shop-nginx-blue, acme-shop-php-fpm-blue, redis]
eliminated: [external-secrets, acme-shop-nginx-green, acme-shop-php-fpm-green, mq, opensearch, rds, media-bucket, ssm-app-config]
5. External Secrets Operator down — deep chain cut short¶
ESO's Deployment is unavailable, so the in-cluster app-config Secret can't be reconciled from SSM. php-fpm pods on both colors fail to start (mount error on a stale/missing Secret). Cron fails. Consumers can't start. The SSM parameter itself — source of truth — is fine.
The test is whether the engine cuts the chain at external-secrets rather than walking all the way to ssm-app-config and blaming it. A probe to SSM returns "ok", eliminating it. A probe to the operator's deployment returns "not ready" — that's the cut point.
name: external-secrets operator down
inject:
ssm-app-config: { exists: true, version: 7 }
external-secrets: { deployment_ready: false, crd_registered: true, restart_count: 0 }
rds: { available: true, connection_count: 100 }
redis: { available: true, connection_count: 200, cache_hit_ratio: 0.93 }
mq: { available: true, consumer_count: 0, queue_depth: 1800 }
media-bucket: { accessible: true }
opensearch: { ready_replicas: 1, desired_replicas: 1, condition_available: true, restart_count: 0 }
acme-shop-php-fpm-blue: { ready_replicas: 0, desired_replicas: 3, condition_available: false, condition_progressing: false, restart_count: 0 }
acme-shop-php-fpm-green: { ready_replicas: 0, desired_replicas: 3, condition_available: false, condition_progressing: false, restart_count: 0 }
acme-shop-cron: { ready_replicas: 0, desired_replicas: 1, condition_available: false, condition_progressing: false, restart_count: 0 }
expect:
root_cause: external-secrets
path: [cloudflare, acme-shop-ingress, acme-shop-svc, acme-shop-nginx-blue, acme-shop-php-fpm-blue, external-secrets]
eliminated: [mq, opensearch, rds, redis, media-bucket, ssm-app-config]
Note restart_count: 0 on the php-fpm and cron pods — a mount failure stalls the rollout before the container starts, so the pod never crashes. "Not ready, never started" is a different signature from "crash-loop", and the model's facts have to carry that distinction for the engine to cut the chain correctly.
Simulate in CI¶
$ mgtt simulate --all
all components healthy ✓ passed
blue nginx crash-looping, green healthy ✓ passed
rds unavailable ✓ passed
redis unavailable ✓ passed
external-secrets operator down ✓ passed
5/5 scenarios passed
No cluster, no credentials. This runs on every PR in ~400ms:
# .gitlab-ci.yml (or .github/workflows/mgtt.yaml)
validate-model:
image: ghcr.io/mgt-tool/mgtt:latest
script:
- mgtt model validate
- mgtt simulate --all
Every scenario is a test of the model's reasoning. If someone renames rds to primary-db without updating the ingress-tier dependency, scenario 3 fails. If someone re-adds consumer_count > 0 to the mq healthy rule, scenario 1 fails. The scenarios are the guard against silent drift.
At 3am: mgtt diagnose¶
A Cloudflare alert fires: 5xx rate on shop.acme.example just crossed 5%. On-call pastes /diagnose api into Slack; a GitLab job runs:
$ mgtt diagnose --suspect acme-shop-php-fpm-blue --max-probes 10
▶ probe cloudflare enabled ✓ healthy ← eliminated
▶ probe acme-shop-ingress address_assigned ✓ healthy ← eliminated
▶ probe acme-shop-svc endpoint_count ✗ 0 endpoints
▶ probe acme-shop-nginx-blue ready_replicas ✗ 0/2 ready
▶ probe acme-shop-php-fpm-blue ready_replicas ✗ 0/3 ready
▶ probe external-secrets deployment_ready ✗ unhealthy
▶ probe ssm-app-config exists ✓ healthy ← eliminated
▶ probe rds available ✓ healthy ← eliminated
▶ probe redis available ✓ healthy ← eliminated
Root cause: external-secrets.deployment_unavailable
Chain: cloudflare ← acme-shop-ingress ← acme-shop-svc ←
acme-shop-nginx-blue ← acme-shop-php-fpm-blue ← external-secrets
Eliminated: cloudflare, acme-shop-ingress, ssm-app-config, rds, redis
Probes run: 9/10
Partial visibility: none
Nine probes, five branches eliminated, one chain confirmed. The runbook link (not shown) points to docs/runbooks/external-secrets-recovery.md. On-call knows where to look before they've opened a terminal.
The reason diagnose runs this fast is the committed scenarios.yaml: at design time, mgtt model validate --write-scenarios enumerated every plausible failure chain this model can produce, so diagnose prunes whole branches (media-bucket, cloudfront, opensearch, mq, async_jobs, consumer/cron) with a single group-elimination step before the first live probe runs.
Lessons from the first few weeks¶
Real usage of mgtt is a dialogue between the model and the system. These four refinements came from running against staging for a week before the model was trusted:
consumer_count > 0 on the mq_broker tripped on idle stage. Staging often has consumer deployments at zero replicas for cost. The default mq_broker healthy rule flagged this as an incident on every run. The fix: component-level healthy: that replaces the default with queue_depth < 10000, because queue depth is the symptom an operator actually cares about regardless of consumer count. Scenario 1 (all-healthy) is the regression test.
cache_hit_ratio > 80 on idle Redis. Same shape: on a freshly-started stage, the hit ratio is 0 because nothing's reading. The default elasticache_cluster rule flagged Redis as unhealthy while Redis was fine. Replaced with available == true.
Modelling the Secret instead of the operator. The first attempt had a kubernetes.secret component app-config-active as the dependency of php-fpm. But (a) the CI diagnose role uses a view-only IAM policy that excludes secrets — so the probe 403s — and (b) a stale Secret is a symptom, not a cause. The actual cause is the operator that should have refreshed it. Replacing the Secret with external-secrets (an operator component) gave the engine a probeable root cause and eliminated the 403.
Overriding meta.vars.namespace per-component. ESO lives in its own namespace, so the model-wide default namespace fell through to the wrong kubectl -n argument. The fix isn't to split the model — it's a two-line vars: { namespace: external-secrets } on that one component. Any other cross-namespace dependency (ALB Controller, cluster operators) would follow the same pattern.
Each of these refinements was caught by a failing scenario run, not by reading the code. The scenario file is where the engineering judgement lives — the model is the shape, the scenarios are the spec.
Next¶
- Model Schema Reference — every field in
system.model.yaml - Scenario Schema Reference — every field in scenario files
- Multi-File Models — splitting the model when the system has distinct operational moments (serving vs. deploy-time)
- Troubleshooting walkthrough — the runtime loop in more detail