Refactoring Bloated EKS Estates: A 90-Day Cost Anatomy
A field-tested teardown of how a mid-size SaaS team trimmed 38% of their EKS spend without freezing roadmap delivery.
This briefing walks through the specific architectural inventory we performed on a Series B SaaS team running 14 EKS clusters across two regions. We cover request/limit calibration, the rightsizing of node groups, and why three of their managed services were quietly doubling effective compute. The narrative is paired with the exact dashboards we used during the engagement and the rollback safety net we set up before the first migration window. The advisory retainer behind this article also produced a runbook we now reuse with platform teams entering Q3 cost reviews.
What this briefing actually contains
- Detailed cluster-by-cluster spend audit template
- Karpenter vs cluster-autoscaler decision worksheet
- CPU/memory request calibration playbook used in the engagement
- Reserved instance vs Savings Plan trade tree for the workload mix
- Postmortem of the two cost spikes that returned in week 6
- Suggested observability hooks before any rightsizing pass
What you can take into your team
-
A reproducible baseline of where compute money actually goes per service
-
Confidence that latency budgets survived the rightsizing pass
-
A 90-day calendar of follow-up checks the platform team can self-serve
₩4,200,000
The fee covers full access to this briefing, the attached retainer notes, and one follow-up question to the responsible editor. Pricing is informational. Engagements are confirmed in writing during the kickoff conversation.
What we are most often asked about this briefing
About 70% of the calibration logic carries over. The rightsizing chapters are vendor-neutral, but the Reserved Instance and Savings Plan section is AWS-specific and would need to be reworked for committed-use discounts on GCP or reservations on Azure.
Reviews — including reservations
The Karpenter decision tree was the part I bookmarked. It does not pretend the answer is always Karpenter, which most posts on this topic do.
Honest about what they will not solve for you. We used the rollback safety net section verbatim and it caught a regression in week 2.