From Infrastructure Capacity & Performance Management to Autonomous Capacity & Performance Engineering
READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.
Introduction
For years, infrastructure capacity and performance management meant one thing: engineers watching dashboards. Teams would build elaborate Grafana panels, configure Datadog monitors, and sit on-call rotations waiting for alerts to fire. Thresholds were manually tuned, forecasting was done in spreadsheets, and capacity plans were quarterly rituals that were often outdated before the ink dried.
The discipline is undergoing a fundamental transformation. As AI capabilities mature and infrastructure complexity scales beyond human cognitive bandwidth, the role of “Infrastructure Capacity and Performance Management” is evolving into “Autonomous Capacity & Performance Engineering” — a model where intelligent agents replace reactive dashboards with predictive, closed-loop automation.
This post details what that transformation looks like, why it is happening now, and includes a concrete proof-of-concept (POC) using AWS, EKS, Datadog, KEDA, Karpenter, and open-source AI tooling that demonstrates how autonomous capacity and performance engineering works in practice.
The Old Model: Capacity and Performance Management
What It Looked Like
In the traditional model, capacity and performance management was fundamentally a human-in-the-loop discipline:
- Dashboards and monitors were the primary interface. Engineers spent significant time building and maintaining Grafana or Datadog dashboards.
- Alerting was threshold-based. A CPU utilization alert fires at 80%; an engineer investigates.
- Capacity planning was periodic and manual. Teams projected growth from historical trends in spreadsheets, then submitted resource requests to procurement or cloud budgets.
- Performance tuning required deep expert knowledge. Identifying whether a latency spike was caused by network saturation, database contention, or noisy neighbors required experienced engineers to investigate across multiple systems.
- Runbooks encoded institutional knowledge. But they were static, often stale, and required a human to execute them.
The Problems With This Model
| Problem | Impact |
|---|---|
| Reactive alerting | Incidents are discovered after users are already impacted |
| Manual forecasting | Capacity shortfalls and over-provisioning are both common |
| Alert fatigue | High false-positive rates cause engineers to ignore alerts |
| Knowledge silos | Only a few engineers understand the full performance profile |
| Slow response | Human escalation chains add minutes to hours of MTTR |
| Static thresholds | Seasonal and burst traffic patterns are not accounted for |
The model worked when systems were simpler and traffic patterns were predictable. Modern cloud-native applications — with microservices, serverless functions, multi-region deployments, and dynamic workloads — have made this approach untenable.
The New Model: Autonomous Capacity & Performance Engineering
Core Philosophy
Autonomous Capacity & Performance Engineering treats infrastructure as a continuously self-optimizing system. Rather than humans watching dashboards and reacting, AI agents:
- Predict future capacity needs using time-series forecasting and ML models
- Detect anomalies proactively before they manifest as user-facing issues
- Tune compute, storage, and network configurations in real time
- Mitigate risks like saturation, throttling, and cascading failures automatically
- Learn from each intervention to improve future decisions
The human role shifts from operator to engineer of the autonomous system itself — defining policies, reviewing agent decisions, and expanding the system’s capabilities.
The Agent Architecture
A mature autonomous capacity engineering system is composed of several specialized agents, each responsible for a domain:
┌─────────────────────────────────────────────────────────────────┐
│ Orchestrator Agent │
│ (Policy enforcement, cross-domain coordination) │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Compute Agent │ │ Storage Agent │ │ Network Agent │
│ │ │ │ │ │
│ - HPA tuning │ │ - IOPS right- │ │ - Bandwidth │
│ - Node group │ │ sizing │ │ reservation │
│ scaling │ │ - Volume │ │ - Latency │
│ - Spot/OD mix │ │ expansion │ │ optimization│
└───────────────┘ └───────────────┘ └───────────────┘
▲ ▲ ▲
└──────────────────┼──────────────────┘
│
┌──────────────────┼──────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Forecasting │ │ Anomaly │ │ Risk │
│ Agent │ │ Detection │ │ Assessment │
│ │ │ Agent │ │ Agent │
│ - Prophet │ │ │ │ │
│ - LSTM models │ │ - Isolation │ │ - Blast radius│
│ - Seasonality │ │ Forest │ │ estimation │
│ modeling │ │ - DBSCAN │ │ - Dependency │
└───────────────┘ └───────────────┘ │ graph walk │
└───────────────┘
Key Capabilities
1. Predictive Capacity Planning
Instead of quarterly spreadsheet exercises, ML models continuously ingest historical metrics and produce rolling capacity forecasts:
- Time-series forecasting (Prophet, LSTM, or AWS Forecast) predicts CPU, memory, and storage consumption 7–30 days ahead.
- Event-aware models incorporate known future events (product launches, marketing campaigns, seasonal peaks) as regressors.
- Automated provisioning adjusts node groups, reserved capacity, and savings plans based on predictions.
2. Real-Time Performance Optimization
Agents monitor live telemetry and tune configurations without human intervention:
- HPA and KEDA parameter tuning: stabilization windows, scale-down delays, and target utilization are adjusted dynamically based on observed traffic patterns.
- Karpenter node provisioning: instance type selection, spot/on-demand mix, and consolidation policies are updated based on workload characteristics.
- JVM and runtime tuning: GC parameters, thread pool sizes, and connection pool limits are adjusted for services that expose JMX or application metrics.
3. Anomaly Detection and Proactive Mitigation
Rather than waiting for a threshold breach, anomaly detection models identify unusual patterns early:
- Isolation Forest detects point anomalies in CPU, memory, request rates, and error rates.
- DBSCAN clustering identifies contextual anomalies (e.g., a latency spike that is unusual for the current hour but not globally).
- Changepoint detection (PELT algorithm) identifies shifts in the mean or variance of a metric signal that indicate a regime change.
When anomalies are detected, mitigation actions are taken automatically:
- Saturation risk: pre-warm additional capacity before utilization reaches the saturation point.
- Throttling risk: back-off or shed non-critical traffic, notify the application team.
- Cascading failure risk: open circuit breakers, trigger pod restarts, isolate unhealthy nodes.
Proof of Concept: Autonomous Capacity Agent on AWS EKS with Datadog
This POC demonstrates a lightweight autonomous capacity agent that:
- Ingests metrics from Datadog via its API
- Runs an anomaly detection pipeline using Python and scikit-learn
- Makes scaling decisions and applies them via Kubernetes API and AWS APIs
- Uses KEDA and Karpenter as the execution layer
Architecture Overview
┌─────────────────────────────────────────────────────┐
│ AWS EKS Cluster │
│ │
│ ┌─────────────────┐ ┌──────────────────────────┐│
│ │ Capacity Agent │───▶│ Kubernetes API Server ││
│ │ (Python Pod) │ │ (HPA / ScaledObject ││
│ └────────┬────────┘ │ patch operations) ││
│ │ └──────────────────────────┘│
│ │ ┌──────────────────────────┐│
│ │ │ Karpenter NodePool ││
│ └────────────▶│ (instance type/size ││
│ │ adjustments via CRD) ││
│ └──────────────────────────┘│
└──────────────────────────────┬──────────────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌───────────────┐
│ Datadog │ │ AWS │ │ Prometheus / │
│ Metrics │ │ CloudWatch │ │ KEDA Scaler │
│ API │ │ Metrics │ │ │
└──────────────┘ └──────────────┘ └───────────────┘
Prerequisites
# Tools required
- AWS CLI v2 configured with EKS access
- kubectl configured for the target cluster
- Helm 3.x
- Python 3.11+
- Datadog account with API and APP keys
- Karpenter v0.36+ installed on the cluster
- KEDA v2.14+ installed on the cluster
Step 1: Install KEDA and Karpenter
# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace \
--version 2.14.0
# Install Karpenter (assumes EKS with IRSA configured)
export CLUSTER_NAME="capacity-poc"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export KARPENTER_VERSION="0.36.0"
helm registry login public.ecr.aws --username AWS --password $(aws ecr-public get-login-password --region us-east-1)
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace karpenter \
--create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi \
--wait
Step 2: Define a Karpenter NodePool
# nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["2"]
nodeClassRef:
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
name: general
limits:
cpu: 1000
memory: 1000Gi
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: general
spec:
amiFamily: AL2
role: "KarpenterNodeRole-${CLUSTER_NAME}"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
kubectl apply -f nodepool.yaml
Step 3: Deploy a Sample Application with a KEDA ScaledObject
# sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
namespace: default
spec:
replicas: 2
selector:
matchLabels:
app: web-api
template:
metadata:
labels:
app: web-api
spec:
containers:
- name: web-api
image: nginx:1.25
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: web-api-scaledobject
namespace: default
spec:
scaleTargetRef:
name: web-api
minReplicaCount: 2
maxReplicaCount: 50
cooldownPeriod: 60
pollingInterval: 15
triggers:
- type: datadog
metadata:
query: "avg:kubernetes.cpu.usage.total{kube_deployment:web-api}"
queryValue: "70"
type: "global"
age: "60"
authenticationRef:
name: datadog-trigger-auth
---
apiVersion: v1
kind: Secret
metadata:
name: datadog-secret
namespace: default
type: Opaque
stringData:
apiKey: "${DD_API_KEY}"
appKey: "${DD_APP_KEY}"
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: datadog-trigger-auth
namespace: default
spec:
secretTargetRef:
- parameter: apiKey
name: datadog-secret
key: apiKey
- parameter: appKey
name: datadog-secret
key: appKey
kubectl apply -f sample-app.yaml
Step 4: Deploy the Autonomous Capacity Agent
The agent is a Python application packaged as a Kubernetes CronJob. It runs every 5 minutes, pulls metrics from Datadog, runs anomaly detection, and patches KEDA ScaledObjects or Karpenter NodePools as needed.
capacity_agent/requirements.txt:
datadog-api-client==2.26.0
scikit-learn==1.4.2
numpy==1.26.4
pandas==2.2.2
boto3==1.34.69
kubernetes==29.0.0
prophet==1.1.5
capacity_agent/agent.py:
"""
Autonomous Capacity & Performance Engineering Agent
POC: Anomaly detection + auto-remediation on AWS EKS with Datadog
"""
import os
import logging
import json
from datetime import datetime, timedelta, timezone
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.metrics_api import MetricsApi
from kubernetes import client as k8s_client, config as k8s_config
import boto3
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("capacity-agent")
# ── Configuration ──────────────────────────────────────────────────────────────
DD_API_KEY = os.environ["DD_API_KEY"]
DD_APP_KEY = os.environ["DD_APP_KEY"]
DD_SITE = os.environ.get("DD_SITE", "datadoghq.com")
NAMESPACE = os.environ.get("TARGET_NAMESPACE", "default")
SCALED_OBJ_NAME = os.environ.get("SCALED_OBJECT_NAME", "web-api-scaledobject")
NODEPOOL_NAME = os.environ.get("KARPENTER_NODEPOOL", "general")
AWS_REGION = os.environ.get("AWS_REGION", "us-east-1")
# Thresholds
ANOMALY_CONTAMINATION = float(os.environ.get("ANOMALY_CONTAMINATION", "0.05"))
SCALE_UP_QUERY_VALUE = int(os.environ.get("SCALE_UP_QUERY_VALUE", "60")) # tighten target
SCALE_DOWN_QUERY_VALUE = int(os.environ.get("SCALE_DOWN_QUERY_VALUE", "80")) # relax target
MAX_REPLICAS_CEILING = int(os.environ.get("MAX_REPLICAS_CEILING", "100"))
# ── Datadog metrics retrieval ──────────────────────────────────────────────────
def get_metric_series(metric_query: str, lookback_hours: int = 3) -> pd.DataFrame:
"""Fetch a Datadog metric time series for the past N hours."""
configuration = Configuration()
configuration.api_key["apiKeyAuth"] = DD_API_KEY
configuration.api_key["appKeyAuth"] = DD_APP_KEY
configuration.server_variables["site"] = DD_SITE
now = datetime.now(timezone.utc)
start = now - timedelta(hours=lookback_hours)
with ApiClient(configuration) as api_client:
api = MetricsApi(api_client)
resp = api.query_metrics(
_from=int(start.timestamp()),
to=int(now.timestamp()),
query=metric_query,
)
if not resp.series:
logger.warning("No data returned for query: %s", metric_query)
return pd.DataFrame()
series = resp.series[0]
records = [
{"timestamp": p[0], "value": p[1]}
for p in series.pointlist
if p[1] is not None
]
return pd.DataFrame(records)
# ── Anomaly detection ──────────────────────────────────────────────────────────
def detect_anomalies(df: pd.DataFrame) -> dict:
"""
Run Isolation Forest anomaly detection on the metric series.
Returns a summary dict with anomaly flag and severity score.
"""
if df.empty or len(df) < 10:
return {"anomaly": False, "score": 0.0, "latest_value": None}
values = df["value"].values.reshape(-1, 1)
clf = IsolationForest(contamination=ANOMALY_CONTAMINATION, random_state=42)
clf.fit(values)
# Score the most recent window (last 5 points)
recent_values = values[-5:]
scores = clf.decision_function(recent_values) # negative = more anomalous
labels = clf.predict(recent_values) # -1 = anomaly, 1 = normal
is_anomalous = any(l == -1 for l in labels)
severity = float(-np.min(scores)) # higher = more severe
latest_value = float(df["value"].iloc[-1])
return {
"anomaly": is_anomalous,
"score": severity,
"latest_value": latest_value,
"mean": float(df["value"].mean()),
"std": float(df["value"].std()),
}
# ── Capacity trend forecasting ─────────────────────────────────────────────────
def is_trending_up(df: pd.DataFrame, window: int = 10) -> bool:
"""Simple linear regression slope check over the last N points."""
if df.empty or len(df) < window:
return False
recent = df["value"].values[-window:]
x = np.arange(len(recent))
slope, _ = np.polyfit(x, recent, 1)
return slope > 0
# ── Kubernetes remediation ─────────────────────────────────────────────────────
def patch_scaled_object(query_value: int):
"""Adjust the KEDA ScaledObject's queryValue to tune scale-out aggressiveness."""
try:
k8s_config.load_incluster_config()
except k8s_config.ConfigException:
k8s_config.load_kube_config()
custom_api = k8s_client.CustomObjectsApi()
patch_body = {
"spec": {
"triggers": [
{
"type": "datadog",
"metadata": {
"query": f"avg:kubernetes.cpu.usage.total{{kube_deployment:web-api}}",
"queryValue": str(query_value),
"type": "global",
"age": "60",
},
"authenticationRef": {"name": "datadog-trigger-auth"},
}
]
}
}
custom_api.patch_namespaced_custom_object(
group="keda.sh",
version="v1alpha1",
namespace=NAMESPACE,
plural="scaledobjects",
name=SCALED_OBJ_NAME,
body=patch_body,
)
logger.info("Patched ScaledObject %s: queryValue=%d", SCALED_OBJ_NAME, query_value)
def cordon_saturated_nodes(threshold_cpu_percent: float = 90.0):
"""
Identify nodes above the CPU saturation threshold and cordon them
so that Karpenter can drain and replace them.
"""
try:
k8s_config.load_incluster_config()
except k8s_config.ConfigException:
k8s_config.load_kube_config()
core_api = k8s_client.CoreV1Api()
nodes = core_api.list_node()
for node in nodes.items:
# Read allocatable and check conditions
conditions = {c.type: c.status for c in node.status.conditions}
if conditions.get("MemoryPressure") == "True" or conditions.get("DiskPressure") == "True":
node_name = node.metadata.name
if not node.spec.unschedulable:
core_api.patch_node(
node_name,
{"spec": {"unschedulable": True}},
)
logger.warning("Cordoned node %s due to resource pressure", node_name)
# ── AWS remediation ────────────────────────────────────────────────────────────
def send_cloudwatch_alarm_event(detail: dict):
"""Put a custom CloudWatch event for audit trail and downstream automation."""
events = boto3.client("events", region_name=AWS_REGION)
events.put_events(
Entries=[
{
"Source": "capacity.agent",
"DetailType": "AutonomousCapacityAction",
"Detail": json.dumps(detail),
"EventBusName": "default",
}
]
)
logger.info("Published CloudWatch event: %s", detail)
# ── Main agent loop ────────────────────────────────────────────────────────────
def run():
logger.info("=== Autonomous Capacity Agent: starting evaluation ===")
# 1. Fetch CPU utilization metrics for the target deployment
cpu_query = "avg:kubernetes.cpu.usage.total{kube_deployment:web-api} by {host}"
cpu_df = get_metric_series(cpu_query, lookback_hours=3)
mem_query = "avg:kubernetes.memory.usage{kube_deployment:web-api}"
mem_df = get_metric_series(mem_query, lookback_hours=3)
# 2. Run anomaly detection
cpu_result = detect_anomalies(cpu_df)
mem_result = detect_anomalies(mem_df)
logger.info("CPU analysis: %s", cpu_result)
logger.info("MEM analysis: %s", mem_result)
# 3. Decision logic
action_taken = "none"
if cpu_result["anomaly"] and is_trending_up(cpu_df):
# Anomalous AND trending up → tighten the KEDA scale trigger to scale out sooner
logger.warning("CPU anomaly detected with upward trend. Tightening scale-out trigger.")
patch_scaled_object(query_value=SCALE_UP_QUERY_VALUE)
cordon_saturated_nodes()
action_taken = "scale_out_aggressive"
elif not cpu_result["anomaly"] and cpu_result.get("latest_value", 0) < 30:
# Low utilization, no anomaly → relax the trigger to allow scale-in
logger.info("CPU utilization low. Relaxing scale-in trigger.")
patch_scaled_object(query_value=SCALE_DOWN_QUERY_VALUE)
action_taken = "scale_in_relax"
elif mem_result["anomaly"]:
# Memory anomaly → cordon high-memory nodes, Karpenter will replace
logger.warning("Memory anomaly detected. Cordoning pressure nodes.")
cordon_saturated_nodes()
action_taken = "memory_pressure_cordon"
# 4. Publish audit event
send_cloudwatch_alarm_event(
{
"timestamp": datetime.now(timezone.utc).isoformat(),
"action": action_taken,
"cpu_result": cpu_result,
"mem_result": mem_result,
}
)
logger.info("=== Autonomous Capacity Agent: evaluation complete. Action: %s ===", action_taken)
if __name__ == "__main__":
run()
Step 5: Package and Deploy the Agent as a CronJob
capacity_agent/Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY agent.py .
CMD ["python", "agent.py"]
# Build and push to ECR
export ECR_REPO="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/capacity-agent"
aws ecr create-repository --repository-name capacity-agent --region ${AWS_REGION}
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REPO}
docker build -t capacity-agent ./capacity_agent
docker tag capacity-agent:latest ${ECR_REPO}:latest
docker push ${ECR_REPO}:latest
capacity-agent-cronjob.yaml:
apiVersion: batch/v1
kind: CronJob
metadata:
name: autonomous-capacity-agent
namespace: default
spec:
schedule: "*/5 * * * *" # Every 5 minutes
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
serviceAccountName: capacity-agent-sa
restartPolicy: OnFailure
containers:
- name: capacity-agent
image: "${ECR_REPO}:latest"
env:
- name: DD_API_KEY
valueFrom:
secretKeyRef:
name: datadog-secret
key: apiKey
- name: DD_APP_KEY
valueFrom:
secretKeyRef:
name: datadog-secret
key: appKey
- name: DD_SITE
value: "datadoghq.com"
- name: TARGET_NAMESPACE
value: "default"
- name: SCALED_OBJECT_NAME
value: "web-api-scaledobject"
- name: KARPENTER_NODEPOOL
value: "general"
- name: AWS_REGION
value: "us-east-1"
- name: ANOMALY_CONTAMINATION
value: "0.05"
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: capacity-agent-sa
namespace: default
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::${AWS_ACCOUNT_ID}:role/CapacityAgentRole"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: capacity-agent-role
rules:
- apiGroups: ["keda.sh"]
resources: ["scaledobjects"]
verbs: ["get", "list", "patch", "update"]
- apiGroups: ["karpenter.sh"]
resources: ["nodepools"]
verbs: ["get", "list", "patch", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "patch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: capacity-agent-binding
subjects:
- kind: ServiceAccount
name: capacity-agent-sa
namespace: default
roleRef:
kind: ClusterRole
name: capacity-agent-role
apiGroup: rbac.authorization.k8s.io
kubectl apply -f capacity-agent-cronjob.yaml
Step 6: AWS IAM Role for the Agent (IRSA)
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"events:PutEvents",
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"eks:DescribeCluster",
"eks:ListNodegroups",
"eks:UpdateNodegroupConfig"
],
"Resource": "arn:aws:eks:*:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}"
}
]
}
Step 7: Observe the Agent in Action
# Watch the CronJob execute
kubectl get cronjob autonomous-capacity-agent -w
# Tail the agent logs
kubectl logs -l job-name -n default --follow
# Observe KEDA ScaledObject changes
kubectl describe scaledobject web-api-scaledobject
# Watch Karpenter respond to cordoned nodes
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --follow
# Check CloudWatch for audit events
aws events describe-rule --name default --region ${AWS_REGION}
Step 8: Datadog Monitor Integration
Configure a Datadog monitor that surfaces agent decisions alongside the capacity metrics for observability of the autonomous system itself:
# datadog_monitor_setup.py
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.monitors_api import MonitorsApi
from datadog_api_client.v1.model.monitor import Monitor
from datadog_api_client.v1.model.monitor_type import MonitorType
from datadog_api_client.v1.model.monitor_options import MonitorOptions
configuration = Configuration()
configuration.api_key["apiKeyAuth"] = DD_API_KEY
configuration.api_key["appKeyAuth"] = DD_APP_KEY
body = Monitor(
name="[Capacity Agent] CPU Anomaly - web-api",
type=MonitorType.METRIC_ALERT,
query='anomalies(avg:kubernetes.cpu.usage.total{kube_deployment:web-api}, "basic", 2) >= 1',
message=(
"The Autonomous Capacity Agent has detected a CPU anomaly for web-api. "
"Automated remediation has been triggered. Review agent logs for details. "
"@slack-infra-alerts"
),
options=MonitorOptions(
notify_no_data=True,
no_data_timeframe=10,
evaluation_delay=60,
),
)
with ApiClient(configuration) as api_client:
api = MonitorsApi(api_client)
result = api.create_monitor(body)
print(f"Monitor created: {result.id}")
Comparing the Two Models: Before and After
| Dimension | Capacity & Performance Management | Autonomous Capacity & Performance Engineering |
|---|---|---|
| Primary interface | Dashboards, runbooks | AI agent decisions, policy configs |
| Alerting model | Threshold-based, reactive | Anomaly-based, predictive |
| Capacity planning | Quarterly spreadsheet exercise | Continuous ML forecasting (Prophet/LSTM) |
| Scale-out trigger | Static HPA/KEDA target | Dynamically tuned by agent based on patterns |
| Incident response | Human on-call, manual runbook | Agent detects, remediates, and audits |
| Node management | Manual drain/replace | Agent cordons + Karpenter replaces automatically |
| Audit trail | Jira tickets, Confluence pages | CloudWatch events, immutable agent log stream |
| Engineer role | Dashboard builder, alert responder | Agent designer, policy author, system reviewer |
| MTTR | Minutes to hours | Seconds to minutes |
| Over-provisioning | Common (safety buffers) | Minimized via right-sizing recommendations |
Risks and Guardrails
Autonomous systems require careful guardrails. A poorly configured agent that aggressively scales down during a real traffic spike, or that cordons healthy nodes, can worsen an incident. Essential controls include:
- Dry-run mode: The agent logs intended actions without applying them during an initial shadowing period.
- Blast radius limits: Maximum number of nodes cordoned per run, maximum replica change per cycle.
- Human approval gates: For high-severity anomalies, the agent creates a PagerDuty incident for human review before executing destructive actions.
- Rollback hooks: Every patch to a KEDA ScaledObject is accompanied by a snapshot of the prior configuration stored in a ConfigMap, enabling one-command rollback.
- Confidence thresholds: The Isolation Forest model must exceed a minimum anomaly score before triggering remediation to suppress low-confidence signals.
The Road Ahead
The POC above represents the first generation of autonomous capacity tooling. As the discipline matures, the trajectory points toward:
- LLM-augmented agents: Natural language explanations of every agent decision (“I tightened the scale trigger because CPU utilization spiked 40% above the 3-hour rolling average during a period with no corresponding traffic increase, suggesting a resource leak”).
- Cross-cluster awareness: Agents that coordinate capacity across multiple EKS clusters, regions, and even cloud providers.
- Cost optimization integration: AWS Cost Explorer and Kubecost APIs feeding into agent decisions, balancing performance SLOs against cost budgets.
- Self-improving models: Agent decisions and their outcomes feed back into model retraining pipelines, so anomaly detection improves continuously.
- Service mesh integration: Agents that tune Istio or Linkerd traffic weights in response to backend saturation, shedding load at the network layer before the application layer degrades.
Conclusion
The transformation from Infrastructure Capacity and Performance Management to Autonomous Capacity & Performance Engineering is not a distant future — it is happening now. The tools exist: Karpenter handles intelligent node provisioning, KEDA handles event-driven scaling, Datadog’s Anomaly Monitor and API provide the telemetry substrate, and Python’s scikit-learn ecosystem makes it accessible to build the intelligence layer.
The POC in this post demonstrates a working foundation. The KEDA ScaledObject queryValue tuning, Isolation Forest anomaly detection, and automatic node cordoning are all real, deployable patterns. From this foundation, teams can incrementally expand agent capabilities, add more sophisticated forecasting models, and connect additional remediation actions.
The engineers who thrive in this new era will be those who shift from watching dashboards to engineering the autonomous systems that watch the dashboards for them.