From Infrastructure Capacity & Performance Management to Autonomous Capacity & Performance Engineering

READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.

Introduction

For years, infrastructure capacity and performance management meant one thing: engineers watching dashboards. Teams would build elaborate Grafana panels, configure Datadog monitors, and sit on-call rotations waiting for alerts to fire. Thresholds were manually tuned, forecasting was done in spreadsheets, and capacity plans were quarterly rituals that were often outdated before the ink dried.

The discipline is undergoing a fundamental transformation. As AI capabilities mature and infrastructure complexity scales beyond human cognitive bandwidth, the role of “Infrastructure Capacity and Performance Management” is evolving into “Autonomous Capacity & Performance Engineering” — a model where intelligent agents replace reactive dashboards with predictive, closed-loop automation.

This post details what that transformation looks like, why it is happening now, and includes a concrete proof-of-concept (POC) using AWS, EKS, Datadog, KEDA, Karpenter, and open-source AI tooling that demonstrates how autonomous capacity and performance engineering works in practice.


The Old Model: Capacity and Performance Management

What It Looked Like

In the traditional model, capacity and performance management was fundamentally a human-in-the-loop discipline:

  • Dashboards and monitors were the primary interface. Engineers spent significant time building and maintaining Grafana or Datadog dashboards.
  • Alerting was threshold-based. A CPU utilization alert fires at 80%; an engineer investigates.
  • Capacity planning was periodic and manual. Teams projected growth from historical trends in spreadsheets, then submitted resource requests to procurement or cloud budgets.
  • Performance tuning required deep expert knowledge. Identifying whether a latency spike was caused by network saturation, database contention, or noisy neighbors required experienced engineers to investigate across multiple systems.
  • Runbooks encoded institutional knowledge. But they were static, often stale, and required a human to execute them.

The Problems With This Model

ProblemImpact
Reactive alertingIncidents are discovered after users are already impacted
Manual forecastingCapacity shortfalls and over-provisioning are both common
Alert fatigueHigh false-positive rates cause engineers to ignore alerts
Knowledge silosOnly a few engineers understand the full performance profile
Slow responseHuman escalation chains add minutes to hours of MTTR
Static thresholdsSeasonal and burst traffic patterns are not accounted for

The model worked when systems were simpler and traffic patterns were predictable. Modern cloud-native applications — with microservices, serverless functions, multi-region deployments, and dynamic workloads — have made this approach untenable.


The New Model: Autonomous Capacity & Performance Engineering

Core Philosophy

Autonomous Capacity & Performance Engineering treats infrastructure as a continuously self-optimizing system. Rather than humans watching dashboards and reacting, AI agents:

  1. Predict future capacity needs using time-series forecasting and ML models
  2. Detect anomalies proactively before they manifest as user-facing issues
  3. Tune compute, storage, and network configurations in real time
  4. Mitigate risks like saturation, throttling, and cascading failures automatically
  5. Learn from each intervention to improve future decisions

The human role shifts from operator to engineer of the autonomous system itself — defining policies, reviewing agent decisions, and expanding the system’s capabilities.

The Agent Architecture

A mature autonomous capacity engineering system is composed of several specialized agents, each responsible for a domain:

┌─────────────────────────────────────────────────────────────────┐
│                   Orchestrator Agent                             │
│          (Policy enforcement, cross-domain coordination)         │
└──────────────────────────┬──────────────────────────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Compute Agent │  │ Storage Agent │  │ Network Agent │
│               │  │               │  │               │
│ - HPA tuning  │  │ - IOPS right- │  │ - Bandwidth   │
│ - Node group  │  │   sizing      │  │   reservation │
│   scaling     │  │ - Volume      │  │ - Latency     │
│ - Spot/OD mix │  │   expansion   │  │   optimization│
└───────────────┘  └───────────────┘  └───────────────┘
        ▲                  ▲                  ▲
        └──────────────────┼──────────────────┘
                           │
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                  ▼
┌───────────────┐  ┌───────────────┐  ┌───────────────┐
│ Forecasting   │  │ Anomaly       │  │ Risk          │
│ Agent         │  │ Detection     │  │ Assessment    │
│               │  │ Agent         │  │ Agent         │
│ - Prophet     │  │               │  │               │
│ - LSTM models │  │ - Isolation   │  │ - Blast radius│
│ - Seasonality │  │   Forest      │  │   estimation  │
│   modeling    │  │ - DBSCAN      │  │ - Dependency  │
└───────────────┘  └───────────────┘  │   graph walk  │
                                       └───────────────┘

Key Capabilities

1. Predictive Capacity Planning

Instead of quarterly spreadsheet exercises, ML models continuously ingest historical metrics and produce rolling capacity forecasts:

  • Time-series forecasting (Prophet, LSTM, or AWS Forecast) predicts CPU, memory, and storage consumption 7–30 days ahead.
  • Event-aware models incorporate known future events (product launches, marketing campaigns, seasonal peaks) as regressors.
  • Automated provisioning adjusts node groups, reserved capacity, and savings plans based on predictions.

2. Real-Time Performance Optimization

Agents monitor live telemetry and tune configurations without human intervention:

  • HPA and KEDA parameter tuning: stabilization windows, scale-down delays, and target utilization are adjusted dynamically based on observed traffic patterns.
  • Karpenter node provisioning: instance type selection, spot/on-demand mix, and consolidation policies are updated based on workload characteristics.
  • JVM and runtime tuning: GC parameters, thread pool sizes, and connection pool limits are adjusted for services that expose JMX or application metrics.

3. Anomaly Detection and Proactive Mitigation

Rather than waiting for a threshold breach, anomaly detection models identify unusual patterns early:

  • Isolation Forest detects point anomalies in CPU, memory, request rates, and error rates.
  • DBSCAN clustering identifies contextual anomalies (e.g., a latency spike that is unusual for the current hour but not globally).
  • Changepoint detection (PELT algorithm) identifies shifts in the mean or variance of a metric signal that indicate a regime change.

When anomalies are detected, mitigation actions are taken automatically:

  • Saturation risk: pre-warm additional capacity before utilization reaches the saturation point.
  • Throttling risk: back-off or shed non-critical traffic, notify the application team.
  • Cascading failure risk: open circuit breakers, trigger pod restarts, isolate unhealthy nodes.

Proof of Concept: Autonomous Capacity Agent on AWS EKS with Datadog

This POC demonstrates a lightweight autonomous capacity agent that:

  1. Ingests metrics from Datadog via its API
  2. Runs an anomaly detection pipeline using Python and scikit-learn
  3. Makes scaling decisions and applies them via Kubernetes API and AWS APIs
  4. Uses KEDA and Karpenter as the execution layer

Architecture Overview

┌─────────────────────────────────────────────────────┐
│                  AWS EKS Cluster                     │
│                                                      │
│  ┌─────────────────┐    ┌──────────────────────────┐│
│  │ Capacity Agent  │───▶│  Kubernetes API Server   ││
│  │  (Python Pod)   │    │  (HPA / ScaledObject     ││
│  └────────┬────────┘    │   patch operations)      ││
│           │             └──────────────────────────┘│
│           │             ┌──────────────────────────┐│
│           │             │   Karpenter NodePool     ││
│           └────────────▶│   (instance type/size    ││
│                         │    adjustments via CRD)  ││
│                         └──────────────────────────┘│
└──────────────────────────────┬──────────────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
     ┌──────────────┐ ┌──────────────┐ ┌───────────────┐
     │   Datadog    │ │   AWS        │ │  Prometheus / │
     │   Metrics    │ │   CloudWatch │ │  KEDA Scaler  │
     │   API        │ │   Metrics    │ │               │
     └──────────────┘ └──────────────┘ └───────────────┘

Prerequisites

# Tools required
- AWS CLI v2 configured with EKS access
- kubectl configured for the target cluster
- Helm 3.x
- Python 3.11+
- Datadog account with API and APP keys
- Karpenter v0.36+ installed on the cluster
- KEDA v2.14+ installed on the cluster

Step 1: Install KEDA and Karpenter

# Install KEDA
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace \
  --version 2.14.0

# Install Karpenter (assumes EKS with IRSA configured)
export CLUSTER_NAME="capacity-poc"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export KARPENTER_VERSION="0.36.0"

helm registry login public.ecr.aws --username AWS --password $(aws ecr-public get-login-password --region us-east-1)

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "${KARPENTER_VERSION}" \
  --namespace karpenter \
  --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Step 2: Define a Karpenter NodePool

# nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["2"]
      nodeClassRef:
        apiVersion: karpenter.k8s.aws/v1beta1
        kind: EC2NodeClass
        name: general
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: general
spec:
  amiFamily: AL2
  role: "KarpenterNodeRole-${CLUSTER_NAME}"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: "${CLUSTER_NAME}"
kubectl apply -f nodepool.yaml

Step 3: Deploy a Sample Application with a KEDA ScaledObject

# sample-app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
    spec:
      containers:
        - name: web-api
          image: nginx:1.25
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: web-api-scaledobject
  namespace: default
spec:
  scaleTargetRef:
    name: web-api
  minReplicaCount: 2
  maxReplicaCount: 50
  cooldownPeriod: 60
  pollingInterval: 15
  triggers:
    - type: datadog
      metadata:
        query: "avg:kubernetes.cpu.usage.total{kube_deployment:web-api}"
        queryValue: "70"
        type: "global"
        age: "60"
      authenticationRef:
        name: datadog-trigger-auth
---
apiVersion: v1
kind: Secret
metadata:
  name: datadog-secret
  namespace: default
type: Opaque
stringData:
  apiKey: "${DD_API_KEY}"
  appKey: "${DD_APP_KEY}"
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: datadog-trigger-auth
  namespace: default
spec:
  secretTargetRef:
    - parameter: apiKey
      name: datadog-secret
      key: apiKey
    - parameter: appKey
      name: datadog-secret
      key: appKey
kubectl apply -f sample-app.yaml

Step 4: Deploy the Autonomous Capacity Agent

The agent is a Python application packaged as a Kubernetes CronJob. It runs every 5 minutes, pulls metrics from Datadog, runs anomaly detection, and patches KEDA ScaledObjects or Karpenter NodePools as needed.

capacity_agent/requirements.txt:

datadog-api-client==2.26.0
scikit-learn==1.4.2
numpy==1.26.4
pandas==2.2.2
boto3==1.34.69
kubernetes==29.0.0
prophet==1.1.5

capacity_agent/agent.py:

"""
Autonomous Capacity & Performance Engineering Agent
POC: Anomaly detection + auto-remediation on AWS EKS with Datadog
"""

import os
import logging
import json
from datetime import datetime, timedelta, timezone

import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.metrics_api import MetricsApi
from kubernetes import client as k8s_client, config as k8s_config
import boto3

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("capacity-agent")

# ── Configuration ──────────────────────────────────────────────────────────────

DD_API_KEY  = os.environ["DD_API_KEY"]
DD_APP_KEY  = os.environ["DD_APP_KEY"]
DD_SITE     = os.environ.get("DD_SITE", "datadoghq.com")

NAMESPACE        = os.environ.get("TARGET_NAMESPACE", "default")
SCALED_OBJ_NAME  = os.environ.get("SCALED_OBJECT_NAME", "web-api-scaledobject")
NODEPOOL_NAME    = os.environ.get("KARPENTER_NODEPOOL", "general")
AWS_REGION       = os.environ.get("AWS_REGION", "us-east-1")

# Thresholds
ANOMALY_CONTAMINATION  = float(os.environ.get("ANOMALY_CONTAMINATION", "0.05"))
SCALE_UP_QUERY_VALUE   = int(os.environ.get("SCALE_UP_QUERY_VALUE", "60"))   # tighten target
SCALE_DOWN_QUERY_VALUE = int(os.environ.get("SCALE_DOWN_QUERY_VALUE", "80")) # relax target
MAX_REPLICAS_CEILING   = int(os.environ.get("MAX_REPLICAS_CEILING", "100"))


# ── Datadog metrics retrieval ──────────────────────────────────────────────────

def get_metric_series(metric_query: str, lookback_hours: int = 3) -> pd.DataFrame:
    """Fetch a Datadog metric time series for the past N hours."""
    configuration = Configuration()
    configuration.api_key["apiKeyAuth"] = DD_API_KEY
    configuration.api_key["appKeyAuth"] = DD_APP_KEY
    configuration.server_variables["site"] = DD_SITE

    now     = datetime.now(timezone.utc)
    start   = now - timedelta(hours=lookback_hours)

    with ApiClient(configuration) as api_client:
        api = MetricsApi(api_client)
        resp = api.query_metrics(
            _from=int(start.timestamp()),
            to=int(now.timestamp()),
            query=metric_query,
        )

    if not resp.series:
        logger.warning("No data returned for query: %s", metric_query)
        return pd.DataFrame()

    series = resp.series[0]
    records = [
        {"timestamp": p[0], "value": p[1]}
        for p in series.pointlist
        if p[1] is not None
    ]
    return pd.DataFrame(records)


# ── Anomaly detection ──────────────────────────────────────────────────────────

def detect_anomalies(df: pd.DataFrame) -> dict:
    """
    Run Isolation Forest anomaly detection on the metric series.
    Returns a summary dict with anomaly flag and severity score.
    """
    if df.empty or len(df) < 10:
        return {"anomaly": False, "score": 0.0, "latest_value": None}

    values = df["value"].values.reshape(-1, 1)
    clf = IsolationForest(contamination=ANOMALY_CONTAMINATION, random_state=42)
    clf.fit(values)

    # Score the most recent window (last 5 points)
    recent_values = values[-5:]
    scores  = clf.decision_function(recent_values)   # negative = more anomalous
    labels  = clf.predict(recent_values)              # -1 = anomaly, 1 = normal

    is_anomalous  = any(l == -1 for l in labels)
    severity      = float(-np.min(scores))            # higher = more severe
    latest_value  = float(df["value"].iloc[-1])

    return {
        "anomaly":       is_anomalous,
        "score":         severity,
        "latest_value":  latest_value,
        "mean":          float(df["value"].mean()),
        "std":           float(df["value"].std()),
    }


# ── Capacity trend forecasting ─────────────────────────────────────────────────

def is_trending_up(df: pd.DataFrame, window: int = 10) -> bool:
    """Simple linear regression slope check over the last N points."""
    if df.empty or len(df) < window:
        return False
    recent = df["value"].values[-window:]
    x = np.arange(len(recent))
    slope, _ = np.polyfit(x, recent, 1)
    return slope > 0


# ── Kubernetes remediation ─────────────────────────────────────────────────────

def patch_scaled_object(query_value: int):
    """Adjust the KEDA ScaledObject's queryValue to tune scale-out aggressiveness."""
    try:
        k8s_config.load_incluster_config()
    except k8s_config.ConfigException:
        k8s_config.load_kube_config()

    custom_api = k8s_client.CustomObjectsApi()

    patch_body = {
        "spec": {
            "triggers": [
                {
                    "type": "datadog",
                    "metadata": {
                        "query": f"avg:kubernetes.cpu.usage.total{{kube_deployment:web-api}}",
                        "queryValue": str(query_value),
                        "type": "global",
                        "age": "60",
                    },
                    "authenticationRef": {"name": "datadog-trigger-auth"},
                }
            ]
        }
    }

    custom_api.patch_namespaced_custom_object(
        group="keda.sh",
        version="v1alpha1",
        namespace=NAMESPACE,
        plural="scaledobjects",
        name=SCALED_OBJ_NAME,
        body=patch_body,
    )
    logger.info("Patched ScaledObject %s: queryValue=%d", SCALED_OBJ_NAME, query_value)


def cordon_saturated_nodes(threshold_cpu_percent: float = 90.0):
    """
    Identify nodes above the CPU saturation threshold and cordon them
    so that Karpenter can drain and replace them.
    """
    try:
        k8s_config.load_incluster_config()
    except k8s_config.ConfigException:
        k8s_config.load_kube_config()

    core_api = k8s_client.CoreV1Api()
    nodes    = core_api.list_node()

    for node in nodes.items:
        # Read allocatable and check conditions
        conditions = {c.type: c.status for c in node.status.conditions}
        if conditions.get("MemoryPressure") == "True" or conditions.get("DiskPressure") == "True":
            node_name = node.metadata.name
            if not node.spec.unschedulable:
                core_api.patch_node(
                    node_name,
                    {"spec": {"unschedulable": True}},
                )
                logger.warning("Cordoned node %s due to resource pressure", node_name)


# ── AWS remediation ────────────────────────────────────────────────────────────

def send_cloudwatch_alarm_event(detail: dict):
    """Put a custom CloudWatch event for audit trail and downstream automation."""
    events = boto3.client("events", region_name=AWS_REGION)
    events.put_events(
        Entries=[
            {
                "Source":       "capacity.agent",
                "DetailType":   "AutonomousCapacityAction",
                "Detail":       json.dumps(detail),
                "EventBusName": "default",
            }
        ]
    )
    logger.info("Published CloudWatch event: %s", detail)


# ── Main agent loop ────────────────────────────────────────────────────────────

def run():
    logger.info("=== Autonomous Capacity Agent: starting evaluation ===")

    # 1. Fetch CPU utilization metrics for the target deployment
    cpu_query = "avg:kubernetes.cpu.usage.total{kube_deployment:web-api} by {host}"
    cpu_df    = get_metric_series(cpu_query, lookback_hours=3)

    mem_query = "avg:kubernetes.memory.usage{kube_deployment:web-api}"
    mem_df    = get_metric_series(mem_query, lookback_hours=3)

    # 2. Run anomaly detection
    cpu_result = detect_anomalies(cpu_df)
    mem_result = detect_anomalies(mem_df)

    logger.info("CPU analysis: %s", cpu_result)
    logger.info("MEM analysis: %s", mem_result)

    # 3. Decision logic
    action_taken = "none"

    if cpu_result["anomaly"] and is_trending_up(cpu_df):
        # Anomalous AND trending up → tighten the KEDA scale trigger to scale out sooner
        logger.warning("CPU anomaly detected with upward trend. Tightening scale-out trigger.")
        patch_scaled_object(query_value=SCALE_UP_QUERY_VALUE)
        cordon_saturated_nodes()
        action_taken = "scale_out_aggressive"

    elif not cpu_result["anomaly"] and cpu_result.get("latest_value", 0) < 30:
        # Low utilization, no anomaly → relax the trigger to allow scale-in
        logger.info("CPU utilization low. Relaxing scale-in trigger.")
        patch_scaled_object(query_value=SCALE_DOWN_QUERY_VALUE)
        action_taken = "scale_in_relax"

    elif mem_result["anomaly"]:
        # Memory anomaly → cordon high-memory nodes, Karpenter will replace
        logger.warning("Memory anomaly detected. Cordoning pressure nodes.")
        cordon_saturated_nodes()
        action_taken = "memory_pressure_cordon"

    # 4. Publish audit event
    send_cloudwatch_alarm_event(
        {
            "timestamp":    datetime.now(timezone.utc).isoformat(),
            "action":       action_taken,
            "cpu_result":   cpu_result,
            "mem_result":   mem_result,
        }
    )

    logger.info("=== Autonomous Capacity Agent: evaluation complete. Action: %s ===", action_taken)


if __name__ == "__main__":
    run()

Step 5: Package and Deploy the Agent as a CronJob

capacity_agent/Dockerfile:

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY agent.py .
CMD ["python", "agent.py"]
# Build and push to ECR
export ECR_REPO="${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com/capacity-agent"
aws ecr create-repository --repository-name capacity-agent --region ${AWS_REGION}
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REPO}

docker build -t capacity-agent ./capacity_agent
docker tag capacity-agent:latest ${ECR_REPO}:latest
docker push ${ECR_REPO}:latest

capacity-agent-cronjob.yaml:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: autonomous-capacity-agent
  namespace: default
spec:
  schedule: "*/5 * * * *"   # Every 5 minutes
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: capacity-agent-sa
          restartPolicy: OnFailure
          containers:
            - name: capacity-agent
              image: "${ECR_REPO}:latest"
              env:
                - name: DD_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: datadog-secret
                      key: apiKey
                - name: DD_APP_KEY
                  valueFrom:
                    secretKeyRef:
                      name: datadog-secret
                      key: appKey
                - name: DD_SITE
                  value: "datadoghq.com"
                - name: TARGET_NAMESPACE
                  value: "default"
                - name: SCALED_OBJECT_NAME
                  value: "web-api-scaledobject"
                - name: KARPENTER_NODEPOOL
                  value: "general"
                - name: AWS_REGION
                  value: "us-east-1"
                - name: ANOMALY_CONTAMINATION
                  value: "0.05"
              resources:
                requests:
                  cpu: "100m"
                  memory: "256Mi"
                limits:
                  cpu: "500m"
                  memory: "512Mi"
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: capacity-agent-sa
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::${AWS_ACCOUNT_ID}:role/CapacityAgentRole"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: capacity-agent-role
rules:
  - apiGroups: ["keda.sh"]
    resources: ["scaledobjects"]
    verbs: ["get", "list", "patch", "update"]
  - apiGroups: ["karpenter.sh"]
    resources: ["nodepools"]
    verbs: ["get", "list", "patch", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "patch"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: capacity-agent-binding
subjects:
  - kind: ServiceAccount
    name: capacity-agent-sa
    namespace: default
roleRef:
  kind: ClusterRole
  name: capacity-agent-role
  apiGroup: rbac.authorization.k8s.io
kubectl apply -f capacity-agent-cronjob.yaml

Step 6: AWS IAM Role for the Agent (IRSA)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "events:PutEvents",
        "cloudwatch:PutMetricData",
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "eks:DescribeCluster",
        "eks:ListNodegroups",
        "eks:UpdateNodegroupConfig"
      ],
      "Resource": "arn:aws:eks:*:${AWS_ACCOUNT_ID}:cluster/${CLUSTER_NAME}"
    }
  ]
}

Step 7: Observe the Agent in Action

# Watch the CronJob execute
kubectl get cronjob autonomous-capacity-agent -w

# Tail the agent logs
kubectl logs -l job-name -n default --follow

# Observe KEDA ScaledObject changes
kubectl describe scaledobject web-api-scaledobject

# Watch Karpenter respond to cordoned nodes
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --follow

# Check CloudWatch for audit events
aws events describe-rule --name default --region ${AWS_REGION}

Step 8: Datadog Monitor Integration

Configure a Datadog monitor that surfaces agent decisions alongside the capacity metrics for observability of the autonomous system itself:

# datadog_monitor_setup.py
from datadog_api_client import ApiClient, Configuration
from datadog_api_client.v1.api.monitors_api import MonitorsApi
from datadog_api_client.v1.model.monitor import Monitor
from datadog_api_client.v1.model.monitor_type import MonitorType
from datadog_api_client.v1.model.monitor_options import MonitorOptions

configuration = Configuration()
configuration.api_key["apiKeyAuth"] = DD_API_KEY
configuration.api_key["appKeyAuth"] = DD_APP_KEY

body = Monitor(
    name="[Capacity Agent] CPU Anomaly - web-api",
    type=MonitorType.METRIC_ALERT,
    query='anomalies(avg:kubernetes.cpu.usage.total{kube_deployment:web-api}, "basic", 2) >= 1',
    message=(
        "The Autonomous Capacity Agent has detected a CPU anomaly for web-api. "
        "Automated remediation has been triggered. Review agent logs for details. "
        "@slack-infra-alerts"
    ),
    options=MonitorOptions(
        notify_no_data=True,
        no_data_timeframe=10,
        evaluation_delay=60,
    ),
)

with ApiClient(configuration) as api_client:
    api = MonitorsApi(api_client)
    result = api.create_monitor(body)
    print(f"Monitor created: {result.id}")

Comparing the Two Models: Before and After

DimensionCapacity & Performance ManagementAutonomous Capacity & Performance Engineering
Primary interfaceDashboards, runbooksAI agent decisions, policy configs
Alerting modelThreshold-based, reactiveAnomaly-based, predictive
Capacity planningQuarterly spreadsheet exerciseContinuous ML forecasting (Prophet/LSTM)
Scale-out triggerStatic HPA/KEDA targetDynamically tuned by agent based on patterns
Incident responseHuman on-call, manual runbookAgent detects, remediates, and audits
Node managementManual drain/replaceAgent cordons + Karpenter replaces automatically
Audit trailJira tickets, Confluence pagesCloudWatch events, immutable agent log stream
Engineer roleDashboard builder, alert responderAgent designer, policy author, system reviewer
MTTRMinutes to hoursSeconds to minutes
Over-provisioningCommon (safety buffers)Minimized via right-sizing recommendations

Risks and Guardrails

Autonomous systems require careful guardrails. A poorly configured agent that aggressively scales down during a real traffic spike, or that cordons healthy nodes, can worsen an incident. Essential controls include:

  • Dry-run mode: The agent logs intended actions without applying them during an initial shadowing period.
  • Blast radius limits: Maximum number of nodes cordoned per run, maximum replica change per cycle.
  • Human approval gates: For high-severity anomalies, the agent creates a PagerDuty incident for human review before executing destructive actions.
  • Rollback hooks: Every patch to a KEDA ScaledObject is accompanied by a snapshot of the prior configuration stored in a ConfigMap, enabling one-command rollback.
  • Confidence thresholds: The Isolation Forest model must exceed a minimum anomaly score before triggering remediation to suppress low-confidence signals.

The Road Ahead

The POC above represents the first generation of autonomous capacity tooling. As the discipline matures, the trajectory points toward:

  • LLM-augmented agents: Natural language explanations of every agent decision (“I tightened the scale trigger because CPU utilization spiked 40% above the 3-hour rolling average during a period with no corresponding traffic increase, suggesting a resource leak”).
  • Cross-cluster awareness: Agents that coordinate capacity across multiple EKS clusters, regions, and even cloud providers.
  • Cost optimization integration: AWS Cost Explorer and Kubecost APIs feeding into agent decisions, balancing performance SLOs against cost budgets.
  • Self-improving models: Agent decisions and their outcomes feed back into model retraining pipelines, so anomaly detection improves continuously.
  • Service mesh integration: Agents that tune Istio or Linkerd traffic weights in response to backend saturation, shedding load at the network layer before the application layer degrades.

Conclusion

The transformation from Infrastructure Capacity and Performance Management to Autonomous Capacity & Performance Engineering is not a distant future — it is happening now. The tools exist: Karpenter handles intelligent node provisioning, KEDA handles event-driven scaling, Datadog’s Anomaly Monitor and API provide the telemetry substrate, and Python’s scikit-learn ecosystem makes it accessible to build the intelligence layer.

The POC in this post demonstrates a working foundation. The KEDA ScaledObject queryValue tuning, Isolation Forest anomaly detection, and automatic node cordoning are all real, deployable patterns. From this foundation, teams can incrementally expand agent capabilities, add more sophisticated forecasting models, and connect additional remediation actions.

The engineers who thrive in this new era will be those who shift from watching dashboards to engineering the autonomous systems that watch the dashboards for them.