Keda vs Karpenter: A Comprehensive Guide to Kubernetes Autoscaling and Datadog Monitoring

READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.

Introduction

Kubernetes autoscaling is a critical capability for running efficient, cost-effective, and responsive workloads. Two popular tools in the Kubernetes autoscaling ecosystem are KEDA (Kubernetes Event-Driven Autoscaler) and Karpenter. While both tools address autoscaling, they operate at different layers and serve complementary purposes. This guide provides a comprehensive comparison of KEDA and Karpenter, including their configurations, YAML examples, and how to monitor them using Datadog.

Overview: KEDA vs Karpenter

KEDA (Kubernetes Event-Driven Autoscaler)

KEDA is an event-driven autoscaler that extends Kubernetes Horizontal Pod Autoscaler (HPA) capabilities. It allows scaling based on external event sources such as message queues, databases, HTTP requests, and custom metrics.

Key Characteristics:

  • Scales pods/replicas within a deployment
  • Event-driven scaling from 0 to N
  • Supports 60+ scalers (Azure Queue, AWS SQS, Kafka, Prometheus, etc.)
  • Works alongside existing HPA
  • Lightweight component running in your cluster

Official Documentation: https://keda.sh/docs/

Karpenter

Karpenter is a node-level autoscaler designed to provision just-in-time compute capacity for Kubernetes clusters. It directly provisions nodes based on pending pod requirements.

Key Characteristics:

  • Scales cluster nodes (infrastructure-level)
  • Provisions right-sized nodes based on pod requirements
  • Fast node provisioning (seconds instead of minutes)
  • Bin-packing and consolidation for cost optimization
  • Cloud provider integration (AWS primarily, Azure in development)

Official Documentation: https://karpenter.sh/docs/

Comparison Table

FeatureKEDAKarpenter
Scaling LevelPod/ApplicationNode/Infrastructure
PurposeScale application replicasProvision compute capacity
TriggersExternal events/metricsPending pods
Scale to ZeroYesYes (node removal)
Cloud AgnosticYesAWS-focused (Azure in preview)
IntegrationHPA extensionReplaces Cluster Autoscaler
Typical Use CaseEvent-driven workloadsDynamic node provisioning

Understanding the Autoscaling Layers

Kubernetes autoscaling operates at multiple layers, and understanding these layers is crucial for effective scaling strategies.

Layer 1: Pod-Level Autoscaling (KEDA/HPA)

This layer scales the number of pod replicas based on metrics or events. KEDA enhances HPA by adding event-driven triggers.

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                         │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │  Pod 1  │  │  Pod 2  │  │  Pod 3  │  │  Pod N  │  ← KEDA │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘         │
└─────────────────────────────────────────────────────────────┘

Layer 2: Node-Level Autoscaling (Karpenter)

This layer provisions or removes nodes based on workload demand. Karpenter observes pending pods and provisions appropriate nodes.

┌─────────────────────────────────────────────────────────────┐
│                  Infrastructure Layer                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                   │
│  │  Node 1  │  │  Node 2  │  │  Node N  │  ← Karpenter      │
│  │  ┌────┐  │  │  ┌────┐  │  │  ┌────┐  │                   │
│  │  │Pod │  │  │  │Pod │  │  │  │Pod │  │                   │
│  │  └────┘  │  │  └────┘  │  │  └────┘  │                   │
│  └──────────┘  └──────────┘  └──────────┘                   │
└─────────────────────────────────────────────────────────────┘

Combined Multi-Layer Autoscaling

In production environments, KEDA and Karpenter work together:

  1. KEDA detects events (e.g., queue depth) and scales pods
  2. Pods become pending if nodes lack capacity
  3. Karpenter detects pending pods and provisions nodes
  4. Pods are scheduled on new nodes
  5. When load decreases, KEDA scales down pods
  6. Karpenter consolidates or removes underutilized nodes

KEDA Configuration

KEDA uses two main Custom Resource Definitions (CRDs): ScaledObject for Deployments/StatefulSets and ScaledJob for Jobs.

Installing KEDA

# Using Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace

ScaledObject Configuration

The ScaledObject is the primary CRD for scaling workloads. Here’s a comprehensive example:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: my-application-scaledobject
  namespace: production
  labels:
    app: my-application
spec:
  # Target workload to scale
  scaleTargetRef:
    apiVersion: apps/v1           # API version of the target (optional)
    kind: Deployment              # Resource kind (Deployment, StatefulSet)
    name: my-application          # Name of the target resource
    envSourceContainerName: app   # Container name for env vars (optional)

  # Scaling behavior configuration
  pollingInterval: 30             # How often KEDA checks triggers (seconds)
  cooldownPeriod: 300             # Wait time before scaling down (seconds)
  idleReplicaCount: 0             # Replicas when no events (enables scale to 0)
  minReplicaCount: 1              # Minimum replicas during active scaling
  maxReplicaCount: 100            # Maximum replicas allowed

  # Fallback configuration for scaler errors
  fallback:
    failureThreshold: 3           # Number of failures before fallback
    replicas: 6                   # Replica count during fallback

  # Advanced HPA configuration
  advanced:
    restoreToOriginalReplicaCount: true  # Restore replicas on ScaledObject deletion
    horizontalPodAutoscalerConfig:
      name: my-custom-hpa-name    # Custom HPA name (optional)
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
        scaleUp:
          stabilizationWindowSeconds: 0
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
          - type: Pods
            value: 4
            periodSeconds: 15
          selectPolicy: Max

  # Triggers define what metrics drive scaling
  triggers:
  # AWS SQS Queue Trigger
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/my-queue
      queueLength: "5"            # Target queue length per replica
      awsRegion: us-east-1
      identityOwner: pod          # Use pod's IAM role for authentication
    authenticationRef:
      name: aws-credentials       # Reference to TriggerAuthentication

  # Prometheus Trigger
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: http_requests_total
      query: sum(rate(http_requests_total{app="my-application"}[2m]))
      threshold: "100"            # Target value per replica

  # Kafka Trigger
  - type: kafka
    metadata:
      bootstrapServers: kafka.kafka.svc:9092
      consumerGroup: my-consumer-group
      topic: my-topic
      lagThreshold: "10"          # Target lag per replica

Configuration Value Explanations

ParameterDescriptionDefault
pollingIntervalHow frequently KEDA queries metrics sources (in seconds)30
cooldownPeriodTime to wait before scaling down after the last trigger activation300
idleReplicaCountNumber of replicas when inactive (enables scale-to-zero when set to 0)N/A
minReplicaCountMinimum replicas when scaling is active0
maxReplicaCountMaximum replicas allowed100
failureThresholdNumber of consecutive scaler failures before activating fallback3
stabilizationWindowSecondsWindow to consider past recommendations for scaling decisions300

TriggerAuthentication Configuration

KEDA uses TriggerAuthentication to securely provide credentials to scalers:

apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: aws-credentials
  namespace: production
spec:
  # Option 1: Pod Identity (IRSA on AWS)
  podIdentity:
    provider: aws                 # aws, azure, gcp, aws-eks, azure-workload

  # Option 2: Using Kubernetes Secrets
  secretTargetRef:
  - parameter: awsAccessKeyID
    name: aws-secrets
    key: AWS_ACCESS_KEY_ID
  - parameter: awsSecretAccessKey
    name: aws-secrets
    key: AWS_SECRET_ACCESS_KEY

  # Option 3: Environment variables from the scaled workload
  env:
  - parameter: apiKey
    name: MY_API_KEY
    containerName: app

ScaledJob Configuration

For batch workloads, use ScaledJob to create Jobs dynamically:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: batch-processor
  namespace: production
spec:
  jobTargetRef:
    parallelism: 1
    completions: 1
    backoffLimit: 4
    template:
      spec:
        containers:
        - name: processor
          image: my-processor:latest
          command: ["process-batch"]
        restartPolicy: Never

  pollingInterval: 30
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 5
  maxReplicaCount: 50
  scalingStrategy:
    strategy: default             # default, custom, accurate
    # For custom strategy:
    # customScalingQueueLengthDeduction: 1
    # customScalingRunningJobPercentage: "0.5"

  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/batch-queue
      queueLength: "1"
      awsRegion: us-east-1

Karpenter Configuration

Karpenter uses NodePool (formerly Provisioner) and EC2NodeClass (AWS-specific) CRDs.

Installing Karpenter

# Set environment variables
export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="0.37.0"
export AWS_PARTITION="aws"
export CLUSTER_NAME="my-cluster"
export AWS_REGION="us-east-1"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"

# Using Helm
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "${KARPENTER_VERSION}" \
  --namespace "${KARPENTER_NAMESPACE}" \
  --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi

NodePool Configuration

The NodePool defines constraints and requirements for provisioned nodes:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  # Template for nodes created by this NodePool
  template:
    metadata:
      labels:
        intent: apps
        team: platform
      annotations:
        example.com/owner: "platform-team"

    spec:
      # Node requirements/constraints
      requirements:
      # Instance categories
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]    # Prefer spot instances

      # Instance families
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["m5.large", "m5.xlarge", "m5.2xlarge", "m6i.large", "m6i.xlarge"]

      # Architecture
      - key: kubernetes.io/arch
        operator: In
        values: ["amd64", "arm64"]

      # Operating system
      - key: kubernetes.io/os
        operator: In
        values: ["linux"]

      # Availability zones
      - key: topology.kubernetes.io/zone
        operator: In
        values: ["us-east-1a", "us-east-1b", "us-east-1c"]

      # Reference to EC2NodeClass
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default

      # Expiration: How long nodes live before being rotated
      expireAfter: 720h            # 30 days

      # Termination grace period for pods
      terminationGracePeriod: 48h

  # Disruption settings for node lifecycle management
  disruption:
    # Consolidation policy: WhenEmpty, WhenEmptyOrUnderutilized
    consolidationPolicy: WhenEmptyOrUnderutilized
    # How long to wait before consolidating underutilized nodes
    consolidateAfter: 1m
    # Budget controls how many nodes can be disrupted simultaneously
    budgets:
    - nodes: "10%"                # Disrupt max 10% of nodes at a time
    - nodes: "3"                  # Or max 3 nodes, whichever is lower
      schedule: "0 9 * * mon-fri" # During business hours
      duration: 8h

  # Resource limits for this NodePool
  limits:
    cpu: 1000                     # Max 1000 vCPUs across all nodes
    memory: 2000Gi                # Max 2000 Gi memory

  # Weight for node selection (higher = preferred)
  weight: 100

EC2NodeClass Configuration (AWS-Specific)

The EC2NodeClass defines AWS-specific node configuration:

apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  # AMI selection
  amiFamily: AL2023                # AL2, AL2023, Bottlerocket, Ubuntu, Custom
  amiSelectorTerms:
  - alias: al2023@latest          # Use latest AL2023 AMI

  # Subnet selection
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: "my-cluster"
  # Or select by ID:
  # - id: subnet-0123456789abcdef0

  # Security group selection
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: "my-cluster"

  # IAM instance profile for nodes
  instanceProfile: KarpenterNodeInstanceProfile-my-cluster

  # Instance store settings
  instanceStorePolicy: RAID0       # RAID0 for ephemeral storage

  # EBS block device mappings
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      volumeSize: 100Gi
      volumeType: gp3
      iops: 3000
      throughput: 125
      encrypted: true
      kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/1234-5678-abcd"
      deleteOnTermination: true

  # Additional EBS volume for container storage
  - deviceName: /dev/xvdb
    ebs:
      volumeSize: 200Gi
      volumeType: gp3
      encrypted: true

  # Metadata options for IMDS
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1     # Restrict to instance only
    httpTokens: required           # Require IMDSv2

  # User data for custom node configuration
  userData: |
    #!/bin/bash
    echo "Custom node initialization"
    # Install additional packages
    yum install -y amazon-ssm-agent
    systemctl enable amazon-ssm-agent
    systemctl start amazon-ssm-agent    

  # Tags applied to EC2 instances
  tags:
    Environment: production
    Team: platform
    Application: kubernetes-workloads

NodePool Configuration Value Explanations

ParameterDescriptionExample Values
karpenter.sh/capacity-typeInstance purchase typespot, on-demand
node.kubernetes.io/instance-typeAllowed EC2 instance typesm5.large, c5.xlarge
kubernetes.io/archCPU architectureamd64, arm64
expireAfterMaximum node lifetime before rotation720h (30 days)
consolidationPolicyWhen to consolidate nodesWhenEmpty, WhenEmptyOrUnderutilized
consolidateAfterWait time before consolidation1m, 5m, 1h
limits.cpuMaximum vCPUs for this NodePool1000
limits.memoryMaximum memory for this NodePool2000Gi

Multiple NodePools for Different Workloads

You can create multiple NodePools for different workload types:

# NodePool for general workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: general
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot", "on-demand"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["m5.large", "m5.xlarge", "m6i.large"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  weight: 50
---
# NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  template:
    metadata:
      labels:
        workload-type: gpu
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]        # GPU workloads often need stability
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["p3.2xlarge", "p3.8xlarge", "g4dn.xlarge", "g4dn.2xlarge"]
      - key: "nvidia.com/gpu"
        operator: Exists
      taints:
      - key: nvidia.com/gpu
        effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu-nodes
  weight: 100
---
# NodePool for high-memory workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: high-memory
spec:
  template:
    metadata:
      labels:
        workload-type: high-memory
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["on-demand"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["r5.xlarge", "r5.2xlarge", "r5.4xlarge", "r6i.xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  weight: 75

Combined KEDA and Karpenter Configuration

Here’s a complete example showing KEDA and Karpenter working together:

# Application Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: queue-processor
  namespace: production
spec:
  replicas: 1
  selector:
    matchLabels:
      app: queue-processor
  template:
    metadata:
      labels:
        app: queue-processor
    spec:
      containers:
      - name: processor
        image: my-processor:v1.0
        resources:
          requests:
            cpu: 500m
            memory: 512Mi
          limits:
            cpu: 1000m
            memory: 1Gi
---
# KEDA ScaledObject for pod scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-processor-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: queue-processor
  pollingInterval: 15
  cooldownPeriod: 300
  idleReplicaCount: 0
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/processing-queue
      queueLength: "10"
      awsRegion: us-east-1
    authenticationRef:
      name: aws-credentials
---
# Karpenter NodePool for infrastructure scaling
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: processing-workloads
spec:
  template:
    metadata:
      labels:
        workload: queue-processing
    spec:
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values: ["spot"]
      - key: node.kubernetes.io/instance-type
        operator: In
        values: ["c5.large", "c5.xlarge", "c5.2xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 168h
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 5m
  limits:
    cpu: 500
    memory: 1000Gi

Monitoring with Datadog

Effective monitoring is essential for understanding autoscaling behavior. Datadog provides comprehensive monitoring capabilities for both KEDA and Karpenter.

Installing Datadog Agent in Kubernetes

# Using Helm
helm repo add datadog https://helm.datadoghq.com
helm repo update

# Create namespace and secret
kubectl create namespace datadog
kubectl create secret generic datadog-secret \
  --namespace datadog \
  --from-literal=api-key=<YOUR_API_KEY>

# Install with custom values
helm install datadog datadog/datadog \
  --namespace datadog \
  --set datadog.apiKeyExistingSecret=datadog-secret \
  --set datadog.site=datadoghq.com \
  --set datadog.logs.enabled=true \
  --set datadog.logs.containerCollectAll=true \
  --set datadog.apm.enabled=true \
  --set datadog.processAgent.enabled=true \
  --set datadog.kubeStateMetricsEnabled=true \
  --set datadog.clusterAgent.enabled=true \
  --set datadog.clusterAgent.metricsProvider.enabled=true

Datadog Agent Values for Autoscaling Monitoring

# datadog-values.yaml
datadog:
  apiKeyExistingSecret: datadog-secret
  site: datadoghq.com

  # Enable Kubernetes integrations
  kubeStateMetricsCore:
    enabled: true
    collectSecretMetrics: false

  # Enable log collection
  logs:
    enabled: true
    containerCollectAll: true

  # Enable APM
  apm:
    portEnabled: true
    socketEnabled: true

  # Enable process monitoring
  processAgent:
    enabled: true
    processCollection: true

  # Prometheus scraping for KEDA metrics
  prometheusScrape:
    enabled: true
    serviceEndpoints: true

  # Additional configuration for autoscaling metrics
  confd:
    keda.yaml: |-
      ad_identifiers:
        - keda-operator
      init_config:
      instances:
        - prometheus_url: http://%%host%%:8080/metrics
          namespace: keda
          metrics:
            - keda_*      

clusterAgent:
  enabled: true
  metricsProvider:
    enabled: true
    useDatadogMetrics: true

  # External metrics for HPA
  externalMetrics:
    enabled: true

Monitoring KEDA with Datadog

KEDA Prometheus Metrics

KEDA exposes metrics on port 8080 that Datadog can scrape:

# Enable metrics scraping via pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keda-operator
  namespace: keda
spec:
  template:
    metadata:
      annotations:
        ad.datadoghq.com/keda-operator.check_names: '["openmetrics"]'
        ad.datadoghq.com/keda-operator.init_configs: '[{}]'
        ad.datadoghq.com/keda-operator.instances: |
          [
            {
              "prometheus_url": "http://%%host%%:8080/metrics",
              "namespace": "keda",
              "metrics": [
                "keda_scaler_active",
                "keda_scaler_metrics_value",
                "keda_scaled_object_errors",
                "keda_trigger_totals",
                "keda_internal_scale_loop_latency",
                "keda_scaler_errors_total"
              ]
            }
          ]          

Key KEDA Metrics to Monitor

MetricDescriptionUse Case
keda_scaler_activeWhether a scaler is active (1) or idle (0)Alert on unexpected idle state
keda_scaler_metrics_valueCurrent metric value from scalerTrack scaling trigger values
keda_scaled_object_errorsError count per ScaledObjectAlert on scaling failures
keda_trigger_totalsTotal trigger activationsUnderstand scaling patterns
keda_internal_scale_loop_latencyLatency of scaling loopsPerformance monitoring
keda_scaler_errors_totalTotal scaler errorsReliability monitoring

KEDA Dashboard JSON

Create a comprehensive KEDA dashboard in Datadog:

{
  "title": "KEDA Autoscaling Dashboard",
  "description": "Monitor KEDA autoscaling metrics and behaviors",
  "widgets": [
    {
      "definition": {
        "title": "Active Scalers",
        "type": "timeseries",
        "requests": [
          {
            "q": "sum:keda.scaler_active{*} by {scaledObject}",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Scaler Metric Values",
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:keda.scaler_metrics_value{*} by {scaledObject,scaler}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Scaling Errors",
        "type": "timeseries",
        "requests": [
          {
            "q": "sum:keda.scaled_object_errors{*} by {scaledObject}.as_count()",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "HPA Replica Count",
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:kubernetes_state.hpa.current_replicas{*} by {hpa}",
            "display_type": "line"
          },
          {
            "q": "avg:kubernetes_state.hpa.desired_replicas{*} by {hpa}",
            "display_type": "line",
            "style": {"line_type": "dashed"}
          }
        ]
      }
    }
  ]
}

Monitoring Karpenter with Datadog

Karpenter Prometheus Metrics

Karpenter exposes detailed metrics about node provisioning:

# Configure Datadog to scrape Karpenter metrics
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-config
  namespace: datadog
data:
  karpenter.yaml: |
    ad_identifiers:
      - karpenter
    init_config:
    instances:
      - prometheus_url: http://%%host%%:8000/metrics
        namespace: karpenter
        metrics:
          - karpenter_nodes_*
          - karpenter_pods_*
          - karpenter_provisioner_*
          - karpenter_nodeclaims_*
          - karpenter_nodepool_*
          - karpenter_cloudprovider_*
          - karpenter_disruption_*    

Key Karpenter Metrics to Monitor

MetricDescriptionUse Case
karpenter_nodes_created_totalTotal nodes createdTrack provisioning activity
karpenter_nodes_terminated_totalTotal nodes terminatedTrack deprovisioning
karpenter_pods_statePods by state (pending, running)Identify scheduling issues
karpenter_nodepool_usageResource usage per NodePoolCapacity planning
karpenter_nodepool_limitResource limits per NodePoolCapacity monitoring
karpenter_nodeclaims_created_totalNodeClaim creation countProvisioning patterns
karpenter_cloudprovider_duration_secondsCloud API latencyPerformance monitoring
karpenter_disruption_actions_performed_totalDisruption actions takenUnderstand node lifecycle

Karpenter Dashboard JSON

{
  "title": "Karpenter Node Provisioning Dashboard",
  "description": "Monitor Karpenter node provisioning and lifecycle",
  "widgets": [
    {
      "definition": {
        "title": "Nodes Created vs Terminated",
        "type": "timeseries",
        "requests": [
          {
            "q": "sum:karpenter.nodes_created_total{*}.as_count()",
            "display_type": "bars",
            "style": {"palette": "green"}
          },
          {
            "q": "sum:karpenter.nodes_terminated_total{*}.as_count()",
            "display_type": "bars",
            "style": {"palette": "red"}
          }
        ]
      }
    },
    {
      "definition": {
        "title": "NodePool Resource Usage",
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:karpenter.nodepool_usage{resource:cpu} by {nodepool}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "NodePool Limits",
        "type": "query_value",
        "requests": [
          {
            "q": "avg:karpenter.nodepool_limit{resource:cpu} by {nodepool}"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Pending Pods",
        "type": "timeseries",
        "requests": [
          {
            "q": "sum:karpenter.pods_state{state:pending}",
            "display_type": "area"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Cloud Provider Latency",
        "type": "timeseries",
        "requests": [
          {
            "q": "avg:karpenter.cloudprovider_duration_seconds.avg{*} by {method}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "title": "Disruption Actions",
        "type": "timeseries",
        "requests": [
          {
            "q": "sum:karpenter.disruption_actions_performed_total{*} by {action}.as_count()",
            "display_type": "bars"
          }
        ]
      }
    }
  ]
}

Combined Autoscaling Monitoring Dashboard

For comprehensive autoscaling visibility, create a unified dashboard:

{
  "title": "Kubernetes Autoscaling Overview",
  "description": "Combined KEDA and Karpenter autoscaling metrics",
  "widgets": [
    {
      "definition": {
        "title": "Autoscaling Flow",
        "type": "note",
        "content": "**Scaling Flow:** Events → KEDA → Pod Scaling → Pending Pods → Karpenter → Node Provisioning"
      }
    },
    {
      "definition": {
        "title": "KEDA Active Scalers",
        "type": "query_value",
        "requests": [{"q": "sum:keda.scaler_active{*}"}]
      }
    },
    {
      "definition": {
        "title": "Current Replica Count",
        "type": "query_value",
        "requests": [{"q": "sum:kubernetes_state.deployment.replicas_available{*}"}]
      }
    },
    {
      "definition": {
        "title": "Karpenter Node Count",
        "type": "query_value",
        "requests": [{"q": "sum:kubernetes_state.node.count{*}"}]
      }
    },
    {
      "definition": {
        "title": "Pod vs Node Scaling Timeline",
        "type": "timeseries",
        "requests": [
          {
            "q": "sum:kubernetes_state.deployment.replicas{*}",
            "display_type": "line"
          },
          {
            "q": "sum:kubernetes_state.node.count{*}",
            "display_type": "line"
          }
        ]
      }
    }
  ]
}

Datadog Monitors and Alerts

Create alerts for autoscaling issues:

# KEDA Scaling Errors Alert
apiVersion: v1
kind: ConfigMap
metadata:
  name: datadog-monitors
data:
  keda-errors.json: |
    {
      "name": "KEDA Scaling Errors Detected",
      "type": "metric alert",
      "query": "sum(last_5m):sum:keda.scaled_object_errors{*}.as_count() > 5",
      "message": "KEDA has encountered scaling errors. Check ScaledObject configurations and scaler connectivity.\n\n@pagerduty-platform-team",
      "tags": ["team:platform", "component:keda"],
      "priority": 2
    }    

  karpenter-pending-pods.json: |
    {
      "name": "High Pending Pod Count",
      "type": "metric alert",
      "query": "avg(last_5m):sum:karpenter.pods_state{state:pending} > 10",
      "message": "Many pods are pending scheduling. Karpenter may be hitting limits or experiencing provisioning issues.\n\nCheck:\n- NodePool limits\n- EC2 capacity\n- Instance type availability\n\n@slack-platform-alerts",
      "tags": ["team:platform", "component:karpenter"],
      "priority": 2
    }    

  karpenter-provisioning-latency.json: |
    {
      "name": "Karpenter High Provisioning Latency",
      "type": "metric alert",
      "query": "avg(last_10m):avg:karpenter.cloudprovider_duration_seconds.avg{method:create} > 60",
      "message": "Karpenter is experiencing high latency when provisioning nodes. This may indicate AWS API throttling or capacity issues.\n\n@slack-platform-alerts",
      "tags": ["team:platform", "component:karpenter"],
      "priority": 3
    }    

  scale-to-zero-alert.json: |
    {
      "name": "Workload Scaled to Zero",
      "type": "metric alert",
      "query": "avg(last_5m):avg:kubernetes_state.deployment.replicas_available{app:queue-processor} == 0",
      "message": "The queue-processor workload has scaled to zero replicas. This is expected during low traffic but verify no processing backlog exists.\n\n@slack-platform-alerts",
      "tags": ["team:platform", "component:keda"],
      "priority": 4
    }    

Using Datadog External Metrics with KEDA

Datadog can serve as a metric source for KEDA scaling decisions:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: datadog-driven-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: api-server
  pollingInterval: 30
  cooldownPeriod: 300
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
  - type: datadog
    metadata:
      # Query Datadog metrics directly
      query: "avg:trace.http.request.duration{service:api-server}.as_count()"
      queryValue: "100"           # Target value for scaling
      queryAggregator: "avg"      # avg, sum, min, max
      age: "60"                   # Query age in seconds
      type: "average"             # global, average
    authenticationRef:
      name: datadog-auth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: datadog-auth
  namespace: production
spec:
  secretTargetRef:
  - parameter: apiKey
    name: datadog-secrets
    key: api-key
  - parameter: appKey
    name: datadog-secrets
    key: app-key
  # Optional: specify Datadog site
  # env:
  # - parameter: datadogSite
  #   name: DD_SITE
  #   containerName: app

Best Practices

KEDA Best Practices

  1. Set Appropriate Cooldown Periods: Prevent rapid scaling oscillations by setting appropriate cooldownPeriod values.

  2. Use Fallback Configuration: Define fallback replicas to maintain availability during scaler failures.

  3. Monitor Scaler Errors: Alert on keda_scaled_object_errors to catch configuration issues early.

  4. Test Scale-to-Zero: Thoroughly test workloads that scale to zero to ensure proper wake-up behavior.

  5. Use TriggerAuthentication: Avoid hardcoding credentials; use TriggerAuthentication with secrets or pod identity.

Karpenter Best Practices

  1. Define Multiple NodePools: Create specialized NodePools for different workload types (GPU, memory-optimized, cost-optimized).

  2. Set Resource Limits: Always define limits in NodePools to control infrastructure costs.

  3. Use Consolidation: Enable consolidation to reduce costs by removing underutilized nodes.

  4. Prefer Spot Instances: Use spot instances for fault-tolerant workloads to reduce costs significantly.

  5. Monitor Pending Pods: Set up alerts for pending pods to catch capacity issues quickly.

Combined Strategy Best Practices

  1. Right-Size Resource Requests: Accurate pod resource requests help Karpenter select optimal instance types.

  2. Coordinate Scaling Parameters: Ensure KEDA cooldown aligns with Karpenter consolidation timing.

  3. Use Pod Disruption Budgets: Protect critical workloads during node consolidation.

  4. Monitor End-to-End: Track metrics from event source through pod scaling to node provisioning.

Conclusion

KEDA and Karpenter are complementary tools that together provide comprehensive Kubernetes autoscaling:

  • KEDA excels at event-driven pod autoscaling, enabling scale-to-zero and scaling based on external metrics
  • Karpenter handles infrastructure-level scaling, provisioning just-right nodes quickly and efficiently
  • Together, they create a responsive, cost-effective autoscaling strategy

When combined with Datadog monitoring, you gain visibility into every layer of your autoscaling infrastructure, enabling proactive management and rapid troubleshooting.

Additional Resources