Keda vs Karpenter: A Comprehensive Guide to Kubernetes Autoscaling and Datadog Monitoring
READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.
Introduction
Kubernetes autoscaling is a critical capability for running efficient, cost-effective, and responsive workloads. Two popular tools in the Kubernetes autoscaling ecosystem are KEDA (Kubernetes Event-Driven Autoscaler) and Karpenter. While both tools address autoscaling, they operate at different layers and serve complementary purposes. This guide provides a comprehensive comparison of KEDA and Karpenter, including their configurations, YAML examples, and how to monitor them using Datadog.
Overview: KEDA vs Karpenter
KEDA (Kubernetes Event-Driven Autoscaler)
KEDA is an event-driven autoscaler that extends Kubernetes Horizontal Pod Autoscaler (HPA) capabilities. It allows scaling based on external event sources such as message queues, databases, HTTP requests, and custom metrics.
Key Characteristics:
- Scales pods/replicas within a deployment
- Event-driven scaling from 0 to N
- Supports 60+ scalers (Azure Queue, AWS SQS, Kafka, Prometheus, etc.)
- Works alongside existing HPA
- Lightweight component running in your cluster
Official Documentation: https://keda.sh/docs/
Karpenter
Karpenter is a node-level autoscaler designed to provision just-in-time compute capacity for Kubernetes clusters. It directly provisions nodes based on pending pod requirements.
Key Characteristics:
- Scales cluster nodes (infrastructure-level)
- Provisions right-sized nodes based on pod requirements
- Fast node provisioning (seconds instead of minutes)
- Bin-packing and consolidation for cost optimization
- Cloud provider integration (AWS primarily, Azure in development)
Official Documentation: https://karpenter.sh/docs/
Comparison Table
| Feature | KEDA | Karpenter |
|---|---|---|
| Scaling Level | Pod/Application | Node/Infrastructure |
| Purpose | Scale application replicas | Provision compute capacity |
| Triggers | External events/metrics | Pending pods |
| Scale to Zero | Yes | Yes (node removal) |
| Cloud Agnostic | Yes | AWS-focused (Azure in preview) |
| Integration | HPA extension | Replaces Cluster Autoscaler |
| Typical Use Case | Event-driven workloads | Dynamic node provisioning |
Understanding the Autoscaling Layers
Kubernetes autoscaling operates at multiple layers, and understanding these layers is crucial for effective scaling strategies.
Layer 1: Pod-Level Autoscaling (KEDA/HPA)
This layer scales the number of pod replicas based on metrics or events. KEDA enhances HPA by adding event-driven triggers.
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │ ← KEDA │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────┘
Layer 2: Node-Level Autoscaling (Karpenter)
This layer provisions or removes nodes based on workload demand. Karpenter observes pending pods and provisions appropriate nodes.
┌─────────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node 1 │ │ Node 2 │ │ Node N │ ← Karpenter │
│ │ ┌────┐ │ │ ┌────┐ │ │ ┌────┐ │ │
│ │ │Pod │ │ │ │Pod │ │ │ │Pod │ │ │
│ │ └────┘ │ │ └────┘ │ │ └────┘ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────┘
Combined Multi-Layer Autoscaling
In production environments, KEDA and Karpenter work together:
- KEDA detects events (e.g., queue depth) and scales pods
- Pods become pending if nodes lack capacity
- Karpenter detects pending pods and provisions nodes
- Pods are scheduled on new nodes
- When load decreases, KEDA scales down pods
- Karpenter consolidates or removes underutilized nodes
KEDA Configuration
KEDA uses two main Custom Resource Definitions (CRDs): ScaledObject for Deployments/StatefulSets and ScaledJob for Jobs.
Installing KEDA
# Using Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda --namespace keda --create-namespace
ScaledObject Configuration
The ScaledObject is the primary CRD for scaling workloads. Here’s a comprehensive example:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: my-application-scaledobject
namespace: production
labels:
app: my-application
spec:
# Target workload to scale
scaleTargetRef:
apiVersion: apps/v1 # API version of the target (optional)
kind: Deployment # Resource kind (Deployment, StatefulSet)
name: my-application # Name of the target resource
envSourceContainerName: app # Container name for env vars (optional)
# Scaling behavior configuration
pollingInterval: 30 # How often KEDA checks triggers (seconds)
cooldownPeriod: 300 # Wait time before scaling down (seconds)
idleReplicaCount: 0 # Replicas when no events (enables scale to 0)
minReplicaCount: 1 # Minimum replicas during active scaling
maxReplicaCount: 100 # Maximum replicas allowed
# Fallback configuration for scaler errors
fallback:
failureThreshold: 3 # Number of failures before fallback
replicas: 6 # Replica count during fallback
# Advanced HPA configuration
advanced:
restoreToOriginalReplicaCount: true # Restore replicas on ScaledObject deletion
horizontalPodAutoscalerConfig:
name: my-custom-hpa-name # Custom HPA name (optional)
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
- type: Pods
value: 4
periodSeconds: 15
selectPolicy: Max
# Triggers define what metrics drive scaling
triggers:
# AWS SQS Queue Trigger
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/my-queue
queueLength: "5" # Target queue length per replica
awsRegion: us-east-1
identityOwner: pod # Use pod's IAM role for authentication
authenticationRef:
name: aws-credentials # Reference to TriggerAuthentication
# Prometheus Trigger
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_requests_total
query: sum(rate(http_requests_total{app="my-application"}[2m]))
threshold: "100" # Target value per replica
# Kafka Trigger
- type: kafka
metadata:
bootstrapServers: kafka.kafka.svc:9092
consumerGroup: my-consumer-group
topic: my-topic
lagThreshold: "10" # Target lag per replica
Configuration Value Explanations
| Parameter | Description | Default |
|---|---|---|
pollingInterval | How frequently KEDA queries metrics sources (in seconds) | 30 |
cooldownPeriod | Time to wait before scaling down after the last trigger activation | 300 |
idleReplicaCount | Number of replicas when inactive (enables scale-to-zero when set to 0) | N/A |
minReplicaCount | Minimum replicas when scaling is active | 0 |
maxReplicaCount | Maximum replicas allowed | 100 |
failureThreshold | Number of consecutive scaler failures before activating fallback | 3 |
stabilizationWindowSeconds | Window to consider past recommendations for scaling decisions | 300 |
TriggerAuthentication Configuration
KEDA uses TriggerAuthentication to securely provide credentials to scalers:
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: aws-credentials
namespace: production
spec:
# Option 1: Pod Identity (IRSA on AWS)
podIdentity:
provider: aws # aws, azure, gcp, aws-eks, azure-workload
# Option 2: Using Kubernetes Secrets
secretTargetRef:
- parameter: awsAccessKeyID
name: aws-secrets
key: AWS_ACCESS_KEY_ID
- parameter: awsSecretAccessKey
name: aws-secrets
key: AWS_SECRET_ACCESS_KEY
# Option 3: Environment variables from the scaled workload
env:
- parameter: apiKey
name: MY_API_KEY
containerName: app
ScaledJob Configuration
For batch workloads, use ScaledJob to create Jobs dynamically:
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: batch-processor
namespace: production
spec:
jobTargetRef:
parallelism: 1
completions: 1
backoffLimit: 4
template:
spec:
containers:
- name: processor
image: my-processor:latest
command: ["process-batch"]
restartPolicy: Never
pollingInterval: 30
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 5
maxReplicaCount: 50
scalingStrategy:
strategy: default # default, custom, accurate
# For custom strategy:
# customScalingQueueLengthDeduction: 1
# customScalingRunningJobPercentage: "0.5"
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/batch-queue
queueLength: "1"
awsRegion: us-east-1
Karpenter Configuration
Karpenter uses NodePool (formerly Provisioner) and EC2NodeClass (AWS-specific) CRDs.
Installing Karpenter
# Set environment variables
export KARPENTER_NAMESPACE="kube-system"
export KARPENTER_VERSION="0.37.0"
export AWS_PARTITION="aws"
export CLUSTER_NAME="my-cluster"
export AWS_REGION="us-east-1"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"
# Using Helm
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "${KARPENTER_VERSION}" \
--namespace "${KARPENTER_NAMESPACE}" \
--create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.interruptionQueue=${CLUSTER_NAME}" \
--set controller.resources.requests.cpu=1 \
--set controller.resources.requests.memory=1Gi \
--set controller.resources.limits.cpu=1 \
--set controller.resources.limits.memory=1Gi
NodePool Configuration
The NodePool defines constraints and requirements for provisioned nodes:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
# Template for nodes created by this NodePool
template:
metadata:
labels:
intent: apps
team: platform
annotations:
example.com/owner: "platform-team"
spec:
# Node requirements/constraints
requirements:
# Instance categories
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # Prefer spot instances
# Instance families
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m5.2xlarge", "m6i.large", "m6i.xlarge"]
# Architecture
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
# Operating system
- key: kubernetes.io/os
operator: In
values: ["linux"]
# Availability zones
- key: topology.kubernetes.io/zone
operator: In
values: ["us-east-1a", "us-east-1b", "us-east-1c"]
# Reference to EC2NodeClass
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
# Expiration: How long nodes live before being rotated
expireAfter: 720h # 30 days
# Termination grace period for pods
terminationGracePeriod: 48h
# Disruption settings for node lifecycle management
disruption:
# Consolidation policy: WhenEmpty, WhenEmptyOrUnderutilized
consolidationPolicy: WhenEmptyOrUnderutilized
# How long to wait before consolidating underutilized nodes
consolidateAfter: 1m
# Budget controls how many nodes can be disrupted simultaneously
budgets:
- nodes: "10%" # Disrupt max 10% of nodes at a time
- nodes: "3" # Or max 3 nodes, whichever is lower
schedule: "0 9 * * mon-fri" # During business hours
duration: 8h
# Resource limits for this NodePool
limits:
cpu: 1000 # Max 1000 vCPUs across all nodes
memory: 2000Gi # Max 2000 Gi memory
# Weight for node selection (higher = preferred)
weight: 100
EC2NodeClass Configuration (AWS-Specific)
The EC2NodeClass defines AWS-specific node configuration:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
# AMI selection
amiFamily: AL2023 # AL2, AL2023, Bottlerocket, Ubuntu, Custom
amiSelectorTerms:
- alias: al2023@latest # Use latest AL2023 AMI
# Subnet selection
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
# Or select by ID:
# - id: subnet-0123456789abcdef0
# Security group selection
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "my-cluster"
# IAM instance profile for nodes
instanceProfile: KarpenterNodeInstanceProfile-my-cluster
# Instance store settings
instanceStorePolicy: RAID0 # RAID0 for ephemeral storage
# EBS block device mappings
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 100Gi
volumeType: gp3
iops: 3000
throughput: 125
encrypted: true
kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/1234-5678-abcd"
deleteOnTermination: true
# Additional EBS volume for container storage
- deviceName: /dev/xvdb
ebs:
volumeSize: 200Gi
volumeType: gp3
encrypted: true
# Metadata options for IMDS
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1 # Restrict to instance only
httpTokens: required # Require IMDSv2
# User data for custom node configuration
userData: |
#!/bin/bash
echo "Custom node initialization"
# Install additional packages
yum install -y amazon-ssm-agent
systemctl enable amazon-ssm-agent
systemctl start amazon-ssm-agent
# Tags applied to EC2 instances
tags:
Environment: production
Team: platform
Application: kubernetes-workloads
NodePool Configuration Value Explanations
| Parameter | Description | Example Values |
|---|---|---|
karpenter.sh/capacity-type | Instance purchase type | spot, on-demand |
node.kubernetes.io/instance-type | Allowed EC2 instance types | m5.large, c5.xlarge |
kubernetes.io/arch | CPU architecture | amd64, arm64 |
expireAfter | Maximum node lifetime before rotation | 720h (30 days) |
consolidationPolicy | When to consolidate nodes | WhenEmpty, WhenEmptyOrUnderutilized |
consolidateAfter | Wait time before consolidation | 1m, 5m, 1h |
limits.cpu | Maximum vCPUs for this NodePool | 1000 |
limits.memory | Maximum memory for this NodePool | 2000Gi |
Multiple NodePools for Different Workloads
You can create multiple NodePools for different workload types:
# NodePool for general workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.xlarge", "m6i.large"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
weight: 50
---
# NodePool for GPU workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu
spec:
template:
metadata:
labels:
workload-type: gpu
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"] # GPU workloads often need stability
- key: node.kubernetes.io/instance-type
operator: In
values: ["p3.2xlarge", "p3.8xlarge", "g4dn.xlarge", "g4dn.2xlarge"]
- key: "nvidia.com/gpu"
operator: Exists
taints:
- key: nvidia.com/gpu
effect: NoSchedule
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: gpu-nodes
weight: 100
---
# NodePool for high-memory workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: high-memory
spec:
template:
metadata:
labels:
workload-type: high-memory
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["r5.xlarge", "r5.2xlarge", "r5.4xlarge", "r6i.xlarge"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
weight: 75
Combined KEDA and Karpenter Configuration
Here’s a complete example showing KEDA and Karpenter working together:
# Application Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: queue-processor
namespace: production
spec:
replicas: 1
selector:
matchLabels:
app: queue-processor
template:
metadata:
labels:
app: queue-processor
spec:
containers:
- name: processor
image: my-processor:v1.0
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
---
# KEDA ScaledObject for pod scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: queue-processor-scaler
namespace: production
spec:
scaleTargetRef:
name: queue-processor
pollingInterval: 15
cooldownPeriod: 300
idleReplicaCount: 0
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/processing-queue
queueLength: "10"
awsRegion: us-east-1
authenticationRef:
name: aws-credentials
---
# Karpenter NodePool for infrastructure scaling
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: processing-workloads
spec:
template:
metadata:
labels:
workload: queue-processing
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["c5.large", "c5.xlarge", "c5.2xlarge"]
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
expireAfter: 168h
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
limits:
cpu: 500
memory: 1000Gi
Monitoring with Datadog
Effective monitoring is essential for understanding autoscaling behavior. Datadog provides comprehensive monitoring capabilities for both KEDA and Karpenter.
Installing Datadog Agent in Kubernetes
# Using Helm
helm repo add datadog https://helm.datadoghq.com
helm repo update
# Create namespace and secret
kubectl create namespace datadog
kubectl create secret generic datadog-secret \
--namespace datadog \
--from-literal=api-key=<YOUR_API_KEY>
# Install with custom values
helm install datadog datadog/datadog \
--namespace datadog \
--set datadog.apiKeyExistingSecret=datadog-secret \
--set datadog.site=datadoghq.com \
--set datadog.logs.enabled=true \
--set datadog.logs.containerCollectAll=true \
--set datadog.apm.enabled=true \
--set datadog.processAgent.enabled=true \
--set datadog.kubeStateMetricsEnabled=true \
--set datadog.clusterAgent.enabled=true \
--set datadog.clusterAgent.metricsProvider.enabled=true
Datadog Agent Values for Autoscaling Monitoring
# datadog-values.yaml
datadog:
apiKeyExistingSecret: datadog-secret
site: datadoghq.com
# Enable Kubernetes integrations
kubeStateMetricsCore:
enabled: true
collectSecretMetrics: false
# Enable log collection
logs:
enabled: true
containerCollectAll: true
# Enable APM
apm:
portEnabled: true
socketEnabled: true
# Enable process monitoring
processAgent:
enabled: true
processCollection: true
# Prometheus scraping for KEDA metrics
prometheusScrape:
enabled: true
serviceEndpoints: true
# Additional configuration for autoscaling metrics
confd:
keda.yaml: |-
ad_identifiers:
- keda-operator
init_config:
instances:
- prometheus_url: http://%%host%%:8080/metrics
namespace: keda
metrics:
- keda_*
clusterAgent:
enabled: true
metricsProvider:
enabled: true
useDatadogMetrics: true
# External metrics for HPA
externalMetrics:
enabled: true
Monitoring KEDA with Datadog
KEDA Prometheus Metrics
KEDA exposes metrics on port 8080 that Datadog can scrape:
# Enable metrics scraping via pod annotations
apiVersion: apps/v1
kind: Deployment
metadata:
name: keda-operator
namespace: keda
spec:
template:
metadata:
annotations:
ad.datadoghq.com/keda-operator.check_names: '["openmetrics"]'
ad.datadoghq.com/keda-operator.init_configs: '[{}]'
ad.datadoghq.com/keda-operator.instances: |
[
{
"prometheus_url": "http://%%host%%:8080/metrics",
"namespace": "keda",
"metrics": [
"keda_scaler_active",
"keda_scaler_metrics_value",
"keda_scaled_object_errors",
"keda_trigger_totals",
"keda_internal_scale_loop_latency",
"keda_scaler_errors_total"
]
}
]
Key KEDA Metrics to Monitor
| Metric | Description | Use Case |
|---|---|---|
keda_scaler_active | Whether a scaler is active (1) or idle (0) | Alert on unexpected idle state |
keda_scaler_metrics_value | Current metric value from scaler | Track scaling trigger values |
keda_scaled_object_errors | Error count per ScaledObject | Alert on scaling failures |
keda_trigger_totals | Total trigger activations | Understand scaling patterns |
keda_internal_scale_loop_latency | Latency of scaling loops | Performance monitoring |
keda_scaler_errors_total | Total scaler errors | Reliability monitoring |
KEDA Dashboard JSON
Create a comprehensive KEDA dashboard in Datadog:
{
"title": "KEDA Autoscaling Dashboard",
"description": "Monitor KEDA autoscaling metrics and behaviors",
"widgets": [
{
"definition": {
"title": "Active Scalers",
"type": "timeseries",
"requests": [
{
"q": "sum:keda.scaler_active{*} by {scaledObject}",
"display_type": "bars"
}
]
}
},
{
"definition": {
"title": "Scaler Metric Values",
"type": "timeseries",
"requests": [
{
"q": "avg:keda.scaler_metrics_value{*} by {scaledObject,scaler}",
"display_type": "line"
}
]
}
},
{
"definition": {
"title": "Scaling Errors",
"type": "timeseries",
"requests": [
{
"q": "sum:keda.scaled_object_errors{*} by {scaledObject}.as_count()",
"display_type": "bars"
}
]
}
},
{
"definition": {
"title": "HPA Replica Count",
"type": "timeseries",
"requests": [
{
"q": "avg:kubernetes_state.hpa.current_replicas{*} by {hpa}",
"display_type": "line"
},
{
"q": "avg:kubernetes_state.hpa.desired_replicas{*} by {hpa}",
"display_type": "line",
"style": {"line_type": "dashed"}
}
]
}
}
]
}
Monitoring Karpenter with Datadog
Karpenter Prometheus Metrics
Karpenter exposes detailed metrics about node provisioning:
# Configure Datadog to scrape Karpenter metrics
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-config
namespace: datadog
data:
karpenter.yaml: |
ad_identifiers:
- karpenter
init_config:
instances:
- prometheus_url: http://%%host%%:8000/metrics
namespace: karpenter
metrics:
- karpenter_nodes_*
- karpenter_pods_*
- karpenter_provisioner_*
- karpenter_nodeclaims_*
- karpenter_nodepool_*
- karpenter_cloudprovider_*
- karpenter_disruption_*
Key Karpenter Metrics to Monitor
| Metric | Description | Use Case |
|---|---|---|
karpenter_nodes_created_total | Total nodes created | Track provisioning activity |
karpenter_nodes_terminated_total | Total nodes terminated | Track deprovisioning |
karpenter_pods_state | Pods by state (pending, running) | Identify scheduling issues |
karpenter_nodepool_usage | Resource usage per NodePool | Capacity planning |
karpenter_nodepool_limit | Resource limits per NodePool | Capacity monitoring |
karpenter_nodeclaims_created_total | NodeClaim creation count | Provisioning patterns |
karpenter_cloudprovider_duration_seconds | Cloud API latency | Performance monitoring |
karpenter_disruption_actions_performed_total | Disruption actions taken | Understand node lifecycle |
Karpenter Dashboard JSON
{
"title": "Karpenter Node Provisioning Dashboard",
"description": "Monitor Karpenter node provisioning and lifecycle",
"widgets": [
{
"definition": {
"title": "Nodes Created vs Terminated",
"type": "timeseries",
"requests": [
{
"q": "sum:karpenter.nodes_created_total{*}.as_count()",
"display_type": "bars",
"style": {"palette": "green"}
},
{
"q": "sum:karpenter.nodes_terminated_total{*}.as_count()",
"display_type": "bars",
"style": {"palette": "red"}
}
]
}
},
{
"definition": {
"title": "NodePool Resource Usage",
"type": "timeseries",
"requests": [
{
"q": "avg:karpenter.nodepool_usage{resource:cpu} by {nodepool}",
"display_type": "line"
}
]
}
},
{
"definition": {
"title": "NodePool Limits",
"type": "query_value",
"requests": [
{
"q": "avg:karpenter.nodepool_limit{resource:cpu} by {nodepool}"
}
]
}
},
{
"definition": {
"title": "Pending Pods",
"type": "timeseries",
"requests": [
{
"q": "sum:karpenter.pods_state{state:pending}",
"display_type": "area"
}
]
}
},
{
"definition": {
"title": "Cloud Provider Latency",
"type": "timeseries",
"requests": [
{
"q": "avg:karpenter.cloudprovider_duration_seconds.avg{*} by {method}",
"display_type": "line"
}
]
}
},
{
"definition": {
"title": "Disruption Actions",
"type": "timeseries",
"requests": [
{
"q": "sum:karpenter.disruption_actions_performed_total{*} by {action}.as_count()",
"display_type": "bars"
}
]
}
}
]
}
Combined Autoscaling Monitoring Dashboard
For comprehensive autoscaling visibility, create a unified dashboard:
{
"title": "Kubernetes Autoscaling Overview",
"description": "Combined KEDA and Karpenter autoscaling metrics",
"widgets": [
{
"definition": {
"title": "Autoscaling Flow",
"type": "note",
"content": "**Scaling Flow:** Events → KEDA → Pod Scaling → Pending Pods → Karpenter → Node Provisioning"
}
},
{
"definition": {
"title": "KEDA Active Scalers",
"type": "query_value",
"requests": [{"q": "sum:keda.scaler_active{*}"}]
}
},
{
"definition": {
"title": "Current Replica Count",
"type": "query_value",
"requests": [{"q": "sum:kubernetes_state.deployment.replicas_available{*}"}]
}
},
{
"definition": {
"title": "Karpenter Node Count",
"type": "query_value",
"requests": [{"q": "sum:kubernetes_state.node.count{*}"}]
}
},
{
"definition": {
"title": "Pod vs Node Scaling Timeline",
"type": "timeseries",
"requests": [
{
"q": "sum:kubernetes_state.deployment.replicas{*}",
"display_type": "line"
},
{
"q": "sum:kubernetes_state.node.count{*}",
"display_type": "line"
}
]
}
}
]
}
Datadog Monitors and Alerts
Create alerts for autoscaling issues:
# KEDA Scaling Errors Alert
apiVersion: v1
kind: ConfigMap
metadata:
name: datadog-monitors
data:
keda-errors.json: |
{
"name": "KEDA Scaling Errors Detected",
"type": "metric alert",
"query": "sum(last_5m):sum:keda.scaled_object_errors{*}.as_count() > 5",
"message": "KEDA has encountered scaling errors. Check ScaledObject configurations and scaler connectivity.\n\n@pagerduty-platform-team",
"tags": ["team:platform", "component:keda"],
"priority": 2
}
karpenter-pending-pods.json: |
{
"name": "High Pending Pod Count",
"type": "metric alert",
"query": "avg(last_5m):sum:karpenter.pods_state{state:pending} > 10",
"message": "Many pods are pending scheduling. Karpenter may be hitting limits or experiencing provisioning issues.\n\nCheck:\n- NodePool limits\n- EC2 capacity\n- Instance type availability\n\n@slack-platform-alerts",
"tags": ["team:platform", "component:karpenter"],
"priority": 2
}
karpenter-provisioning-latency.json: |
{
"name": "Karpenter High Provisioning Latency",
"type": "metric alert",
"query": "avg(last_10m):avg:karpenter.cloudprovider_duration_seconds.avg{method:create} > 60",
"message": "Karpenter is experiencing high latency when provisioning nodes. This may indicate AWS API throttling or capacity issues.\n\n@slack-platform-alerts",
"tags": ["team:platform", "component:karpenter"],
"priority": 3
}
scale-to-zero-alert.json: |
{
"name": "Workload Scaled to Zero",
"type": "metric alert",
"query": "avg(last_5m):avg:kubernetes_state.deployment.replicas_available{app:queue-processor} == 0",
"message": "The queue-processor workload has scaled to zero replicas. This is expected during low traffic but verify no processing backlog exists.\n\n@slack-platform-alerts",
"tags": ["team:platform", "component:keda"],
"priority": 4
}
Using Datadog External Metrics with KEDA
Datadog can serve as a metric source for KEDA scaling decisions:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: datadog-driven-scaler
namespace: production
spec:
scaleTargetRef:
name: api-server
pollingInterval: 30
cooldownPeriod: 300
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- type: datadog
metadata:
# Query Datadog metrics directly
query: "avg:trace.http.request.duration{service:api-server}.as_count()"
queryValue: "100" # Target value for scaling
queryAggregator: "avg" # avg, sum, min, max
age: "60" # Query age in seconds
type: "average" # global, average
authenticationRef:
name: datadog-auth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: datadog-auth
namespace: production
spec:
secretTargetRef:
- parameter: apiKey
name: datadog-secrets
key: api-key
- parameter: appKey
name: datadog-secrets
key: app-key
# Optional: specify Datadog site
# env:
# - parameter: datadogSite
# name: DD_SITE
# containerName: app
Best Practices
KEDA Best Practices
Set Appropriate Cooldown Periods: Prevent rapid scaling oscillations by setting appropriate
cooldownPeriodvalues.Use Fallback Configuration: Define fallback replicas to maintain availability during scaler failures.
Monitor Scaler Errors: Alert on
keda_scaled_object_errorsto catch configuration issues early.Test Scale-to-Zero: Thoroughly test workloads that scale to zero to ensure proper wake-up behavior.
Use TriggerAuthentication: Avoid hardcoding credentials; use
TriggerAuthenticationwith secrets or pod identity.
Karpenter Best Practices
Define Multiple NodePools: Create specialized NodePools for different workload types (GPU, memory-optimized, cost-optimized).
Set Resource Limits: Always define
limitsin NodePools to control infrastructure costs.Use Consolidation: Enable consolidation to reduce costs by removing underutilized nodes.
Prefer Spot Instances: Use spot instances for fault-tolerant workloads to reduce costs significantly.
Monitor Pending Pods: Set up alerts for pending pods to catch capacity issues quickly.
Combined Strategy Best Practices
Right-Size Resource Requests: Accurate pod resource requests help Karpenter select optimal instance types.
Coordinate Scaling Parameters: Ensure KEDA cooldown aligns with Karpenter consolidation timing.
Use Pod Disruption Budgets: Protect critical workloads during node consolidation.
Monitor End-to-End: Track metrics from event source through pod scaling to node provisioning.
Conclusion
KEDA and Karpenter are complementary tools that together provide comprehensive Kubernetes autoscaling:
- KEDA excels at event-driven pod autoscaling, enabling scale-to-zero and scaling based on external metrics
- Karpenter handles infrastructure-level scaling, provisioning just-right nodes quickly and efficiently
- Together, they create a responsive, cost-effective autoscaling strategy
When combined with Datadog monitoring, you gain visibility into every layer of your autoscaling infrastructure, enabling proactive management and rapid troubleshooting.