AI-Augmented Application–Infrastructure Alignment: From Manual Middleware to Autonomous Agents
READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.
Introduction
For as long as software has run on shared infrastructure, engineers have wrestled with a deceptively hard problem: keeping application code and infrastructure in sync. Not just “does the app deploy?” but “does the app speak to its infrastructure correctly, durably, and under all load conditions?” This mandate — “Application code aligned to infrastructure in support of Platform features” — consumed enormous engineering cycles. Developers instrumented middleware. Platform teams wrote runbooks. SREs chased configuration drift at 2 AM.
That mandate is transforming into something fundamentally different: “AI-Augmented Application–Infrastructure Alignment” — a model where intelligent agents continuously enforce infrastructure compatibility, generate and validate environment contracts, detect configuration drift before it cascades, and proactively remediate inconsistencies between application code and runtime infrastructure.
This post traces that transformation across its full arc, compares the old and new operational models in detail, and closes with a practical proof-of-concept (POC) using AWS EKS and a Ruby on Rails application to demonstrate one concrete dimension of this shift.
Part 1: The Old Contract — Application Code Aligned to Infrastructure
The Core Challenge
In the pre-AI era, aligning an application to its infrastructure was fundamentally a human-coordination problem. The application codebase had to be modified — sometimes extensively — to properly consume, monitor, and integrate with the infrastructure resources it depended on. These modifications were not incidental; they were foundational to platform reliability.
Three forces drove this work:
- Infrastructure opacity — applications could not introspect their runtime environment without explicit instrumentation.
- Configuration volatility — environment variables, secrets, connection strings, and service endpoints changed across environments and over time.
- Observability gaps — without application-level telemetry, infrastructure teams could not right-size resources, detect saturation, or plan capacity.
Characteristic Patterns of the Old Way
1. Middleware Instrumentation for Infrastructure Feedback
The canonical example of application-infrastructure alignment was embedding observability middleware directly into the web application stack. Teams added gems, libraries, or custom rack layers to capture request rates, error rates, response times, and database query statistics — and then exposed those metrics to infrastructure tooling.
In a Ruby on Rails application, this looked like:
# config/initializers/prometheus.rb
require 'prometheus/client'
require 'prometheus/client/rack/collector'
require 'prometheus/client/rack/exporter'
# Define metrics
REGISTRY = Prometheus::Client.registry
HTTP_REQUEST_DURATION = REGISTRY.histogram(
:http_request_duration_seconds,
docstring: 'HTTP request duration',
labels: [:method, :path, :status],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
)
DB_QUERY_DURATION = REGISTRY.histogram(
:db_query_duration_seconds,
docstring: 'ActiveRecord query duration',
labels: [:table, :operation],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5]
)
ACTIVE_CONNECTIONS = REGISTRY.gauge(
:db_connection_pool_active,
docstring: 'Active database connections'
)
WAITING_CONNECTIONS = REGISTRY.gauge(
:db_connection_pool_waiting,
docstring: 'Waiting database connections'
)
# config/application.rb — inserting middleware into the Rack stack
module MyApp
class Application < Rails::Application
config.middleware.use Prometheus::Client::Rack::Collector
config.middleware.use Prometheus::Client::Rack::Exporter
end
end
# lib/middleware/infrastructure_reporter.rb — custom middleware
class InfrastructureReporter
def initialize(app)
@app = app
end
def call(env)
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
status, headers, body = @app.call(env)
duration = Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
path = env['REQUEST_PATH'] || env['PATH_INFO']
method = env['REQUEST_METHOD']
HTTP_REQUEST_DURATION.observe(
duration,
labels: { method: method, path: sanitize_path(path), status: status }
)
report_connection_pool_stats
status, headers, body
end
private
def report_connection_pool_stats
pool = ActiveRecord::Base.connection_pool
stat = pool.stat
ACTIVE_CONNECTIONS.set(stat[:busy])
WAITING_CONNECTIONS.set(stat[:waiting])
end
def sanitize_path(path)
path.gsub(/\d+/, ':id')
end
end
This instrumentation existed purely to inform infrastructure decisions: should we add more database replicas? Is the connection pool exhausted? Are certain endpoints disproportionately slow under specific infrastructure configurations? The application carried the burden of explaining itself to infrastructure.
2. Manual Environment Contract Management
Applications relied on complex chains of environment variables, often managed through combinations of .env files, Kubernetes ConfigMaps, Secrets, and SSM Parameter Store. Developers had to manually maintain these contracts across environments — and mismatches caused production incidents.
# kubernetes/configmap.yaml — manually maintained environment contract
apiVersion: v1
kind: ConfigMap
metadata:
name: rails-app-config
namespace: production
data:
RAILS_ENV: "production"
DATABASE_POOL: "10"
REDIS_URL: "redis://redis-primary.production.svc.cluster.local:6379"
ELASTICSEARCH_URL: "http://es-cluster.production.svc.cluster.local:9200"
SIDEKIQ_CONCURRENCY: "15"
RAILS_MAX_THREADS: "5"
WEB_CONCURRENCY: "3"
MALLOC_ARENA_MAX: "2"
# config/initializers/connection_validation.rb
# Developers wrote startup checks to catch mismatches early
Rails.application.config.after_initialize do
required_vars = %w[
DATABASE_URL REDIS_URL SECRET_KEY_BASE
SIDEKIQ_CONCURRENCY DATABASE_POOL
]
missing = required_vars.reject { |var| ENV[var].present? }
if missing.any?
raise "Missing required environment variables: #{missing.join(', ')}"
end
# Validate pool sizing coherence
db_pool = ENV.fetch('DATABASE_POOL').to_i
threads = ENV.fetch('RAILS_MAX_THREADS').to_i
if db_pool < threads
Rails.logger.warn(
"[CONFIG WARNING] DATABASE_POOL (#{db_pool}) < RAILS_MAX_THREADS (#{threads}). " \
"Thread starvation possible under load."
)
end
end
Every time infrastructure changed — a new service endpoint, a Redis cluster migration, a scaling event that required pool adjustment — a developer had to manually update application configuration, validate the contract, and re-deploy.
3. Health Check and Readiness Probe Engineering
Applications had to explicitly implement infrastructure-aware health checks that Kubernetes could use to gate traffic and manage pod lifecycle. This required developers to understand infrastructure topology — not just application logic.
# app/controllers/health_controller.rb
class HealthController < ActionController::Base
protect_from_forgery with: :null_session
# Kubernetes liveness probe — is the app alive at all?
def liveness
render json: { status: 'ok', timestamp: Time.current.iso8601 }, status: :ok
end
# Kubernetes readiness probe — is the app ready to serve traffic?
def readiness
checks = {}
overall_status = :ok
# Check database connectivity
begin
ActiveRecord::Base.connection.execute('SELECT 1')
checks[:database] = { status: 'ok' }
rescue => e
checks[:database] = { status: 'error', message: e.message }
overall_status = :service_unavailable
end
# Check Redis connectivity
begin
redis = Redis.new(url: ENV['REDIS_URL'])
redis.ping
checks[:redis] = { status: 'ok' }
rescue => e
checks[:redis] = { status: 'error', message: e.message }
overall_status = :service_unavailable
end
# Check connection pool health
pool_stat = ActiveRecord::Base.connection_pool.stat
pool_ratio = pool_stat[:busy].to_f / pool_stat[:size]
if pool_ratio > 0.9
checks[:connection_pool] = {
status: 'warning',
busy: pool_stat[:busy],
size: pool_stat[:size]
}
else
checks[:connection_pool] = { status: 'ok', **pool_stat.slice(:busy, :size, :waiting) }
end
render json: { status: overall_status == :ok ? 'ready' : 'not_ready', checks: checks },
status: overall_status
end
end
# kubernetes/deployment.yaml — health probe configuration maintained by developers
spec:
containers:
- name: rails-app
livenessProbe:
httpGet:
path: /health/liveness
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 3000
initialDelaySeconds: 20
periodSeconds: 5
failureThreshold: 3
4. Connection Pool Tuning as an Artisanal Practice
Sizing database connection pools relative to infrastructure capacity was a manual, iterative process requiring deep knowledge of both application threading models and database server limits.
# config/database.yml — manually tuned connection pool settings
production:
adapter: postgresql
pool: <%= ENV.fetch("DATABASE_POOL") { 10 } %>
checkout_timeout: 5
connect_timeout: 5
variables:
statement_timeout: 30000
url: <%= ENV['DATABASE_URL'] %>
# config/initializers/puma.rb — manually coordinated with database pool size
workers ENV.fetch("WEB_CONCURRENCY") { 2 }
threads_count = ENV.fetch("RAILS_MAX_THREADS") { 5 }
threads threads_count, threads_count
# NOTE: DATABASE_POOL must be >= RAILS_MAX_THREADS * WEB_CONCURRENCY
# Currently: 5 threads * 2 workers = 10 connections minimum.
# DATABASE_POOL set to 12 to allow headroom.
# When scaling WEB_CONCURRENCY, DATABASE_POOL must be updated manually.
preload_app!
The comment in that Puma configuration is a fingerprint of the era: a developer leaving a note to their future self because the system had no mechanism to enforce the relationship automatically.
Why the Old Model Hit Its Limits
| Failure Mode | Cause | Impact |
|---|---|---|
| Configuration drift | Manual syncing across environments | Production incidents from staging/prod mismatches |
| Pool exhaustion | Static sizing without runtime awareness | Cascading failures under unexpected load |
| Observability gaps | Incomplete middleware coverage | Infrastructure teams flying blind on capacity |
| Onboarding friction | Implicit tribal knowledge of contracts | Slow developer ramp-up, repeated mistakes |
| Incident fatigue | Human-in-the-loop for every remediation | Overnight escalations for fixable misconfiguration |
Part 2: The New Contract — AI-Augmented Application–Infrastructure Alignment
The Paradigm Shift
In the AI-Augmented model, the relationship between application code and infrastructure is no longer maintained through human discipline and manual instrumentation. Instead, AI agents become the connective tissue — continuously observing both the application layer and infrastructure reality, enforcing contracts, detecting divergence, and remediating drift without waiting for a human to notice.
Application Commit → [AI Compatibility Enforcer] → [Environment Contract Generator] →
[Runtime Drift Detector] → [Proactive Remediation Agent] → [Telemetry Feedback Loop]
Four capabilities define this new model:
Capability 1: AI-Enforced Infrastructure Compatibility
Rather than developers manually consulting runbooks or tribal knowledge to understand what infrastructure constraints their code must satisfy, an AI agent analyzes every code change and flags incompatibilities before they reach production.
This agent understands:
- Threading model implications (a change to
puma.rbthreads that would cause pool exhaustion) - Resource consumption patterns (a new background job that will saturate the Sidekiq queue)
- Service dependency changes (a new
requirethat introduces a dependency on an unprovisioned infrastructure service) - Container resource footprint (an algorithm change that increases memory pressure beyond pod limits)
Capability 2: AI-Generated and AI-Validated Environment Contracts
Instead of developers manually writing and maintaining ConfigMaps, Secrets, and environment variable documentation, an AI agent generates environment contracts by analyzing the application codebase and cross-referencing them against infrastructure state.
The agent:
- Parses application code to enumerate all environment variable dependencies
- Validates that all referenced variables are provisioned in the target environment
- Detects breaking changes between application version contracts and currently deployed infrastructure
- Generates draft Kubernetes manifests with correctly sized resources based on observed runtime behavior
Capability 3: Continuous Drift Detection
Configuration drift — the gradual divergence between what the application expects and what infrastructure provides — is detected continuously by an agent that watches both sides of the interface simultaneously.
The agent monitors:
- ConfigMap and Secret changes in Kubernetes against application-declared expectations
- Database connection counts against application pool configurations
- Memory and CPU headroom against application resource requests and limits
- Service mesh configurations against application service discovery patterns
Capability 4: Proactive Remediation
When drift is detected, the AI agent does not simply alert a human. It:
- Classifies the severity of the drift and its blast radius
- Generates a remediation plan with specific configuration changes
- Applies low-risk remediations autonomously (adjusting replicas, updating ConfigMap values, restarting degraded pods)
- Escalates high-risk remediations with a fully formed plan ready for human approval
Part 3: Comparing the Models
| Dimension | Old Model (Manual Alignment) | New Model (AI-Augmented Alignment) |
|---|---|---|
| Instrumentation | Developers embed middleware manually | AI agent instruments automatically based on gap analysis |
| Environment contracts | Hand-authored ConfigMaps and runbooks | AI-generated from codebase analysis, validated against live infra |
| Drift detection | Scheduled audits or incident-driven discovery | Continuous real-time agent monitoring |
| Remediation | Pager alert → human investigation → manual fix | Autonomous fix for low-risk drift; plan-ready escalation for high-risk |
| Pool/resource sizing | Artisanal tuning from tribal knowledge | AI recommends based on observed runtime telemetry |
| Compatibility checks | Code review (human, inconsistent) | Pre-commit and pre-deploy agent analysis |
| Onboarding | Developer reads wiki, asks teammates | AI agent surfaces constraints in developer’s IDE and PR |
| MTTR for config incidents | Hours (pager → triage → fix → deploy) | Minutes (agent detects → classifies → remediates or escalates) |
| Knowledge location | In developers’ heads and wikis | In agent policies, codebase analysis, and telemetry history |
Part 4: POC — AI Infrastructure Alignment Agent for Ruby on Rails on AWS EKS
This proof-of-concept demonstrates one concrete slice of the AI-Augmented model: an agent that runs inside a Rails application’s lifecycle on AWS EKS, continuously validates infrastructure compatibility, detects configuration drift, and applies or escalates remediations.
Architecture Overview
┌─────────────────────────────────────────────────────────────┐
│ AWS EKS Cluster │
│ │
│ ┌─────────────────────┐ ┌──────────────────────────┐ │
│ │ Rails Application │ │ Alignment Agent Sidecar │ │
│ │ │◄──►│ │ │
│ │ • Puma web server │ │ • Contract validator │ │
│ │ • Sidekiq workers │ │ • Drift detector │ │
│ │ • Health endpoints │ │ • Remediation engine │ │
│ └─────────┬───────────┘ └──────────┬───────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Kubernetes API (ConfigMaps, Secrets, Deployments) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Amazon RDS │ │ ElastiCache │ │ AWS Bedrock │ │
│ │ (PostgreSQL) │ │ (Redis) │ │ (AI Engine) │ │
│ └─────────────────┘ └──────────────┘ └───────────────┘ │
└─────────────────────────────────────────────────────────────┘
Step 1: Rails Application with Declarative Infrastructure Contract
The Rails application declares its infrastructure requirements explicitly using a structured contract file. This is the foundation the AI agent reads from.
# config/infrastructure_contract.yml
# Declarative contract: what this application requires from infrastructure
schema_version: "1.0"
application:
name: my-rails-app
framework: rails
version: "7.1"
compute:
min_replicas: 2
max_replicas: 10
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
database:
adapter: postgresql
pool_formula: "RAILS_MAX_THREADS * WEB_CONCURRENCY + 2"
min_pool: 5
max_pool: 25
required_extensions:
- uuid-ossp
- pgcrypto
cache:
adapter: redis
required_commands:
- GET
- SET
- EXPIRE
- LPUSH
- BLPOP
environment_variables:
required:
- DATABASE_URL
- REDIS_URL
- SECRET_KEY_BASE
- RAILS_MASTER_KEY
optional_with_defaults:
DATABASE_POOL: "10"
RAILS_MAX_THREADS: "5"
WEB_CONCURRENCY: "2"
SIDEKIQ_CONCURRENCY: "10"
health:
liveness_path: /health/liveness
readiness_path: /health/readiness
startup_timeout_seconds: 60
Step 2: Infrastructure Alignment Agent — Core Implementation
The agent runs as a sidecar container in the same Kubernetes pod as the Rails application. It has three primary loops: contract validation, drift detection, and remediation.
# agent/infrastructure_alignment_agent.rb
require 'aws-sdk-bedrockruntime'
require 'aws-sdk-eks'
require 'k8s-ruby'
require 'yaml'
require 'json'
require 'logger'
class InfrastructureAlignmentAgent
DRIFT_CHECK_INTERVAL = 60 # seconds
CONTRACT_FILE = '/app/config/infrastructure_contract.yml'
def initialize
@logger = Logger.new($stdout)
@logger.progname = 'AlignmentAgent'
@contract = YAML.load_file(CONTRACT_FILE)
@k8s = K8s::Client.in_cluster_config
@bedrock = Aws::BedrockRuntime::Client.new(region: ENV['AWS_REGION'])
@namespace = ENV.fetch('KUBERNETES_NAMESPACE', 'default')
@deployment_name = ENV.fetch('DEPLOYMENT_NAME')
@dry_run = ENV.fetch('AGENT_DRY_RUN', 'false') == 'true'
end
def run
@logger.info("Starting Infrastructure Alignment Agent for #{@deployment_name}")
loop do
begin
validate_environment_contract
detect_and_remediate_drift
rescue => e
@logger.error("Agent cycle error: #{e.class}: #{e.message}")
@logger.debug(e.backtrace.join("\n"))
end
sleep(DRIFT_CHECK_INTERVAL)
end
end
private
# ─── CONTRACT VALIDATION ──────────────────────────────────────────────────
def validate_environment_contract
@logger.info("Validating environment contract...")
violations = []
# Check required environment variables
@contract['environment_variables']['required'].each do |var|
unless ENV[var]
violations << { type: :missing_env_var, variable: var, severity: :critical }
end
end
# Validate pool sizing coherence
threads = ENV.fetch('RAILS_MAX_THREADS', '5').to_i
workers = ENV.fetch('WEB_CONCURRENCY', '2').to_i
pool = ENV.fetch('DATABASE_POOL', '10').to_i
required_pool = threads * workers + 2
if pool < required_pool
violations << {
type: :pool_undersized,
current: pool,
required: required_pool,
severity: :high,
context: "#{threads} threads × #{workers} workers + 2 overhead"
}
end
if violations.any?
handle_violations(violations, 'contract_validation')
else
@logger.info("Contract validation: PASS (#{@contract['environment_variables']['required'].length} variables OK)")
end
end
# ─── DRIFT DETECTION ──────────────────────────────────────────────────────
def detect_and_remediate_drift
@logger.info("Running drift detection cycle...")
drift_findings = []
drift_findings.concat(detect_configmap_drift)
drift_findings.concat(detect_resource_drift)
drift_findings.concat(detect_replica_drift)
if drift_findings.any?
@logger.warn("Detected #{drift_findings.length} drift finding(s)")
remediation_plan = generate_ai_remediation_plan(drift_findings)
execute_remediation(remediation_plan, drift_findings)
else
@logger.info("Drift detection: CLEAN — no drift detected")
end
end
def detect_configmap_drift
findings = []
begin
configmap = @k8s.api('v1')
.resource('configmaps', namespace: @namespace)
.get(@deployment_name)
deployed_config = configmap.data.to_h
# Compare deployed DATABASE_POOL against contract requirement
threads = deployed_config['RAILS_MAX_THREADS'].to_i
workers = deployed_config['WEB_CONCURRENCY'].to_i
deployed_pool = deployed_config['DATABASE_POOL'].to_i
required_pool = threads * workers + 2
if deployed_pool < required_pool
findings << {
type: :configmap_drift,
resource: "ConfigMap/#{@deployment_name}",
field: 'DATABASE_POOL',
current_value: deployed_pool.to_s,
expected_value: required_pool.to_s,
severity: :high,
auto_remediable: true
}
end
rescue K8s::Error::NotFound
findings << {
type: :missing_configmap,
resource: "ConfigMap/#{@deployment_name}",
severity: :critical,
auto_remediable: false
}
end
findings
end
def detect_resource_drift
findings = []
begin
deployment = @k8s.api('apps/v1')
.resource('deployments', namespace: @namespace)
.get(@deployment_name)
container = deployment.spec.template.spec.containers.first
actual_limits = container.resources.limits
contract_limits = @contract.dig('compute', 'resources', 'limits')
# Check memory limit
actual_memory = parse_memory_mi(actual_limits['memory'])
contract_memory = parse_memory_mi(contract_limits['memory'])
if actual_memory < contract_memory
findings << {
type: :resource_drift,
resource: "Deployment/#{@deployment_name}",
field: 'resources.limits.memory',
current_value: actual_limits['memory'],
expected_value: contract_limits['memory'],
severity: :medium,
auto_remediable: false
}
end
rescue => e
@logger.error("Resource drift check failed: #{e.message}")
end
findings
end
def detect_replica_drift
findings = []
begin
deployment = @k8s.api('apps/v1')
.resource('deployments', namespace: @namespace)
.get(@deployment_name)
actual_replicas = deployment.spec.replicas
min_replicas = @contract.dig('compute', 'min_replicas')
if actual_replicas < min_replicas
findings << {
type: :replica_drift,
resource: "Deployment/#{@deployment_name}",
field: 'spec.replicas',
current_value: actual_replicas.to_s,
expected_value: min_replicas.to_s,
severity: :high,
auto_remediable: true
}
end
rescue => e
@logger.error("Replica drift check failed: #{e.message}")
end
findings
end
# ─── AI-POWERED REMEDIATION PLANNING ──────────────────────────────────────
def generate_ai_remediation_plan(findings)
prompt = build_remediation_prompt(findings)
response = @bedrock.invoke_model(
model_id: 'anthropic.claude-3-5-sonnet-20241022-v2:0',
content_type: 'application/json',
accept: 'application/json',
body: JSON.generate({
anthropic_version: 'bedrock-2023-05-31',
max_tokens: 2048,
messages: [
{
role: 'user',
content: prompt
}
]
})
)
result = JSON.parse(response.body.read)
plan_text = result['content'][0]['text']
# Parse structured plan from AI response
JSON.parse(plan_text.match(/```json\n(.*?)\n```/m)[1])
rescue => e
@logger.error("AI remediation planning failed: #{e.message}")
{ actions: [], escalate: true, reason: "AI planning unavailable: #{e.message}" }
end
def build_remediation_prompt(findings)
<<~PROMPT
You are an infrastructure alignment agent for a Ruby on Rails application running on AWS EKS.
You have detected the following drift findings:
#{JSON.pretty_generate(findings)}
The application's infrastructure contract is:
#{JSON.pretty_generate(@contract)}
For each finding, determine:
1. Whether it can be auto-remediated safely (low blast radius, reversible)
2. The exact Kubernetes API actions required to remediate
3. Whether it requires human escalation
Respond ONLY with a JSON remediation plan in this format:
```json
{
"actions": [
{
"finding_type": "configmap_drift",
"resource": "ConfigMap/my-rails-app",
"action": "patch",
"api_group": "v1",
"resource_type": "configmaps",
"patch": { "data": { "DATABASE_POOL": "12" } },
"auto_apply": true,
"rationale": "Pool size below contract minimum. Safe to increase."
}
],
"escalate": false,
"escalation_reason": null,
"summary": "1 auto-remediable finding. Patching ConfigMap DATABASE_POOL."
}
```
PROMPT
end
# ─── REMEDIATION EXECUTION ─────────────────────────────────────────────────
def execute_remediation(plan, findings)
@logger.info("Remediation plan: #{plan['summary']}")
plan['actions'].each do |action|
if action['auto_apply'] && !@dry_run
apply_remediation_action(action)
else
log_manual_action_required(action)
end
end
if plan['escalate']
escalate_to_operations(findings, plan)
end
end
def apply_remediation_action(action)
@logger.info("Applying remediation: #{action['action']} #{action['resource']}")
case action['action']
when 'patch'
@k8s.api(action['api_group'])
.resource(action['resource_type'], namespace: @namespace)
.merge_patch(action['resource'].split('/').last, action['patch'])
@logger.info("✓ Patched #{action['resource']}: #{action['rationale']}")
when 'scale'
@k8s.api('apps/v1')
.resource('deployments', namespace: @namespace)
.merge_patch(@deployment_name, { spec: { replicas: action['replicas'] } })
@logger.info("✓ Scaled #{action['resource']} to #{action['replicas']} replicas")
else
@logger.warn("Unknown action type: #{action['action']} — skipping")
end
rescue => e
@logger.error("Remediation action failed: #{e.message}")
end
def escalate_to_operations(findings, plan)
# In production: post to PagerDuty, Slack, or create a GitHub issue
@logger.warn("ESCALATION REQUIRED: #{plan['escalation_reason']}")
@logger.warn("Findings requiring human review: #{JSON.generate(findings.reject { |f| f[:auto_remediable] })}")
end
def handle_violations(violations, context)
critical = violations.select { |v| v[:severity] == :critical }
if critical.any?
@logger.error("CRITICAL contract violations in #{context}: #{JSON.generate(critical)}")
else
@logger.warn("Contract violations in #{context}: #{JSON.generate(violations)}")
end
end
def parse_memory_mi(memory_str)
case memory_str
when /(\d+)Gi/ then $1.to_i * 1024
when /(\d+)Mi/ then $1.to_i
when /(\d+)Ki/ then $1.to_i / 1024
else 0
end
end
end
# Entry point
InfrastructureAlignmentAgent.new.run
Step 3: Sidecar Container Deployment Manifest
The agent deploys as a sidecar alongside the Rails application, sharing the application’s infrastructure contract file via a ConfigMap volume.
# kubernetes/deployment-with-alignment-agent.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-rails-app
namespace: production
labels:
app: my-rails-app
component: web
spec:
replicas: 2
selector:
matchLabels:
app: my-rails-app
template:
metadata:
labels:
app: my-rails-app
spec:
serviceAccountName: rails-app-alignment-agent
volumes:
- name: infrastructure-contract
configMap:
name: rails-infrastructure-contract
- name: shared-tmp
emptyDir: {}
containers:
# ── Main Rails application container ──────────────────────────────
- name: rails-app
image: ${ECR_REGISTRY}/my-rails-app:${IMAGE_TAG}
ports:
- containerPort: 3000
envFrom:
- configMapRef:
name: my-rails-app
- secretRef:
name: my-rails-app-secrets
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/liveness
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/readiness
port: 3000
initialDelaySeconds: 20
periodSeconds: 5
volumeMounts:
- name: infrastructure-contract
mountPath: /app/config/infrastructure_contract.yml
subPath: infrastructure_contract.yml
# ── AI Alignment Agent sidecar ────────────────────────────────────
- name: alignment-agent
image: ${ECR_REGISTRY}/infrastructure-alignment-agent:${AGENT_TAG}
env:
- name: AWS_REGION
value: "us-east-1"
- name: KUBERNETES_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: DEPLOYMENT_NAME
value: "my-rails-app"
- name: AGENT_DRY_RUN
value: "false"
resources:
requests:
memory: "128Mi"
cpu: "50m"
limits:
memory: "256Mi"
cpu: "200m"
volumeMounts:
- name: infrastructure-contract
mountPath: /app/config/infrastructure_contract.yml
subPath: infrastructure_contract.yml
---
# RBAC: Grant agent permission to read and patch deployment resources
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: alignment-agent-role
namespace: production
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: alignment-agent-binding
namespace: production
subjects:
- kind: ServiceAccount
name: rails-app-alignment-agent
namespace: production
roleRef:
kind: Role
name: alignment-agent-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: rails-app-alignment-agent
namespace: production
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::${AWS_ACCOUNT_ID}:role/rails-app-alignment-agent
Step 4: Terraform for AWS Infrastructure
Provision the underlying AWS infrastructure — EKS cluster, RDS, ElastiCache, and Bedrock IAM permissions — using Terraform.
# terraform/main.tf
terraform {
required_version = ">= 1.6"
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
}
# ── EKS Cluster ─────────────────────────────────────────────────────────────
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = "rails-app-cluster"
cluster_version = "1.31"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
cluster_addons = {
coredns = { most_recent = true }
kube-proxy = { most_recent = true }
vpc-cni = { most_recent = true }
aws-ebs-csi-driver = { most_recent = true }
}
eks_managed_node_groups = {
application = {
instance_types = ["m7i.large"]
min_size = 2
max_size = 10
desired_size = 3
labels = {
workload = "application"
}
}
}
enable_cluster_creator_admin_permissions = true
}
# ── RDS PostgreSQL ───────────────────────────────────────────────────────────
module "db" {
source = "terraform-aws-modules/rds/aws"
version = "~> 6.0"
identifier = "rails-app-db"
engine = "postgres"
engine_version = "16"
instance_class = "db.t4g.medium"
allocated_storage = 100
storage_encrypted = true
db_name = "myapp_production"
username = "myapp"
manage_master_user_password = true
vpc_security_group_ids = [module.rds_sg.security_group_id]
db_subnet_group_name = module.vpc.database_subnet_group_name
# Parameters aligned to Rails connection pool expectations
parameters = [
{ name = "max_connections", value = "200" },
{ name = "shared_buffers", value = "{DBInstanceClassMemory/32768}" }
]
}
# ── ElastiCache Redis ────────────────────────────────────────────────────────
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "rails-app-redis"
description = "Redis for Rails cache and Sidekiq"
node_type = "cache.t4g.medium"
num_cache_clusters = 2
automatic_failover_enabled = true
at_rest_encryption_enabled = true
transit_encryption_enabled = true
engine_version = "7.1"
}
# ── IAM Role for Alignment Agent (IRSA) ─────────────────────────────────────
module "alignment_agent_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "rails-app-alignment-agent"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["production:rails-app-alignment-agent"]
}
}
role_policy_arns = {
bedrock = aws_iam_policy.bedrock_access.arn
}
}
resource "aws_iam_policy" "bedrock_access" {
name = "rails-app-alignment-agent-bedrock"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = ["bedrock:InvokeModel"]
Resource = [
"arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-5-sonnet-20241022-v2:0"
]
}
]
})
}
Step 5: GitHub Actions — CI/CD with Pre-Deploy Contract Validation
Before any deployment reaches EKS, a GitHub Actions workflow runs the AI agent in validation-only mode to catch contract violations in the pipeline, before they manifest in production.
# .github/workflows/deploy-rails.yml
name: Deploy Rails App to EKS
on:
push:
branches: [main]
pull_request:
branches: [main]
permissions:
id-token: write
contents: read
jobs:
# ── Contract Validation (runs on PR and push) ──────────────────────────────
validate-infrastructure-contract:
name: Validate Infrastructure Contract
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Set up Ruby
uses: ruby/setup-ruby@v1
with:
ruby-version: '3.3'
bundler-cache: true
- name: Run contract validation
run: |
ruby agent/validate_contract.rb \
--contract config/infrastructure_contract.yml \
--environment staging \
--kubeconfig-context eks-staging
env:
AGENT_DRY_RUN: "true"
AGENT_MODE: "validate_only"
# ── Build and Push ─────────────────────────────────────────────────────────
build:
name: Build and Push Docker Image
runs-on: ubuntu-latest
needs: validate-infrastructure-contract
outputs:
image-tag: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Login to ECR
id: ecr-login
uses: aws-actions/amazon-ecr-login@v2
- name: Build and push Rails image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.ecr-login.outputs.registry }}/my-rails-app:${{ github.sha }}
cache-from: type=registry,ref=${{ steps.ecr-login.outputs.registry }}/my-rails-app:cache
cache-to: type=registry,ref=${{ steps.ecr-login.outputs.registry }}/my-rails-app:cache,mode=max
- name: Build and push Alignment Agent image
uses: docker/build-push-action@v5
with:
context: agent/
push: true
tags: ${{ steps.ecr-login.outputs.registry }}/infrastructure-alignment-agent:${{ github.sha }}
# ── Deploy to EKS ──────────────────────────────────────────────────────────
deploy:
name: Deploy to EKS
runs-on: ubuntu-latest
needs: build
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Update kubeconfig
run: |
aws eks update-kubeconfig \
--name rails-app-cluster \
--region us-east-1
- name: Deploy with alignment agent
run: |
envsubst < kubernetes/deployment-with-alignment-agent.yaml | kubectl apply -f -
env:
ECR_REGISTRY: ${{ steps.ecr-login.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
AGENT_TAG: ${{ github.sha }}
AWS_ACCOUNT_ID: ${{ secrets.AWS_ACCOUNT_ID }}
- name: Wait for rollout
run: |
kubectl rollout status deployment/my-rails-app \
-n production \
--timeout=300s
- name: Verify post-deploy contract
run: |
kubectl exec -n production \
deployment/my-rails-app \
-c alignment-agent \
-- ruby /agent/validate_contract.rb --mode post_deploy
Step 6: Agent Dockerfile
# agent/Dockerfile
FROM ruby:3.3-slim
WORKDIR /agent
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
curl \
&& rm -rf /var/lib/apt/lists/*
COPY Gemfile Gemfile.lock ./
RUN bundle config set without 'development test' && \
bundle install --frozen
COPY . .
CMD ["ruby", "infrastructure_alignment_agent.rb"]
# agent/Gemfile
source 'https://rubygems.org'
gem 'aws-sdk-bedrockruntime', '~> 1.0'
gem 'aws-sdk-eks', '~> 1.0'
gem 'k8s-ruby', '~> 0.16'
What This POC Demonstrates
| Old Pattern | POC Replacement |
|---|---|
Developer writes pool size comment in puma.rb | Agent detects pool/thread mismatch, patches ConfigMap |
| SRE checks replica count manually after alert | Agent detects replica drift, scales deployment directly |
| Developer validates env vars at app boot | Agent continuously validates contract every 60 seconds |
| Human writes remediation plan after incident | AWS Bedrock generates structured remediation plan at detection time |
| Middleware captures stats for humans to review | Agent observes Kubernetes state and acts autonomously |
Conclusion: The Evolving Role of the Application–Infrastructure Engineer
The transformation from “Application code aligned to infrastructure” to “AI-Augmented Application–Infrastructure Alignment” is not a replacement of engineering judgment — it is a leverage amplification of that judgment.
In the old model, engineering judgment was applied once, at configuration time, and then slowly decayed as infrastructure evolved around static application code. Middleware was added. Runbooks were written. Connection pools were tuned. And then the cycle repeated with the next incident.
In the new model, engineering judgment is applied to designing the agent’s decision framework — the infrastructure contract, the drift policies, the remediation boundaries, the escalation thresholds. The agent applies that judgment continuously, 24 hours a day, across every dimension of the application–infrastructure interface simultaneously.
The middleware example is instructive. In the old world, a developer spent a sprint embedding Prometheus collectors into the Rack stack so that an operations engineer could, weeks later, use those metrics to make a capacity decision. In the new world, an alignment agent observes application behavior, correlates it against infrastructure contracts, detects the emerging capacity gap, generates a scaling recommendation backed by telemetry, and either applies it automatically or presents it to an engineer with full context and a one-click approval path.
The sprint collapses into seconds. The decision improves. The engineer focuses on the next hard problem.
That is the promise — and increasingly, the reality — of AI-Augmented Application–Infrastructure Alignment.