Getting Started with Runme: Executable Documentation for Incident Management, Infrastructure, DevOps, and Security
READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.
Introduction
In the fast-paced world of DevOps, incident management, and security operations, documentation often becomes outdated the moment it’s written. Teams struggle with runbooks that contain commands copied into terminals, scripts scattered across multiple repositories, and processes that work on one engineer’s machine but fail elsewhere. Enter Runme.dev—a revolutionary approach to documentation that makes your markdown files executable.
Runme transforms traditional static documentation into interactive, executable runbooks. Rather than copying commands from documentation into your terminal, Runme allows you to run commands directly from your markdown files with a single click, all while maintaining the context and explanations that make documentation valuable.
This guide is designed for teams just starting to adopt Runme, with a focus on practical use cases in incident management, infrastructure operations, DevOps workflows, and security operations. We’ll explore how Runme integrates with external systems like AWS, how authentication works, where code executes, and how state persists across sessions.
What is Runme?
Runme is a tool that bridges the gap between documentation and execution. It works as:
- VS Code Extension: The primary interface, turning your VS Code editor into an interactive notebook experience for markdown files
- CLI Tool: A command-line interface for running markdown-based runbooks in CI/CD pipelines or directly from the terminal
- Notebook Interface: A cell-based execution environment similar to Jupyter notebooks, but for operational tasks
At its core, Runme parses markdown files and makes code blocks executable. Instead of:
## Restart the service
Copy and run this command:
\`\`\`bash
kubectl rollout restart deployment/my-app -n production
\`\`\`
With Runme, users click a “Run” button next to the code block, and the command executes in a managed environment with proper context, logging, and state tracking.
Understanding Runme’s Architecture
Before diving into use cases, it’s crucial to understand three fundamental aspects of Runme: authentication, runtime environment, and persistence.
Authentication: How Runme Handles Credentials
Runme itself does not store or manage credentials. This is a critical security feature that makes Runme safe for sensitive operations.
Local Execution Model
When you run commands through Runme, they execute in your local shell environment with your existing credentials. This means:
- AWS Credentials: Commands use your configured AWS CLI credentials (
~/.aws/credentials, environment variables, or SSO sessions) - Kubernetes Contexts: kubectl commands use your current kubeconfig context (
~/.kube/config) - SSH Keys: SSH-based operations use your local SSH agent and keys
- API Tokens: Environment variables and credential files on your system are accessible
Example: If you have AWS SSO configured with multiple profiles:
# This command runs in your local shell with your AWS credentials
aws s3 ls --profile production
Runme executes this exactly as if you typed it in your terminal, using the credentials associated with the production profile.
Environment-Based Authentication
Runme supports environment variables within notebook cells, allowing you to:
- Set context-specific variables: Define AWS profiles, Kubernetes namespaces, or API endpoints per runbook
- Inherit from parent shell: Commands inherit environment variables from the shell where Runme was launched
- Scope credentials per cell: Different cells can use different credential contexts
Example runbook with environment configuration:
## Configuration
\`\`\`bash {"name":"config"}
export AWS_PROFILE=production
export AWS_REGION=us-east-1
export KUBE_CONTEXT=prod-eks-cluster
\`\`\`
## Check AWS Resources
\`\`\`bash {"name":"check-aws"}
# This uses the AWS_PROFILE set above
aws ec2 describe-instances --region $AWS_REGION
\`\`\`
## Check Kubernetes Pods
\`\`\`bash {"name":"check-k8s"}
kubectl config use-context $KUBE_CONTEXT
kubectl get pods -n production
\`\`\`
Integration with Credential Managers
Runme integrates seamlessly with enterprise credential management systems:
- AWS SSO: Run
aws sso loginin a Runme cell before AWS operations - HashiCorp Vault: Use Vault CLI commands to fetch secrets dynamically
- 1Password/LastPass: Use CLI tools to inject secrets at runtime
- Cloud IAM: Leverage cloud provider IAM roles when running in cloud environments
Runtime: Where Does Code Execute?
Understanding where your code runs is crucial for security and operational planning.
Local Execution (Default)
By default, Runme executes commands on your local machine in a shell session. This means:
- File system access: Commands can read/write files on your local disk
- Network access: Commands make network requests from your IP address
- Process isolation: Each cell runs as a subprocess of the Runme process
- Resource limits: Commands are subject to your machine’s CPU, memory, and network limits
Security Implications:
- Commands have the same permissions as your user account
- Malicious runbooks could potentially harm your system (always review before running)
- Network policies and firewalls apply as they would for any local process
Shell Sessions and State
Runme manages shell sessions intelligently:
- Persistent Sessions: By default, cells share a single shell session, meaning environment variables and directory changes persist between cells
- Named Sessions: You can create multiple named sessions to isolate different contexts
- Session Cleanup: Sessions terminate when you close the runbook or explicitly end them
Example showing session persistence:
## Navigate to Project Directory
\`\`\`bash
cd /opt/projects/my-app
export APP_ENV=production
\`\`\`
## Build Application
\`\`\`bash
# This runs in the same directory and has access to APP_ENV
npm run build:$APP_ENV
\`\`\`
Remote Execution (Advanced)
While Runme primarily runs locally, it can orchestrate remote execution:
- SSH into remote hosts: Use standard SSH commands in cells
- Cloud shell integration: Execute commands in AWS CloudShell, GCP Cloud Shell, or Azure Cloud Shell
- Container execution: Run commands inside Docker containers or Kubernetes pods
- CI/CD integration: Runme CLI can run runbooks in CI/CD pipeline runners
Example remote execution pattern:
## Execute on Production Server
\`\`\`bash
ssh production-server << 'EOF'
cd /var/www/app
sudo systemctl restart nginx
curl -sf http://localhost/health
EOF
\`\`\`
## Execute in Kubernetes Pod
\`\`\`bash
kubectl exec -n production deploy/api-server -- \
python manage.py check_health
\`\`\`
Persistence: How State is Maintained
Runme provides multiple persistence mechanisms to maintain context across time and sessions.
Cell Output History
Every cell execution is logged with:
- Stdout/stderr capture: Complete output from commands
- Exit codes: Success/failure status
- Execution timestamps: When commands ran
- Execution duration: How long commands took
This history is stored in:
- VS Code: In-memory during the session, with optional disk persistence
- Runme CLI: Output can be saved to files or streamed to logging systems
Environment Variables and Context
Runme can persist context between sessions through:
- Markdown front matter: Store variables in YAML front matter at the top of runbooks
- Environment files: Load
.envfiles or export to persist state - Named cells: Reference outputs from previous cells by name
- Session files: Export session state to files for later restoration
Example with persistent configuration:
---
runme:
version: v3
shell: bash
env:
AWS_PROFILE: production
AWS_REGION: us-west-2
CLUSTER_NAME: prod-eks-01
---
# Production Operations Runbook
## Configuration is loaded automatically from front matter
\`\`\`bash {"name":"verify-config"}
echo "AWS Profile: $AWS_PROFILE"
echo "Region: $AWS_REGION"
echo "Cluster: $CLUSTER_NAME"
\`\`\`
State Management Patterns
For complex workflows, consider these patterns:
- Checkpoint cells: Save state to files that subsequent cells can load
- Idempotent operations: Design commands to be safely re-runnable
- State verification cells: Include cells that check system state before operations
Example stateful incident response:
## Step 1: Capture Incident Context
\`\`\`bash {"name":"capture-context"}
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"
INCIDENT_DIR="/tmp/incidents/$INCIDENT_ID"
mkdir -p "$INCIDENT_DIR"
echo "Incident ID: $INCIDENT_ID" | tee "$INCIDENT_DIR/metadata.txt"
echo "Started: $(date)" | tee -a "$INCIDENT_DIR/metadata.txt"
# Export for other cells
export INCIDENT_ID
export INCIDENT_DIR
\`\`\`
## Step 2: Gather Logs (uses context from Step 1)
\`\`\`bash {"name":"gather-logs"}
kubectl logs -n production -l app=api --tail=1000 \
> "$INCIDENT_DIR/api-logs.txt"
aws logs tail /aws/ecs/api-service --since 1h \
> "$INCIDENT_DIR/ecs-logs.txt"
\`\`\`
## Step 3: Generate Report (persists to disk)
\`\`\`bash {"name":"generate-report"}
cat > "$INCIDENT_DIR/summary.md" << EOF
# Incident Report: $INCIDENT_ID
Started: $(date)
Status: In Progress
## Symptoms Observed
- API latency increased to >2s
- Error rate at 5%
## Data Collected
- API pod logs: $(wc -l < "$INCIDENT_DIR/api-logs.txt") lines
- ECS task logs: $(wc -l < "$INCIDENT_DIR/ecs-logs.txt") lines
EOF
echo "Report saved to $INCIDENT_DIR/summary.md"
\`\`\`
Use Case 1: Incident Management
Incident response requires speed, accuracy, and clear communication. Runme transforms incident runbooks from static documents into interactive response tools.
Incident Response Runbook
Here’s a complete incident response runbook demonstrating Runme’s capabilities:
# API Service Incident Response Runbook
## 1. Initial Assessment
### Check Service Health
\`\`\`bash {"name":"health-check"}
# Check endpoint availability
curl -sf https://api.example.com/health || echo "❌ Health check failed"
# Check response time
time curl -sf https://api.example.com/health > /dev/null
\`\`\`
### Check Error Rates
\`\`\`bash {"name":"error-rates"}
# Query last 5 minutes of errors from CloudWatch
aws cloudwatch get-metric-statistics \
--namespace "API/Production" \
--metric-name ErrorRate \
--start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%S) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
--period 60 \
--statistics Average,Maximum \
--dimensions Name=Environment,Value=production
\`\`\`
## 2. Identify Root Cause
### Check Pod Status
\`\`\`bash {"name":"pod-status"}
kubectl get pods -n production -l app=api-service -o wide
\`\`\`
### Review Recent Logs
\`\`\`bash {"name":"recent-logs"}
# Get logs from last 10 minutes
kubectl logs -n production -l app=api-service \
--since=10m \
--tail=100 \
| grep -i "error\|exception\|fatal"
\`\`\`
### Check Dependencies
\`\`\`bash {"name":"check-deps"}
# Database connectivity
kubectl exec -n production deploy/api-service -- \
timeout 5 nc -zv postgres-service 5432
# Redis connectivity
kubectl exec -n production deploy/api-service -- \
timeout 5 redis-cli -h redis-service ping
\`\`\`
### Check Recent Deployments
\`\`\`bash {"name":"recent-deploys"}
# Check rollout history
kubectl rollout history deployment/api-service -n production
# Check recent events
kubectl get events -n production \
--sort-by='.lastTimestamp' \
--field-selector involvedObject.name=api-service \
| tail -20
\`\`\`
## 3. Mitigation Actions
### Scale Up Pods
\`\`\`bash {"name":"scale-up"}
# Increase replica count
kubectl scale deployment/api-service -n production --replicas=10
# Wait for new pods to be ready
kubectl wait --for=condition=Ready pod \
-l app=api-service \
-n production \
--timeout=300s
\`\`\`
### Restart Deployment (if needed)
\`\`\`bash {"name":"restart-deployment"}
# Rolling restart
kubectl rollout restart deployment/api-service -n production
# Monitor rollout status
kubectl rollout status deployment/api-service -n production
\`\`\`
### Rollback to Previous Version (if regression)
\`\`\`bash {"name":"rollback"}
# Rollback to previous revision
kubectl rollout undo deployment/api-service -n production
# Verify rollback
kubectl rollout status deployment/api-service -n production
\`\`\`
## 4. Verification
### Verify Service Recovery
\`\`\`bash {"name":"verify-recovery"}
# Check health endpoint
for i in {1..5}; do
echo "Attempt $i:"
curl -sf https://api.example.com/health && echo "✓ OK" || echo "✗ Failed"
sleep 2
done
\`\`\`
### Monitor Error Rates Post-Fix
\`\`\`bash {"name":"monitor-errors"}
# Real-time error monitoring for 1 minute
timeout 60 watch -n 5 'kubectl logs -n production -l app=api-service --tail=50 | grep -c ERROR'
\`\`\`
## 5. Documentation
### Generate Incident Report
\`\`\`bash {"name":"incident-report"}
INCIDENT_ID="INC-$(date +%Y%m%d-%H%M%S)"
cat > "/tmp/incident-$INCIDENT_ID.md" << EOF
# Incident Report
**Incident ID**: $INCIDENT_ID
**Date**: $(date)
**Duration**: [TO BE FILLED]
**Severity**: [TO BE FILLED]
## Timeline
- $(date): Incident detected
- $(date): Initial assessment completed
- $(date): Mitigation applied
- $(date): Service recovered
## Root Cause
[TO BE FILLED AFTER INVESTIGATION]
## Actions Taken
1. Scaled deployment from 5 to 10 replicas
2. Restarted pods
3. Verified service health
## Next Steps
- [ ] Post-incident review
- [ ] Update monitoring alerts
- [ ] Document lessons learned
EOF
echo "Report created: /tmp/incident-$INCIDENT_ID.md"
cat "/tmp/incident-$INCIDENT_ID.md"
\`\`\`
Benefits for Incident Management
- Speed: One-click execution eliminates typing errors and command lookup time
- Consistency: Everyone follows the same tested procedure
- Audit Trail: Complete log of actions taken during incident response
- Collaboration: Team members can see what commands were run and their results
- Learning: New team members can execute runbooks with guidance, building expertise
- Version Control: Runbooks are versioned in Git, with history of improvements
Use Case 2: Infrastructure Management
Infrastructure teams manage cloud resources, server configurations, and deployment pipelines. Runme makes infrastructure operations repeatable and auditable.
AWS Infrastructure Management Runbook
# AWS Infrastructure Audit and Management
## Environment Setup
\`\`\`bash {"name":"setup-env"}
export AWS_PROFILE=production
export AWS_REGION=us-east-1
export ENVIRONMENT=production
# Verify credentials
aws sts get-caller-identity
\`\`\`
## 1. Inventory Check
### EC2 Instances
\`\`\`bash {"name":"ec2-inventory"}
# List all EC2 instances with key details
aws ec2 describe-instances \
--region $AWS_REGION \
--query 'Reservations[].Instances[].[InstanceId,InstanceType,State.Name,Tags[?Key==`Name`].Value|[0],PrivateIpAddress,LaunchTime]' \
--output table
\`\`\`
### RDS Databases
\`\`\`bash {"name":"rds-inventory"}
# List all RDS instances
aws rds describe-db-instances \
--region $AWS_REGION \
--query 'DBInstances[].[DBInstanceIdentifier,Engine,EngineVersion,DBInstanceClass,DBInstanceStatus,AllocatedStorage]' \
--output table
\`\`\`
### EKS Clusters
\`\`\`bash {"name":"eks-inventory"}
# List EKS clusters
aws eks list-clusters --region $AWS_REGION
# Get details for each cluster
for cluster in $(aws eks list-clusters --region $AWS_REGION --query 'clusters[]' --output text); do
echo "\n=== Cluster: $cluster ==="
aws eks describe-cluster --name $cluster --region $AWS_REGION \
--query 'cluster.[status,version,endpoint]' \
--output table
done
\`\`\`
## 2. Cost Analysis
### Monthly Cost by Service
\`\`\`bash {"name":"cost-analysis"}
# Get cost breakdown for current month
START_DATE=$(date +%Y-%m-01)
END_DATE=$(date +%Y-%m-%d)
aws ce get-cost-and-usage \
--time-period Start=$START_DATE,End=$END_DATE \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[].[Keys[0],Metrics.UnblendedCost.Amount]' \
--output table
\`\`\`
### Identify Unused Resources
\`\`\`bash {"name":"unused-resources"}
# Find unattached EBS volumes
echo "=== Unattached EBS Volumes ==="
aws ec2 describe-volumes \
--region $AWS_REGION \
--filters Name=status,Values=available \
--query 'Volumes[].[VolumeId,Size,VolumeType,CreateTime]' \
--output table
# Find unused Elastic IPs
echo "\n=== Unallocated Elastic IPs ==="
aws ec2 describe-addresses \
--region $AWS_REGION \
--query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
--output table
\`\`\`
## 3. Security Audit
### Check Security Groups
\`\`\`bash {"name":"security-groups"}
# Find security groups with overly permissive rules (0.0.0.0/0)
aws ec2 describe-security-groups \
--region $AWS_REGION \
--query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]]].{ID:GroupId,Name:GroupName,VPC:VpcId}' \
--output table
\`\`\`
### Check IAM Password Policy
\`\`\`bash {"name":"iam-policy"}
# Verify password policy meets requirements
aws iam get-account-password-policy
\`\`\`
### Check S3 Bucket Encryption
\`\`\`bash {"name":"s3-encryption"}
# Check which buckets lack encryption
for bucket in $(aws s3api list-buckets --query 'Buckets[].Name' --output text); do
encryption=$(aws s3api get-bucket-encryption --bucket $bucket 2>&1)
if echo "$encryption" | grep -q "ServerSideEncryptionConfigurationNotFoundError"; then
echo "❌ $bucket: No encryption"
else
echo "✓ $bucket: Encrypted"
fi
done
\`\`\`
## 4. Maintenance Operations
### Update Auto Scaling Groups
\`\`\`bash {"name":"update-asg"}
ASG_NAME="production-api-asg"
# Get current configuration
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names $ASG_NAME \
--query 'AutoScalingGroups[0].[MinSize,MaxSize,DesiredCapacity]' \
--output table
# Update desired capacity (uncomment to execute)
# aws autoscaling set-desired-capacity \
# --auto-scaling-group-name $ASG_NAME \
# --desired-capacity 5
\`\`\`
### Rotate Access Keys (Audit)
\`\`\`bash {"name":"access-key-audit"}
# List access keys older than 90 days
for user in $(aws iam list-users --query 'Users[].UserName' --output text); do
echo "\n=== User: $user ==="
aws iam list-access-keys --user-name $user \
--query 'AccessKeyMetadata[].[AccessKeyId,CreateDate,Status]' \
--output table
done
\`\`\`
### Snapshot Critical Volumes
\`\`\`bash {"name":"snapshot-volumes"}
# Create snapshots of production volumes
VOLUME_IDS=$(aws ec2 describe-volumes \
--region $AWS_REGION \
--filters "Name=tag:Environment,Values=production" "Name=tag:Backup,Values=true" \
--query 'Volumes[].VolumeId' \
--output text)
for volume in $VOLUME_IDS; do
echo "Creating snapshot for $volume..."
aws ec2 create-snapshot \
--volume-id $volume \
--description "Manual backup $(date +%Y-%m-%d)" \
--tag-specifications "ResourceType=snapshot,Tags=[{Key=CreatedBy,Value=Runme},{Key=Date,Value=$(date +%Y-%m-%d)}]"
done
\`\`\`
## 5. Compliance Reporting
### Generate Compliance Report
\`\`\`bash {"name":"compliance-report"}
REPORT_FILE="/tmp/aws-compliance-$(date +%Y%m%d).txt"
{
echo "AWS Infrastructure Compliance Report"
echo "Generated: $(date)"
echo "Account: $(aws sts get-caller-identity --query Account --output text)"
echo "Region: $AWS_REGION"
echo ""
echo "=== CloudTrail Status ==="
aws cloudtrail describe-trails --region $AWS_REGION
echo "\n=== Config Recorder Status ==="
aws configservice describe-configuration-recorder-status --region $AWS_REGION
echo "\n=== GuardDuty Status ==="
aws guardduty list-detectors --region $AWS_REGION
echo "\n=== Security Hub Status ==="
aws securityhub describe-hub --region $AWS_REGION 2>&1
} > "$REPORT_FILE"
echo "Compliance report saved to: $REPORT_FILE"
cat "$REPORT_FILE"
\`\`\`
Infrastructure Benefits
- Consistency: Infrastructure operations follow standardized procedures
- Safety: Review commands before execution, with clear documentation
- Efficiency: Complex multi-step operations in a single runbook
- Knowledge Sharing: Junior engineers can execute senior engineer runbooks
- Compliance: Auditable record of who ran what commands and when
Use Case 3: DevOps Workflows
DevOps teams orchestrate deployments, manage CI/CD pipelines, and maintain development environments. Runme streamlines these workflows.
Deployment Runbook
# Production Deployment Runbook - API Service v2.5.0
## Pre-Deployment Checklist
### Verify Prerequisites
\`\`\`bash {"name":"verify-prereqs"}
echo "Checking prerequisites..."
# Verify kubectl access
kubectl cluster-info | grep "Kubernetes control plane"
# Verify AWS access
aws sts get-caller-identity
# Verify Docker registry access
docker login registry.example.com
echo "✓ All prerequisites met"
\`\`\`
### Backup Current Configuration
\`\`\`bash {"name":"backup-config"}
BACKUP_DIR="/tmp/deployment-backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Backup current Kubernetes manifests
kubectl get deployment api-service -n production -o yaml > "$BACKUP_DIR/deployment.yaml"
kubectl get service api-service -n production -o yaml > "$BACKUP_DIR/service.yaml"
kubectl get configmap api-config -n production -o yaml > "$BACKUP_DIR/configmap.yaml"
echo "✓ Backup saved to $BACKUP_DIR"
export BACKUP_DIR
\`\`\`
## Deployment Steps
### 1. Build and Push Docker Image
\`\`\`bash {"name":"build-image"}
VERSION="2.5.0"
IMAGE_NAME="registry.example.com/api-service"
IMAGE_TAG="$VERSION"
# Build image
cd /path/to/api-service
docker build -t "$IMAGE_NAME:$IMAGE_TAG" .
# Tag as latest
docker tag "$IMAGE_NAME:$IMAGE_TAG" "$IMAGE_NAME:latest"
# Push to registry
docker push "$IMAGE_NAME:$IMAGE_TAG"
docker push "$IMAGE_NAME:latest"
echo "✓ Image pushed: $IMAGE_NAME:$IMAGE_TAG"
\`\`\`
### 2. Update Kubernetes Manifests
\`\`\`bash {"name":"update-manifests"}
VERSION="2.5.0"
IMAGE_NAME="registry.example.com/api-service:$VERSION"
# Update deployment with new image
kubectl set image deployment/api-service \
api-service="$IMAGE_NAME" \
-n production
# Annotate with change cause
kubectl annotate deployment/api-service \
kubernetes.io/change-cause="Deploy version $VERSION" \
-n production
echo "✓ Deployment updated to $VERSION"
\`\`\`
### 3. Monitor Rollout
\`\`\`bash {"name":"monitor-rollout"}
# Watch rollout status
kubectl rollout status deployment/api-service -n production --timeout=5m
# Verify new pods are running
kubectl get pods -n production -l app=api-service -o wide
echo "✓ Rollout completed successfully"
\`\`\`
### 4. Smoke Tests
\`\`\`bash {"name":"smoke-tests"}
# Get service endpoint
SERVICE_URL="https://api.example.com"
# Test health endpoint
echo "Testing health endpoint..."
curl -sf "$SERVICE_URL/health" | jq '.'
# Test version endpoint
echo "\nTesting version endpoint..."
curl -sf "$SERVICE_URL/version" | jq '.'
# Test sample API call
echo "\nTesting sample API call..."
curl -sf "$SERVICE_URL/api/v1/status" | jq '.'
echo "✓ All smoke tests passed"
\`\`\`
### 5. Performance Validation
\`\`\`bash {"name":"performance-test"}
SERVICE_URL="https://api.example.com"
# Run quick load test
echo "Running performance test (100 requests)..."
ab -n 100 -c 10 "$SERVICE_URL/api/v1/status"
# Check response times
echo "\nChecking p95 response time..."
# (Results from ab command above)
\`\`\`
## Post-Deployment
### Update Monitoring
\`\`\`bash {"name":"update-monitoring"}
# Add annotation to Datadog
DEPLOY_TIME=$(date +%s)
VERSION="2.5.0"
curl -X POST "https://api.datadoghq.com/api/v1/events" \
-H "DD-API-KEY: $DATADOG_API_KEY" \
-H "Content-Type: application/json" \
-d @- << EOF
{
"title": "API Service Deployed",
"text": "Version $VERSION deployed to production",
"priority": "normal",
"tags": ["environment:production", "service:api", "version:$VERSION"],
"alert_type": "info"
}
EOF
echo "✓ Monitoring updated"
\`\`\`
### Notify Team
\`\`\`bash {"name":"notify-team"}
VERSION="2.5.0"
DEPLOY_TIME=$(date)
# Post to Slack
curl -X POST "$SLACK_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d @- << EOF
{
"text": "✅ Production Deployment Complete",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*API Service v$VERSION* deployed to production\n*Time:* $DEPLOY_TIME\n*Status:* ✅ Success"
}
}
]
}
EOF
echo "✓ Team notified"
\`\`\`
## Rollback Procedure (If Needed)
### Quick Rollback
\`\`\`bash {"name":"rollback"}
echo "⚠️ Initiating rollback..."
# Rollback to previous revision
kubectl rollout undo deployment/api-service -n production
# Wait for rollback to complete
kubectl rollout status deployment/api-service -n production
# Restore backed up config if needed
if [ -n "$BACKUP_DIR" ]; then
kubectl apply -f "$BACKUP_DIR/"
fi
echo "✅ Rollback completed"
\`\`\`
DevOps Workflow Benefits
- Reproducibility: Same deployment process every time
- Visibility: Everyone can see the deployment steps and current status
- Safety: Built-in checkpoints and rollback procedures
- Speed: One-click deployments instead of manual command execution
- Training: New team members learn deployment process by running runbooks
Use Case 4: Security Operations
Security teams need to respond to threats, audit systems, and maintain compliance. Runme provides secure, auditable security operations.
Security Incident Response Runbook
# Security Incident Response - Compromised AWS Account
## Phase 1: Containment
### Immediate Actions - Stop Active Threat
\`\`\`bash {"name":"emergency-stop"}
# Set up incident tracking
INCIDENT_ID="SEC-$(date +%Y%m%d-%H%M%S)"
INCIDENT_DIR="/tmp/security-incident-$INCIDENT_ID"
mkdir -p "$INCIDENT_DIR"
echo "Security Incident: $INCIDENT_ID" | tee "$INCIDENT_DIR/timeline.txt"
echo "Started: $(date)" | tee -a "$INCIDENT_DIR/timeline.txt"
export INCIDENT_ID
export INCIDENT_DIR
\`\`\`
### Disable Compromised User Access
\`\`\`bash {"name":"disable-user"}
COMPROMISED_USER="suspicious-user"
# Disable console access
aws iam delete-login-profile --user-name "$COMPROMISED_USER" 2>/dev/null
# Deactivate all access keys
aws iam list-access-keys --user-name "$COMPROMISED_USER" \
--query 'AccessKeyMetadata[].AccessKeyId' \
--output text | while read key; do
aws iam update-access-key --user-name "$COMPROMISED_USER" --access-key-id "$key" --status Inactive
echo "$(date): Deactivated access key $key for $COMPROMISED_USER" | tee -a "$INCIDENT_DIR/timeline.txt"
done
echo "✓ User access disabled"
\`\`\`
### Revoke Active Sessions
\`\`\`bash {"name":"revoke-sessions"}
COMPROMISED_USER="suspicious-user"
# Attach policy to deny all actions
cat > /tmp/deny-all-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "*",
"Resource": "*"
}
]
}
EOF
# Create and attach deny policy
POLICY_ARN=$(aws iam create-policy \
--policy-name "DenyAll-$INCIDENT_ID" \
--policy-document file:///tmp/deny-all-policy.json \
--query 'Policy.Arn' \
--output text)
aws iam attach-user-policy --user-name "$COMPROMISED_USER" --policy-arn "$POLICY_ARN"
echo "$(date): Attached deny-all policy to $COMPROMISED_USER" | tee -a "$INCIDENT_DIR/timeline.txt"
echo "✓ Active sessions effectively revoked"
\`\`\`
## Phase 2: Investigation
### Collect CloudTrail Logs
\`\`\`bash {"name":"collect-cloudtrail"}
COMPROMISED_USER="suspicious-user"
START_TIME=$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%S)
END_TIME=$(date -u +%Y-%m-%dT%H:%M:%S)
# Query CloudTrail for user activity
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=Username,AttributeValue="$COMPROMISED_USER" \
--start-time "$START_TIME" \
--end-time "$END_TIME" \
--max-results 50 \
> "$INCIDENT_DIR/cloudtrail-events.json"
# Extract key information
jq -r '.Events[] | "\(.EventTime) \(.EventName) \(.SourceIPAddress)"' \
"$INCIDENT_DIR/cloudtrail-events.json" \
| tee "$INCIDENT_DIR/event-summary.txt"
echo "✓ CloudTrail logs collected"
\`\`\`
### Identify Affected Resources
\`\`\`bash {"name":"identify-resources"}
# List resources created by compromised user in last 24 hours
echo "=== EC2 Instances ===" | tee -a "$INCIDENT_DIR/affected-resources.txt"
aws ec2 describe-instances \
--filters "Name=tag:CreatedBy,Values=$COMPROMISED_USER" \
--query 'Reservations[].Instances[].[InstanceId,LaunchTime,State.Name]' \
--output table | tee -a "$INCIDENT_DIR/affected-resources.txt"
echo "\n=== S3 Buckets ===" | tee -a "$INCIDENT_DIR/affected-resources.txt"
for bucket in $(aws s3api list-buckets --query 'Buckets[].Name' --output text); do
tags=$(aws s3api get-bucket-tagging --bucket $bucket 2>/dev/null || echo "")
if echo "$tags" | grep -q "$COMPROMISED_USER"; then
echo "$bucket" | tee -a "$INCIDENT_DIR/affected-resources.txt"
fi
done
echo "\n=== IAM Resources ===" | tee -a "$INCIDENT_DIR/affected-resources.txt"
aws iam list-users --query "Users[?contains(UserName, '$COMPROMISED_USER')]" \
| tee -a "$INCIDENT_DIR/affected-resources.txt"
echo "✓ Affected resources identified"
\`\`\`
### Check for Data Exfiltration
\`\`\`bash {"name":"check-exfiltration"}
# Query CloudTrail for potential data exfiltration events
jq -r '.Events[] | select(.EventName=="GetObject" or .EventName=="DownloadDBSnapshot" or .EventName=="CreateSnapshot") | "\(.EventTime) \(.EventName) \(.Resources[0].ResourceName)"' \
"$INCIDENT_DIR/cloudtrail-events.json" \
| tee "$INCIDENT_DIR/potential-exfiltration.txt"
# Check VPC Flow Logs for unusual outbound traffic
echo "\n=== Checking VPC Flow Logs ==="
# (Requires VPC Flow Logs to be enabled and stored in CloudWatch/S3)
aws logs filter-log-events \
--log-group-name "/aws/vpc/flowlogs" \
--start-time $(date -d '24 hours ago' +%s)000 \
--filter-pattern "[version, account, eni, source, destination, srcport, destport, protocol, packets, bytes, start, end, action=ACCEPT, status]" \
--query 'events[*].message' \
--output text \
| grep -E "ACCEPT.*OUTBOUND" \
> "$INCIDENT_DIR/vpc-outbound-traffic.txt"
echo "✓ Exfiltration check complete"
\`\`\`
## Phase 3: Eradication
### Terminate Unauthorized Resources
\`\`\`bash {"name":"terminate-resources"}
# Terminate EC2 instances created by compromised user
INSTANCE_IDS=$(aws ec2 describe-instances \
--filters "Name=tag:CreatedBy,Values=$COMPROMISED_USER" "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceId' \
--output text)
if [ -n "$INSTANCE_IDS" ]; then
echo "Terminating instances: $INSTANCE_IDS"
aws ec2 terminate-instances --instance-ids $INSTANCE_IDS
echo "$(date): Terminated instances: $INSTANCE_IDS" | tee -a "$INCIDENT_DIR/timeline.txt"
else
echo "No unauthorized instances found"
fi
echo "✓ Unauthorized resources terminated"
\`\`\`
### Remove Malicious IAM Policies
\`\`\`bash {"name":"remove-policies"}
# List and detach suspicious policies
aws iam list-policies --scope Local \
--query "Policies[?contains(PolicyName, 'temp') || contains(PolicyName, 'test')]" \
--output json > "$INCIDENT_DIR/suspicious-policies.json"
# Review and manually remove if confirmed malicious
cat "$INCIDENT_DIR/suspicious-policies.json"
echo "⚠️ Review suspicious policies before removal"
\`\`\`
### Rotate Credentials
\`\`\`bash {"name":"rotate-credentials"}
# Force rotation of potentially exposed credentials
echo "=== Credentials to Rotate ===" | tee "$INCIDENT_DIR/credential-rotation.txt"
# List IAM users who may have been compromised
aws iam list-users --query 'Users[].UserName' --output text | while read user; do
last_used=$(aws iam get-user --user-name "$user" --query 'User.PasswordLastUsed' --output text 2>/dev/null)
if [ "$last_used" != "None" ]; then
echo "User: $user - Last password use: $last_used" | tee -a "$INCIDENT_DIR/credential-rotation.txt"
fi
done
echo "\n⚠️ Manually rotate credentials for affected users"
echo "⚠️ Consider rotating AWS root account credentials"
\`\`\`
## Phase 4: Recovery
### Restore Access for Legitimate Users
\`\`\`bash {"name":"restore-access"}
# Remove deny-all policy after threat is contained
aws iam detach-user-policy \
--user-name "$COMPROMISED_USER" \
--policy-arn "$POLICY_ARN"
aws iam delete-policy --policy-arn "$POLICY_ARN"
echo "$(date): Removed containment policy" | tee -a "$INCIDENT_DIR/timeline.txt"
echo "✓ Containment policy removed (verify threat eliminated first)"
\`\`\`
### Implement Additional Security Controls
\`\`\`bash {"name":"security-controls"}
# Enable MFA requirement for sensitive operations
cat > /tmp/require-mfa-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Action": "*",
"Resource": "*",
"Condition": {
"BoolIfExists": {
"aws:MultiFactorAuthPresent": "false"
}
}
}
]
}
EOF
# Create MFA requirement policy
aws iam create-policy \
--policy-name "RequireMFA" \
--policy-document file:///tmp/require-mfa-policy.json
echo "✓ Additional security controls implemented"
\`\`\`
## Phase 5: Post-Incident
### Generate Incident Report
\`\`\`bash {"name":"generate-report"}
cat > "$INCIDENT_DIR/incident-report.md" << EOF
# Security Incident Report: $INCIDENT_ID
## Executive Summary
**Incident Type**: Compromised AWS Account
**Detected**: $(head -2 "$INCIDENT_DIR/timeline.txt" | tail -1)
**Contained**: $(date)
**Severity**: HIGH
## Timeline
\`\`\`
$(cat "$INCIDENT_DIR/timeline.txt")
\`\`\`
## Impact Assessment
- **Compromised User**: $COMPROMISED_USER
- **Affected Resources**: $(wc -l < "$INCIDENT_DIR/affected-resources.txt") resources identified
- **Data Exfiltration**: See $INCIDENT_DIR/potential-exfiltration.txt
## Actions Taken
1. Disabled compromised user access
2. Revoked active sessions
3. Collected forensic evidence
4. Terminated unauthorized resources
5. Rotated credentials
6. Implemented additional security controls
## Root Cause
[TO BE COMPLETED AFTER FULL INVESTIGATION]
## Preventive Measures
- Implement MFA for all users
- Enable AWS CloudTrail across all regions
- Configure GuardDuty for threat detection
- Review and update IAM policies
- Implement least-privilege access
## Lessons Learned
[TO BE COMPLETED IN POST-INCIDENT REVIEW]
## Evidence Location
All evidence stored in: $INCIDENT_DIR
EOF
echo "✓ Incident report generated: $INCIDENT_DIR/incident-report.md"
cat "$INCIDENT_DIR/incident-report.md"
\`\`\`
### Archive Evidence
\`\`\`bash {"name":"archive-evidence"}
# Create encrypted archive of evidence
ARCHIVE_FILE="/secure/evidence/incident-$INCIDENT_ID.tar.gz.gpg"
tar -czf - "$INCIDENT_DIR" | gpg --encrypt --recipient security-team@example.com > "$ARCHIVE_FILE"
echo "✓ Evidence archived: $ARCHIVE_FILE"
echo "$(date): Evidence archived to $ARCHIVE_FILE" | tee -a "$INCIDENT_DIR/timeline.txt"
\`\`\`
Security Operations Benefits
- Speed: Rapid response to security incidents with pre-tested procedures
- Forensics: Complete audit trail of actions taken during incident response
- Consistency: Standard operating procedures followed every time
- Collaboration: Security team can work together using same runbooks
- Compliance: Demonstrates security incident response capability for audits
Integration with External Systems
Runme excels at orchestrating interactions with external systems. Let’s explore common integration patterns.
AWS Integration
Runme uses the AWS CLI, which respects standard AWS credential mechanisms:
## AWS SSO Authentication
\`\`\`bash {"name":"aws-sso-login"}
# Authenticate with AWS SSO
aws sso login --profile production
# Verify authentication
aws sts get-caller-identity --profile production
\`\`\`
## Multi-Account Operations
\`\`\`bash {"name":"multi-account"}
# Iterate through multiple AWS accounts
for profile in dev staging production; do
echo "\n=== Account: $profile ==="
aws s3 ls --profile $profile
done
\`\`\`
## AssumeRole for Cross-Account Access
\`\`\`bash {"name":"assume-role"}
# Assume role in another account
ROLE_ARN="arn:aws:iam::123456789012:role/CrossAccountAdmin"
# Get temporary credentials
CREDENTIALS=$(aws sts assume-role \
--role-arn "$ROLE_ARN" \
--role-session-name "runme-session" \
--query 'Credentials.[AccessKeyId,SecretAccessKey,SessionToken]' \
--output text)
# Export credentials for subsequent commands
export AWS_ACCESS_KEY_ID=$(echo $CREDENTIALS | cut -d' ' -f1)
export AWS_SECRET_ACCESS_KEY=$(echo $CREDENTIALS | cut -d' ' -f2)
export AWS_SESSION_TOKEN=$(echo $CREDENTIALS | cut -d' ' -f3)
# Use assumed role
aws ec2 describe-instances
\`\`\`
Kubernetes Integration
## Switch Contexts
\`\`\`bash {"name":"switch-context"}
# List available contexts
kubectl config get-contexts
# Switch to production cluster
kubectl config use-context prod-eks-cluster
# Verify current context
kubectl config current-context
\`\`\`
## Multi-Cluster Operations
\`\`\`bash {"name":"multi-cluster"}
# Run command across all clusters
for context in $(kubectl config get-contexts -o name); do
echo "\n=== Cluster: $context ==="
kubectl --context=$context get nodes -o wide
done
\`\`\`
API Integration with Authentication
## OAuth 2.0 Flow
\`\`\`bash {"name":"oauth-flow"}
# Get OAuth token
TOKEN=$(curl -sf -X POST https://auth.example.com/oauth/token \
-H "Content-Type: application/json" \
-d '{"client_id":"'$CLIENT_ID'","client_secret":"'$CLIENT_SECRET'","grant_type":"client_credentials"}' \
| jq -r '.access_token')
export API_TOKEN=$TOKEN
echo "✓ Authenticated, token expires in 3600s"
\`\`\`
## Use Token for API Calls
\`\`\`bash {"name":"api-calls"}
# Make authenticated API call
curl -sf -X GET https://api.example.com/v1/resources \
-H "Authorization: Bearer $API_TOKEN" \
| jq '.'
\`\`\`
Database Interactions
## PostgreSQL Query
\`\`\`bash {"name":"postgres-query"}
# Connect and query database
PGPASSWORD=$DB_PASSWORD psql -h postgres.example.com -U admin -d production << 'EOF'
SELECT
table_name,
pg_size_pretty(pg_total_relation_size(table_name::text)) as size
FROM information_schema.tables
WHERE table_schema = 'public'
ORDER BY pg_total_relation_size(table_name::text) DESC
LIMIT 10;
EOF
\`\`\`
Getting Started: Adopting Runme in Your Organization
Installation
VS Code Extension (Recommended for beginners):
- Open VS Code
- Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
- Search for “Runme”
- Click Install
- Open any
.mdfile to see Runme in action
CLI Installation:
# macOS
brew install runme
# Linux
curl -fsSL https://download.runme.dev/install.sh | sh
# Verify installation
runme --version
Creating Your First Runbook
- Create a markdown file:
ops-runbook.md - Add code blocks with commands
- Open in VS Code with Runme extension
- Click “Run” buttons to execute
Example first runbook:
# My First Runbook
## Check System Status
\`\`\`bash
date
whoami
uname -a
\`\`\`
## List Running Processes
\`\`\`bash
ps aux | head -10
\`\`\`
## Check Disk Usage
\`\`\`bash
df -h
\`\`\`
Best Practices for Adoption
- Start Small: Begin with simple operational tasks
- Version Control: Store runbooks in Git alongside code
- Document Context: Add explanatory text between code blocks
- Use Named Cells: Give cells meaningful names for better logs
- Test Thoroughly: Run runbooks in safe environments before production use
- Review Before Running: Always review commands before execution
- Set up Sessions: Use named sessions to isolate different workflows
- Add Safety Checks: Include verification steps before destructive operations
Runbook Template
---
runme:
version: v3
shell: bash
env:
ENVIRONMENT: staging
---
# [Task Name] Runbook
**Purpose**: [Brief description]
**Owner**: [Team/Person]
**Last Updated**: [Date]
## Prerequisites
- [ ] Access to [system/service]
- [ ] Required tools installed
- [ ] Credentials configured
## Safety Checks
### Verify Environment
\`\`\`bash {"name":"verify-env"}
echo "Environment: $ENVIRONMENT"
echo "Current user: $(whoami)"
echo "Current directory: $(pwd)"
# Add checks specific to your task
\`\`\`
## Procedure
### Step 1: [Description]
\`\`\`bash {"name":"step-1"}
# Your commands here
\`\`\`
### Step 2: [Description]
\`\`\`bash {"name":"step-2"}
# Your commands here
\`\`\`
## Verification
### Verify Results
\`\`\`bash {"name":"verify"}
# Commands to verify success
\`\`\`
## Rollback (If Needed)
### Rollback Procedure
\`\`\`bash {"name":"rollback"}
# Commands to rollback changes
\`\`\`
## Next Steps
- [ ] Update monitoring
- [ ] Notify stakeholders
- [ ] Document any issues
Security Considerations
- Credential Storage: Never hardcode credentials in runbooks. Use environment variables or credential managers
- Access Control: Use Git permissions to control who can modify runbooks
- Audit Trail: Enable logging to track runbook executions
- Code Review: Require peer review for runbook changes
- Sensitive Operations: Add confirmation prompts for destructive operations
- Encryption: Encrypt runbooks containing sensitive information
- Least Privilege: Run runbooks with minimal required permissions
Conclusion
Runme represents a paradigm shift in operational documentation. By making markdown files executable, it eliminates the gap between documentation and reality. For teams managing incidents, infrastructure, DevOps workflows, and security operations, Runme provides:
- Speed: Execute complex procedures with one click
- Consistency: Everyone follows the same tested procedures
- Auditability: Complete record of what was executed and when
- Collaboration: Share operational knowledge through versioned runbooks
- Safety: Review before execution, with built-in rollback procedures
The key to success with Runme is starting small, building comprehensive runbooks gradually, and fostering a culture where executable documentation becomes the norm. As your team gains experience, you’ll find that Runme becomes an indispensable tool for operational excellence.
Whether you’re responding to a production incident at 2 AM, deploying a critical security patch, or onboarding a new team member, Runme ensures that your operations are fast, reliable, and well-documented.
Start your journey with Runme today by converting your most frequently used runbook into an executable markdown file. Your future self (and your team) will thank you.