AWS CloudWatch
AWS CloudWatch
AWS's observability platform. Collects metrics, logs, traces, and events from all AWS services. Core for production monitoring and the MLA-C01 exam.
Core Components
| Component | What it is | Analogy |
|---|---|---|
| Metrics | Time-series numerical data | Datadog metrics |
| Logs | Text log streams | ELK/CloudWatch Logs |
| Alarms | Trigger on metric threshold | PagerDuty alert |
| Events / EventBridge | Rule-based event routing | Cron + webhook |
| Dashboards | Visualization | Grafana |
| Insights | Log query language (like SQL) | Kibana Discover |
| Container Insights | ECS/EKS metrics | — |
| Lambda Insights | Per-function metrics | — |
Metrics
Standard Metrics (free, auto-collected)
- EC2: CPUUtilization, NetworkIn/Out, DiskRead/WriteOps
- RDS: DatabaseConnections, FreeStorageSpace, ReadLatency
- Lambda: Invocations, Errors, Duration, Throttles, ConcurrentExecutions
- SQS: NumberOfMessagesSent, ApproximateNumberOfMessagesVisible
Custom Metrics
import boto3
cw = boto3.client('cloudwatch')
# Publish custom metric
cw.put_metric_data(
Namespace='MyApp/API',
MetricData=[{
'MetricName': 'RequestLatency',
'Value': 150.5,
'Unit': 'Milliseconds',
'Dimensions': [
{'Name': 'Endpoint', 'Value': '/users'},
{'Name': 'Environment', 'Value': 'production'}
]
}]
)
Resolution
- Standard resolution: 1-minute granularity (free)
- High resolution: 1-second granularity (additional cost) — use for Lambda, real-time
Alarms
Three states: OK | ALARM | INSUFFICIENT_DATA
cw.put_metric_alarm(
AlarmName='High-CPU-EC2',
MetricName='CPUUtilization',
Namespace='AWS/EC2',
Statistic='Average',
Period=300, # 5 minutes
EvaluationPeriods=2, # 2 consecutive periods
Threshold=90.0,
ComparisonOperator='GreaterThanThreshold',
Dimensions=[{'Name': 'InstanceId', 'Value': 'i-1234567890'}],
AlarmActions=['arn:aws:sns:us-east-1:123:alert-topic'], # SNS notification
TreatMissingData='missing'
)
Alarm actions:
- SNS notification → email, SMS, PagerDuty webhook
- Auto Scaling action (scale out/in)
- EC2 action (stop, terminate, reboot)
- Lambda invocation
Composite Alarm: Combine multiple alarms with AND/OR logic.
CloudWatch Logs
Log Groups and Streams
Log Group: /aws/lambda/my-function ← one per service
└── Log Stream: 2026/04/16/[$LATEST]abc123 ← one per Lambda invocation/hour
├── log event: timestamp + message
└── log event: timestamp + message
Sending Logs from Code
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def handler(event, context):
logger.info("Processing request", extra={"request_id": event.get("id")})
logger.error("Failed to connect to DB", exc_info=True)
# Lambda automatically sends stdout/stderr to CloudWatch
Retention Policy
Default: never expire (costs money). Set retention:
logs = boto3.client('logs')
logs.put_retention_policy(
logGroupName='/aws/lambda/my-function',
retentionInDays=30
)
CloudWatch Logs Insights
SQL-like query language for log analysis:
-- Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20
-- Average Lambda duration by function
stats avg(Duration) as avgDuration by FunctionName
| sort avgDuration desc
-- Count 4xx/5xx errors
filter @message like /statusCode/
| parse @message '"statusCode": *,' as statusCode
| filter statusCode >= 400
| stats count(*) as errors by statusCode
EventBridge (formerly CloudWatch Events)
Event-driven routing. Route events from AWS services to targets.
Rule: schedule (cron) OR event pattern
Target: Lambda, SQS, SNS, Step Functions, ECS task, ...
Examples:
- Every day at 9am UTC → Lambda (cleanup job)
- EC2 state change (stopped) → SNS alert
- S3 PutObject in raw/ → Lambda (ingest pipeline)
- CodePipeline failure → Slack webhook
# Cron trigger for Lambda
{
"ScheduleExpression": "cron(0 9 * * ? *)", # 9am UTC daily
"State": "ENABLED",
"Targets": [{"Id": "cleanup", "Arn": "arn:aws:lambda:..."}]
}
Container Insights
For ECS/EKS — collects CPU, memory, disk, network per task/pod. Requires CloudWatch agent installed.
Key Metrics for ML Systems (MLA-C01)
| System | Key Metrics |
|---|---|
| SageMaker Training | TrainingJobStatus, ObjectivMetricValue, CPUUtilization |
| SageMaker Endpoint | Invocations, ModelLatency, OverheadLatency, 4xx/5xx errors |
| SageMaker Ground Truth | TotalLabeled, HumanLabeledObjectCount |
| Feature Store | RecordsFetched, RecordIngested |
Interview Talking Points
"How do you monitor a Lambda function?"
CloudWatch auto-captures Lambda logs (stdout → log group). Set up alarms on Errors metric (> 0) and Duration (near timeout). Use Lambda Insights for memory utilization. Query logs with Insights for error patterns.
"How would you set up alerting for a microservices system?" Custom metrics for business KPIs (order failures, latency p99), CloudWatch Alarms on those + standard AWS metrics, Composite Alarms to reduce noise, SNS → PagerDuty/Slack. Dashboards per service in CloudWatch.
Related
- [[AWS/Lambda]] — Lambda logs auto-sent to CloudWatch
- [[AWS/RDS]] — RDS metrics in CloudWatch
- [[AWS/SQS & SNS]] — EventBridge → SNS for alerting
- [[Distributed Systems Concepts]] — observability as a pillar