Back to Notes

AWS CloudWatch

AWS CloudWatch

AWS's observability platform. Collects metrics, logs, traces, and events from all AWS services. Core for production monitoring and the MLA-C01 exam.


Core Components

ComponentWhat it isAnalogy
MetricsTime-series numerical dataDatadog metrics
LogsText log streamsELK/CloudWatch Logs
AlarmsTrigger on metric thresholdPagerDuty alert
Events / EventBridgeRule-based event routingCron + webhook
DashboardsVisualizationGrafana
InsightsLog query language (like SQL)Kibana Discover
Container InsightsECS/EKS metrics
Lambda InsightsPer-function metrics

Metrics

Standard Metrics (free, auto-collected)

  • EC2: CPUUtilization, NetworkIn/Out, DiskRead/WriteOps
  • RDS: DatabaseConnections, FreeStorageSpace, ReadLatency
  • Lambda: Invocations, Errors, Duration, Throttles, ConcurrentExecutions
  • SQS: NumberOfMessagesSent, ApproximateNumberOfMessagesVisible

Custom Metrics

import boto3

cw = boto3.client('cloudwatch')

# Publish custom metric
cw.put_metric_data(
    Namespace='MyApp/API',
    MetricData=[{
        'MetricName': 'RequestLatency',
        'Value': 150.5,
        'Unit': 'Milliseconds',
        'Dimensions': [
            {'Name': 'Endpoint', 'Value': '/users'},
            {'Name': 'Environment', 'Value': 'production'}
        ]
    }]
)

Resolution

  • Standard resolution: 1-minute granularity (free)
  • High resolution: 1-second granularity (additional cost) — use for Lambda, real-time

Alarms

Three states: OK | ALARM | INSUFFICIENT_DATA

cw.put_metric_alarm(
    AlarmName='High-CPU-EC2',
    MetricName='CPUUtilization',
    Namespace='AWS/EC2',
    Statistic='Average',
    Period=300,              # 5 minutes
    EvaluationPeriods=2,     # 2 consecutive periods
    Threshold=90.0,
    ComparisonOperator='GreaterThanThreshold',
    Dimensions=[{'Name': 'InstanceId', 'Value': 'i-1234567890'}],
    AlarmActions=['arn:aws:sns:us-east-1:123:alert-topic'],  # SNS notification
    TreatMissingData='missing'
)

Alarm actions:

  • SNS notification → email, SMS, PagerDuty webhook
  • Auto Scaling action (scale out/in)
  • EC2 action (stop, terminate, reboot)
  • Lambda invocation

Composite Alarm: Combine multiple alarms with AND/OR logic.


CloudWatch Logs

Log Groups and Streams

Log Group: /aws/lambda/my-function      ← one per service
  └── Log Stream: 2026/04/16/[$LATEST]abc123  ← one per Lambda invocation/hour
      ├── log event: timestamp + message
      └── log event: timestamp + message

Sending Logs from Code

import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def handler(event, context):
    logger.info("Processing request", extra={"request_id": event.get("id")})
    logger.error("Failed to connect to DB", exc_info=True)
    # Lambda automatically sends stdout/stderr to CloudWatch

Retention Policy

Default: never expire (costs money). Set retention:

logs = boto3.client('logs')
logs.put_retention_policy(
    logGroupName='/aws/lambda/my-function',
    retentionInDays=30
)

CloudWatch Logs Insights

SQL-like query language for log analysis:

-- Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 20

-- Average Lambda duration by function
stats avg(Duration) as avgDuration by FunctionName
| sort avgDuration desc

-- Count 4xx/5xx errors
filter @message like /statusCode/
| parse @message '"statusCode": *,' as statusCode
| filter statusCode >= 400
| stats count(*) as errors by statusCode

EventBridge (formerly CloudWatch Events)

Event-driven routing. Route events from AWS services to targets.

Rule: schedule (cron) OR event pattern
Target: Lambda, SQS, SNS, Step Functions, ECS task, ...

Examples:
- Every day at 9am UTC → Lambda (cleanup job)
- EC2 state change (stopped) → SNS alert
- S3 PutObject in raw/ → Lambda (ingest pipeline)
- CodePipeline failure → Slack webhook
# Cron trigger for Lambda
{
    "ScheduleExpression": "cron(0 9 * * ? *)",  # 9am UTC daily
    "State": "ENABLED",
    "Targets": [{"Id": "cleanup", "Arn": "arn:aws:lambda:..."}]
}

Container Insights

For ECS/EKS — collects CPU, memory, disk, network per task/pod. Requires CloudWatch agent installed.


Key Metrics for ML Systems (MLA-C01)

SystemKey Metrics
SageMaker TrainingTrainingJobStatus, ObjectivMetricValue, CPUUtilization
SageMaker EndpointInvocations, ModelLatency, OverheadLatency, 4xx/5xx errors
SageMaker Ground TruthTotalLabeled, HumanLabeledObjectCount
Feature StoreRecordsFetched, RecordIngested

Interview Talking Points

"How do you monitor a Lambda function?" CloudWatch auto-captures Lambda logs (stdout → log group). Set up alarms on Errors metric (> 0) and Duration (near timeout). Use Lambda Insights for memory utilization. Query logs with Insights for error patterns.

"How would you set up alerting for a microservices system?" Custom metrics for business KPIs (order failures, latency p99), CloudWatch Alarms on those + standard AWS metrics, Composite Alarms to reduce noise, SNS → PagerDuty/Slack. Dashboards per service in CloudWatch.


Related

  • [[AWS/Lambda]] — Lambda logs auto-sent to CloudWatch
  • [[AWS/RDS]] — RDS metrics in CloudWatch
  • [[AWS/SQS & SNS]] — EventBridge → SNS for alerting
  • [[Distributed Systems Concepts]] — observability as a pillar