Configure alerting rules and notifications
✓Works with OpenClaudeYou are a monitoring engineer. The user wants to configure alerting rules and notifications to trigger actions when metrics cross thresholds.
What to check first
- Review your monitoring stack (Prometheus, Grafana, AlertManager, or cloud provider native solution)
- Run
curl http://localhost:9090/api/v1/alertsto see current active alerts in Prometheus - Verify notification channels are configured (email, Slack, PagerDuty, webhooks)
Steps
- Define alert conditions using PromQL expressions — write queries that return non-zero values when the alert should fire (e.g.,
up{job="api"} == 0for down services) - Create alert rules in
prometheus.ymlor a separate rules file withalert_rules.yml, specifyingalert,expr,for, andannotations - Set the
forduration to prevent flapping — typically5mmeans the condition must persist 5 minutes before firing - Add
labelsto categorize alerts by severity (critical,warning) and team ownership - Configure
annotationswithsummaryanddescriptiontemplates using{{ $labels.instance }}to pass context - Point AlertManager at your rules file and configure
global.resolve_timeout(default5m) - Define notification routes in AlertManager's
alertmanager.ymlwithreceiver,group_by, andgroup_waitsettings - Test rule syntax with
promtool check rules /path/to/rules.ymlbefore deploying
Code
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- 'alert_rules.yml'
alerting:
alertmanagers:
- static_configs:
- targets:
- 'localhost:9093'
---
# alert_rules.yml
groups:
- name: application_alerts
interval: 15s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: ServiceDown
expr: up{job="api"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "{{ $labels.instance }} is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} has been unreachable for 1 minute"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_Mem
Note: this example was truncated in the source. See the GitHub repo for the latest full version.
Common Pitfalls
- Treating this skill as a one-shot solution — most workflows need iteration and verification
- Skipping the verification steps — you don't know it worked until you measure
- Applying this skill without understanding the underlying problem — read the related docs first
When NOT to Use This Skill
- When a simpler manual approach would take less than 10 minutes
- On critical production systems without testing in staging first
- When you don't have permission or authorization to make these changes
How to Verify It Worked
- Run the verification steps documented above
- Compare the output against your expected baseline
- Check logs for any warnings or errors — silent failures are the worst kind
Production Considerations
- Test in staging before deploying to production
- Have a rollback plan — every change should be reversible
- Monitor the affected systems for at least 24 hours after the change
Related Monitoring & Logging Skills
Other Claude Code skills in the same category — free to download.
Structured Logging
Implement structured logging (Winston, Pino)
Error Tracking
Set up error tracking (Sentry)
APM Setup
Set up Application Performance Monitoring
Log Rotation
Configure log rotation and management
Health Dashboard
Create health monitoring dashboard
Distributed Tracing
Set up distributed tracing
Metrics Collector
Implement custom metrics collection
Uptime Monitor
Set up uptime monitoring
Want a Monitoring & Logging skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.