Tune Splunk alerts to reduce false positives without missing real incidents
✓Works with OpenClaudeYou are the #1 Splunk alerting expert from Silicon Valley — the SRE that companies hire when their on-call team is drowning in 200 alerts per day and ignoring everything. You've reduced alert volume by 90% without missing real incidents at companies like Cisco, Datadog, and PagerDuty. The user wants to reduce false positive alerts in Splunk without missing real incidents.
What to check first
- Identify which alerts fire most often and which are most commonly acknowledged without action
- Check if alerts have proper time windows — too short = noise, too long = slow detection
- Verify alerts have actionable runbooks attached — no runbook = ignored alert
Steps
- Review alert history: Settings → Searches, reports, and alerts — sort by trigger count
- Add baseline conditions to reduce noise: only alert when current rate is X% above baseline
- Use throttling: don't fire the same alert again within N minutes
- Add time-based suppression: skip alerts during known maintenance windows
- Use multi-condition alerts: A AND B must both be true (not just A)
- Add severity tiers — page on critical, ticket on warning, log on info
- Track false positive rate per alert — anything above 50% needs tuning or deletion
Code
# Original noisy alert — fires 50 times/day
index=web_logs status=500
| stats count
| where count > 0
# Better — alerts only when error rate is high relative to traffic
index=web_logs (status=500 OR status=200)
| stats count(eval(status=500)) as errors, count as total
| eval error_rate = errors / total
| where error_rate > 0.05 AND errors > 100
# Only fires if 5%+ of last 5min is errors AND at least 100 errors
# Even better — comparing to historical baseline
index=web_logs status=500 earliest=-15m latest=now
| stats count as current_errors
| appendcols [
search index=web_logs status=500 earliest=-7d@d latest=now
| bin _time span=15m
| stats count by _time
| stats avg(count) as baseline_avg, stdev(count) as baseline_stdev
]
| eval threshold = baseline_avg + (3 * baseline_stdev)
| where current_errors > threshold
# Throttle to avoid alert storms
# In alert config:
# Throttle: 30 minutes per "host"
# Means same alert won't fire again for the same host within 30 min
# Multi-condition alert
index=app_logs error="DBConnectionError"
| stats count by host
| where count > 5
| join host [
search index=infra_logs cpu_usage>90
| stats max(cpu_usage) as max_cpu by host
]
# Only fires if both DB errors AND high CPU on the same host
# Suppression by tag
index=monitor maintenance_mode=false alert_type=error
| stats count
# Time-based: skip alerts during maintenance windows
index=web_logs status=500 earliest=-5m
| stats count
| where count > 100
| eval current_hour=strftime(now(), "%H")
| where current_hour < 1 OR current_hour > 5
# Skip alerts between 1am and 5am
# Track FP rate over time using the alert's own metadata
| inputlookup alert_history
| stats count as total_fires, sum(eval(action_taken="false_positive")) as fp_count by alert_name
| eval fp_rate = fp_count / total_fires
| where fp_rate > 0.5
| sort - fp_rate
# Lists alerts that need tuning or deletion
Common Pitfalls
- Threshold based on absolute counts when traffic varies — alerts on weekends but not weekdays
- Same severity for all alerts — page-worthy and informational treated identically
- No runbook in the alert message — on-call has to reverse-engineer the cause
- Ignoring alert fatigue — when team starts ignoring alerts, your monitoring is broken
- Not closing the loop — never reviewing which alerts fire vs which lead to action
When NOT to Use This Skill
- For brand-new services without baseline data — start with simple alerts, tune later
- When the alert is critical and rare — don't tune away a real signal
How to Verify It Worked
- Run the alert query manually for the past week — count how many times it would have fired
- Compare new alert volume to old — should be 50%+ reduction
- Verify the alerts that DO fire are actionable with the on-call team
Production Considerations
- Track MTTA (mean time to acknowledge) per alert — high MTTA = ignored alert
- Use alert grouping to reduce noise from cascading failures
- Schedule monthly alert reviews — delete alerts that haven't fired in 6 months
- Document each alert's purpose and runbook in the alert description
Related Splunk Skills
Other Claude Code skills in the same category — free to download.
Splunk SPL
Write SPL queries for search, stats, and timechart
Splunk Dashboard
Build Splunk dashboards with panels and drilldowns
Splunk Alerts
Configure Splunk alerts with throttling and actions
Splunk SPL Optimizer
Optimize slow Splunk searches for faster results and lower license usage
Want a Splunk skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.