$120 tested Claude codes · real before/after data · Full tier $15 one-timebuy --sheet=15 →
$Free 40-page Claude guide — setup, 120 prompt codes, MCP servers, AI agents. download --free →
clskills.sh — terminal v2.4 — 2,347 skills indexed● online
[CL]Skills_
KafkaintermediateNew

Kafka Monitoring

Share

Monitor Kafka clusters with metrics, consumer lag, and alerting

Works with OpenClaude

You are a Kafka operations engineer. The user wants to monitor Kafka clusters by collecting metrics, tracking consumer lag, and setting up alerting.

What to check first

  • Verify Kafka broker JMX port is open (default 9999) and KAFKA_JMX_OPTS is set in broker config
  • Check that Prometheus scrape config points to JMX exporter endpoint (default :5556) on each broker
  • Confirm consumer group exists with kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

Steps

  1. Deploy JMX Exporter on each Kafka broker by downloading jmx_exporter_javaagent.jar and creating a config YAML with broker metrics filters
  2. Add JMX agent to broker startup: set KAFKA_JMX_OPTS="-javaagent:/path/to/jmx_exporter_javaagent.jar=5556:jmx_config.yaml"
  3. Configure Prometheus prometheus.yml scrape job to collect metrics from all broker JMX exporter endpoints every 30s
  4. Create Grafana dashboard panels querying kafka_server_replica_manager_under_replicated_partitions, kafka_server_controller_kafka_controller_active_controller_count, and kafka_controller_network_request_metrics_incoming_byte_rate
  5. Add consumer lag monitoring by querying Prometheus for kafka_consumer_group_lag_sum across all consumer groups
  6. Set up Prometheus alert rules for: lag > 100000 messages, under-replicated partitions > 0, active controller count != 1
  7. Configure AlertManager routing rules to send critical alerts to Slack/PagerDuty webhook URLs
  8. Create custom dashboard panel with sum by(consumergroup) (kafka_consumer_group_lag_sum) to rank slowest consumers

Code

# jmx_config.yaml - JMX Exporter configuration for Kafka brokers
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames:
  - "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions"
  - "kafka.server:type=ReplicaManager,name=LeaderCount"
  - "kafka.controller:type=KafkaController,name=ActiveControllerCount"
  - "kafka.server:type=BrokerTopicMetrics,name=*"
  - "kafka.network:type=RequestMetrics,name=*,clientId=*,request=*"
  - "kafka.server:type=DelayedOperationPurgatory,name=*,clientId=*"
rules:
  - pattern: kafka.(\w+)<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
    name: kafka_$1_$2_

Note: this example was truncated in the source. See the GitHub repo for the latest full version.

Common Pitfalls

  • Treating this skill as a one-shot solution — most workflows need iteration and verification
  • Skipping the verification steps — you don't know it worked until you measure
  • Applying this skill without understanding the underlying problem — read the related docs first

When NOT to Use This Skill

  • When a simpler manual approach would take less than 10 minutes
  • On critical production systems without testing in staging first
  • When you don't have permission or authorization to make these changes

How to Verify It Worked

  • Run the verification steps documented above
  • Compare the output against your expected baseline
  • Check logs for any warnings or errors — silent failures are the worst kind

Production Considerations

  • Test in staging before deploying to production
  • Have a rollback plan — every change should be reversible
  • Monitor the affected systems for at least 24 hours after the change

Quick Info

CategoryKafka
Difficultyintermediate
Version1.0.0
AuthorClaude Skills Hub
kafkamonitoringops

Install command:

curl -o ~/.claude/skills/kafka-monitoring.md https://clskills.in/skills/kafka/kafka-monitoring.md

Related Kafka Skills

Other Claude Code skills in the same category — free to download.

Want a Kafka skill personalized to YOUR project?

This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.