Monitor Kafka clusters with metrics, consumer lag, and alerting
✓Works with OpenClaudeYou are a Kafka operations engineer. The user wants to monitor Kafka clusters by collecting metrics, tracking consumer lag, and setting up alerting.
What to check first
- Verify Kafka broker JMX port is open (default 9999) and
KAFKA_JMX_OPTSis set in broker config - Check that Prometheus scrape config points to JMX exporter endpoint (default
:5556) on each broker - Confirm consumer group exists with
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
Steps
- Deploy JMX Exporter on each Kafka broker by downloading
jmx_exporter_javaagent.jarand creating a config YAML with broker metrics filters - Add JMX agent to broker startup: set
KAFKA_JMX_OPTS="-javaagent:/path/to/jmx_exporter_javaagent.jar=5556:jmx_config.yaml" - Configure Prometheus
prometheus.ymlscrape job to collect metrics from all broker JMX exporter endpoints every 30s - Create Grafana dashboard panels querying
kafka_server_replica_manager_under_replicated_partitions,kafka_server_controller_kafka_controller_active_controller_count, andkafka_controller_network_request_metrics_incoming_byte_rate - Add consumer lag monitoring by querying Prometheus for
kafka_consumer_group_lag_sumacross all consumer groups - Set up Prometheus alert rules for: lag > 100000 messages, under-replicated partitions > 0, active controller count != 1
- Configure AlertManager routing rules to send critical alerts to Slack/PagerDuty webhook URLs
- Create custom dashboard panel with
sum by(consumergroup) (kafka_consumer_group_lag_sum)to rank slowest consumers
Code
# jmx_config.yaml - JMX Exporter configuration for Kafka brokers
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames:
- "kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions"
- "kafka.server:type=ReplicaManager,name=LeaderCount"
- "kafka.controller:type=KafkaController,name=ActiveControllerCount"
- "kafka.server:type=BrokerTopicMetrics,name=*"
- "kafka.network:type=RequestMetrics,name=*,clientId=*,request=*"
- "kafka.server:type=DelayedOperationPurgatory,name=*,clientId=*"
rules:
- pattern: kafka.(\w+)<type=(.+), name=(.+), clientId=(.+), brokerHost=(.+), brokerPort=(.+)><>Value
name: kafka_$1_$2_
Note: this example was truncated in the source. See the GitHub repo for the latest full version.
Common Pitfalls
- Treating this skill as a one-shot solution — most workflows need iteration and verification
- Skipping the verification steps — you don't know it worked until you measure
- Applying this skill without understanding the underlying problem — read the related docs first
When NOT to Use This Skill
- When a simpler manual approach would take less than 10 minutes
- On critical production systems without testing in staging first
- When you don't have permission or authorization to make these changes
How to Verify It Worked
- Run the verification steps documented above
- Compare the output against your expected baseline
- Check logs for any warnings or errors — silent failures are the worst kind
Production Considerations
- Test in staging before deploying to production
- Have a rollback plan — every change should be reversible
- Monitor the affected systems for at least 24 hours after the change
Related Kafka Skills
Other Claude Code skills in the same category — free to download.
Kafka Producer
Build Kafka producers with serialization, partitioning, and delivery guarantees
Kafka Consumer
Build Kafka consumers with consumer groups, offsets, and error handling
Kafka Streams
Build stream processing applications with Kafka Streams DSL
Kafka Connect
Configure source and sink connectors for data integration
Kafka Schema Registry
Manage Avro/Protobuf schemas with Confluent Schema Registry
Kafka Consumer Group Setup
Configure Kafka consumer groups for parallel processing and fault tolerance
Want a Kafka skill personalized to YOUR project?
This is a generic skill that works for everyone. Our AI can generate one tailored to your exact tech stack, naming conventions, folder structure, and coding patterns — with 3x more detail.