Troubleshooting — Kafka
ZooKeeper loses quorum
Cause: the number of ZooKeeper replicas is insufficient or even, preventing the formation of a majority quorum. A quorum requires a strict majority (e.g., 2/3 nodes).
Solution:
- Check the configured ZooKeeper replicas count:
kubectl get kafka -o yaml | grep -A 5 zookeeper - Ensure
zookeeper.replicasis an odd number (3, 5, or 7) - Check ZooKeeper pod status:
kubectl get pods -l app.kubernetes.io/component=zookeeper - Check available disk space on ZooKeeper volumes — a full disk causes quorum loss:
kubectl exec <zookeeper-pod> -- df -h /data - If needed, increase
zookeeper.sizein your manifest and reapply it
Topic inaccessible or broker unavailable
Cause: one or more Kafka brokers are not functioning properly, or the topic does not have enough synchronized replicas relative to min.insync.replicas.
Solution:
- Check Kafka pod status:
kubectl get pods -l app.kubernetes.io/component=kafka - Inspect events on a failing pod:
kubectl describe pod <kafka-pod> - Verify that the topic's replica count is consistent with the number of available brokers:
kubectl exec <kafka-pod> -- kafka-topics.sh --describe --topic <topic-name> --bootstrap-server localhost:9092 - Check storage space — a full volume prevents the broker from operating:
kubectl exec <kafka-pod> -- df -h /bitnami/kafka
Significant consumer lag
Cause: consumers are not processing messages fast enough compared to the production rate. This can be due to insufficient partitions, too few consumers in the group, or under-provisioned consumers.
Solution:
- Identify consumer group lag:
kubectl exec <kafka-pod> -- kafka-consumer-groups.sh --describe --group <group-id> --bootstrap-server localhost:9092 - If lag is spread across many partitions, increase the number of consumers in the group (without exceeding the number of partitions)
- If all partitions have lag, consider increasing the number of partitions for the topic:
kafka.yaml
topics:
- name: events
partitions: 12
replicas: 3 - Verify that consumers have sufficient resources (CPU, memory) to process messages
Broker OOMKilled
Cause: the Kafka broker consumes more memory than the allocated limit. This frequently occurs with the nano or micro preset under load.
Solution:
- Check pod events to confirm the OOMKill:
kubectl describe pod <kafka-pod> | grep -A 5 "Last State" - Increase broker memory resources using a higher preset or explicit resources:
kafka.yaml
kafka:
replicas: 3
resources:
cpu: 2000m
memory: 4Gi
size: 20Gi - Reapply the manifest:
kubectl apply -f kafka.yaml
Duplicate messages
Cause: by default, Kafka operates in at-least-once delivery mode. In case of producer retries or consumer rebalancing, messages may be delivered multiple times.
Solution:
- Producer side: enable idempotence to avoid duplicates during retries:
enable.idempotence=true
acks=all - Consumer side: implement a deduplication mechanism based on a unique message identifier (key, UUID, etc.)
- For critical use cases, combine
acks=allandenable.idempotence=trueon the producer with idempotent processing on the consumer side
Producer idempotence ensures that a message sent multiple times (due to network retries) is written only once to the partition. Idempotent processing on the consumer side remains necessary to cover rebalancing scenarios.