Version: 3.0.0-alpha (Diátaxis)

Troubleshooting — Kafka

ZooKeeper loses quorum

Cause: the number of ZooKeeper replicas is insufficient or even, preventing the formation of a majority quorum. A quorum requires a strict majority (e.g., 2/3 nodes).

Solution:

Check the configured ZooKeeper replicas count:

kubectl get kafka -o yaml | grep -A 5 zookeeper

Ensure zookeeper.replicas is an odd number (3, 5, or 7)

Check ZooKeeper pod status:

kubectl get pods -l app.kubernetes.io/component=zookeeper

Check available disk space on ZooKeeper volumes — a full disk causes quorum loss:
```
kubectl exec <zookeeper-pod> -- df -h /data
```
If needed, increase zookeeper.size in your manifest and reapply it

Topic inaccessible or broker unavailable

Cause: one or more Kafka brokers are not functioning properly, or the topic does not have enough synchronized replicas relative to min.insync.replicas.

Solution:

Check Kafka pod status:

kubectl get pods -l app.kubernetes.io/component=kafka

Inspect events on a failing pod:
```
kubectl describe pod <kafka-pod>
```

Verify that the topic's replica count is consistent with the number of available brokers:

kubectl exec <kafka-pod> -- kafka-topics.sh --describe --topic <topic-name> --bootstrap-server localhost:9092

Check storage space — a full volume prevents the broker from operating:
```
kubectl exec <kafka-pod> -- df -h /bitnami/kafka
```

Significant consumer lag

Cause: consumers are not processing messages fast enough compared to the production rate. This can be due to insufficient partitions, too few consumers in the group, or under-provisioned consumers.

Solution:

Identify consumer group lag:

kubectl exec <kafka-pod> -- kafka-consumer-groups.sh --describe --group <group-id> --bootstrap-server localhost:9092

If lag is spread across many partitions, increase the number of consumers in the group (without exceeding the number of partitions)
If all partitions have lag, consider increasing the number of partitions for the topic:
kafka.yaml
```
topics:
  - name: events
    partitions: 12
    replicas: 3
```
Verify that consumers have sufficient resources (CPU, memory) to process messages

Broker OOMKilled

Cause: the Kafka broker consumes more memory than the allocated limit. This frequently occurs with the nano or micro preset under load.

Solution:

Check pod events to confirm the OOMKill:

kubectl describe pod <kafka-pod> | grep -A 5 "Last State"

Increase broker memory resources using a higher preset or explicit resources:

kafka.yaml
kafka:
  replicas: 3
  resources:
    cpu: 2000m
    memory: 4Gi
  size: 20Gi

Reapply the manifest:
```
kubectl apply -f kafka.yaml
```

Duplicate messages

Cause: by default, Kafka operates in at-least-once delivery mode. In case of producer retries or consumer rebalancing, messages may be delivered multiple times.

Solution:

Producer side: enable idempotence to avoid duplicates during retries:
```
enable.idempotence=true
acks=all
```
Consumer side: implement a deduplication mechanism based on a unique message identifier (key, UUID, etc.)
For critical use cases, combine acks=all and enable.idempotence=true on the producer with idempotent processing on the consumer side

tip

Producer idempotence ensures that a message sent multiple times (due to network retries) is written only once to the partition. Idempotent processing on the consumer side remains necessary to cover rebalancing scenarios.

ZooKeeper loses quorum​

Topic inaccessible or broker unavailable​

Significant consumer lag​

Broker OOMKilled​

Duplicate messages​

ZooKeeper loses quorum

Topic inaccessible or broker unavailable

Significant consumer lag

Broker OOMKilled

Duplicate messages