Troubleshooting — ClickHouse
ClickHouse Keeper unstable (even number of replicas)
Cause: the number of ClickHouse Keeper replicas is even (2, 4, etc.), which prevents quorum maintenance. The Raft protocol requires a strict majority to elect a leader, and an even number of nodes does not guarantee this majority in case of a network partition.
Solution:
- Check the current number of Keeper replicas:
kubectl get pods -l app=clickhouse-keeper-<name> - Change the number of replicas to use an odd number (3 or 5):
clickhouse.yaml
spec:
clickhouseKeeper:
enabled: true
replicas: 3 # Always odd - Apply the change:
kubectl apply -f clickhouse.yaml - Check the Keeper logs to confirm the quorum is restored:
kubectl logs -l app=clickhouse-keeper-<name>
Slow queries on large volumes
Cause: the sharding configuration is not optimal, tables are not using the right engines, or allocated resources are insufficient.
Solution:
- Verify that you are using Distributed tables to spread queries across all shards.
- Make sure local tables use the
ReplicatedMergeTreeengine with anORDER BYadapted to your most frequent queries. - Increase the number of shards to distribute the load:
clickhouse.yaml
spec:
shards: 4 # Increase the number of shards - Check allocated resources and increase if needed:
kubectl top pod -l app=clickhouse-<name> - Analyze slow queries via the system
query_log:SELECT query, elapsed, read_rows, memory_usage
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY elapsed DESC
LIMIT 10;
Insufficient disk space
Cause: the data volume exceeds the PVC size, or system logs (query_log, query_thread_log) are accumulating too much data.
Solution:
- Increase the data volume size:
clickhouse.yaml
spec:
size: 50Gi # Increase from the current value - Also check the log volume size and adjust if needed:
clickhouse.yaml
spec:
logStorageSize: 5Gi # Increase if logs are saturating - Reduce system log retention via
logTTL:clickhouse.yamlspec:
logTTL: 7 # Reduce from 15 to 7 days for example - Review your application data retention policies and drop obsolete partitions.
ClickHouse pod stuck in Pending state
Cause: the PersistentVolumeClaim (PVC) cannot bind to a volume, usually because of a non-existent storageClass or exceeded resource quota.
Solution:
- Check the pod status and associated events:
kubectl describe pod clickhouse-<name>-0-0 - Check the PVC status:
kubectl get pvc -l app=clickhouse-<name> - Make sure the
storageClassused is one of the available classes:local,replicated, orreplicated-async. - Check that resource quotas (CPU, memory, storage) have not been reached.
- Fix the configuration in your manifest and reapply:
kubectl apply -f clickhouse.yaml
Cross-shard replication failed
Cause: ClickHouse Keeper is not functional, the network between pods is unstable, or the replicas configuration per shard is incorrect.
Solution:
- Check that ClickHouse Keeper is operational:
kubectl get pods -l app=clickhouse-keeper-<name> - Check the Keeper logs to identify errors:
kubectl logs -l app=clickhouse-keeper-<name> - Verify network connectivity between ClickHouse pods:
kubectl exec clickhouse-<name>-0-0 -- clickhouse-client --query "SELECT * FROM system.clusters" - Make sure the replicas configuration is consistent:
clickhouse.yaml
spec:
shards: 2
replicas: 3 # Each shard must have the same number of replicas
clickhouseKeeper:
enabled: true
replicas: 3 - If Keeper is unstable, restart the Keeper pods and wait for quorum stabilization.