Version: 3.0.0-alpha (Diátaxis)

Troubleshooting — ClickHouse

ClickHouse Keeper unstable (even number of replicas)

Cause: the number of ClickHouse Keeper replicas is even (2, 4, etc.), which prevents quorum maintenance. The Raft protocol requires a strict majority to elect a leader, and an even number of nodes does not guarantee this majority in case of a network partition.

Solution:

Check the current number of Keeper replicas:

kubectl get pods -l app=clickhouse-keeper-<name>

Change the number of replicas to use an odd number (3 or 5):

clickhouse.yaml
spec:
  clickhouseKeeper:
    enabled: true
    replicas: 3    # Always odd

Apply the change:
```
kubectl apply -f clickhouse.yaml
```
Check the Keeper logs to confirm the quorum is restored:
```
kubectl logs -l app=clickhouse-keeper-<name>
```

Slow queries on large volumes

Cause: the sharding configuration is not optimal, tables are not using the right engines, or allocated resources are insufficient.

Solution:

Verify that you are using Distributed tables to spread queries across all shards.
Make sure local tables use the ReplicatedMergeTree engine with an ORDER BY adapted to your most frequent queries.
Increase the number of shards to distribute the load:
clickhouse.yaml
```
spec:
  shards: 4    # Increase the number of shards
```
Check allocated resources and increase if needed:
```
kubectl top pod -l app=clickhouse-<name>
```

Analyze slow queries via the system query_log:

SELECT query, elapsed, read_rows, memory_usage
FROM system.query_log
WHERE type = 'QueryFinish'
ORDER BY elapsed DESC
LIMIT 10;

Insufficient disk space

Cause: the data volume exceeds the PVC size, or system logs (query_log, query_thread_log) are accumulating too much data.

Solution:

Increase the data volume size:

clickhouse.yaml
spec:
  size: 50Gi    # Increase from the current value

Also check the log volume size and adjust if needed:

clickhouse.yaml
spec:
  logStorageSize: 5Gi    # Increase if logs are saturating

Reduce system log retention via logTTL:

clickhouse.yaml
spec:
  logTTL: 7    # Reduce from 15 to 7 days for example

Review your application data retention policies and drop obsolete partitions.

ClickHouse pod stuck in Pending state

Cause: the PersistentVolumeClaim (PVC) cannot bind to a volume, usually because of a non-existent storageClass or exceeded resource quota.

Solution:

Check the pod status and associated events:
```
kubectl describe pod clickhouse-<name>-0-0
```

Check the PVC status:

kubectl get pvc -l app=clickhouse-<name>

Make sure the storageClass used is one of the available classes: local, replicated, or replicated-async.
Check that resource quotas (CPU, memory, storage) have not been reached.
Fix the configuration in your manifest and reapply:
```
kubectl apply -f clickhouse.yaml
```

Cross-shard replication failed

Cause: ClickHouse Keeper is not functional, the network between pods is unstable, or the replicas configuration per shard is incorrect.

Solution:

Check that ClickHouse Keeper is operational:

kubectl get pods -l app=clickhouse-keeper-<name>

Check the Keeper logs to identify errors:

kubectl logs -l app=clickhouse-keeper-<name>

Verify network connectivity between ClickHouse pods:

kubectl exec clickhouse-<name>-0-0 -- clickhouse-client --query "SELECT * FROM system.clusters"

Make sure the replicas configuration is consistent:

clickhouse.yaml
spec:
  shards: 2
  replicas: 3    # Each shard must have the same number of replicas
  clickhouseKeeper:
    enabled: true
    replicas: 3

If Keeper is unstable, restart the Keeper pods and wait for quorum stabilization.

ClickHouse Keeper unstable (even number of replicas)​

Slow queries on large volumes​

Insufficient disk space​

ClickHouse pod stuck in Pending state​

Cross-shard replication failed​

ClickHouse Keeper unstable (even number of replicas)

Slow queries on large volumes

Insufficient disk space

ClickHouse pod stuck in Pending state

Cross-shard replication failed