Freshly Rebuilding Zero to avoid idx issue

This runbook is to freshly create the Zero StatefulSet to avoid IDX issue. The only state the zero has is the zw directory, so once that is removed, it should come up cleanly.

To work properly with the alpha, the zero should know the timestamp, max UID and max Namespace ID used in the alpha data. So before Freshly creating Zeros, we need to retrieve those values and apply those once the Zero Pods come back up cleanly.

**NOTE**: This runbook will incur **downtime**.

Use Cases for This Runbook

This procedure is needed when one or more zero pods cannot join the RAFT group due to idx errors

Symptoms:

One of more zero(s) not working (crashing, throwing errors, cannot restart etc)
A Zero was removed using /removeNode and it is crashing with error Error while joining cluster: rpc error: code = Unknown desc = REUSE_RAFTID: Reusing removed id

Overview

First Get the Leader Zero to retrieve the “max” values
Scale Down the Alpha and Zero STS in that order.
Delete Zero PVCs
Scale Zero STS to 3 Replicas and Apply Max Values
Scale Apha STS to first 1 Replica and then to 3 Replicas.

Detailed Steps:

1. Set the NS variable to point to the faulty cluster

          export NS=<namespace>
example-  export NS=cluster-0x12345

2. Find the leader Zero

kubectl exec -it zerostatefulset-0 -n $NS -- bash

curl -s localhost:6080/state | jq '{alphas: .groups."1".members[]|select(.leader==true), zeros: .zeros[]|select(.leader==true)}'

3. Get the max values from the Zero node - ( maxUID, maxTxnTs and maxNsID) and save these to be used later

Assuming zerostatefulset-ldr is the zero leader from Step 2

# Exec Into the Leader Zero
kubectl exec -it zerostatefulset-ldr -n $NS -- bash

curl -s localhost:6080/state | jq | grep '"max'

##Should return max values like the example below "maxUID": "669546334271", "maxTxnTs": "41060000", "maxNsID": "41061000"

4. First Scale Down Alpha STS and then Zero STS

kubectl scale --replicas=0 sts alphastatefulset -n $NS

kubectl scale --replicas=0 sts zerostatefulset -n $NS

5. Delete Zero PVCs

kubectl delete pvc -l app=dgraph-zero -n $NS

6. Scale Zero STS to 3 Replicas

kubectl scale --replicas=3 sts zerostatefulset -n $NS

7. Assign Max Values on Zero Leader and verify that max values are correct

Find the leader Zero nodes in the cluster by following Step 2
Execute the assign endpoints on the leader zero with values from Step 3

kubectl exec -it zerostatefulset-ldr -n $NS –- bash

#Run the following Assign calls with values captured at step 3
curl "localhost:6080/assign?what=uids&num=xxxx"
curl "localhost:6080/assign?what=timestamps&num=xxxx"
curl "localhost:6080/assign?what=nsids&num=xxxx"

Verify that Max Values are reflected on Zero Leader

Exec into Leader Zero

kubectl exec -it zerostatefulset-ldr -n $NS -- bash

curl -s localhost:6080/state | jq | grep '"max'

Verify the values retrieved are higher that with Values retrieved at Step 3 - Typically incremented by 10,000

Scale Apha STS to first 1 Replica and then to 3 Replicas.

kubectl scale --replicas=1 sts alphastatefulset -n $NS

Wait for the Schema to Lazy Load and for the alpha to create a Snapshot

kubectl logs alphastatefulset-0 -n <namespace> -f | grep "CreateSnapshot"

Now Scale Alpha STS to 3 Replicas:

kubectl scale --replicas=3 sts alphastatefulset -n $NS

Note:

When a follower zero fails to come up, It can be repaired using this method. But its better to rebuild zero than changing the script.

Follower going bad requires idx value change
1. Remove Node for Zero
2. Delete PVC of the bad zero
3. Change Zero STS to add an idx value (if Zero with RAFTID 2 was removed, then we need to assign it 4 as idx for Zero cannot be reused.
```
idx=$(($ordinal + 1))
if [ "$idx" -eq 2 ]; then
idx=4
fi
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freshly Rebuilding Zero to avoid idx issue.md

Freshly Rebuilding Zero to avoid idx issue.md

Freshly Rebuilding Zero to avoid idx issue

Use Cases for This Runbook

Overview

Detailed Steps:

1. Set the NS variable to point to the faulty cluster

2. Find the leader Zero

3. Get the max values from the Zero node - ( maxUID, maxTxnTs and maxNsID) and save these to be used later

4. First Scale Down Alpha STS and then Zero STS

5. Delete Zero PVCs

6. Scale Zero STS to 3 Replicas

7. Assign Max Values on Zero Leader and verify that max values are correct

Scale Apha STS to first 1 Replica and then to 3 Replicas.

Files

Freshly Rebuilding Zero to avoid idx issue.md

Latest commit

History

Freshly Rebuilding Zero to avoid idx issue.md

File metadata and controls

Freshly Rebuilding Zero to avoid idx issue

Use Cases for This Runbook

Overview

Detailed Steps:

1. Set the NS variable to point to the faulty cluster

2. Find the leader Zero

3. Get the max values from the Zero node - ( maxUID, maxTxnTs and maxNsID) and save these to be used later

4. First Scale Down Alpha STS and then Zero STS

5. Delete Zero PVCs

6. Scale Zero STS to 3 Replicas

7. Assign Max Values on Zero Leader and verify that max values are correct

Scale Apha STS to first 1 Replica and then to 3 Replicas.