Failover and Node Maintenance Guide¶
This page will show you how you can keep your services up & running with no disruptions when performing node maintenance.
Some cloud providers will supply you with an automated upgrade workflow, in which case you don't need to take any action.
We separate this into 2 scenarios:
- Failover when your LoadBalancer moves your IP automatically
- this applies to all cloud providers, as well as when using Cilium or another Load Balancer provider with automated failover capabilities
- Failover when using the default setup with ServiceLB
- for small to medium deployments we recommend ServiceLB - it is simple and easy to run, however it requires manual updates of DNS records during maintenance
Failover - ServiceLB¶
When using ServiceLB, update your DNS records to point to another node.
If using a single A record, simply point it to another control plane node, complete the upgrade of the node the record points to and move it back. If using DNS round-robin, remove the A record just for your node.
Once it is no longer getting resolved by your DNS, you can proceed with the rest of the guide.
Maintenance/upgrade¶
Juno is designed to enable you to take down a node with minimal impact.
Maintenance/upgrades - workstations¶
Juno is designed to enable you to take down a node with minimal impact. This applies to our services, however workstation sessions need to be recreated when upgrading.
At large facilities, you can approach this with a batched upgrade, minimizing user disruption.
First, decide which nodes you want to upgrade first.
Then, stop scheduling on them:
Replace <nodeX>
with the name of your node eg. workstation01
. The name is the shortform hostname and can be validated via kubectl get nodes
.
This will ensure no further pods are scheduled on this machine until you explicitly allow it. It will not disrupt existing workloads.
To check what workloads are running on those nodes, run:
#-owide includes the node name for each pod
kubectl get pods -A -owide | grep -E -i "<node1>|<node2>|<node3>"
Assuming workstations are powered down and recreated, the users will all eventually be moved to other nodes with scheduling enabled.
Once no workloads you depend on are left, drain the node to move any pods that remain for rescheduling:
To make sure of that, run:
That's all you need with a default setup.
When using a more complex LoadBalancer scheme, such as cilium, follow the upstream documentation to help avoid disruption. For Cilium in BGP mode see the upstream maintenance guide.
For Cilium's L2 mode, a drain is sufficient. You only need to shut down the node gracefully eg. using the shutdown now
command as opposed to a hard power-off.
Maintenance/upgrades - service nodes¶
When using the default ServiceLB, ensure you have already switched DNS records to exclude the node you are upgrading.
For service nodes, you might see a slight (under 3s) disruption to the API endpoints, but not for your active users. The workstations will keep on working for them while you upgrade.
When you are ready to drain a service/control plane node, run:
Maintenance/upgrades - bringing the node back into service¶
Once you reboot, the node will not immediately start serving workloads. This is meant to give you time to verify any potentially disruptive OS-level changes.
Once you are done and ready to bring the node back into service, run: