Control When to Replace an Unhealthy Node

Cloud service provider relevance: AKS

Ocean lets you control if and when to automatically replace unhealthy nodes in an AKS cluster. This capability helps balance faster recovery from unhealthy nodes against avoiding replacements that occur too quickly or without sufficient cause.

When enabled, Ocean continuously monitors node health states, such as NotReady or Unschedulable. Ocean can automatically replace nodes that remain unhealthy beyond a configurable duration threshold.

This behavior is disabled by default and must be explicitly enabled at the cluster level through the Spot API using the Create Cluster or Update Cluster endpoints, under cluster.health.

These are the parameters:

cluster.health.shouldReplaceUnhealthyInstances: A boolean flag that enables or disables automatic replacement of unhealthy nodes.
- true – Ocean automatically replaces nodes that remain unhealthy.
- false – Ocean detects unhealthy nodes but does not replace them.
cluster.health.gracePeriod: Defines a grace period (in seconds) during which health checks are suppressed for newly observed nodes. This prevents replacements while nodes are still initializing.
- Minimum value: 60 seconds.
- Default value: 120 seconds.
cluster.health.healthCheckUnhealthyDurationBeforeReplacement: Defines how long (in seconds) a node can remain unhealthy before Ocean replaces it. Once this threshold is exceeded, Ocean triggers an automatic node replacement.
- Minimum value: 120 seconds.
- Default value: 180 seconds.