Mirantis Automating Kubernetes Self-Healing with Mirantis K8s and Chaos Engineering Pipelines

Production outages rarely announce themselves politely. Nodes disappear, networks flap, storage stalls, and suddenly your SRE team is firefighting at 2 a.m. The organizations that scale reliably are not the ones that avoid failure but the ones that automate recovery.
This article describes using Mirantis Kubernetes Platforms with GitOps and Chaos Engineering to deliver a self-healing system; this means you can detect a failure and recover from that failure automatically and through continuous improvement of resilience without needing a “hero.”
Why self-healing needs GitOps and chaos engineering
While Kubernetes provides the capability to restart pods and reschedule workloads, there are additional complexities in the real world of failures, such as node health changes, CNI Plugin failures, degradation of control plane components, and odd behaviours of applications due to partial failures. In order to achieve self-healing, there must be a combination of three things.
The first of those is the intentional injection of failures through Chaos Engineering in order to validate your assumptions of what will happen in your system when you have a failure. The second component is to use GitOps to ensure that the logic for remediating failures is versioned, auditable, and repeatable. Finally, the third component needs to be the platform automating those actions at both the cluster level and the infrastructure level, and not just within the namespace.
Mirantis brings together the pieces required to do this by integrating the following products: Mirantis Kubernetes Engine, Mirantis Container Cloud, and Mirantis’s automated life cycle management tools, along with GitOps-based operational practices for large fleets of Clusters.
GitOps-driven chaos experiments with ArgoCD and LitmusChaos
A practical pattern is to store chaos experiments and remediation policies directly in Git and deploy them using ArgoCD or Flux. When chaos is triggered, the system observes signals, validates recovery, and automatically reverts or patches.
Example ArgoCD application for LitmusChaos
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: chaos-experiments
spec:
project: default
source:
repoURL: https://github.com/org/chaos-gitops
targetRevision: main
path: litmus/node-failure
destination:
server: https://kubernetes.default.svc
namespace: litmus
syncPolicy:
automated:
prune: true
selfHeal: true
This ArgoCD application is tied directly to your chaos test repository; thus, any time there is a change to the chaos test in Git or a rollback, the application automatically reconciles this change.
LitmusChaos node failure experiment
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: node-drain-chaos
spec:
appinfo:
appns: default
applabel: app=payments
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
– name: node-drain
spec:
components:
– name: NODE_LABEL
value: “node-role.kubernetes.io/worker”
This experiment validates that workloads reschedule correctly when nodes are drained or become unavailable.
Validating remediation and automatic revert
As with any testing, simply creating chaos does not provide sufficient validation that remediation procedures are effective; therefore, if you wish to spend the time/effort on additional practices in addition to simply creating chaos, you can pair your chaos experiments with health checks and rollback logic.
In Flux, you can accomplish the same goals through health checks and custom overlays created with Kustomize.
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: payments-resilience
spec:
interval: 5m
path: ./resilience
prune: true
healthChecks:
– apiVersion: apps/v1
kind: Deployment
name: payments
namespace: default
If your deployment does not report a healthy status in the context of the time frame you established after an event has occurred, Flux will fail to reconcile your deployment with the defined chaos test criteria and will most likely initiate automated rollback to a stable version of the deployment or apply a patch, such as increasing replicas or redirecting traffic.
How Mirantis simplifies multi-cluster self-healing
To operate self-healing clusters on a large scale means consistently managing many clusters (sometimes as many as hundreds) across multiple clusters. With Mirantis Container Cloud, organizations can centrally manage their lifecycle of multiple-cluster environments across bare-metal servers, VMware infrastructure, public cloud services, and edge computing deployment.
The key features of self-healing clusters are the following:
- Monitoring the health of your clusters
- Using declarative configurations for configuring clusters
- Performing automated workflows to recover from node failures
For example, in real-world usage, if a node has multiple Kubelet failure events or disk-pushed events, then Mirantis will automatically cordon off the node preventing any new pods from being assigned to it, drain the node (remove all pods from the node), reprovision the node (deploy the same pods to another node), and rejoin the node to the cluster without human intervention e.g., using automation. This is done across all clusters uniformly because the same self-healing logic is used in every cluster, and the remediation is executed at the time the failures are detected.
In customer environments, Mirantis Platforms have identified recurring patterns of instability in nodes (e.g., repeated “network timeouts” or “kernel panic” events) that triggered automatic remediation actions taken by Mirantis, including the use of automated node replacement and CNI reconfiguration. The net result has been that organizations using Mirantis have seen significant reductions in their mean-time-to-repair (MTTR) and, therefore, fewer issues being referred to their on-call engineering teams.
Connecting self-healing to SRE error budgets
Self-healing should never be uncontrolled. SRE error budgets provide a natural control mechanism. When the system detects that error budgets are not consumed, the system can perform greater chaos testing and utilise the chaos experiment as an automation mechanism for self-healing. If an organisation has an error budget that is nearing exhaustion, the user must stop testing and instead use the self-healing automation process to ensure stability.
One simple way to do this is to send SLO metrics to Prometheus and use these values as conditions for chaos testing execution. If the error rate exceeds the desired thresholds, the GitOps pipeline would no longer allow the deployment of new chaos manifests. The use of SLO metrics brings alignment between business risk associated with resilience testing and trust building between product owners and the infrastructure or platform teams.
Progressive delivery with automated resilience validation
Progressive delivery tools, such as Argo Rollouts, are designed for seamless chaos-driven self-healing. If a service uses canary rollouts as part of its progressive delivery strategy, chaos testing can be targeted at canary pods only, and remediation will automatically roll back the traffic if degraded metrics or failed remediations are detected in the canaries.
This process assures that new versions of applications are functionally accurate as well as resilient to system failures due to resilience testing.
Best practices for automated rollback and failover
- Designing safe self-healing systems requires discipline.
- Keep remediation declarative and versioned in Git. Avoid imperative scripts that bypass GitOps reconciliation.
- Test chaos in production-like environments first. Staging clusters should mirror production topology, autoscaling, and networking as closely as possible.
- Limit blast radius. Use labels and scopes so chaos targets specific workloads or nodes.
- Make rollback boring. Ensure that reverting to a previous commit reliably restores service without manual cleanup.
- Instrument everything. Logs, metrics, and events are essential for validating that self-healing actually worked.
Building resilient platforms that improve over time
Kubernetes self-healing isn’t just about having a self-healing cluster. Rather, it represents an ideal that includes ongoing cycles of injecting failure into the system, watching what happens (observing), fixing the system (remediating), and processing the knowledge gained from each experience. The tools available from Mirantis make it easy for users to create these types of cycles for all their clusters and infrastructures.
In addition, thanks to GitOps and Chaos Engineering, users have a systematic way to test and increase the resiliency of their systems. By using GitOps to codify your Recovery Logic and then methodically testing it, organisations can turn their backups into a competitive advantage. With Mirantis as a partner, users can automate their self-healing processes and achieve “auto-resiliency”.





Get involved!
Comments