Production outages rarely announce themselves politely. Nodes disappear, networks flap, storage stalls, and suddenly your SRE team is firefighting at 2 a.m. The organizations that scale reliably are not the ones that avoid failure but the ones that automate recovery. 

This article describes using Mirantis Kubernetes Platforms with GitOps and Chaos Engineering to deliver a self-healing system; this means you can detect a failure and recover from that failure automatically and through continuous improvement of resilience without needing a “hero.”

Why self-healing needs GitOps and chaos engineering

While Kubernetes provides the capability to restart pods and reschedule workloads, there are additional complexities in the real world of failures, such as node health changes, CNI Plugin failures, degradation of control plane components, and odd behaviours of applications due to partial failures. In order to achieve self-healing, there must be a combination of three things.

The first of those is the intentional injection of failures through Chaos Engineering in order to validate your assumptions of what will happen in your system when you have a failure. The second component is to use GitOps to ensure that the logic for remediating failures is versioned, auditable, and repeatable. Finally, the third component needs to be the platform automating those actions at both the cluster level and the infrastructure level, and not just within the namespace.

Mirantis brings together the pieces required to do this by integrating the following products: Mirantis Kubernetes Engine, Mirantis Container Cloud, and Mirantis’s automated life cycle management tools, along with GitOps-based operational practices for large fleets of Clusters.

GitOps-driven chaos experiments with ArgoCD and LitmusChaos

A practical pattern is to store chaos experiments and remediation policies directly in Git and deploy them using ArgoCD or Flux. When chaos is triggered, the system observes signals, validates recovery, and automatically reverts or patches.

Example ArgoCD application for LitmusChaos

apiVersion: argoproj.io/v1alpha1

kind: Application

metadata:

  name: chaos-experiments

spec:

  project: default

  source:

    repoURL: https://github.com/org/chaos-gitops

    targetRevision: main

    path: litmus/node-failure

  destination:

    server: https://kubernetes.default.svc

    namespace: litmus

  syncPolicy:

    automated:

      prune: true

      selfHeal: true

This ArgoCD application is tied directly to your chaos test repository; thus, any time there is a change to the chaos test in Git or a rollback, the application automatically reconciles this change.

LitmusChaos node failure experiment

apiVersion: litmuschaos.io/v1alpha1

kind: ChaosEngine

metadata:

  name: node-drain-chaos

spec:

  appinfo:

    appns: default

    applabel: app=payments

    appkind: deployment

  chaosServiceAccount: litmus-admin

  experiments:

    – name: node-drain

      spec:

        components:

            – name: NODE_LABEL

              value: “node-role.kubernetes.io/worker”

This experiment validates that workloads reschedule correctly when nodes are drained or become unavailable.

Validating remediation and automatic revert

As with any testing, simply creating chaos does not provide sufficient validation that remediation procedures are effective; therefore, if you wish to spend the time/effort on additional practices in addition to simply creating chaos, you can pair your chaos experiments with health checks and rollback logic.

In Flux, you can accomplish the same goals through health checks and custom overlays created with Kustomize.

apiVersion: kustomize.toolkit.fluxcd.io/v1

kind: Kustomization

metadata:

  name: payments-resilience

spec:

  interval: 5m

  path: ./resilience

  prune: true

  healthChecks:

    – apiVersion: apps/v1

      kind: Deployment

      name: payments

      namespace: default

If your deployment does not report a healthy status in the context of the time frame you established after an event has occurred, Flux will fail to reconcile your deployment with the defined chaos test criteria and will most likely initiate automated rollback to a stable version of the deployment or apply a patch, such as increasing replicas or redirecting traffic.

How Mirantis simplifies multi-cluster self-healing

To operate self-healing clusters on a large scale means consistently managing many clusters (sometimes as many as hundreds) across multiple clusters. With Mirantis Container Cloud, organizations can centrally manage their lifecycle of multiple-cluster environments across bare-metal servers, VMware infrastructure, public cloud services, and edge computing deployment.

The key features of self-healing clusters are the following:

  • Monitoring the health of your clusters
  • Using declarative configurations for configuring clusters
  • Performing automated workflows to recover from node failures

For example, in real-world usage, if a node has multiple Kubelet failure events or disk-pushed events, then Mirantis will automatically cordon off the node preventing any new pods from being assigned to it, drain the node (remove all pods from the node), reprovision the node (deploy the same pods to another node), and rejoin the node to the cluster without human intervention e.g., using automation. This is done across all clusters uniformly because the same self-healing logic is used in every cluster, and the remediation is executed at the time the failures are detected.

In customer environments, Mirantis Platforms have identified recurring patterns of instability in nodes (e.g., repeated “network timeouts” or “kernel panic” events) that triggered automatic remediation actions taken by Mirantis, including the use of automated node replacement and CNI reconfiguration. The net result has been that organizations using Mirantis have seen significant reductions in their mean-time-to-repair (MTTR) and, therefore, fewer issues being referred to their on-call engineering teams.

Connecting self-healing to SRE error budgets

Self-healing should never be uncontrolled. SRE error budgets provide a natural control mechanism. When the system detects that error budgets are not consumed, the system can perform greater chaos testing and utilise the chaos experiment as an automation mechanism for self-healing. If an organisation has an error budget that is nearing exhaustion, the user must stop testing and instead use the self-healing automation process to ensure stability.

One simple way to do this is to send SLO metrics to Prometheus and use these values as conditions for chaos testing execution. If the error rate exceeds the desired thresholds, the GitOps pipeline would no longer allow the deployment of new chaos manifests. The use of SLO metrics brings alignment between business risk associated with resilience testing and trust building between product owners and the infrastructure or platform teams.

Progressive delivery with automated resilience validation

Progressive delivery tools, such as Argo Rollouts, are designed for seamless chaos-driven self-healing. If a service uses canary rollouts as part of its progressive delivery strategy, chaos testing can be targeted at canary pods only, and remediation will automatically roll back the traffic if degraded metrics or failed remediations are detected in the canaries. 

This process assures that new versions of applications are functionally accurate as well as resilient to system failures due to resilience testing.

Best practices for automated rollback and failover

  • Designing safe self-healing systems requires discipline.
  • Keep remediation declarative and versioned in Git. Avoid imperative scripts that bypass GitOps reconciliation.
  • Test chaos in production-like environments first. Staging clusters should mirror production topology, autoscaling, and networking as closely as possible.
  • Limit blast radius. Use labels and scopes so chaos targets specific workloads or nodes.
  • Make rollback boring. Ensure that reverting to a previous commit reliably restores service without manual cleanup.
  • Instrument everything. Logs, metrics, and events are essential for validating that self-healing actually worked.

Building resilient platforms that improve over time

Kubernetes self-healing isn’t just about having a self-healing cluster. Rather, it represents an ideal that includes ongoing cycles of injecting failure into the system, watching what happens (observing), fixing the system (remediating), and processing the knowledge gained from each experience. The tools available from Mirantis make it easy for users to create these types of cycles for all their clusters and infrastructures. 

In addition, thanks to GitOps and Chaos Engineering, users have a systematic way to test and increase the resiliency of their systems. By using GitOps to codify your Recovery Logic and then methodically testing it, organisations can turn their backups into a competitive advantage. With Mirantis as a partner, users can automate their self-healing processes and achieve “auto-resiliency”.

Share:

Get involved!

Get Connected!
Join our community. Expand your network and discover great content!

Comments

No comments yet