Robusta: HolmesGPT for AI-Driven Incident Investigation in the Cloud

Whenever there is an alert or a cloud incident, it is rarely about an isolated incident. A service alert could trigger in Kubernetes, but the reason could be a firewall issue, a bad infrastructure update, a forgotten dependency, or a cloud permission change.
When solving these issues, teams have to jump across logs, source control, and dashboards before they can figure out what has actually failed.
This was the center of Software Plaza’s conversation with Natan Yelen, CEO of Robusta. Twain Taylor of Software Plaza spoke with Yelen about how Robusta moved from Kubernetes monitoring toward AI-led incident investigation with HolmesGPT.
Instead of now asking engineers to manually assemble the story of an incident, HolmesGPT sits at the center of figuring out the problem, investigating it, pulling the evidence from the systems, and also suggesting a solution. So it returns a root-cause narrative that is tied to the source data.
From Kubernetes alerts to cross-system investigation
Early on, Robusta only had an earlier mission to make Kubernetes monitoring easier. However, as Kubernetes operations grew and the observability tooling improved, just cluster monitoring became less differentiated. So Robusta shifted to a more stubborn problem: teams were unable to figure out why an alert fired or which system change was behind it.
As the company moved toward AI as a main tool for troubleshooting, it created space for HolmesGPT.
HolmesGPT supports AWS, GCP, Oracle Cloud, OpenShift, IBM mainframes, and other environments, while also connecting to observability platforms, source control systems like GitHub, ITSM tools, and internal knowledge bases such as Confluence and Notion.
And this is because a root cause can sit outside one of your current systems while raising the first alarm, and HolmesGPT sees the wider picture.
What HolmesGPT needs to see
HolmesGPT is an AI SRE agent, which means that it needs access to more than just logs and metrics.
A useful incident investigation can pull from several layers at once:
- observability data such as logs, metrics, and traces
- cloud and infrastructure context
- recent code and infrastructure changes in source control
- ITSM records and internal documentation
- custom operational data from databases or devices
Once HolmesGPT has all of this wider context, it can move beyond just alert summaries. It is able to inspect any Terraform changes and relate them back to cloud configuration, compare service behavior changes, and connect that back to the user-facing symptom.
Now, what this does is that HolmesGPT is able to suggest fixes and submit pull requests so that teams can move from diagnosis into response.
This model is an Apache-licensed CNCF open source project, which gives this engine a wider surface area.
How HolmesGPT keeps AI grounded in evidence
Trust is one of the hardest problems when it comes to using language models, and the Robusta team was also facing internal debate because a lot of the earlier GPT-3.5 models were prone to inaccurate responses.
Now the solution is to tie the model to evidence. HolmesGPT responses include graphs and charts that can be used to refer to the original data sources, as well as citations that are linked to specific data points. So the model is not just writing an explanation about what’s happening or what you should be doing, but it is gathering the intel from the signals while also pointing back to those data points.
Now this is one of the most important things in observability. Most systems are unable to consume the volume of raw data, which HolmesGPT handles by slicing it into manageable chunks that fit within the model’s context window.
This is a better match for incident work than a generic chatbot pattern. Teams do not need a fluent answer. They need an answer they can check.
Beyond alert triage
And the product’s direction is not only reactive troubleshooting, but also running scheduled analysis, looking for issues even when no alerts are issued, and sending those reports and generating dashboards via its servers.
This implies that there is a broader operating model, and the agent is not only present when a moment of failure is reported or an alert is raised. You can schedule a review of cost or reliability issues and incorporate the findings back into your daily operations.
What this means for platform teams
One of the stronger ideas in the webinar is that platform engineering is shifting toward systems that AI agents can use well, not only systems built for humans. That is a practical design change.
If an AI agent is going to investigate issues, propose code changes, and open pull requests, the surrounding platform has to make those actions inspectable and testable. The source points to ephemeral environments, automated screenshots, videos, and tests as part of that loop. The human role moves toward review and approval rather than manual assembly of every fix.
That does not remove the need for engineering judgment. It changes where that judgment is applied. Teams still need to decide what data the agent can access, what actions it can take, what evidence it must return, and when automation stops short of production. The webinar also touches on controlled autonomy, including the idea of agents operating within strict limits when spending or acting on behalf of a team.
For enterprises, the near-term takeaway is less dramatic and more useful. The best place to evaluate tools like HolmesGPT is the investigation layer: alert triage, cross-system correlation, evidence packaging, and guided remediation. That is where the source shows the product working today.
HolmesGPT is not presented as magic. It is presented as a way to compress the search phase of incident response and make the answer easier to verify. For teams dealing with Kubernetes, multi-cloud sprawl, and too many disconnected tools, that is a meaningful shift.Â
Readers who want the fuller demo and product discussion should check out the interview for more detail.





Get involved!
Comments