Building an AWS governance agent with Strands, Bedrock, and AgentCore

· 12 min read

Looking at an AWS Organization from a governance point of view means checking a lot of places. Cost anomalies, security posture, API activity, resource changes, Health events, configuration drift, relevant AWS announcements. Individually each of these is a dashboard glance. Together, across multiple accounts, the surface is wide enough that something non-obvious can easily go unnoticed, usually the thing that would connect a signal in one domain with one in another.

I wanted to see if an AI agent could do that reading for me: look at the same data I would look at, correlate across domains, and tell me only the things I wouldn’t have spotted on my own in a quick check.

So I built one. I’m a cloud architect, not an AI researcher, and building agentic systems wasn’t part of my day job, so I started from what I knew best: the AWS ecosystem.

Yes, this is an actual bracelet of mine.
Yes, this is an actual bracelet of mine.

Looking around, I found three pieces that together cover the full stack you need for an agent, and it’s worth spending a minute on what each of them actually does, because the names are similar enough that they can easily blur together.

Amazon Bedrock is the model layer. It’s the managed service that gives you access to foundation models (Amazon Nova, Anthropic Claude, Meta Llama, Mistral, Qwen, DeepSeek, and others) through a single API, without having to host the model yourself. Bedrock is where the actual inference happens: you send a prompt, it runs it on the model you chose, and returns a response. If you strip everything else away, Bedrock is the brain.

Strands Agents is the framework layer. An agent is more than a single call to a model: it’s a loop where the model decides to call a tool, receives the tool’s output, reasons about it, decides whether to call another tool, and eventually produces a final answer. That loop, plus the plumbing to define tools, pass them to the model, parse tool calls out of the response, execute them, feed results back, and handle all the edge cases, is what Strands Agents provides. It’s an open source SDK by AWS with native Bedrock integration. You write your agent logic in Python, it takes care of the loop.

AgentCore is the runtime layer. Once you have an agent written with Strands, you still need to run it somewhere: an environment that handles sessions, scales up and down, gives you observability, evaluates outputs, and manages the lifecycle of the agent’s executions. AgentCore is that environment, built specifically for agentic workloads. Agents running inside it get a dedicated microVM per session, structured traces in CloudWatch, integration with AgentCore Evaluations for quality scoring, and other production concerns handled out of the box.

The short version:

PieceLayerWhat it does
Amazon BedrockModelRuns the foundation model that does the reasoning
Strands AgentsFrameworkOrchestrates the reasoning loop (tool calls, responses, final answer)
AgentCoreRuntimeHosts and runs the agent in production, with sessions, observability, evaluation

Unlike the chatbot-style agents most posts and demos focus on, where a user asks something and the agent responds in real time, this one has no interactive interface. At least not yet. It runs once a day per Organization, on a schedule, reasons about the data it has access to, and produces structured JSON with a maximum of five findings.

On paper, it’s a clean architecture. In practice, the first version didn’t survive contact with real data, and what follows is the account of everything I learned along the way, one rabbit hole at a time.

Drinking responsibly.
Drinking responsibly.

The first approach: raw data tools

The first lesson was where an agent’s intelligence actually lives. Framework and model get most of the attention, but in practice they’re only part of the story: an agent is as capable as the tools you give it. The model decides what to do, but it can only act through the tools it has access to, and the quality of those tools, what they return, how much, at what granularity, with which filters, is what determines whether the agent can reason or just shuffles data around.

So tools became the first thing to get right. I started writing them, each one mapped to a question I would have asked if I were looking at the dashboard myself: one for cost summaries, one for cost anomalies, one for security posture, one for active findings, one for API activity, and so on. The first version of each tool was basically a thin wrapper around the data it needed: the cost tool returned daily cost records per service per account for the requested date range, the activity tool returned CloudTrail events grouped by API call and principal, the security tool returned all active findings with their full details. Give the agent everything, let it figure out what matters.

Hitting the context window limit

The agent started up fine, but when invoked, the container would die shortly after. AgentCore kept restarting it, and CloudWatch logs showed the crash-and-restart pattern on every invocation. I didn’t know whether the problem was the model, the prompt, or my own code.

I added defensive handling around the model response so the agent wouldn’t crash on malformed output. That stopped the crashes and the restarts, and the agent now always returned something.

But the problem now was that the output sometimes was empty, and inconsistently so: some runs came back coherent, others came back with nothing. If the failure had been systematic I’d have gone looking for the cause straight away, but the intermittence kept pointing me at things that weren’t the cause. I tried changing the model, and I noticed that the supposed most capable models produced an empty output more often than the smaller models.

The plot thickens.
The plot thickens.

I eventually dug into the raw Bedrock responses and found the signal I’d been missing. The runs that returned nothing had stopReason = max_tokens: the model was being cut off mid-generation because its output hit the token ceiling. The JSON that came back was truncated, my defensive parsing couldn’t make sense of it, and the agent returned nothing.

Most models have context windows in the 128k-200k range; the more capable models produced richer, longer output, so they bumped into the ceiling. The cheaper models almost never did because their output was naturally shorter, which is also why I’d seen more empty responses on the capable ones. I tried also Claude Sonnet 4.6 and Opus 4.6, that have a one-million-token context window, and they handled the data without breaking. But at that rate, each invocation was costing too much.

And the problem also scales with the Organization. More accounts, more services, more CloudTrail events, more security findings: bigger Organizations produce more data per tool call, and they’re also the ones where the agent is most useful because there’s more complexity to navigate. The cost and the value scale together, which is exactly the wrong dynamic.

Redesigning the tools with pre-aggregation

The fix was making the tools smarter about what they return. Instead of giving the model raw records, the tools do the aggregation themselves and return compact summaries.

The cost summary tool, for example, went from returning daily records per service per account (thousands of data points) to returning this:

{
    "total_cost_usd": 14523.47,
    "daily_average_usd": 2074.78,
    "monthly_projection_usd": 62243.40,
    "cost_by_account": [
        {
            "account_id": "123456789012",
            "cost_usd": 8241.33,
            "daily_datapoints": 7,
            "top_services": [
                {"service": "Amazon EC2", "cost_usd": 4120.50},
                {"service": "Amazon RDS", "cost_usd": 1853.22},
                {"service": "AWS Lambda", "cost_usd": 891.04}
            ]
        }
    ],
    "top_services": [
        {"service": "Amazon EC2", "cost_usd": 6842.10, "percentage": 47.1}
    ],
    "accounts_count": 8,
    "period": {"start": "2026-03-01", "end": "2026-03-07"}
}

A few hundred tokens instead of tens of thousands. The model can read it, understand it, and decide where to dig deeper. The same approach for all tools: security posture returns an overall score and finding counts by severity instead of every individual finding. Activity returns aggregated API call counts and user agent distribution instead of event-level records.

The token consumption per invocation (and related costs) dropped dramatically. The agent could run without anyone worrying about the Bedrock bill.

Problem solved, right?

The visibility tradeoff of pre-aggregation

Not exactly. Because by pre-aggregating, I’m making decisions about what the model gets to see. And those decisions have consequences.

The cost tool returns the top 5 services per account. What if the interesting anomaly is in the 6th service? What if the correlation the agent should find involves a service that doesn’t appear in the summary at all because its cost is low in absolute terms, but it showed up for the first time this week?

The security tool returns finding counts by severity. What if the important thing isn’t a new Critical finding (which would appear in the count) but the fact that 15 Medium findings appeared simultaneously on the same account, which individually look unremarkable but together suggest a misconfiguration?

The whole point of having an agent is to find things I wouldn’t spot on my own. But the pre-aggregation is doing exactly what I would do if I were glancing at the console: look at the big numbers, the top services, the high-severity findings. The agent sees what I would see. It doesn’t see the subtle stuff in the long tail of the data, which is where the non-obvious insights live.

This hit me when I tried to articulate what I wanted: “find patterns I haven’t anticipated, but only look at the data I’ve pre-selected for you.” Those two goals are in tension, and I don’t think there’s a clean resolution.

Two-pass investigation: summary, then drill-down

The best compromise I’ve found is a two-pass architecture, encoded in the prompt:

Pass 1: call the summary tools (cost summary, security posture, resource changes) to get the broad picture. These return compact, pre-aggregated data. The model uses this to identify areas that look interesting: a cost anomaly on a specific account, a security score drop in a particular domain, a cluster of new resources that wasn’t expected.

Pass 2: call drill-down tools with specific filters. The cost tool accepts an account_id parameter to return the detailed breakdown for that one account. The security tool accepts severity and date filters. The activity tool can be scoped to a specific time window.

This way the summary tools keep the initial token cost low, and the model decides where to zoom in. The drill-down calls return more data, but for a narrow scope, so the token volume stays manageable. The total across a handful of tool calls (a few broad, a few narrow) stays well within the context window of any reasonably capable model, and the per-invocation cost drops to a level that makes daily runs viable.

The assumption is that the summary is informative enough for the model to know where to drill down. If the anomaly doesn’t show up in the summary (because it’s in the 6th service, or it’s a pattern across multiple Medium findings that the count doesn’t reveal), the model will never investigate it.

The approach has a structural limit: patterns that don’t surface in the aggregated summaries or in the available drill-downs remain invisible to the agent.

Pre-computed correlations: when code beats the model

Among the 17 tools, one takes a different approach. The get_security_cost_correlation tool doesn’t give the model data to analyze. The correlation logic runs in a separate data collection Lambda that pre-computes the analysis: it matches CloudTrail security error spikes (AccessDenied, UnauthorizedAccess, InvalidAccessKeyId) against cost anomalies on the same accounts in nearby time windows, stores the results in DynamoDB, and the tool just reads them. The agent receives a pre-computed correlation score, never the raw data.

If the score is above 70, the model gets something like: “account X had a cost anomaly of 40% on March 3rd, and in the 48 hours before that there was a spike in AccessDenied errors. Correlation score: 82.” The model’s job is to explain this in context and recommend next steps. The pattern was found by code, deterministically, with zero token cost.

This is a fundamentally different architecture: code for finding the pattern, model for narrating and contextualizing. It works perfectly for known correlation patterns. The problem is the word “known.” You have to identify the pattern before you can write the code to detect it. For the security-cost correlation, that pattern is well understood: credential theft often produces a burst of AccessDenied errors followed by a cost spike. I knew to look for this, so I wrote code to find it.

But the agent was supposed to find patterns I haven’t thought of. For that, the model needs to reason over the data, which brings us back to the context window and cost constraints. The deterministic approach and the exploratory approach serve different purposes, and both have a place. The tension between them is the core design challenge of this kind of system.

From tool design to model choice

So, the tools were redesigned. The prompt encodes the two-pass reasoning process. The pre-aggregation keeps costs manageable. The security-cost correlation tool handles one important pattern entirely in code.

But the agent was still producing banalities. The output read like restated dashboard summaries, not cross-domain investigations. My initial suspicion was that the model was the bottleneck: the cheapest model on Bedrock, chosen for budget reasons, might simply not have the reasoning capacity to follow the multi-step investigation the prompt describes.

To test that, I needed to compare models systematically: not by reading one output and forming an opinion or a feeling, but by running the same scenario on multiple models, multiple times, and comparing structured data. That meant building a benchmarking system, which is the next post.