AWS just shipped an AI on-call engineer. We tested it.


AWS FOR THE REAL WORLD
⏱️
Reading time: 12 minutes
🎯
Main Learning: AWS DevOps Agent investigates incidents autonomously across CloudWatch, CloudTrail, and your code. It surfaces evidence brilliantly β€” but can confidently point at the wrong root cause, so don't apply its fixes blindly.
πŸ“

Hey Reader πŸ‘‹πŸ½

I was in Portugal for the past week. 10 days of tennis, padel, sun and waves 🎾

Highly recommended place!

​
Our daily lives as software developers really changed since we started using AI. My workflow changed quite a bit:

  1. Copy Paste from ChatGPT
  2. Tab Completion
  3. Agentic IDE

… and honestly I love it.

Every company is adding AI features now. Not all of them make sense.

Also AWS is known for new AI features. One of the last ones: AWS DevOps Agent.

Tobi checks it out in real-live, let's see what it can do!

Sponsored

The dev platform tax just got worse.

Dropbox migrated 1,000 developers to Coder β€” same platform now governs AI agents

Every infra team eventually builds an internal dev platform. Pre-warmed envs. Network isolation. Audit logs. Provisioning glue someone has to maintain. It's a tax most teams pay quietly.

Dropbox paid it for years β€” a homegrown Python provisioning system, monolithic, painful to extend. Then they switched to Coder. 2 people migrated 1,000 developers in 4 months. Startup times went from ~1 hour to 15 minutes. Cloud costs dropped 30%.

Now AI agents are joining the team. They need the same platform primitives β€” pre-warmed envs, allow-list controls, audit logs. Building that yourself just got twice as expensive.

  • AI Governance β€” one governed layer for every LLM call and every agent action. Token logging, attribution, allow-list controls.
  • Coder Agents β€” label a GitHub issue, a workspace spins up, a PR gets shipped.

Same self-hosted footprint, in your own AWS account.

This issue is sponsored by Coder.

AWS DevOps Agent β€” your AI SRE is now on call

πŸ“š This Week's Deep Dive

AWS DevOps Agent β€” Hands-on

Most incidents are not complex.

Something breaks and someone has to figure out what changed. You open CloudWatch, check recent deployments, ask whoever last touched the service. Two hours later, you find a renamed DynamoDB field in last week's commit. The fix? Five minutes.

AWS thinks they have a solution for that.

It's called AWS DevOps Agent β€” an "always-on, autonomous on-call engineer" that starts investigating the moment an alarm fires. Built on Bedrock AgentCore, generally available since March 2026.

What it actually is

Connect it to your observability stack, code repos, deployment pipelines, and runbooks. When something breaks, it correlates data across all of them to find what changed.

Three things it does:

  • Incident investigation. Alarm fires, agent pulls logs, checks deployments, diffs commits, tells you what changed.
  • Proactive recommendations. Between incidents it analyzes patterns and flags recurring issues before they blow up again.
  • On-demand questions. "Why did checkout latency spike last Tuesday?" β€” it queries your tools and answers.

Test 2 β€” the missing IAM permission

Order processor Lambda missing DynamoDB PutItem permission

An order processor Lambda calls dynamodb:PutItem, but the IAM role is missing the permission. Classic code-review miss.

The moment the alarm fired, the agent posted to Slack with this:

Root cause: Terraform deployment by tobias.schmidt created Lambda function with incomplete IAM permissions. The Terraform configuration is missing the required DynamoDB permissions for the role despite the function's code calling dynamodb:PutItem on table devops-agent-orders.

Not vague. It identified the exact deployment, the exact missing permission, and traced it through CloudTrail timestamps. No human dug through logs.

Test 3 β€” where it picked the wrong side to blame

Inventory updater Lambda with wrong TABLE_NAME env var

Lambda env var: TABLE_NAME=devops-agent-inventory-v1. IAM policy: grants UpdateItem on table/devops-agent-inventory (no -v1). Every invocation: AccessDeniedException.

The agent traced everything correctly β€” env var, IAM, the exact Terraform commit. Then concluded: fix the IAM policy. Point it at table/devops-agent-inventory-v1.

That fix would resolve the AccessDeniedException and immediately surface a ResourceNotFoundException, because that table doesn't exist.

The actual bug is the env var. TABLE_NAME should be devops-agent-inventory. The IAM policy is correct.

So: it surfaces evidence brilliantly. It does not always reason about which piece of evidence is the actual root cause.

Where it falls short

  • No native CloudWatch integration. No alarm action targets DevOps Agent directly. You build a bridge Lambda, sign a payload with HMAC-SHA256, POST to a webhook. Not the one-click setup the product page implies.
  • Pricing is hard to predict. $0.0083/agent-second across all task types. Roughly $30/hour of active time. No cost caps.
  • Azure support is shallow. Cross-cloud is real but early. CloudTrail has no equivalent on the Azure side.

Worth trying?

If your team spends hours per incident manually correlating logs, deployments, and CloudTrail β€” start the 2-month free trial. Setup takes an afternoon and after that it runs without maintenance.

The DevOps SRE fear is real but premature. It's a good investigator, not a replacement for someone who understands the system. Test 3 proved that.

That's a wrap.
Try the trial, but don't follow the agent's recommendations blindly β€” Test 3 is the warning shot.
See you next week!

Sandro & Tobi

AWS for the Real World

We teach AWS for the real world - not for certifications. Join more than 10,500 developers learning how to build real-world applications on AWS.

Read more from AWS for the Real World

AWS FOR THE REAL WORLD ⏱️ Reading time: 10 minutes 🎯 Main Learning: S3 Files gives POSIX access at S3 prices: 13x cheaper than EFS for large files, but the 60-second write-back delay silently breaks coordination patterns! πŸ“ Blog Post πŸ’» GitHub Repository 🎬 Watch on YouTube Hey Reader πŸ‘‹πŸ½ Sandro is currently traveling through Portugal - work and fun combined! If you happen to be around, reach out. Would be great to meet up! 🀝 This week we're digging into S3 Files: a POSIX file system backed by...

AWS FOR THE REAL WORLD ⏱️ Reading time: 6 minutes 🎯 Main Learning: 5 common AWS account mistakes and how to fix each one in under 10 minutes 🎬 Watch on YouTube Hey Reader πŸ‘‹πŸ½ New week, new AWS deep dive 🐠 In this one, we'll show you the 5 most common mistakes we've seen in almost every AWS account we've looked at. Yes, there are more out there. But these are the ones you'll see everywhere. And they're pretty simple to fix! The good news? Most of these fixes take under 10 minutes. Rather watch...

AWS FOR THE REAL WORLD ⏱️ Reading time: 9 minutes 🎯 Main Learning: Build a self-service portal that grants temporary AWS + Azure access and revokes it automatically β€” using Kestra and one YAML file. πŸ“ Blog Post πŸ’» GitHub Repository 🎬 Watch on YouTube Hey Reader πŸ‘‹πŸ½ Happy new week! Tobi and I met up last week and spent some time planning the videos ahead. We’re going more and more into YouTube β€” and a few things I’m hyped about: The biggest AWS mistakes we’ve made (so you don’t have to) How...