AutoClaw
← Playbooks

Playbook: Deploying an Autonomous SRE Agent for GKE on Cloudflare

Step-by-step guide to deploying a Cloudflare Workers-based SRE incident investigation agent that connects to a private GKE cluster and notifies via Google Chat.

Note: This playbook is published on autoclaw.sh, which is primarily focused on OpenClaw. This guide covers a separate project — an SRE agent for Kubernetes infrastructure. We’ve published it here as a practical reference for teams running similar stacks.


Overview

This playbook walks through deploying an autonomous SRE incident investigation agent. The agent:

Stack: Cloudflare Workers + Durable Objects + Queues + KV + Workers AI (Gemma 4) + GKE + Uptime Kuma + Google Chat


Prerequisites


Part 1 — GKE: Create a read-only ServiceAccount

Create a ServiceAccount with cluster-wide read permissions (no write, no secrets access).

# sre-agent-rbac.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: sre-agent-ns
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sre-agent
  namespace: sre-agent-ns
---
apiVersion: v1
kind: Secret
metadata:
  name: sre-agent-token
  namespace: sre-agent-ns
  annotations:
    kubernetes.io/service-account.name: sre-agent
type: kubernetes.io/service-account-token
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: sre-agent-reader
rules:
  - apiGroups: [""]
    resources: ["namespaces", "pods", "pods/log", "events", "nodes", "services"]
    verbs: ["get", "list"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: sre-agent-reader-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: sre-agent-reader
subjects:
  - kind: ServiceAccount
    name: sre-agent
    namespace: sre-agent-ns

Apply it:

kubectl apply -f sre-agent-rbac.yaml

Extract the bearer token:

kubectl get secret sre-agent-token -n sre-agent-ns \
  -o jsonpath='{.data.token}' | base64 -d

Save this token — you’ll use it as KUBECONFIG_TOKEN.

Get your GKE master endpoint:

kubectl cluster-info
# Kubernetes control plane is running at https://<IP_OR_HOSTNAME>

Part 2 — Cloudflare DNS: expose the GKE API

The GKE master endpoint is on the public internet but addressed by IP. Cloudflare Workers require a hostname. Create a DNS record on your domain:

FieldValue
TypeA
Namegke-api
IPv4<GKE master IP>
ProxyProxied (orange cloud)

Then add a Page Rule:

This lets Cloudflare proxy requests to the GKE API without strict certificate verification (GKE’s cert is for its IP, not your subdomain).

The record should be proxied, not DNS-only. DNS-only would expose the GKE IP directly and cause TLS errors from Workers.


Part 3 — Cloudflare: Create infrastructure

Create the required Cloudflare resources:

export CLOUDFLARE_ACCOUNT_ID=<your-account-id>

# KV namespace for deduplication
npx wrangler kv namespace create DEDUPE_KV

# Queues
npx wrangler queues create sre-incident-queue
npx wrangler queues create sre-incident-dlq

Note the KV namespace ID from the output — you’ll need it in wrangler.toml.


Part 4 — Worker: configure wrangler.toml

name = "sre-incident-agent"
main = "src/index.ts"
compatibility_date = "2025-01-01"
compatibility_flags = ["nodejs_compat"]
account_id = "<your-cloudflare-account-id>"

[observability]
enabled = true
head_sampling_rate = 1

[ai]
binding = "AI"

[[durable_objects.bindings]]
name = "INCIDENT_DO"
class_name = "IncidentDO"

[[migrations]]
tag = "v1"
new_sqlite_classes = ["IncidentDO"]

[[queues.producers]]
binding = "INCIDENT_QUEUE"
queue = "sre-incident-queue"

[[queues.consumers]]
queue = "sre-incident-queue"
max_batch_size = 5
max_batch_timeout = 2
max_retries = 3
dead_letter_queue = "sre-incident-dlq"

[[kv_namespaces]]
binding = "DEDUPE_KV"
id = "<kv-namespace-id-from-step-3>"

[vars]
ENVIRONMENT = "production"
GKE_API_SERVER = "https://gke-api.yourdomain.com"
LLM_MODEL = "@cf/google/gemma-4-26b-a4b-it"

Part 5 — Worker: set secrets

# GKE ServiceAccount bearer token (from Part 1)
npx wrangler secret put KUBECONFIG_TOKEN

# Google Chat incoming webhook URL
npx wrangler secret put GOOGLE_CHAT_WEBHOOK_URL

# Uptime Kuma webhook bearer token (choose any string, use same value in Kuma)
npx wrangler secret put UPTIME_KUMA_SECRET

Part 6 — Worker: deploy

npm install
npx wrangler deploy

Verify it’s live:

curl https://sre-incident-agent.<your-subdomain>.workers.dev/health
# {"ok":true,"environment":"production"}

Test K8s connectivity (also sends first 10 namespaces to Google Chat):

curl https://sre-incident-agent.<your-subdomain>.workers.dev/test/k8s
# {"ok":true,"message":"Connected to GKE. Found N namespaces..."}

Part 7 — Uptime Kuma: configure notification

In Uptime Kuma → NotificationsAdd Notification:

FieldValue
TypeWebhook
NameSRE Agent
URLhttps://sre-incident-agent.<subdomain>.workers.dev/webhook/uptime-kuma
MethodPOST
Content Typeapplication/json
Additional Headers{"Authorization": "Bearer <your-UPTIME_KUMA_SECRET>"}

Hit Test — you should see “Test ping acknowledged” (200 OK). Uptime Kuma test pings send heartbeat: null which the agent handles gracefully without triggering an investigation.

Assign this notification to whichever monitors you want covered.


Part 8 — Verify end-to-end

Trigger a fake down event:

curl -X POST https://sre-incident-agent.<subdomain>.workers.dev/webhook/uptime-kuma \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-secret>" \
  -d '{
    "heartbeat": {"status": 0, "time": "2026-01-01T00:00:00Z", "msg": "Test"},
    "monitor": {"id": 1, "name": "my-service", "url": "https://my-service.example.com"},
    "msg": "Test incident"
  }'

Check Google Chat — you should see a thread appear with:

  1. Alert received notification
  2. Investigation step notifications as tool calls fire
  3. A final card with root cause, recommended action, and confidence level

Check the incident state:

curl https://sre-incident-agent.<subdomain>.workers.dev/incidents/<incidentId>

Architecture reference

Uptime Kuma
    │  POST /webhook/uptime-kuma

Cloudflare Worker (sre-incident-agent)
    │  verify bearer token
    │  deduplicate (KV)
    │  notify Google Chat: "received"
    │  enqueue (Workers Queue)

Queue Consumer (same Worker)
    │  dispatch to Durable Object by incidentId

IncidentDO (Durable Object)
    │  POST /start → schedule alarm

alarm() loop [up to 20 turns, 3s between alarms]

    ├── Gemma 4 (Workers AI)
    │       calls tools: list_pods, get_deployment_status,
    │                     get_recent_events, get_pod_logs,
    │                     get_rollout_status, get_node_readiness

    ├── KubeHttpClient
    │       → https://gke-api.yourdomain.com  (Cloudflare proxy)
    │       → GKE master endpoint             (Full SSL, bearer token)

    └── Google Chat webhook (threaded by incidentId)
            "investigation step N"
            "complete: root cause + recommended action"

Security notes


Tuning

ParameterLocationDefaultNotes
Max tool callsloop.ts MAX_TURNS20Increase for complex clusters
Alarm intervalIncidentDO.ts3sRate limit buffer for Workers AI
Queue batch sizewrangler.toml5Max concurrent incidents processed
Queue retrieswrangler.toml3Before DLQ
LLM modelwrangler.toml LLM_MODEL@cf/google/gemma-4-26b-a4b-itAny Workers AI model with tool calling