claude-code·8 min read·18 May 2026

Claude Code observability step 2: alerting rules (UK indie hacker, 2026)

Yesterday's post got Claude Code metrics into Grafana. Today is what you do with them - Prometheus Alertmanager rules for daily cost thresholds, Pool B drain alerts, failed-job alarms, and runaway-loop detection. UK indie hacker setup, GBP costs, four alerts you actually need.

By IdeaStack

Claude Code observability step 2: alerting rules (UK indie hacker, 2026)

The SigNoz post yesterday got the metrics flowing. You have a Grafana dashboard that shows cost-per-session, token-spend-per-model, agent run counts, and lines of code changed. That is the collect half of observability.

The act half is alerting. A dashboard you have to look at is not an alert. The point of the metrics is to wake you up only when something is actually wrong - a daily cost overshoot, a Pool B drain rate that will blow through your monthly cap, a failed job, an agent stuck in a runaway loop. This post is the four alerts every UK indie hacker actually needs, the Prometheus Alertmanager rules that fire them, a free Slack receiver, and how to wire the systemd OnFailure handler into the same pipe.

The four alerts that earn their keep

After running self-hosted observability on a fleet of UK indie hacker Claude agents for a few months, these are the four that actually fire and matter. Everything else is dashboard porn.

Alert	Threshold	Fires when	Action
Daily cost overshoot	GBP per day	Yesterday's spend > threshold	Check which agent burned tokens
Pool B drain rate	Projected month-end	Projection > 90% of cap	Check spend curve, cap aggressive agents
Failed job	Any non-zero	Agent run exits non-zero	Read journalctl, fix and re-run
Runaway loop	Requests/min	Rate > N for > 2 min	Stop the service, inspect logs

Two of these (daily cost, runaway loop) are tripwires that prevent a quiet GBP 30+ bill at the end of the month. Two (Pool B drain, failed job) are operational signals that something needs human attention this week.

The Prometheus rules

Self-hosted Prometheus on the same Hetzner box from yesterday, scraping the Claude Code OpenTelemetry collector. The rules live in /etc/prometheus/alerts.yml. Reload Prometheus to pick them up.

# /etc/prometheus/alerts.yml
groups:
  - name: claude-code
  interval: 30s
  rules:

  # Daily cost overshoot - fires once per day per agent
  - alert: ClaudeCodeDailyCostOver
  expr: |
  sum by (agent) (
  increase(claude_code_cost_usage_total[24h])
  ) > 1.50
  for: 5m
  labels:
  severity: warning
  annotations:
  summary: "Claude agent {{ $labels.agent }} spent over GBP 1.50 in 24h"
  description: "24h spend: GBP {{ $value }}"

  # Pool B drain rate projection
  - alert: ClaudeCodePoolBDrainHigh
  expr: |
  (
  sum(increase(claude_code_pool_b_cost_total[7d])) / 7
  ) * 30 > 18.00
  for: 30m
  labels:
  severity: warning
  annotations:
  summary: "Projected Pool B spend > GBP 18 this month"
  description: "Current 7d-avg-projected: GBP {{ $value }} of GBP 20 cap"

  # Failed job - any agent run with non-zero exit
  - alert: ClaudeCodeJobFailed
  expr: |
  increase(claude_code_job_exit_code_nonzero_total[5m]) > 0
  for: 0m
  labels:
  severity: critical
  annotations:
  summary: "Claude agent {{ $labels.agent }} failed"
  description: "Check journalctl -u {{ $labels.unit }} --since 5m"

  # Runaway loop - request rate > 60/min for 2 min
  - alert: ClaudeCodeRunawayLoop
  expr: |
  rate(claude_code_requests_total[1m]) > 60
  for: 2m
  labels:
  severity: critical
  annotations:
  summary: "Claude agent {{ $labels.agent }} runaway loop suspected"
  description: "Request rate {{ $value }}/min for 2+ min - check + stop service"

The metric names match the Anthropic OTEL spec - claude_code_cost_usage_total, claude_code_pool_b_cost_total, claude_code_requests_total. Your scrape config in prometheus.yml should already be pointed at the OTEL collector exposing them.

A note on the cost rule. The Prometheus increase() function over 24h gives you the cumulative spend in the last day. The GBP 1.50 threshold is what works for a typical UK indie hacker with two or three agents - set it higher if you run heavier agents, lower if your agents are mostly cheap Sonnet calls. The point is to catch an outlier, not to fire on normal usage.

The Alertmanager config and the free Slack webhook

Alertmanager turns Prometheus alert events into messages. A minimal config that posts to one Slack channel for everything.

First, the free Slack webhook. Create a Slack workspace (or use an existing one), open https://api.slack.com/apps, click "Create New App", pick "From scratch", name it claude-alerts. In the app settings, enable "Incoming Webhooks", add a webhook to a channel, copy the webhook URL. Total time about 5 minutes.

Then the Alertmanager config at /etc/alertmanager/alertmanager.yml:

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "agent"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-claude-alerts

receivers:
  - name: slack-claude-alerts
  slack_configs:
  - api_url: "https://hooks.slack.com/services/T0XXXX/B0YYYY/zzzzzzzzzzzzzz"
  channel: "#claude-status"
  title: "{{ .GroupLabels.alertname }} - {{ .GroupLabels.agent }}"
  text: |
  {{ range .Alerts }}
  *Summary:* {{ .Annotations.summary }}
  *Description:* {{ .Annotations.description }}
  *Severity:* {{ .Labels.severity }}
  {{ end }}
  send_resolved: true

The group_by clause means three failed jobs from the same agent in five minutes will be grouped into one Slack message rather than three. The repeat_interval: 4h stops a still-firing alert from spamming the channel - you get one ping every four hours until you acknowledge or fix it.

Wire Prometheus to Alertmanager in prometheus.yml:

alerting:
  alertmanagers:
  - static_configs:
  - targets: ["localhost:9093"]

rule_files:
  - "/etc/prometheus/alerts.yml"

Reload both with systemctl reload prometheus and systemctl reload alertmanager. From now on every threshold breach lands in #claude-status within 30 seconds of the rule firing.

The systemd OnFailure handler - same pipe

The systemd OnFailure= handler from the Linux scheduler post should land in the same Slack channel. That way every failure - whether it came from a Prometheus rule (cost overshoot, runaway loop) or from a process crash (segfault, exit non-zero, OOM kill) - funnels into one pane of glass.

The handler unit at /etc/systemd/system/notify-failure@.service:

[Unit]
Description=Send failure notice to Slack for %i

[Service]
Type=oneshot
EnvironmentFile=/etc/claude/slack-webhook.env
ExecStart=/bin/bash -c '\
  CONTEXT=$(journalctl -u %i --since "-2 minutes" --no-pager | tail -50); \
  PAYLOAD=$(jq -nc \
  --arg channel "#claude-status" \
  --arg text "*systemd failure:* %i\\n\`\`\`$CONTEXT\`\`\`" \
  "{ channel: \\$channel, text: \\$text }"); \
  curl -sS -X POST -H "Content-Type: application/json" \
  -d "$PAYLOAD" "$SLACK_WEBHOOK_URL"'

The %i is the failed unit name passed through by systemd. journalctl --since "-2 minutes" grabs the last two minutes of logs for context. jq builds the JSON safely (handles quoting in log output). EnvironmentFile= holds the webhook URL with chmod 600 so it stays out of journalctl.

Wire it from the main service file:

# /etc/systemd/system/morning-brief.service
[Service]
...
OnFailure=notify-failure@%n.service

Now a claude -p crash inside morning-brief.service triggers notify-failure@morning-brief.service.service, which posts the last two minutes of journalctl context to #claude-status within seconds. Same channel as the Prometheus alerts.

The operational pattern that drops out: one Slack channel, all signals. Cost overshoot, Pool B drain, runaway loop, process crash, exit non-zero - one mental model for "is anything wrong".

This Week's Free Business Idea

Audit UK sites for DMCCA fake-review and drip-pricing breaches

Find the breaches, get the policy, for £39 a month

7.2/10Read the full breakdown →

The daily-summary cron - proactive health signal

Tripwires fire on breaches. A daily summary fires once a day regardless, giving you a heartbeat on normal-state spend. Indie hackers who run only tripwires sometimes go three weeks without thinking about the agents, then get a surprise. A 23:55 daily summary is the cheapest insurance policy.

A simple systemd timer that runs a script:

#!/bin/bash
# /usr/local/bin/claude-daily-summary.sh

source /etc/claude/slack-webhook.env

TODAY_COST=$(curl -sS "http://localhost:9090/api/v1/query" \
  --data-urlencode 'query=sum(increase(claude_code_cost_usage_total[24h]))' \
  | jq -r '.data.result[0].value[1]')

TODAY_RUNS=$(curl -sS "http://localhost:9090/api/v1/query" \
  --data-urlencode 'query=sum(increase(claude_code_jobs_total[24h]))' \
  | jq -r '.data.result[0].value[1]')

TODAY_FAILS=$(curl -sS "http://localhost:9090/api/v1/query" \
  --data-urlencode 'query=sum(increase(claude_code_job_exit_code_nonzero_total[24h]))' \
  | jq -r '.data.result[0].value[1]')

curl -sS -X POST -H "Content-Type: application/json" \
  -d "{\"channel\":\"#claude-status\",\"text\":\"*Daily summary:* GBP ${TODAY_COST} | ${TODAY_RUNS} runs | ${TODAY_FAILS} failures\"}" \
  "$SLACK_WEBHOOK_URL"

A systemd timer at /etc/systemd/system/claude-daily-summary.timer:

[Unit]
Description=Claude daily summary to Slack

[Timer]
OnCalendar=*-*-* 23:55:00
Persistent=true

[Install]
WantedBy=timers.target

Enable: sudo systemctl enable --now claude-daily-summary.timer. From now on every night at 23:55 you get a one-line ping in Slack: "Daily summary: GBP 2.14 | 47 runs | 0 failures". Healthy days look identical, problem days stick out instantly.

The GBP cost line for this whole setup

Total recurring cost of step-2 alerting on top of yesterday's step-1 stack: GBP 0. Everything self-hosted on the same Hetzner box. Slack webhooks free. Alertmanager and Prometheus open source.

The only spend that changed is the Claude spend itself, which you should now see down, not up. The runaway-loop and daily-cost alerts catch the failures that used to silently eat your budget. Multiple indie hackers running this setup have saved GBP 20-50/month in caught runaway loops alone.

The honest comparison with managed alternatives. PagerDuty + Datadog for the same alert surface costs around USD 50/mo entry, scaling to hundreds. Better Stack free tier handles up to 3 monitors. Grafana Cloud free tier handles the rules but adds Grafana Cloud lock-in. For a UK indie hacker on the Hetzner VPS, self-host wins - same alert surface, GBP 0 recurring.

The UK indie hacker observability stack - collect plus alert

Joining yesterday and today: the full observability story is two layers. Collect with the Claude Code OpenTelemetry pipeline into Prometheus and Grafana. Alert with Prometheus Alertmanager and the systemd OnFailure handler, both posting to one Slack channel.

The end-state: every agent run is metric-tracked, dashboarded, and tripwire-alerted. Failed jobs ping Slack within seconds with the journalctl context. Cost overshoots fire daily. Pool B drain fires when you are projected to blow through your cap. Runaway loops fire within two minutes. Daily summaries keep you honest on quiet days.

Total recurring infrastructure cost: GBP 3.30/mo for the Hetzner ARM box (already covering all the other agents). Total setup time once you have step 1 in place: about 45 minutes for step 2. The payoff is that you stop worrying about whether the agents are working.

That is the entire shape of production for a UK indie hacker running Claude Code in 2026. A cheap box, systemd timers for the simple jobs, WDK on Vercel for the multi-step ones, OTEL collecting metrics into Grafana, Prometheus alerting into one Slack channel. Five small pieces wired together. Less than an hour of setup beyond the per-agent code. The agents now actually work, and you know when they don't.

New here? IdeaStack publishes one deeply researched UK business opportunity every Thursday - real keyword data, competitor analysis, builder prompts. See the latest free report.

Frequently asked

Do I need Alertmanager if Grafana has its own alerting?

You can use Grafana alerting alone for a single-stack indie hacker setup - it shipped a unified alerting system in Grafana 8 and the rules live alongside the dashboards. Alertmanager is the better choice when you are running multiple Prometheus instances, want alert deduplication and grouping across hosts, or already have Alertmanager somewhere for other infrastructure. For a UK indie hacker on a single Hetzner box, Grafana alerting is the simpler default. The rules below work in both - the YAML is similar, only the wrapper differs.

What does Pool B drain rate actually measure?

Pool B is the new programmatic credit pool that Anthropic split out on 15 June 2026 - all `claude -p` and Agent SDK usage draws from it, capped at your subscription price. The drain rate is how fast you are burning through that monthly allowance. The useful alert fires when you are on track to hit your cap before the month ends. The Prometheus rule below extrapolates current spend rate against the days remaining in the billing cycle and fires when projected month-end spend exceeds 90% of the cap. The point is to catch a runaway agent before it eats your whole allowance, not to alert on normal usage.

How do I get a Slack webhook for free?

Create a free Slack workspace (or use an existing one), go to Slack apps, search for Incoming Webhooks, install it to a channel, copy the webhook URL. The free tier is rate-limited to one message per second per webhook which is more than enough for alert traffic. Paste the URL into your Alertmanager config as the receiver. Total time about 5 minutes. No paid Slack plan required - the webhook tier is free indefinitely and the channel is yours forever.

What is a runaway-loop detector and why do I need one?

A runaway loop is when your Claude Code agent gets stuck calling itself or hitting the same tool repeatedly without terminating. It is the worst possible failure mode because it silently burns tokens at maximum rate until you notice. The detector is a simple Prometheus rule that fires when an agent's request rate exceeds a reasonable upper bound (say 60 requests per minute for an agent that normally does 1-2 per minute). When the alert fires you stop the systemd service and check the logs. It has saved indie hackers from GBP 50 surprise spends more than once and is the alert most setups miss until they get bitten.

Can I do all this without running Prometheus myself?

Yes. The same alert rules express in Grafana Cloud free tier (10k series, 50GB logs, 14 days retention) or in Vercel's observability integrations. The trade is convenience versus the GBP 0 self-hosted Hetzner stack from yesterday's post. For a UK indie hacker who already owns the Hetzner box, the self-hosted SigNoz or Prometheus + Grafana stack is the cheapest answer. For someone who wants zero-ops, Grafana Cloud free or Better Stack free tier handles the same alert rules with a click-through UI. Pick by where you want the ops burden, not by what is technically possible.

Filed under

Claude Code & AI Tools