Concepts

The dashboard is meant to be self-explanatory, but a few behaviors are worth understanding up front: when alerts re-fire, when older data disappears, and what the various paused / disabled states on rules and channels actually mean.

Alert lifecycle

Each rule type has its own trip condition. worker_offline trips after a worker hasn't been heard from for absence_seconds; queue_depth_threshold trips when depth crosses N for duration_seconds; and so on. Those settings answer "when does this rule first fire?"

Once it has fired, two settings shared by every rule type (cooldown_seconds and renotify_seconds) plus an automatic incident-dedupe mechanism answer the separate question: "how often do you want to hear about it after that?" Those are what this section covers.

Cooldown

cooldown_seconds is an anti-jitter floor. After a rule fires, the engine refuses to re-fire it again until cooldown elapses, regardless of whether the underlying condition cleared and re-tripped or stayed broken the whole time. Defaults to 300 seconds (5 min).

Cooldown applies to all re-fires, including fires for what's arguably a brand-new incident. It exists to keep a flapping condition from paging you every evaluation tick.

Incident dedupe

Cooldown alone is not enough for some rule types. A 1-minute cron with beat down would fire roughly every 5 minutes at default cooldown: twelve pings an hour for a single ongoing outage. So on top of cooldown, the engine deduplicates by incident:

beat_miss: fires once per outage. The "incident" is anchored to the schedule's last successful fire; as long as that doesn't advance, additional missed windows don't generate new alerts. The next outage gets a fresh anchor automatically once the schedule recovers.
worker_offline: fires once per offline period. Anchored to the worker's last_seen; a worker that comes back and goes offline again will alert again.
queue_depth_threshold: not deduped by incident. Depth oscillates around thresholds in normal operation, and "still backed up" is usually the customer's operative concern. Cooldown alone applies, and the rule re-fires on every tick past the cooldown window for as long as the breach persists.
task_failure_rate: also not deduped. Failure rates flap; "still failing" is the operative signal.

The practical default for the deduped rule types (beat_miss, worker_offline) is one alert per incident, ever, until it recovers and re-breaks.

Renotify (opt-in reminders)

renotify_seconds is how you opt into reminders during a single open incident. Set it to 14400 (4 hours), say, and the deduped rule types will re-fire every 4 hours that the incident stays open. Leave it unset and you get the single-alert-per-incident default.

Renotify must be larger than cooldown to do anything visible. If you set cooldown=300 and renotify=60, cooldown is the binding constraint and you'll see one ping every 5 minutes regardless. The dashboard doesn't enforce ordering, so set them deliberately.

For the rule types that aren't deduped (queue_depth_threshold, task_failure_rate), renotify is irrelevant; cooldown is the only knob that matters.

Failed deliveries don't count

Both cooldown and renotify gates only consider successful dispatches. If your Discord webhook 500s, the rule keeps re-evaluating on the next tick and tries again; failed dispatches are not anchors. The audit trail still records the failed attempt; it just doesn't reset the timer.

Bounds on rule config

Each rule type has minimum bounds enforced at create time, both server-side and in the form's HTML5 validation:

worker_offline.absence_seconds: minimum 100s. Heartbeats land every 30s and the engine evaluates every 60s; lower values would fire on every tick.
queue_depth_threshold.duration_seconds: minimum 30s, the broker poll cadence.
beat_miss.consecutive_misses: minimum 1.
task_failure_rate.window_seconds: between 600s (10 min) and 21600s (6 hours). Shorter windows rarely accumulate enough samples to evaluate; longer windows dilute spikes too much to be useful.
task_failure_rate.threshold_pct: between 1 and 100.

task_failure_rate also has a hidden minimum-samples gate of 10. Without it, one failure on a quiet queue is 1/1 = 100% failure rate and would page on the first failure.

Retention

Each plan retains historical data for a fixed window:

Free: 7 days
Developer: 30 days
Business: 90 days

An hourly cleanup pass deletes anything older than the cutoff. This applies to task events, worker heartbeats, queue depth samples, beat execution rows, and alert event history. The cutoff is rolling, so the oldest data falls off as new data lands.

Why some charts go blank

The /tasks/ log and the /tasks/breakdown/ page both let you pick a range pill (1h · 24h · 7d · 30d · all). If you pick a range longer than your plan retains, the underlying rows aren't there to chart; they've been pruned.

The dashboard handles this gracefully: a green banner appears above the panel ("Your plan keeps 7 days of history. Older data has been pruned. Upgrade to keep more.") and the chart shows whatever data is actually available within your retention window. The query is automatically clamped, so the visible rows match what the plan provides; you don't get a partially-empty chart.

Business is the top tier; on Business + "all", the banner doesn't appear because there's nowhere higher to upgrade to.

Why a rule or channel might say "paused"

Two states can take a rule or channel out of service automatically. The dashboard renders them with distinct badges so you can tell which is which. (There's no manual on/off toggle: rules and channels are either active or deleted. To stop a rule firing, delete it.)

Paused (plan limit)

The Free plan caps rules at 3 and channels at 1, with email as the only allowed channel type. Paid plans (Developer and Business) are unlimited on both rules and channels, and unlock the other channel types. So in practice, "paused (plan limit)" on a rule or channel only happens after a paid → Free downgrade that leaves you with more rows than Free allows, or with channels of a type Free doesn't permit.

When that happens, the engine doesn't delete your over-cap rules or channels. It pauses them. Paused rows render with a gray "paused" badge at reduced opacity. They keep existing on disk; the engine ignores them. Upgrade back to a paid plan and they automatically un-pause and resume firing; your config survives the round trip.

This is intentional: a downgrade-then-upgrade round trip preserves all your work. We never overwrite your intent (whether you wanted the rule on or off); we only flip a separate plan_paused flag that the engine reads alongside.

When the engine has to choose which rows to pause, it favors keeping rows whose channel can actually deliver. A scenario from a real downgrade: customer has 4 Discord rules + 1 newer email rule, and downgrades to Free (which only allows email). The naive "oldest rules win" heuristic would pause the email rule. The actual heuristic notices that the email rule is the only one whose channel can still deliver under Free, and keeps it.

Auto-disabled (channel keeps failing)

If a notification channel fails to deliver 10 times in a row (wrong webhook URL, expired Slack token, recipient on the email provider's bounce list), the engine flips it to "auto-disabled" and emails you once at the moment of the flip.

The dashboard doesn't expose a "re-enable" or "edit" button for auto-disabled channels. The recovery path is to delete the channel and create a fresh one. Reasoning: 10 consecutive failures means the underlying configuration is most likely broken; re-enabling it against the same broken config just runs through 10 more failures. Forcing a delete-and-recreate makes the recovery deliberate.

Rules pointing at an auto-disabled channel keep evaluating but their fires route to a console fallback (which is to say: the alert event is recorded for audit, but no real notification goes out). The /rules/ page surfaces a red ⚠ inline note next to any rule whose channel can't currently deliver, so you can see at a glance which alerts are silently dead.

Why your worker might say "monitoring paused"

Plans cap the number of workers you can monitor (Free 2, Developer 15, Business 100). When you exceed the cap, the freshest-heartbeat workers stay active and the older ones flip to "monitoring paused", distinct from offline. A returning paused worker can promote itself back to active by being among the freshest-N on its next heartbeat.

Paused workers don't trigger worker_offline alerts. The cap applies continuously, so rotating workers in and out doesn't get you more monitored hosts than your plan allows; only the freshest-N at any moment are active.

Plan limits are caps, not gates. Your over-cap rows aren't deleted, just paused. Re-upgrade and everything you had comes back exactly as it was. The single exception is auto-disabled channels: those need a delete-and-recreate because the underlying configuration is broken.

Reference

What gets monitored

The four signal types CeleryRadar collects (tasks, workers, beat schedules, queue depth) and how each is detected.

Troubleshooting

The handful of issues that come up in practice when wiring up the SDK, and how to resolve each.