Troubleshooting

The handful of issues that come up in practice. If you hit something that isn't here, tell us — we'll add it.

Tasks aren't appearing in the dashboard

Work through these in order:

  1. Is the SDK actually loaded? Restart your worker with verbose logging and look for SDK messages on startup. The SDK is silent on the happy path but logs on every failure.
  2. Is connect() being called? A common mistake is putting celeryradar_sdk.connect() inside a function that never runs at import time. The call has to happen during module load, before Celery starts dispatching signals.
  3. Is the API key right? Wrong key produces 401 responses, which the SDK logs at WARNING level. Tail your worker log for ingest returned 401.
  4. Can the worker reach the ingest endpoint? Try curl https://api.celeryradar.com/ingest/ from inside the worker's environment. Network egress restrictions on locked-down deploys are a frequent culprit.
  5. Did the queue fill up? If the SDK's main queue is full, it logs celeryradar_sdk: ingest queue full; dropped N events at most once a minute. That happens during sustained ingest outages or when the worker is firing more than 1000 events between successful POSTs — neither is normal.

Worker shows offline but it's running

Almost always one of two things:

Ephemeral hostnames in containers

In Kubernetes pods, ECS tasks, and Docker containers without an explicit name, socket.gethostname() returns a name that rotates on every restart. Each restart creates a fresh worker row in the dashboard, and the previous row's last_seen ages out — so the row labelled with yesterday's hostname correctly shows offline (it really is offline; that name no longer exists), while your actually-running worker is sitting under a new row with this morning's hostname. The dashboard ends up with one online row and a growing pile of offline ghosts.

Set CELERYRADAR_WORKER_NAME in your deployment manifest to a stable per-deployment value:

# Kubernetes
env:
  - name: CELERYRADAR_WORKER_NAME
    value: "celery-worker-prod"

# docker-compose
environment:
  CELERYRADAR_WORKER_NAME: "celery-worker-prod"

You can delete the ghost rows from the workers page (the dashboard's per-row delete button) — or just wait, our retention pruning sweeps them when their last heartbeat ages past your plan's retention window.

Heartbeat events are being dropped

If the SDK's ingest queue is filling up because a long ingest outage is pyramiding, heartbeat events can drop along with everything else. The retry queue is supposed to catch this — heartbeats use retry=True — but if the outage is long enough, even the retry queue can drop oldest. Enable DEBUG logging on the celeryradar_sdk logger and look for send failed lines to confirm the endpoint isn't actually reachable. (The SDK's per-failure messages are at DEBUG level so they don't spam your normal logs.)

Phantom alerts after a CeleryRadar outage

If our ingest endpoint goes down, your worker can't ship heartbeats. From the alert engine's perspective, that looks identical to your worker actually being offline. Without protection, a CR-side outage would page you about your perfectly-fine workers.

Three layers of protection are in place:

  1. The SDK retry queue holds heartbeats and beat fire events through the outage and replays them after recovery.
  2. The backend's heartbeat upsert uses GREATEST(existing, incoming) semantics so retried backfill events can't push last_seen backward.
  3. The alert engine has a 10-minute startup grace period after worker recovery during which absence-based rules (worker_offline, beat_miss) skip evaluation. This covers the window where retry-queue backfill is still landing.

If you're still seeing phantom alerts after a CR-side outage, it usually means the outage was longer than the SDK retry queue's capacity (~100 events of buffer per worker process) and longer than the 10-minute grace window. Email us with the timestamps and we'll dig into it.

Queue depth charts are blank

Three causes:

  1. Unsupported broker. Today, queue depth requires a standard Redis list-mode broker (redis:// or rediss://). RabbitMQ, SQS, Redis Cluster, Sentinel, and Streams are not yet supported. Tasks, workers, and beat schedules still work on all of these — just queue depth doesn't.
  2. The poller can't reach Redis. The poller derives its connection from app.conf.broker_url by default. If your worker reaches Redis through a network path the SDK doesn't have access to (different VPC, scoped credentials), pass an explicit URL to connect(broker_url=...).
  3. The leader is in another process. By design, only one process per account holds the queue depth poller lock at a time. If you're tailing logs from a non-leader process, you won't see the poller's debug messages there. The leader will be one of your worker processes; the lock rotates if it crashes.

Channel keeps auto-disabling

Notification channels (Discord webhook, Slack, email) auto-disable themselves after 10 consecutive delivery failures. The dashboard's channels page shows a red "auto-disabled" badge and the account owner gets one email at the moment of the flip.

Common causes:

Heads-up on email delivery: Resend's API returns success as soon as it accepts an email for delivery, not when it lands in the recipient's inbox. We don't currently process Resend's bounce webhooks, so a typo'd recipient address can silently swallow alerts for a while before Resend's suppression list kicks in and the channel auto-disables. If you've never seen an alert email arrive, click Test on the channels page — it sends a real test email through the same path, so if it doesn't land in your inbox, the recipient is the problem (delete and recreate the channel to fix the address).

The dashboard doesn't expose an "edit" or "re-enable" action for channels — once auto-disabled, a channel stays that way. The recovery path is to delete it and create a fresh channel with the corrected destination.

"This task ran but it's not in the log"

The most likely reasons, in rough order:

  1. Plan retention. Free retains 7 days, Developer 30, Business 90. Older events are pruned hourly. If you're picking the 30d range pill on a Free account, the dashboard shows a banner explaining the cutoff.
  2. The worker process running it didn't have the SDK loaded. If you're running mixed worker pools where some have connect() and some don't, only the connected ones report.
  3. The event was in a queue that filled up during an ingest outage. Drop-newest semantics means new events drop when the queue is full. The SDK logs cumulative drop counts at most once per minute.

SDK log levels and what they mean

The SDK uses Python's standard logging framework with the logger name celeryradar_sdk. Configure it like any other logger:

import logging
logging.getLogger('celeryradar_sdk').setLevel(logging.INFO)

Still stuck?

Email [email protected] with:

We read every email.