Troubleshooting
The handful of issues that come up in practice. If you hit something that isn't here, tell us — we'll add it.
Tasks aren't appearing in the dashboard
Work through these in order:
- Is the SDK actually loaded? Restart your worker with verbose logging and look for SDK messages on startup. The SDK is silent on the happy path but logs on every failure.
- Is
connect()being called? A common mistake is puttingceleryradar_sdk.connect()inside a function that never runs at import time. The call has to happen during module load, before Celery starts dispatching signals. - Is the API key right? Wrong key produces 401 responses, which the SDK logs at WARNING level. Tail your worker log for
ingest returned 401. - Can the worker reach the ingest endpoint? Try
curl https://api.celeryradar.com/ingest/from inside the worker's environment. Network egress restrictions on locked-down deploys are a frequent culprit. - Did the queue fill up? If the SDK's main queue is full, it logs
celeryradar_sdk: ingest queue full; dropped N eventsat most once a minute. That happens during sustained ingest outages or when the worker is firing more than 1000 events between successful POSTs — neither is normal.
Worker shows offline but it's running
Almost always one of two things:
Ephemeral hostnames in containers
In Kubernetes pods, ECS tasks, and Docker containers without an explicit name, socket.gethostname() returns a name that rotates on every restart. Each restart creates a fresh worker row in the dashboard, and the previous row's last_seen ages out — so the row labelled with yesterday's hostname correctly shows offline (it really is offline; that name no longer exists), while your actually-running worker is sitting under a new row with this morning's hostname. The dashboard ends up with one online row and a growing pile of offline ghosts.
Set CELERYRADAR_WORKER_NAME in your deployment manifest to a stable per-deployment value:
# Kubernetes
env:
- name: CELERYRADAR_WORKER_NAME
value: "celery-worker-prod"
# docker-compose
environment:
CELERYRADAR_WORKER_NAME: "celery-worker-prod"
You can delete the ghost rows from the workers page (the dashboard's per-row delete button) — or just wait, our retention pruning sweeps them when their last heartbeat ages past your plan's retention window.
Heartbeat events are being dropped
If the SDK's ingest queue is filling up because a long ingest outage is pyramiding, heartbeat events can drop along with everything else. The retry queue is supposed to catch this — heartbeats use retry=True — but if the outage is long enough, even the retry queue can drop oldest. Enable DEBUG logging on the celeryradar_sdk logger and look for send failed lines to confirm the endpoint isn't actually reachable. (The SDK's per-failure messages are at DEBUG level so they don't spam your normal logs.)
Phantom alerts after a CeleryRadar outage
If our ingest endpoint goes down, your worker can't ship heartbeats. From the alert engine's perspective, that looks identical to your worker actually being offline. Without protection, a CR-side outage would page you about your perfectly-fine workers.
Three layers of protection are in place:
- The SDK retry queue holds heartbeats and beat fire events through the outage and replays them after recovery.
- The backend's heartbeat upsert uses
GREATEST(existing, incoming)semantics so retried backfill events can't pushlast_seenbackward. - The alert engine has a 10-minute startup grace period after worker recovery during which absence-based rules (
worker_offline,beat_miss) skip evaluation. This covers the window where retry-queue backfill is still landing.
If you're still seeing phantom alerts after a CR-side outage, it usually means the outage was longer than the SDK retry queue's capacity (~100 events of buffer per worker process) and longer than the 10-minute grace window. Email us with the timestamps and we'll dig into it.
Queue depth charts are blank
Three causes:
- Unsupported broker. Today, queue depth requires a standard Redis list-mode broker (
redis://orrediss://). RabbitMQ, SQS, Redis Cluster, Sentinel, and Streams are not yet supported. Tasks, workers, and beat schedules still work on all of these — just queue depth doesn't. - The poller can't reach Redis. The poller derives its connection from
app.conf.broker_urlby default. If your worker reaches Redis through a network path the SDK doesn't have access to (different VPC, scoped credentials), pass an explicit URL toconnect(broker_url=...). - The leader is in another process. By design, only one process per account holds the queue depth poller lock at a time. If you're tailing logs from a non-leader process, you won't see the poller's debug messages there. The leader will be one of your worker processes; the lock rotates if it crashes.
Channel keeps auto-disabling
Notification channels (Discord webhook, Slack, email) auto-disable themselves after 10 consecutive delivery failures. The dashboard's channels page shows a red "auto-disabled" badge and the account owner gets one email at the moment of the flip.
Common causes:
- Webhook URL was deleted. Discord and Slack invalidate webhook URLs when the channel or workspace integration is deleted. Re-create the integration on their side and paste the new URL (or for Slack, click "Add to Slack" again — OAuth re-issues a fresh webhook).
- Resend rejecting the recipient. For email channels, the auto-disable trips when Resend's API itself returns a 4xx — most often because the recipient address is on Resend's internal suppression list after repeated bounces. Channels can't be edited in place, so the fix is to delete the channel and recreate it with a corrected recipient (or contact us if you believe an address was wrongly suppressed).
The dashboard doesn't expose an "edit" or "re-enable" action for channels — once auto-disabled, a channel stays that way. The recovery path is to delete it and create a fresh channel with the corrected destination.
"This task ran but it's not in the log"
The most likely reasons, in rough order:
- Plan retention. Free retains 7 days, Developer 30, Business 90. Older events are pruned hourly. If you're picking the
30drange pill on a Free account, the dashboard shows a banner explaining the cutoff. - The worker process running it didn't have the SDK loaded. If you're running mixed worker pools where some have
connect()and some don't, only the connected ones report. - The event was in a queue that filled up during an ingest outage. Drop-newest semantics means new events drop when the queue is full. The SDK logs cumulative drop counts at most once per minute.
SDK log levels and what they mean
The SDK uses Python's standard logging framework with the logger name celeryradar_sdk. Configure it like any other logger:
import logging
logging.getLogger('celeryradar_sdk').setLevel(logging.INFO)
- WARNING — non-fatal issues you should know about: ingest queue full and dropping events, ingest endpoint returning 4xx (config error),
connect()called twice, queue depth skipping a non-Redis broker, a beat schedule with an unsupported type (solar, clocked). - DEBUG — high-volume internals: backoff durations after failed sends, poller tick errors, beat re-sync failures, RedBeat enumeration fallbacks, individual schedule normalization failures, pipelined LLEN errors. Off by default.
Still stuck?
Email [email protected] with:
- What you expected to see vs. what you actually see.
- The SDK and Celery versions (
pip show celeryradar-sdk celery). - Your broker (Redis URL scheme is enough — don't include credentials).
- Any relevant lines from your worker log.
We read every email.