How the SDK works

The SDK runs inside your production Celery workers. You should know what it's doing. This page walks through the architecture; the source is fully open source.

Architecture

The two design constraints are:

  1. Never slow down your worker. If our endpoint is slow or down, your tasks must run at the same speed they would without us.
  2. Never crash your worker. If the SDK hits an unexpected error, it should log and move on, never propagate.

To get there, the SDK splits into two halves: the foreground hooks that run on Celery's signal threads, and a background drain thread that owns all I/O.

The foreground path

When Celery fires a signal — task_prerun, task_failure, heartbeat_sent, etc. — the SDK's handler builds a payload dict and calls Client.send(), which is just Queue.put_nowait(). That's microseconds. The handler returns; your worker thread continues. There is no HTTP, no DNS, no TLS handshake, and no lock contention on this path. (Tasks with non-trivial arguments do pay a small json.dumps/loads on the worker thread for the 4 KB cap check; everything else is pure dict construction.)

The background drain thread

A single daemon thread per worker process pulls events off the queue and POSTs them to ingest over a keep-alive HTTPS connection. One TCP/TLS handshake at startup, then bodies for the lifetime of the process. Failures (5xx, network, timeout) drop the event and put the thread to sleep with exponential backoff. The delay doubles per consecutive failure — 2s, 4s, 8s, 16s — and caps at 30s thereafter. 4xx errors drop without backoff because they're config errors that won't get better with retry.

Why drop, not retry, on the main path?

Naïve "retry every event" pyramids losses during real outages. The queue fills, then new events drop too. Drop-and-backoff bounds the badness: events drop fast during downtime, the dashboard goes blank for the outage window, and recovery is instant when the endpoint returns. The retry queue (below) covers the narrow case where dropping a single event would produce a false alert.

Bounded queues

The main queue holds 1000 events; if it fills, new sends are dropped (drop-newest) and a throttled cumulative-drop log line is emitted at most once per minute. The retry queue holds 100 events; on full, the oldest is dropped (drop-oldest, since "most recent state" is what matters for absence alerts).

Bounded queues mean the SDK can't OOM your process during a long ingest outage. Backpressure manifests as gaps in your dashboard, not memory growth.

The retry queue

A handful of events would produce customer-visible false alerts if they were lost during a CeleryRadar-side outage:

For these, the drain thread takes a different path: when the POST fails, the event is pushed onto a small bounded retry queue. The drain thread checks the main queue first (priority) and the retry queue during quiet periods. So during a long outage, the most recent state events backfill once we recover.

Task lifecycle events (task-started, -succeeded, -failed, -retried) deliberately do not retry. They're high-frequency, missing one is tolerable (a gap in the task log), and retrying them would compete with new events for queue capacity during the worst possible moment.

Three layers of redundancy protect against false alerts during a CeleryRadar-side outage: the SDK retry queue (above), the backend's GREATEST-style heartbeat upsert (out-of-order arrivals don't push last_seen backward), and a 10-minute startup grace period on the alert engine that suppresses absence-rule evaluation while in-flight backfill events are still landing. Each layer covers a different failure mode.

Fork safety

Celery's prefork pool spawns child workers from a parent process. If the SDK's drain thread is mid-flight when the fork happens, the child inherits a stale thread reference and a duplicated TCP socket — two processes thinking they own the same connection. That's a bug class we have to handle explicitly.

The SDK tracks the PID that owns its drain thread and HTTPS connection. On every send(), it compares os.getpid() to the stored PID. If they don't match (we're in a fresh child), it rebuilds: new queue, new stop event, new drain thread, new HTTPS connection. The Redis connection used by the queue depth poller follows the same pattern.

This means importing the SDK in your parent process and forking workers afterward is safe. You don't need to call connect() per-child.

Payload size and PII

Task arguments are captured by default. They're useful — the task detail page shows them, and they're often what you need to reproduce a failure. But they can also contain user data you'd rather not ship to a third party.

Two safeguards apply:

  1. 4 KB cap. The combined args + kwargs payload is serialized once; if it exceeds 4 KB, the values are dropped and replaced with a ['__truncated__', '<n> bytes'] marker. The task detail page surfaces this as a "payload truncated" note. This catches the case where a task takes a DataFrame, a bytes blob, or another large object as an argument — your ingest payloads stay small and your bandwidth bill stays predictable.
  2. capture_args=False. Disables capture entirely. The server still receives task name, state, runtime, retry count, and exception info — just not the arg values themselves. Use this if any of your tasks accept secrets, tokens, raw email bodies, or PII as arguments.

Non-JSON-serializable values (datetimes, custom classes, file handles, etc.) are coerced via repr() rather than dropped. So a task that takes a UUID shows up as UUID('abc-123-…') in the UI rather than blanking out.

Performance characteristics

Event reference

Every event the SDK sends is a JSON object POSTed to /ingest/ with a Bearer token. The shapes are stable and intentionally minimal.

task-started

{
  "type": "task-started",
  "task_id": "uuid",
  "task_name": "myapp.tasks.send_email",
  "worker": "worker-7d9c",
  "queue": "default",
  "args": [...],
  "kwargs": {...},
  "retries": 0,
  "timestamp": 1714400000.123
}

task-succeeded

{
  "type": "task-succeeded",
  "task_id": "uuid",
  "task_name": "...",
  "worker": "...",
  "runtime": 0.0421,
  "args": [...], "kwargs": {...},
  "retries": 0,
  "timestamp": ...
}

task-failed

{
  "type": "task-failed",
  "task_id": "uuid",
  "task_name": "...",
  "worker": "...",
  "exception": "ValueError('bad input')",
  "traceback": "Traceback (most recent call last):\n  ...",
  "args": [...], "kwargs": {...},
  "retries": 2,
  "timestamp": ...
}

task-retried

Same shape as task-failed. Emitted when Celery's task_retry signal fires.

worker-heartbeat

{
  "type": "worker-heartbeat",
  "hostname": "worker-7d9c",
  "queues": ["default", "high"],
  "timestamp": ...
}

Throttled to one every 30 seconds per worker process (Celery itself emits heartbeats more often).

beat-fired

{
  "type": "beat-fired",
  "task_name": "myapp.tasks.send_digest",
  "timestamp": ...
}

Emitted by the beat process every time it publishes a scheduled task. The dashboard matches the task_name against any active beat schedules to mark the corresponding fire window as satisfied.

schedule-register / schedule-snapshot

Emitted on beat startup and periodically thereafter. register adds individual entries; snapshot sends the full active set so deletions can propagate. Together they keep the dashboard in sync with admin add/delete in django-celery-beat or RedBeat without restarting the beat process.

queue-depth

{
  "type": "queue-depth",
  "timestamp": 1714400070,
  "samples": [
    {"queue_name": "default", "depth": 12},
    {"queue_name": "high",    "depth":  0}
  ]
}

Emitted by the leader-elected poller every 30 seconds. One event per poll, carrying a samples array with the current LLEN of every queue declared on the Celery app. The timestamp is bucketed to the poll interval so concurrent leaders during a lock handoff don't double-write.