integration-webhooks

$npx mdskill add BuilderIO/agent-native/integration-webhooks

Enqueue long tasks to SQL and return immediately for webhooks.

  • Prevents serverless timeouts by returning 200 instantly.
  • Requires SQL storage and self-fired HTTP POST triggers.
  • Avoids duplicate runs by using a separate fresh function.
  • Ensures cross-platform compatibility across all serverless hosts.

SKILL.md

.github/skills/integration-webhooksView on GitHub ↗
---
name: integration-webhooks
description: >-
  Cross-platform pattern for handling messaging integration webhooks (Slack,
  Telegram, WhatsApp, email, etc.) on serverless hosts. Use when adding a new
  integration adapter, debugging dropped messages, or wiring long-running agent
  work into a webhook handler.
---

# Integration Webhooks

## Rule

Integration webhooks (Slack, Telegram, WhatsApp, email, Google Docs, etc.) must
**enqueue work to SQL and return 200 immediately**, then process the work in a
**separate fresh function execution** kicked off by a self-fired HTTP POST. A
recurring retry job sweeps anything that gets stuck. This pattern works on every
serverless host (Netlify, Vercel, Cloudflare Workers, Fly, Render, Node) without
relying on platform-specific background-execution features.

Do not run agent loops inside the webhook handler itself. Do not rely on
fire-and-forget `Promise`s after `return`ing from a serverless handler — they get
killed when the function freezes.

## Why

Messaging platforms expect a 200 response within a tight window — Slack will
retry after 3 seconds, and a retried event triggers duplicate agent runs. At the
same time, an agent loop replying to the message can take 30–60+ seconds because
it may make multiple LLM calls and tool calls.

Past attempts that don't work cross-host:

- **Fire-and-forget `Promise.then(...)` after returning** — Lambda/Vercel/CF
  freeze the execution context the moment the response goes out. The promise
  is silently killed, the user gets no reply, and there's no error in the
  logs.
- **Netlify Background Functions** — Netlify-only, requires a `-background`
  filename suffix, breaks on every other host.
- **Cloudflare `event.waitUntil()`** — CF Workers only, not portable.
- **Vercel Fluid / `after()`** — Vercel-only, gated behind specific runtimes.
- **A long-lived in-process queue** — fine on a single Node box, but on
  serverless every cold start gets a fresh queue and any pending work is
  lost.

The only universal answer: **persist the work, then trigger a brand new
function execution to do it.** SQL is the queue, a self-webhook is the trigger,
and a recurring job is the safety net.

## The Flow

```
┌──────────┐    1. POST /integrations/:platform/webhook
│ Platform │────────────────────────────────────────────►┌──────────────────┐
└──────────┘                                             │ Webhook handler  │
                                                         │ (function exec 1)│
                                                         └──────────────────┘
                                                                  │
                            2. INSERT INTO integration_pending_tasks
                                 (status='pending', payload=...)
                                                                  │
                            3. fetch(POST /integrations/_process-task)
                                 — fire-and-forget, NO await on body
                                                                  │
                            4. return 200 to platform ◄───────────┘

                                                         ┌──────────────────┐
                          5. POST arrives at processor   │ Processor        │
                             (separate fresh function)   │ (function exec 2)│
                                                         └──────────────────┘
                                                                  │
                            6. claimPendingTask(id) → status='processing'
                            7. runAgentLoop(...) — full timeout budget here
                            8. adapter.sendResponse(...) back to platform
                            9. markTaskCompleted(id)


                          ┌──────────────────────────────────────────────┐
                          │  Recurring job (every 60s) — safety net      │
                          │  Re-fires processor for tasks stuck in       │
                          │  'pending' or 'processing' beyond timeout.   │
                          │  Caps retries at 3 then marks 'failed'.      │
                          └──────────────────────────────────────────────┘
```

The webhook handler does as little as possible. The fresh function execution
that handles `_process-task` gets its own full timeout budget for the agent
loop.

## Key Files

| File                                                                    | Purpose                                                                |
| ----------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| `packages/core/src/integrations/plugin.ts`                              | Mounts `/_agent-native/integrations/*` routes                          |
| `packages/core/src/integrations/webhook-handler.ts`                     | Verifies signature, parses, enqueues task, fires processor             |
| `packages/core/src/integrations/pending-tasks-store.ts`                 | SQL queue: `insertPendingTask`, `claimPendingTask`, `markTaskCompleted`, `markTaskFailed` |
| `packages/core/src/integrations/pending-tasks-retry-job.ts`             | Recurring retry sweep (`startPendingTasksRetryJob`, `retryStuckPendingTasks`) |
| `packages/core/src/integrations/types.ts`                               | `PlatformAdapter`, `IncomingMessage`, `OutgoingMessage`                |
| `packages/core/src/integrations/adapters/{slack,telegram,whatsapp,email,google-docs}.ts` | One adapter per platform                                               |

## Routes

All under `/_agent-native/integrations/`:

| Method | Path                       | Purpose                                                       |
| ------ | -------------------------- | ------------------------------------------------------------- |
| POST   | `/:platform/webhook`       | Platform pings this. Verifies, enqueues, returns 200 quickly. |
| POST   | `/_process-task`           | Self-webhook target. Claims a task and runs the agent loop.   |
| GET    | `/status`                  | All integrations status (settings UI).                        |
| GET    | `/:platform/status`        | One platform's status.                                        |
| POST   | `/:platform/enable`        | Enable an integration.                                        |
| POST   | `/:platform/disable`       | Disable an integration.                                       |
| POST   | `/:platform/setup`         | Platform-specific setup (e.g. Telegram webhook registration). |

## SQL Schema

The pending-task queue lives in `integration_pending_tasks`:

```sql
CREATE TABLE IF NOT EXISTS integration_pending_tasks (
  id                 TEXT    PRIMARY KEY,
  platform           TEXT    NOT NULL,
  external_thread_id TEXT    NOT NULL,
  payload            TEXT    NOT NULL,   -- JSON-serialized IncomingMessage
  owner_email        TEXT    NOT NULL,
  org_id             TEXT,
  status             TEXT    NOT NULL,   -- pending | processing | completed | failed
  attempts           INTEGER NOT NULL DEFAULT 0,
  error_message      TEXT,
  created_at         INTEGER NOT NULL,
  updated_at         INTEGER NOT NULL,
  completed_at       INTEGER
);
CREATE INDEX IF NOT EXISTS idx_pending_tasks_status_created
  ON integration_pending_tasks(status, created_at);
```

The store layer creates this lazily on first use via `ensureTable()` and uses
`intType()` from `db/client.ts` so it works on both SQLite and Postgres.

`claimPendingTask` is the critical concurrency primitive: it atomically flips
`pending` → `processing` and increments `attempts`, returning `null` if another
worker beat us to it. Both the initial fire-and-forget call and the retry job
funnel through the same processor endpoint, and `claimPendingTask` is what
prevents the same task from being processed twice.

## Adding a New Platform Adapter

1. **Implement `PlatformAdapter`** in `packages/core/src/integrations/adapters/<platform>.ts`:

   ```ts
   export function myPlatformAdapter(): PlatformAdapter {
     return {
       platform: "myplatform",
       label: "MyPlatform",
       getRequiredEnvKeys: () => [
         { name: "MYPLATFORM_TOKEN", label: "MyPlatform Bot Token", scope: "global" },
       ],
       async handleVerification(event) {
         // Platform-specific challenge response, if any
         return { handled: false };
       },
       async verifyWebhook(event) {
         // HMAC / signing-secret check — return false on mismatch
         return true;
       },
       async parseIncomingMessage(event) {
         // Map raw payload → IncomingMessage, or null to ignore
         return null;
       },
       async sendResponse(message, context) {
         // POST back to the platform's API
       },
       formatAgentResponse(text) {
         return { text, platformContext: {} };
       },
       async getStatus(baseUrl) {
         return { platform: "myplatform", label: "MyPlatform", enabled: false, configured: false };
       },
     };
   }
   ```

2. **Register it** in `getDefaultAdapters()` inside `plugin.ts`. The webhook,
   queue, processor, and retry job are shared infrastructure — you do not
   write any of that per-adapter.

3. **Declare required env keys** so the secrets/onboarding UI surfaces them.
   See `secrets` and `onboarding` skills.

4. **Update the platform's webhook URL** to point at
   `${baseUrl}/_agent-native/integrations/<platform>/webhook`. For platforms
   with a registration API (Telegram), implement `POST /:platform/setup`.

The adapter is **only** responsible for:

- platform-specific verification (signatures, challenges)
- payload → `IncomingMessage` mapping
- agent text → platform format
- delivering the response back to the platform

It does **not** know about the queue, the processor, retries, or the agent
loop. Those are handled by the shared webhook handler.

## Long-Running Agent Work

The processor endpoint runs in a fresh function execution with its own full
timeout (typically 30–60s on Netlify/Vercel, longer on background-friendly
hosts). That budget is dedicated entirely to the agent loop — there is no
platform-side timer racing it.

If a single agent run might exceed the function timeout (large multi-step
plans, deep delegation chains), the agent should:

1. Send an interim acknowledgement back to the platform so the user knows the
   request landed (`adapter.sendResponse({ text: "Working on it..." })`).
2. Persist intermediate state in chat-thread data, application state, or a
   recurring job so the next invocation can pick up where this one left off.

The retry job will only re-fire tasks stuck in `processing` for over 5 minutes,
so a normal long-running reply is safe.

## Cross-Platform Considerations

- **No platform-specific background APIs.** No `waitUntil`, no
  `-background.ts` filenames, no Vercel `after()`. The pattern works
  identically on every host because it only uses `fetch()` and SQL.
- **No assumed runtime.** The processor endpoint is a normal H3 handler under
  `/_agent-native/`. It runs wherever the rest of the framework runs.
- **No persistent in-memory state.** The dedup map in the webhook handler is
  best-effort only; the SQL queue is the source of truth. Any cold start
  loses the dedup map but the queue stays consistent.
- **Postgres + SQLite both supported.** `claimPendingTask` uses `RETURNING` on
  Postgres and a re-read on SQLite. No platform-specific SQL.
- **Self-webhook URL resolution.** The processor URL is built from
  `WEBHOOK_BASE_URL`, `APP_URL`, or `URL` env vars (with `localhost:3000` as
  the dev fallback). Templates that change their public URL must keep one of
  these set.

## Why Fire-and-Forget on Serverless Is Unreliable

Even though the webhook handler does `fetch(processorUrl, ...)` without
awaiting the response body, that initial dispatch is **not** guaranteed to
complete before the function freezes. In practice it usually does — the TCP
connect + write happens quickly — but the recurring retry job is the safety
net for the cases where:

- The serverless platform froze the handler before the outbound `fetch`
  flushed its bytes.
- The processor function 502'd or cold-started slow enough to time out.
- The processor itself was killed mid-agent-loop (function timeout, container
  shutdown, deploy mid-run).

Tasks stuck in `pending` for >90s or `processing` for >5min get re-fired up to
3 times. After 3 attempts they're marked `failed` permanently so we stop
spamming the processor.

**Never assume the initial fire-and-forget succeeded.** Always rely on the
queue + retry job for at-least-once delivery.

## Debugging Checklist

1. **Platform sent the webhook?** Check the platform's delivery logs (Slack
   admin, Telegram `getWebhookInfo`).
2. **Webhook handler returned 200?** If not, the platform retries — look for
   duplicate task rows. Signature failures return 401.
3. **Task in the queue?** `SELECT * FROM integration_pending_tasks WHERE
   external_thread_id = '...' ORDER BY created_at DESC LIMIT 5`.
4. **Status?** `pending` means the processor never picked it up — check that
   `_process-task` is reachable from the box itself (the self-fetch must work
   over the public URL). `processing` for over 5 minutes means the processor
   died mid-run — the retry job will pick it up.
5. **Failed?** Check `error_message` and `attempts`. After 3 attempts the row
   is parked at `failed` and won't be retried.
6. **Reply not delivered?** The processor likely succeeded but
   `adapter.sendResponse` failed — check the adapter's outbound logs.

## Related Skills

- `server-plugins` — How `/_agent-native/` routes get mounted
- `recurring-jobs` — Pattern the retry job follows
- `actions` — When to use an action vs a webhook
- `secrets` — Registering platform tokens
- `onboarding` — Surfacing setup steps for each platform
- `delegate-to-agent` — How the processor invokes the agent loop

More from BuilderIO/agent-native