bigquery

$npx mdskill add BuilderIO/agent-native/bigquery

Query analytics data directly using native SQL.

  • Retrieves metrics, funnels, and user activity instantly.
  • Connects to BigQuery via the Google Cloud client.
  • Executes SQL commands without HTTP workarounds.
  • Returns structured query results for immediate analysis.
SKILL.md
.github/skills/bigqueryView on GitHub ↗
---
name: bigquery
description: >
  Query BigQuery for analytics events, signups, pageviews, subscriptions, and user data.
  Use this skill when the user asks about metrics, funnels, user activity, or customer usage data.
---

# BigQuery Integration

## CRITICAL: BigQuery is a Native Agent Tool

**`bigquery` is available directly in the agent's tool list as a native callable tool.**

- If you see `bigquery` in your available tools — **call it directly with your SQL**. Do not use HTTP workarounds, web-request hacks, or scripts as a substitute.
- The `server/lib/bigquery.ts` description below is the *underlying implementation*. It does **not** mean BigQuery is only accessible via terminal commands or scripts.
- **When uncertain if the tool works, call it — don't reason your way to "it won't work".** Empirically test by calling the tool.
- Scripts (`pnpm action`) and the server lib are for dashboard UI code and CLI use. The agent calls BigQuery directly via its native tool.

> **Behavioral rule**: "When uncertain if a tool works, call it — don't reason your way to 'it won't work'." Escalating to HTTP/web-request hacks before verifying the simple path fails is a bug in agent behavior.

## Connection

- **Client**: `@google-cloud/bigquery` Node.js client
- **Auth**: `GOOGLE_APPLICATION_CREDENTIALS_JSON` env var (JSON credentials string), falls back to Application Default Credentials
- **Project**: `BIGQUERY_PROJECT_ID` env var, defaults to `your-project-id`
- **Caching**: 24-hour in-memory cache (sha256 of SQL), max 200 entries
- **Byte limit**: `maximumBytesBilled: 750GB` per query

## Server Lib

- **File**: `server/lib/bigquery.ts`
- **Key export**: `runQuery(sql: string): Promise<QueryResult>`
- **Return type**: `{ rows, totalRows, schema, bytesProcessed, cached? }` — **NOT an array**, always access `.rows`
- **Placeholder**: `@app_events` resolves to `analytics.events_partitioned` (fully qualified)

## General Table Usage Guidelines

**Always use these canonical tables for specific use cases:**

| Use Case             | Table to Use                         | Key Columns                                                           | Notes                                  |
| -------------------- | ------------------------------------ | --------------------------------------------------------------------- | -------------------------------------- |
| Customer contracts   | `dbt_mart.dim_contracts`             | contract_id, company_id, start_date, end_date, contract_value, status | Canonical source for all contract data |
| HubSpot deals        | `dbt_mart.dim_deals`                 | deal_id, amount, stage_name, is_closed_won, close_date                | NOT `deal_amount` or `deal_stage`      |
| Active subscriptions | `dbt_mart.dim_subscriptions`         | subscription_id, root_id, plan, status, subscription_arr              | Filter `status = 'active'`             |
| Enterprise customers | `dbt_mart.enterprise_companies`      | Joins hubspot_companies + dim_contracts + organizations               | Has health_status, renewal dates       |
| ARR                  | `finance.arr_revenue_tracker_latest` | unique_id, product, plan, status, arr_change, event_date              | arr changes on unique id level         |
| All Traffic          | `dbt_staging_bigquery.all_pageviews` |                                                                       |                                        |

**Schema preferences:**

- Use `dbt_mart.*` for business-level queries (deals, contracts, subscriptions, customers)
- Use `dbt_staging_bigquery.*` for raw event data (pageviews, signups)
- Use `dbt_analytics.*` for reporting views
- **Avoid `dbt_dev.*`** - development schema excluded globally

## Table Map

| Logical Name         | Actual Table                                                    | Key Columns                                                                                                                 |
| -------------------- | --------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
| First pageviews      | `dbt_staging_bigquery.first_pageviews`                          | visitor*id, url, referrer, created_date (TIMESTAMP), channel, utm*\*, user_id. **No page_type** — derive from URL.          |
| All pageviews        | `dbt_staging_bigquery.all_pageviews`                            | Has `page_type`, `sub_page_type`, `first_touch_channel`, `session_channel`, `c_referrer`, full utm fields                   |
| Signups              | `dbt_staging_bigquery.signups`                                  | visitor*id, user_id, root_organization_id, utm*\*, signup_url, created_date                                                 |
| Signups (enriched)   | `dbt_analytics.product_signups`                                 | user*id, user_create_d (TIMESTAMP), channel, icp_flag, top_subscription, referrer, utm*\*                                   |
| Blog metadata        | `sigma_materialized.SIGDS_82deb8e2_40f8_4fb4_b3cb_caa011a72d29` | Cryptic column names — see mapping below. 858 rows, deduplicate by blog slug.                                               |
| Blog content (old)   | `test.builder_blog_content`                                     | contentId, name (blog TITLE not author), handle, topic. Only 75 rows, no author. **DO NOT use for author data.**            |
| CRM contacts         | `dbt_mart.dim_hs_contacts`                                      | contact_id (INT64), b_visitor_id, builder_user_id, ql_score, date_entered_mql/sal/s0/s1, lifecycle_stage_name               |
| Deals                | `dbt_mart.dim_deals`                                            | deal_id, amount (not deal_amount), stage_name (not deal_stage), is_closed_won (string), arr_amount, close_date, create_date |
| Subscriptions        | `dbt_mart.dim_subscriptions`                                    | subscription_id, root_id, space_id, subscription_arr, start_date, plan, status                                              |
| Enterprise companies | `dbt_mart.enterprise_companies`                                 | Joins hubspot_companies + dim_contracts + organizations. Has upcoming_renewal_date, health_status, customer_stage.          |
| HubSpot companies    | `dbt_staging.hubspot_companies`                                 | company_name, company_id, company_domain_name, upcoming_renewal_date, root_org_id, current_enterprise_arr                   |

### Sigma Blog Metadata Column Mapping

| Cryptic Column | Meaning                  | Example Values                                                    |
| -------------- | ------------------------ | ----------------------------------------------------------------- |
| `SUOHFYGIOG`   | Blog URL                 | `https://www.example.com/blog/sample-post`                        |
| `H5YIATNDT5`   | Author                   | Jane Doe, Alex Chen, Sam Patel, Taylor Kim                        |
| `ZZJ6XRJAII`   | Publish date (TIMESTAMP) |                                                                   |
| `FTRKLGZM1R`   | Purpose                  | Acquisition, Awareness                                            |
| `IFHWPU1IDO`   | Persona                  | Developers, Product Managers, Engineering Leaders, Designers      |
| `Z52LFY52AK`   | Topic                    | AI, CMS, Web Development, Design                                  |
| `_DGCBJNKLE`   | Sub-type                 | Tooling, Development, Prototyping                                 |
| `JQL-G1QE-B`   | Sub-topic                | AI Design, AI Prototyping, AI Tools                               |

**Deduplication**: Table has duplicates (http:// vs https://). Always deduplicate: `REGEXP_EXTRACT(SUOHFYGIOG, r'/blog/([^/?#]+)')` with ROW_NUMBER or DISTINCT.

### Column Name Differences (bug sources)

| Spec Column             | Actual Column              | Table           |
| ----------------------- | -------------------------- | --------------- |
| `first_pageview_date`   | `created_date` (TIMESTAMP) | first_pageviews |
| `channel` (pageviews)   | `first_touch_channel`      | all_pageviews   |
| `referrer`              | `c_referrer`               | all_pageviews   |
| `referrer_channel`      | `session_channel`          | all_pageviews   |
| `user_create_date`      | `user_create_d`            | product_signups |
| `deal_stage`            | `stage_name`               | dim_deals       |
| `deal_amount`           | `amount`                   | dim_deals       |
| `visitor_id` → contacts | `b_visitor_id`             | dim_hs_contacts |
| `user_id` → contacts    | `builder_user_id`          | dim_hs_contacts |

### Join Paths

- **Visitor → Signup**: `first_pageviews.visitor_id = signups.visitor_id`
- **Visitor → Contact**: `first_pageviews.visitor_id = dim_hs_contacts.b_visitor_id`
- **Signup → Contact**: **CRITICAL: Match on BOTH user_id AND email**
  ```sql
  signups.user_id = dim_hs_contacts.builder_user_id
  AND signups.email = dim_hs_contacts.email
  ```
  Matching on both prevents mismatches from reassigned user IDs or data sync issues.
- **Signup → Subscription**: `signups.root_organization_id = dim_subscriptions.root_id`
- **Contact → Deal**: No direct FK. Use `dbt_intermediate.deal_first_contact` or lifecycle stage dates on dim_hs_contacts.
- **Blog → Author**: `builder_blog_content.contentId = content_with_models.content_id` → `dim_users.name`

### Datasets Overview

`dbt_staging_bigquery` (raw staged), `dbt_analytics` (reporting views), `dbt_mart` (dimensional models), `dbt_intermediate` (joins/transforms), `analytics` (raw events), `finance` (ARR tracking), `sigma_materialized` (Sigma-generated views with UUID names).

## SQL Patterns

### Deriving page_type from URL

```sql
CASE
  WHEN url LIKE '%/blog/%' THEN 'blog'
  WHEN url LIKE '%/docs/%' THEN 'docs'
  WHEN REGEXP_CONTAINS(url, r'example\.com/?(?:\?|$)') THEN 'marketing'
  WHEN url LIKE '%/sign-up%' THEN 'webapp'
  ELSE 'other'
END
```

### Timestamps vs Dates

- `first_pageviews.created_date` is TIMESTAMP — wrap: `TIMESTAMP('2025-11-01')`
- `product_signups.user_create_d` is TIMESTAMP — same
- `dim_hs_contacts.sign_up_time_stamp` is DATE — no wrapping
- Use `DATE(timestamp_col)` before `DATE_TRUNC` to avoid type mismatch

### Avoid double WHERE

```sql
-- WRONG
WHERE date BETWEEN '...' AND '...'
WHERE col IS NOT NULL

-- CORRECT
WHERE col IS NOT NULL AND date BETWEEN '...' AND '...'
```

## Analytics Events (Amplitude)

### Event column vs Name column

Agent chat events use `event` column, NOT `name` column (often NULL). Use `event = 'agent chat message submitted'`.

### Preferred table for agent chat events

Use **Amplitude** (`amplitude.EVENTS_182198`) instead of `@app_events` — smaller, stays within byte limits:

- Use `event_type` (not `event`) and `event_time` (not `createdDate`)
- `rootOrganizationId` and `organizationId` in `event_properties` JSON
- `builderSpaceId` is NULL for agent chat events
- Always use `capTo30Days()` helper for Amplitude time series

### Enterprise customer identification

- `data.isEnterpriseCompany` is **unreliable** — most enterprise events have it false/missing
- **Preferred**: JOIN to `dim_subscriptions` on rootOrganizationId, filter `LOWER(plan) = 'enterprise' AND status = 'active'`

### Key agent chat event types

| Event                                     | Description                              |
| ----------------------------------------- | ---------------------------------------- |
| `agent chat message submitted`            | User sends message in AI chat            |
| `visual editor ai chat message submitted` | User sends message in Visual Editor AI   |
| `agent chat message completed`            | AI response completed                    |
| `agent chat code applied`                 | User applied generated code              |

### Customer agent chat message lookup pipeline

1. HubSpot deal → company → contacts
2. `dim_hs_contacts` (contact_id is INT64) → `builder_user_id`
3. `signups` → `root_organization_id`
4. Amplitude events WHERE `JSON_VALUE(event_properties, '$.rootOrganizationId') IN (...)`
5. **Always filter out internal team emails** — internal SEs show up in customer org events

## Dashboard Data Fetching (CRITICAL)

**NEVER use scripts for dashboard UI data.** Use `useMetricsQuery(queryKey, sql)` with direct BigQuery SQL:

- Define SQL in `queries.ts` alongside the dashboard
- Queries go through authenticated `/api/query` endpoint
- For customer lookups, use CTEs with JOINs to `dim_hs_contacts`
- **Scripts are for CLI/agent use only**
More from BuilderIO/agent-native