site-content-catalog
$
npx mdskill add gooseworks-ai/goose-skills/site-content-catalogCatalog every page to build complete content inventories.
- Enables SEO audits and content gap analysis.
- Depends on sitemap and blog index crawling.
- Groups content by category and topic cluster.
- Outputs structured JSON with URL, title, and date.
SKILL.md
.github/skills/site-content-catalogView on GitHub ↗
---
name: site-content-catalog
description: >
Crawl a website's sitemap and blog index to build a complete content inventory.
Lists every page with URL, title, publish date, content type, and topic cluster.
Groups content by category and topic. Optionally deep-reads top N pages for
quality analysis and funnel stage tagging. Use before SEO audits, content gap
analysis, or brand voice extraction.
tags: [content, seo]
---
# Site Content Catalog
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
## Quick Start
```bash
# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"
# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20
# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json
```
## Inputs
| Parameter | Required | Default | Description |
|-----------|----------|---------|-------------|
| domain | Yes | — | Domain to catalog (e.g., "example.com") |
| deep-analyze | No | 0 | Number of top pages to deep-read for content analysis |
| output | No | stdout | Path to save JSON output |
| include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
## Cost
- **Sitemap/RSS crawling:** Free (direct HTTP requests)
- **Apify sitemap extractor (fallback):** ~$0.50 per site
- **Deep analysis:** Free (WebFetch on individual pages)
## Process
### Phase 1: Discover All Pages
The script attempts multiple methods to find all pages on a site, in order:
#### A) Sitemap.xml
1. Fetch `https://[domain]/sitemap.xml`
2. If it's a sitemap index, recursively fetch all child sitemaps
3. Common alternate locations: `/sitemap_index.xml`, `/sitemap-index.xml`, `/wp-sitemap.xml`
4. Check `robots.txt` for `Sitemap:` directives
#### B) RSS/Atom Feeds
1. Check `/feed`, `/rss`, `/atom.xml`, `/blog/feed`, etc.
2. Extract posts with titles, dates, and URLs
3. RSS typically only surfaces recent content (last 10-50 posts)
#### C) Blog Index Crawl
1. Fetch `/blog`, `/resources`, `/insights`, `/news`, `/articles`
2. Extract links from the page
3. Follow pagination if present (`/blog/page/2`, `?page=2`, etc.)
#### D) Site: Search (fallback)
1. WebSearch: `site:[domain]` to estimate total indexed pages
2. WebSearch: `site:[domain]/blog` to find blog content
3. WebSearch: `site:[domain] intitle:` to discover page title patterns
#### E) Apify Sitemap Extractor (fallback for JS-heavy sites)
- Actor: `onescales/sitemap-url-extractor`
- Use when sitemap.xml is missing and the site is JS-rendered
### Phase 2: Classify Each Page
For each discovered URL, classify by:
#### Content Type
Classify based on URL patterns and page titles:
| Type | URL Patterns | Examples |
|------|-------------|----------|
| `blog-post` | `/blog/`, `/posts/`, `/articles/` | How-to guides, opinion pieces |
| `case-study` | `/case-study/`, `/customers/`, `/success-stories/` | Customer stories |
| `comparison` | `/vs/`, `/compare/`, `/alternative/` | X vs Y pages |
| `landing-page` | `/solutions/`, `/use-cases/`, `/for-/` | Product marketing pages |
| `docs` | `/docs/`, `/help/`, `/documentation/`, `/api/` | Technical documentation |
| `changelog` | `/changelog/`, `/releases/`, `/whats-new/` | Product updates |
| `pricing` | `/pricing/` | Pricing page |
| `about` | `/about/`, `/team/`, `/careers/` | Company pages |
| `legal` | `/privacy/`, `/terms/`, `/security/` | Legal/compliance |
| `resource` | `/resources/`, `/guides/`, `/ebooks/`, `/webinars/` | Gated/downloadable content |
| `glossary` | `/glossary/`, `/dictionary/`, `/terms/` | SEO glossary pages |
| `integration` | `/integrations/`, `/apps/`, `/marketplace/` | Integration pages |
| `other` | — | Anything else |
#### Topic Cluster
Group by extracting topic signals from URL slugs and titles:
- Extract keywords from URL path segments
- Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
- Use simple keyword co-occurrence for clustering
### Phase 3: Analyze Publishing Patterns
From the dated content (primarily blog posts):
- **Total content pieces** by type
- **Publishing frequency:** Posts per month over last 12 months
- **Trend:** Increasing, decreasing, or stable output
- **Recency:** Date of most recent publish
- **Author diversity:** Unique authors (if extractable from RSS)
### Phase 4: Deep Analysis (Optional)
If `--deep-analyze N` is specified, fetch the top N pages (prioritizing blog posts) and extract:
- **Word count** (approximate)
- **Target keyword** (inferred from title + H1 + URL)
- **Funnel stage:** TOFU (awareness), MOFU (consideration), BOFU (decision)
- **Content depth:** Shallow (<500 words), Medium (500-1500), Deep (1500+)
- **Has images/video:** Boolean
- **Has CTA:** Boolean (detected by common CTA patterns)
- **Internal links count**
### Phase 5: Output
#### JSON Output (default)
```json
{
"domain": "example.com",
"crawl_date": "2026-02-25",
"total_pages": 347,
"discovery_methods": ["sitemap.xml", "rss"],
"pages": [
{
"url": "https://example.com/blog/reduce-aws-costs",
"title": "How to Reduce Your AWS Bill by 40%",
"date": "2025-11-15",
"type": "blog-post",
"topic_cluster": "Cloud Cost Optimization",
"deep_analysis": {
"word_count": 2100,
"target_keyword": "reduce aws costs",
"funnel_stage": "TOFU",
"content_depth": "deep",
"has_images": true,
"has_cta": true
}
}
],
"summary": {
"by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
"by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
"publishing_cadence": {
"posts_per_month_avg": 4.2,
"trend": "increasing",
"most_recent": "2026-02-20"
}
}
}
```
#### Markdown Summary (also generated)
```markdown
# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347
## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...
## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...
## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20
## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |
```
## Tips
- **Sitemap.xml is the best source.** Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
- **RSS only shows recent content.** If you need the full catalog, sitemap is essential. RSS is supplementary.
- **Deep analysis is optional but valuable.** Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
- **JS-rendered sites** may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
- **Combine with seo-domain-analyzer** to overlay traffic data on the content inventory — see which content actually performs.
## Dependencies
- Python 3.8+
- `requests` library (`pip install requests`)
- `APIFY_API_TOKEN` env var (only for Apify fallback mode)
More from gooseworks-ai/goose-skills