Sigil · Lead-to-Mock Production Pipeline

Pipeline · One-page mental model

From a row in `scored_llm.jsonl` to `https://<slug>.pages.dev`

Five stages. Each stage is idempotent, resumable, and writes a status row to a shared ledger. Nothing is re-done if it's already done.

STAGE 0 — DONE

Lead discovery + qualification

Already ran. 3,042 candidates · 184 qualified for pool. Output is the input to Stage 1.

Inputs Google Places (12 niches × 16 DFW cities)

Outputs data/scored_llm.jsonl · data/scored_no_website.jsonl

Cost paid $33.30 Google Places · $0 LLM (OAuth)

↓

STAGE 1 — PICK

Batch selection + dedup

Read N un-processed prospects from scored JSONL. Skip anything already in batch-state.jsonl.

Worker scripts/batch_pick.py --niche pool_service --count 10

Filter pure-niche (name OR domain contains "pool") · score ≥ 0.7 · phone present · NOT in batch-state.jsonl

Outputs list of N place_ids to process · prospect dirs scaffolded

↓

STAGE 2 — IMAGES

Generate 4 images per prospect

Per-prospect prompts tuned by brand register. Skipped if images/manifest.json already exists with 4 valid files.

Worker scripts/batch_images.py

Model gpt-image-2-medium via openai-codex provider (ChatGPT OAuth · $0 marginal)

Style rule "unstaged candid documentary" (DESIGN.md §7)

Outputs 4 .jpg files + manifest.json with vision descriptions per prospect

↓

STAGE 3 — RENDER + 3-GATE QA

Build the mock HTML, auto-validate

Existing v3 pipeline. Render via Opus 4.7 XHigh, pass through Gate 1 (deterministic), Gate 2 (functional), Gate 3 (vision QA). Surgical retry up to 5 times.

Worker scripts/render_prospect.py <slug> (already built)

Outputs prospects/<slug>/mock/index.html + preview screenshots + qa-log.json

↓

STAGE 4 — DEPLOY

Cloudflare Pages → permanent URL

One Pages project per prospect. Production URL is stable forever; same URL across redeploys.

Worker scripts/deploy_prospect.py <slug>

Stack npx wrangler@latest pages deploy

URL shape https://sigil-pool-<slug>.pages.dev

Cost $0 (Pages free tier: 500 builds/mo, unlimited bandwidth)

↓

STAGE 5 — LEDGER + PING

Update batch-state · spot-check page · Telegram ping

Write outcome row. Build/refresh a comparison page (current site vs mock URL) for the batch. Send Samar one URL.

Outputs batch-state.jsonl row · spot-check Pages URL · Telegram message

Agents · who does what

Four agents, three of them already running

No subagent zoo. Each agent is a process with a clear input/output contract and a cost.

gpt-5.5

Stage-1 picker (already ran for Stage 0 too)

Where it runs: Codex CLI via OAuth (free on Codex Plus subscription)
What it does in this pipeline: nothing new at Stage 1 — pure Python filters on existing scored JSONL.
Future use: personal-hook extraction at outbound stage (deferred).
Cost: $0 marginal

gpt-image-2-medium

Stage-2 image generator

Where it runs: Hermes image_generate tool, provider openai-codex
Auth: ChatGPT OAuth (your Codex subscription) · $0 marginal
Reasoning: n/a (image model)
Prompt style: "unstaged candid documentary, phone snapshot quality, unflattering 10:30 morning angle, harsh shadows, mid-repair, no watermarks, no branding, no text overlays"
Per-prospect prompt: seeded from design-brief.md brand register (Petri = warm Texas earth, Crystal White = clean blue/white, etc.)
Cost: $0 (covered by Codex subscription, same path used for all 5 calibration prospects)

Claude Opus 4.7 XHigh

Stage-3 renderer (build the mock)

Where it runs: claude --print --model claude-opus-4-7 --permission-mode bypassPermissions
Auth: OAuth via Claude Max subscription · $0 marginal
Mode: FULL on first attempt; SURGICAL (Edit not Rewrite) on Gate-3 retries
Inputs: audit.json + design-brief.md + DESIGN.md + image manifest with vision descriptions + current site screenshot
Hard rules: 12 layout rules baked into base prompt (sticky-bar padding, phone-nowrap, ≤8 sections, no invented stats, etc.)
Time: 60-180s per attempt · typical 1-2 attempts to ship-ready

Claude Opus 4.7 (vision)

Stage-3 Gate-3 vision QA reviewer

Where it runs: same claude --print, multimodal mode on preview screenshots
Anchor: owner-perception ("would the pool owner say 'send this'") NOT designer portfolio
Severity model: Critical/High/Medium block · Cosmetic logged only
Sees: current-site screenshot + mock mobile + mock desktop · "clearly better than current?"
Output: strict JSON · {score_out_of_10, clearly_better_than_current, findings[]}

Stage 3 internals · the 3-gate QA harness

Why mocks reach 10/10 before Samar ever sees them

This is the part Samar already approved on 5/5 calibration prospects. Reusing as-is, no changes.

GATE 1 · DETERMINISTIC

Static HTML/CSS checks

sticky-bar + body padding-bottom + safe-area-inset
phone CTAs have white-space: nowrap OR <br>
≤ 8 <section> elements
every <img> has width+height
no placeholder rectangles in header
no invented stats (only audit.json values)
HTML size 8KB-50KB · no embedded base64
desktop ≠ stretched mobile

GATE 2 · FUNCTIONAL

Playwright at 3 viewports

0 console errors
all images naturalWidth > 0 after networkidle
no horizontal overflow at 390px
page load < 3000ms (mobile)
writes preview/{mobile,tablet,desktop}.png

GATE 3 · VISION

Owner-anchored review

compares mock vs current-site screenshot
Critical/High/Medium → SURGICAL retry (Edit-not-Rewrite)
Cosmetic → logged to qa-log.json (non-blocking)
up to 5 retries before needs_manual_review/
fires Telegram alert on manual-review routing

Gate 3 vision prompt (abridged · full version in render_prospect.py) You are QA'ing a modernized website MOCK for a pool service owner. Not a designer. Owner-operator on his phone between jobs. SHIP TEST: would a regular pool-service owner, shown both pages on his phone, say "yes, this is clearly better, send this"? SEVERITY (anchor every finding to OWNER perception): - CRITICAL: page broken (console errors, layout collapse, CTA overlap) - HIGH: visible bug owner would notice (phone wraps mid-digits, sticky overlap) - MEDIUM: non-designer would feel something off ("FORT / WORTH" split, blank stat cell) - COSMETIC: designer would notice, regular person wouldn't (logo distinctiveness, italic overuse) If audit.json only has 2 stats, a 2-stat layout is correct. Do NOT flag "needs more stats." Inventing data is worse than sparse layout. Output strict JSON: {score_out_of_10, clearly_better_than_current, findings[]}

Stage 4 internals · Cloudflare Pages deploy

One project per prospect, one stable URL forever

Proven path. Token saved, wrangler verified, 13 existing Pages projects already listed via the same credentials.

Create the Pages project (idempotent)

npx wrangler@latest pages project create sigil-pool-<slug> --production-branch main · errors silenced if it already exists

Deploy the mock directory

npx wrangler@latest pages deploy prospects/pool-<slug>/mock --project-name sigil-pool-<slug> --branch main --commit-dirty=true

Capture the production URL

Output URL is always https://sigil-pool-<slug>.pages.dev — stable across redeploys.

Write to batch-state.jsonl

Append row: {slug, place_id, status: "ship_ready", url, vision_score, attempts, ts}

Why Pages, not the tunnel: Outbound emails get opened days or weeks later. The generally-engineers-predict-coffee.trycloudflare.com tunnel dies if the mini reboots and changes URL anyway. Pages URLs are forever. Free tier covers everything we need.

Dedup ledger · don't re-do work

One JSONL file is the source of truth for "what's done"

Field	Type	Example
slug	str	`petri-pools`
place_id	str	`ChIJxxx...` (Google)
niche	str	`pool_service`
status	enum	`ship_ready · needs_manual_review · image_failed · render_failed · deploy_failed`
vision_score	int	`8`
attempts	int	`2`
pages_url	str	`https://sigil-pool-petri-pools.pages.dev`
duration_s	int	`187`
cost_usd	float	`0.40` (images only)
ts	iso8601	`2026-05-16T17:42:00Z`

Stage-1 dedup rule: before processing a prospect, check batch-state.jsonl for that place_id. If status is ship_ready, skip. If *_failed, allow re-attempt. If needs_manual_review, skip (don't loop on it).

The full file map

What gets built (new) vs what already exists

File	Role	Status
`scripts/render_prospect.py`	Stage 3 renderer + 3-gate QA	DONE
`scripts/build_prospect_dirs.py`	Stage 1 scaffolder	DONE
`scripts/build_image_manifests.py`	Stage 2 image gen	DONE
`data/scored_llm.jsonl`	Stage 0 output	DONE
`DESIGN.md` (9 sections, 10/10 contract, image rules)	Shared design constraint	DONE
`~/.hermes/.env` CLOUDFLARE_API_TOKEN + ACCOUNT_ID	Stage 4 auth	DONE
`scripts/batch_state.py`	Dedup ledger helpers (mark / is_done / pick_next)	NEW
`scripts/batch_pick.py`	Stage 1: pick N from scored, exclude done	NEW
`scripts/deploy_prospect.py`	Stage 4: wrangler pages create + deploy	NEW
`scripts/batch_run.py`	Orchestrator: pick → image → render → deploy → ledger	NEW
`scripts/spot_check.py`	Build comparison HTML for batch → Pages	NEW
`scripts/launch_tmux_batch.sh`	tmux wrapper so SSH drops don't kill the run	NEW
cron job (Hermes) — "ping when batch done"	Notification mechanism that actually works	NEW

Resource	Per prospect	Per batch of 10	Per 173-prospect full pool sweep
Image gen ($)	$0 (Codex OAuth)	$0	$0
Opus rendering ($)	$0 (Claude Max OAuth)	$0	$0
Vision QA ($)	$0 (Claude Max OAuth)	$0	$0
Cloudflare Pages ($)	$0 (free tier)	$0	$0
Wall-clock (concurrency=3)	3-8 min	~25 min	~7 hours
Output	1 .pages.dev URL	10 URLs	173 URLs

Decisions you're approving when you say "go"

Five open calls baked into this plan

D1 · Score threshold for "qualified"

Pool counts: 184 at score≥0.7 · 105 at ≥0.8. Plan uses ≥0.7 (184) for batch eligibility; the LLM judge is conservative enough that this isn't too loose. Pure-niche filter (name OR domain contains "pool") narrows to 178.

D2 · Concurrency limit

Each prospect uses 1 Opus session + 4 image calls. Plan uses concurrency=3 to stay polite with both providers and within Claude Max rate limits. Configurable per-run.

D3 · Cost ceiling per batch

Every model call is OAuth-subscription ($0 marginal). Plan still ships a --max-calls-per-batch 200 safety rail (rate-limit guardrail, not dollar guardrail). Cron pings if hit.

D4 · Spot-check sample size

After every batch, build a comparison page (current site vs mock URL) for 5 random ship_ready prospects from THAT batch · NOT all 10. Cheaper for you to review.

D5 · Failure routing

Failed prospects go to needs_manual_review/ with the last_findings. Plan does NOT auto-retry across batches. You decide if/when to re-attempt them.

Sequence you'd actually see

From "go" to "10 URLs in Telegram"

You say "ship the 10"

Hermes runs bash scripts/launch_tmux_batch.sh --niche pool_service --count 10. Returns immediately with the tmux session name.

Cron is armed for completion ping

Hermes schedules a self-polling cron job (every 2 min, max 30 polls). When tmux session exits, cron sends a Telegram message with the spot-check URL — survives SSH drops, laptop sleep, anything.

Batch runs in tmux on the mini

Concurrency=3. Per-prospect log files. Live status in batch-state.jsonl. You can tmux attach -t lead-to-mock-batch any time.

Spot-check page is published to Pages

At end of batch, spot_check.py picks 5 random ship_ready prospects, generates A/B HTML, deploys to https://sigil-batch-spot-check.pages.dev (overwrites prior).

Telegram ping arrives in your home channel

Single message: spot-check URL + batch stats (10 ship_ready, 0 manual_review, $4.00 spent, 25 min wall-clock). You click, you review, you say "scale to all 173" or "fix X first."

Safety · what cannot happen

Hard guardrails

No outbound emails. This pipeline produces URLs only. Outbound is a separate, gated stage requiring Samar approval per message.
No re-processing of the 5 calibration prospects. Their place_ids seed batch-state.jsonl at first run with status=calibration_done.
No mock ships at <10/10 by severity model. Critical/High/Medium block; failures route to needs_manual_review/, Samar is pinged.
No cost overrun. Per-batch --max-cost-usd halt. Per-call image generation is the only real spend; everything else is OAuth-subscription.
No silent failure of the notification. Telegram ping fires via cron, not in-session — survives anything.
No touching production infra. Abbot, sigil-seo, luxdnc-* projects are out of scope.

After you approve · what I build

Six files, ~90 minutes wall-clock

File	What it does	Est build time
`scripts/batch_state.py`	Ledger helpers (read/write JSONL, dedup, status enum)	15 min
`scripts/batch_pick.py`	Filter scored JSONL, exclude done, write picks	10 min
`scripts/deploy_prospect.py`	Wrangler wrapper · create-if-missing + deploy + capture URL	20 min
`scripts/batch_run.py`	Orchestrator: pick → for each (image, render, deploy, ledger) with concurrency	30 min
`scripts/spot_check.py`	Pick 5 random ship_ready, build A/B HTML, deploy to Pages	15 min
`scripts/launch_tmux_batch.sh` + cron arming	tmux session + scheduled cron for completion ping	10 min

Then I seed batch-state.jsonl with the 5 calibration prospects marked calibration_done (so dedup blocks them), run batch_run.py --count 10, and you wait for one Telegram message. Total wall-clock from go to ping: ~90 min build + ~25 min run.

Lead → Qualified → Mock → Permanent URL

From a row in `scored_llm.jsonl` to `https://<slug>.pages.dev`

Four agents, three of them already running

Why mocks reach 10/10 before Samar ever sees them

One project per prospect, one stable URL forever

One JSONL file is the source of truth for "what's done"

What gets built (new) vs what already exists

What this costs and how long it takes

Five open calls baked into this plan

From "go" to "10 URLs in Telegram"

Hard guardrails

Six files, ~90 minutes wall-clock

From a row in scored_llm.jsonl to https://<slug>.pages.dev

Four agents, three of them already running

Why mocks reach 10/10 before Samar ever sees them

One project per prospect, one stable URL forever

One JSONL file is the source of truth for "what's done"

What gets built (new) vs what already exists

What this costs and how long it takes

Five open calls baked into this plan

From "go" to "10 URLs in Telegram"

Hard guardrails

Six files, ~90 minutes wall-clock

From a row in `scored_llm.jsonl` to `https://<slug>.pages.dev`