Sigil AI · Production Plan v1

Lead → Qualified → Mock → Permanent URL

The end-to-end harness that turns Phase-0 scraped pool-service leads into live *.pages.dev URLs ready for the outbound email stage. Every agent, prompt, config, and gate documented before a single new mock is built.

Niche pilot: pool_service · DFW Available: 173 qualified prospects remaining (178 minus 5 done) Calibration: 5/5 B-wins (locked) Next ship: 10-prospect batch
Pipeline · One-page mental model

From a row in scored_llm.jsonl to https://<slug>.pages.dev

Five stages. Each stage is idempotent, resumable, and writes a status row to a shared ledger. Nothing is re-done if it's already done.

STAGE 0 — DONE
Lead discovery + qualification
Already ran. 3,042 candidates · 184 qualified for pool. Output is the input to Stage 1.
Inputs Google Places (12 niches × 16 DFW cities)
Outputs data/scored_llm.jsonl · data/scored_no_website.jsonl
Cost paid $33.30 Google Places · $0 LLM (OAuth)
STAGE 1 — PICK
Batch selection + dedup
Read N un-processed prospects from scored JSONL. Skip anything already in batch-state.jsonl.
Worker scripts/batch_pick.py --niche pool_service --count 10
Filter pure-niche (name OR domain contains "pool") · score ≥ 0.7 · phone present · NOT in batch-state.jsonl
Outputs list of N place_ids to process · prospect dirs scaffolded
STAGE 2 — IMAGES
Generate 4 images per prospect
Per-prospect prompts tuned by brand register. Skipped if images/manifest.json already exists with 4 valid files.
Worker scripts/batch_images.py
Model gpt-image-2-medium via openai-codex provider (ChatGPT OAuth · $0 marginal)
Style rule "unstaged candid documentary" (DESIGN.md §7)
Outputs 4 .jpg files + manifest.json with vision descriptions per prospect
STAGE 3 — RENDER + 3-GATE QA
Build the mock HTML, auto-validate
Existing v3 pipeline. Render via Opus 4.7 XHigh, pass through Gate 1 (deterministic), Gate 2 (functional), Gate 3 (vision QA). Surgical retry up to 5 times.
Worker scripts/render_prospect.py <slug> (already built)
Outputs prospects/<slug>/mock/index.html + preview screenshots + qa-log.json
STAGE 4 — DEPLOY
Cloudflare Pages → permanent URL
One Pages project per prospect. Production URL is stable forever; same URL across redeploys.
Worker scripts/deploy_prospect.py <slug>
Stack npx wrangler@latest pages deploy
URL shape https://sigil-pool-<slug>.pages.dev
Cost $0 (Pages free tier: 500 builds/mo, unlimited bandwidth)
STAGE 5 — LEDGER + PING
Update batch-state · spot-check page · Telegram ping
Write outcome row. Build/refresh a comparison page (current site vs mock URL) for the batch. Send Samar one URL.
Outputs batch-state.jsonl row · spot-check Pages URL · Telegram message
Agents · who does what

Four agents, three of them already running

No subagent zoo. Each agent is a process with a clear input/output contract and a cost.

gpt-5.5
Stage-1 picker (already ran for Stage 0 too)
Where it runs: Codex CLI via OAuth (free on Codex Plus subscription)
What it does in this pipeline: nothing new at Stage 1 — pure Python filters on existing scored JSONL.
Future use: personal-hook extraction at outbound stage (deferred).
Cost: $0 marginal
gpt-image-2-medium
Stage-2 image generator
Where it runs: Hermes image_generate tool, provider openai-codex
Auth: ChatGPT OAuth (your Codex subscription) · $0 marginal
Reasoning: n/a (image model)
Prompt style: "unstaged candid documentary, phone snapshot quality, unflattering 10:30 morning angle, harsh shadows, mid-repair, no watermarks, no branding, no text overlays"
Per-prospect prompt: seeded from design-brief.md brand register (Petri = warm Texas earth, Crystal White = clean blue/white, etc.)
Cost: $0 (covered by Codex subscription, same path used for all 5 calibration prospects)
Claude Opus 4.7 XHigh
Stage-3 renderer (build the mock)
Where it runs: claude --print --model claude-opus-4-7 --permission-mode bypassPermissions
Auth: OAuth via Claude Max subscription · $0 marginal
Mode: FULL on first attempt; SURGICAL (Edit not Rewrite) on Gate-3 retries
Inputs: audit.json + design-brief.md + DESIGN.md + image manifest with vision descriptions + current site screenshot
Hard rules: 12 layout rules baked into base prompt (sticky-bar padding, phone-nowrap, ≤8 sections, no invented stats, etc.)
Time: 60-180s per attempt · typical 1-2 attempts to ship-ready
Claude Opus 4.7 (vision)
Stage-3 Gate-3 vision QA reviewer
Where it runs: same claude --print, multimodal mode on preview screenshots
Anchor: owner-perception ("would the pool owner say 'send this'") NOT designer portfolio
Severity model: Critical/High/Medium block · Cosmetic logged only
Sees: current-site screenshot + mock mobile + mock desktop · "clearly better than current?"
Output: strict JSON · {score_out_of_10, clearly_better_than_current, findings[]}
Stage 3 internals · the 3-gate QA harness

Why mocks reach 10/10 before Samar ever sees them

This is the part Samar already approved on 5/5 calibration prospects. Reusing as-is, no changes.

GATE 1 · DETERMINISTIC
Static HTML/CSS checks
  • sticky-bar + body padding-bottom + safe-area-inset
  • phone CTAs have white-space: nowrap OR <br>
  • ≤ 8 <section> elements
  • every <img> has width+height
  • no placeholder rectangles in header
  • no invented stats (only audit.json values)
  • HTML size 8KB-50KB · no embedded base64
  • desktop ≠ stretched mobile
GATE 2 · FUNCTIONAL
Playwright at 3 viewports
  • 0 console errors
  • all images naturalWidth > 0 after networkidle
  • no horizontal overflow at 390px
  • page load < 3000ms (mobile)
  • writes preview/{mobile,tablet,desktop}.png
GATE 3 · VISION
Owner-anchored review
  • compares mock vs current-site screenshot
  • Critical/High/Medium → SURGICAL retry (Edit-not-Rewrite)
  • Cosmetic → logged to qa-log.json (non-blocking)
  • up to 5 retries before needs_manual_review/
  • fires Telegram alert on manual-review routing
Gate 3 vision prompt (abridged · full version in render_prospect.py) You are QA'ing a modernized website MOCK for a pool service owner. Not a designer. Owner-operator on his phone between jobs. SHIP TEST: would a regular pool-service owner, shown both pages on his phone, say "yes, this is clearly better, send this"? SEVERITY (anchor every finding to OWNER perception): - CRITICAL: page broken (console errors, layout collapse, CTA overlap) - HIGH: visible bug owner would notice (phone wraps mid-digits, sticky overlap) - MEDIUM: non-designer would feel something off ("FORT / WORTH" split, blank stat cell) - COSMETIC: designer would notice, regular person wouldn't (logo distinctiveness, italic overuse) If audit.json only has 2 stats, a 2-stat layout is correct. Do NOT flag "needs more stats." Inventing data is worse than sparse layout. Output strict JSON: {score_out_of_10, clearly_better_than_current, findings[]}
Stage 4 internals · Cloudflare Pages deploy

One project per prospect, one stable URL forever

Proven path. Token saved, wrangler verified, 13 existing Pages projects already listed via the same credentials.

1
Create the Pages project (idempotent)
npx wrangler@latest pages project create sigil-pool-<slug> --production-branch main · errors silenced if it already exists
2
Deploy the mock directory
npx wrangler@latest pages deploy prospects/pool-<slug>/mock --project-name sigil-pool-<slug> --branch main --commit-dirty=true
3
Capture the production URL
Output URL is always https://sigil-pool-<slug>.pages.dev — stable across redeploys.
4
Write to batch-state.jsonl
Append row: {slug, place_id, status: "ship_ready", url, vision_score, attempts, ts}
Why Pages, not the tunnel: Outbound emails get opened days or weeks later. The generally-engineers-predict-coffee.trycloudflare.com tunnel dies if the mini reboots and changes URL anyway. Pages URLs are forever. Free tier covers everything we need.
Dedup ledger · don't re-do work

One JSONL file is the source of truth for "what's done"

FieldTypeExample
slugstrpetri-pools
place_idstrChIJxxx... (Google)
nichestrpool_service
statusenumship_ready · needs_manual_review · image_failed · render_failed · deploy_failed
vision_scoreint8
attemptsint2
pages_urlstrhttps://sigil-pool-petri-pools.pages.dev
duration_sint187
cost_usdfloat0.40 (images only)
tsiso86012026-05-16T17:42:00Z
Stage-1 dedup rule: before processing a prospect, check batch-state.jsonl for that place_id. If status is ship_ready, skip. If *_failed, allow re-attempt. If needs_manual_review, skip (don't loop on it).
The full file map

What gets built (new) vs what already exists

FileRoleStatus
scripts/render_prospect.pyStage 3 renderer + 3-gate QADONE
scripts/build_prospect_dirs.pyStage 1 scaffolderDONE
scripts/build_image_manifests.pyStage 2 image genDONE
data/scored_llm.jsonlStage 0 outputDONE
DESIGN.md (9 sections, 10/10 contract, image rules)Shared design constraintDONE
~/.hermes/.env CLOUDFLARE_API_TOKEN + ACCOUNT_IDStage 4 authDONE
scripts/batch_state.pyDedup ledger helpers (mark / is_done / pick_next)NEW
scripts/batch_pick.pyStage 1: pick N from scored, exclude doneNEW
scripts/deploy_prospect.pyStage 4: wrangler pages create + deployNEW
scripts/batch_run.pyOrchestrator: pick → image → render → deploy → ledgerNEW
scripts/spot_check.pyBuild comparison HTML for batch → PagesNEW
scripts/launch_tmux_batch.shtmux wrapper so SSH drops don't kill the runNEW
cron job (Hermes) — "ping when batch done"Notification mechanism that actually worksNEW
Resource budget · per batch of 10

What this costs and how long it takes

ResourcePer prospectPer batch of 10Per 173-prospect full pool sweep
Image gen ($)$0 (Codex OAuth)$0$0
Opus rendering ($)$0 (Claude Max OAuth)$0$0
Vision QA ($)$0 (Claude Max OAuth)$0$0
Cloudflare Pages ($)$0 (free tier)$0$0
Wall-clock (concurrency=3)3-8 min~25 min~7 hours
Output1 .pages.dev URL10 URLs173 URLs
Whole pipeline is $0 marginal. Every model call routes through OAuth-subscription credentials you're already paying for (Claude Max, Codex Plus). Cloudflare Pages free tier covers deploys. The only money spent on this whole project so far is the $33.30 Google Places bill at Stage 0.
Decisions you're approving when you say "go"

Five open calls baked into this plan

D1 · Score threshold for "qualified"
Pool counts: 184 at score≥0.7 · 105 at ≥0.8. Plan uses ≥0.7 (184) for batch eligibility; the LLM judge is conservative enough that this isn't too loose. Pure-niche filter (name OR domain contains "pool") narrows to 178.
D2 · Concurrency limit
Each prospect uses 1 Opus session + 4 image calls. Plan uses concurrency=3 to stay polite with both providers and within Claude Max rate limits. Configurable per-run.
D3 · Cost ceiling per batch
Every model call is OAuth-subscription ($0 marginal). Plan still ships a --max-calls-per-batch 200 safety rail (rate-limit guardrail, not dollar guardrail). Cron pings if hit.
D4 · Spot-check sample size
After every batch, build a comparison page (current site vs mock URL) for 5 random ship_ready prospects from THAT batch · NOT all 10. Cheaper for you to review.
D5 · Failure routing
Failed prospects go to needs_manual_review/ with the last_findings. Plan does NOT auto-retry across batches. You decide if/when to re-attempt them.
Sequence you'd actually see

From "go" to "10 URLs in Telegram"

1
You say "ship the 10"
Hermes runs bash scripts/launch_tmux_batch.sh --niche pool_service --count 10. Returns immediately with the tmux session name.
2
Cron is armed for completion ping
Hermes schedules a self-polling cron job (every 2 min, max 30 polls). When tmux session exits, cron sends a Telegram message with the spot-check URL — survives SSH drops, laptop sleep, anything.
3
Batch runs in tmux on the mini
Concurrency=3. Per-prospect log files. Live status in batch-state.jsonl. You can tmux attach -t lead-to-mock-batch any time.
4
Spot-check page is published to Pages
At end of batch, spot_check.py picks 5 random ship_ready prospects, generates A/B HTML, deploys to https://sigil-batch-spot-check.pages.dev (overwrites prior).
5
Telegram ping arrives in your home channel
Single message: spot-check URL + batch stats (10 ship_ready, 0 manual_review, $4.00 spent, 25 min wall-clock). You click, you review, you say "scale to all 173" or "fix X first."
Safety · what cannot happen

Hard guardrails

After you approve · what I build

Six files, ~90 minutes wall-clock

FileWhat it doesEst build time
scripts/batch_state.pyLedger helpers (read/write JSONL, dedup, status enum)15 min
scripts/batch_pick.pyFilter scored JSONL, exclude done, write picks10 min
scripts/deploy_prospect.pyWrangler wrapper · create-if-missing + deploy + capture URL20 min
scripts/batch_run.pyOrchestrator: pick → for each (image, render, deploy, ledger) with concurrency30 min
scripts/spot_check.pyPick 5 random ship_ready, build A/B HTML, deploy to Pages15 min
scripts/launch_tmux_batch.sh + cron armingtmux session + scheduled cron for completion ping10 min
Then I seed batch-state.jsonl with the 5 calibration prospects marked calibration_done (so dedup blocks them), run batch_run.py --count 10, and you wait for one Telegram message. Total wall-clock from go to ping: ~90 min build + ~25 min run.