/pika:explainer

Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.

Usage: /pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]

Behavior

Defaults — fire fast, no mid-flow confirmation

Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (--avatar, --voice) when supplied; otherwise resolve via identity_avatar_url / identity_voice_id and proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null).
No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via --preview (for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing.
Do not solicit --focus either. Make a confident first attempt from page structure; users re-run with --focus "X" if the angle missed.

These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.

Local avatar images on Claude Desktop

Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as --avatar, pause Step 1 and kindly send them this — something like:

Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar: - Paste a URL if it's already hosted (Imgur, S3, your site) — fastest - Attach the image file so I can upload it before generation.

When a local file arrives, convert it to a public URL with upload_asset and use the returned public_url as --avatar <url> before Step 1. Already-hosted https://... URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.

Step 0 — Resolve URL (empty-args menu)

Strip flags (--focus, --avatar, --voice, --live-url, --lipsync-provider, --no-captions, --preview, --skip-preview, --yes) and key=value parameters from $ARGUMENTS. If what remains contains no https://... URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message. Calling a tool here risks recording or explaining the wrong page. If $ARGUMENTS already carries a URL, skip this step silently and proceed to Step 1.

Which URL would you like me to walk through? Works on any of: - A GitHub repo — e.g. https://github.com/anthropics/claude-code (activates repo-aware mode: README scan + live-demo detection) - A product page / launch page — e.g. https://pika.art - A docs site — e.g. https://docs.anthropic.com - A blog post / article URL Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass --preview if you want a 3-second lipsync sanity check first. Reply with the URL and I'll start. Tip: you don't need to type /pika:explainer — just say things like "walk me through <url>", "make a demo video of <url>", or "explain this repo: <github-url>" and I'll fire this skill automatically.

When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.

Step 1 — Parse input + detect mode

Required: url (must be https://...). Optional: --avatar <url> (overrides identity-store default), --voice <minimax-voice-id>, --focus "..." (editorial guidance woven into vo_text), --live-url <url> (force-supply live demo URL — GitHub mode only), --lipsync-provider <pika|kling> (defaults to pika — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass kling for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), --no-captions (skip the Step 11 caption burn — default is captions on), --preview (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). --skip-preview and --yes are accepted as no-ops for backward compatibility.

Mode detection:

GitHub mode — URL host is github.com AND path matches /{owner}/{repo} (no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors.
Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like /blob/HEAD/path). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.

Avatar resolution (silent — never ask the user):

If --avatar <url> was passed, use it.
Else call mcp__pika__identity_avatar_url. If non-null, use it.
Else (fresh user, no identity avatar set yet) call mcp__pika__generate_image once with prompt "professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting" and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.

Voice resolution (silent — never ask the user):

If --voice <id> was passed, use it.
Else call mcp__pika__identity_voice_id. If non-null, use it.
Else pick a casual MiniMax speech-2.8-hd preset matching the resolved avatar's apparent gender:

Female-coded avatar → English_PlayfulGirl (warm, casual, clearly female-voiced — verified)
Male-coded avatar → English_Jovialman (warm, casual male)
Unclear / gender-neutral → English_Jovialman (default)

Determine gender from mcp__pika__identity_persona_read (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call analyze_media for this — it's not worth the extra ~30s round-trip. Do not ask the user.

Do NOT use English_FriendlyPerson — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. English_PlayfulGirl is the canonical casual-female pick. Other verified-female alternates: English_Upbeat_Woman, English_LovelyGirl, English_radiant_girl.

The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.

Step 2 — Read source (no MCP call)

Both modes: use Claude's WebFetch on the input URL to pull the page's main content (h1, hero section, headings, primary copy).

GitHub mode additions: also fetch top-level file tree, (best-effort) package.json / pyproject.toml, and GitHub API repo metadata via gh api repos/{owner}/{repo} for homepage, description, language, topics. Detect a candidate live_url in this priority:

User-supplied --live-url.
GitHub API meta.homepage field — set when the maintainer configured the repo's homepage in GitHub settings (matches tarball repo_analyzer.py:66-77).
package.json "homepage" field.
First match in README of https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*.
Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g. <project>-project-page.com); when the README explicitly designates a project page, prefer that over the github.io fallback.
GitHub Pages convention https://{owner}.github.io/{repo} — but only if the deep tree contains a frontend signal (one of index.html, App.tsx, App.jsx, App.vue, app.py, main.py).

If no candidate resolves, the beat sheet skips beats 6–7.

Generic-URL mode: the input URL itself is the only URL the beats walk through — no live_url inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.

Step 2.5 — Verify `live_url` reachability (GitHub mode only, no MCP call)

If a candidate live_url was selected, verify it serves real content before authoring beats 6–7. Use WebFetch on the candidate and check the response:

If the response status is 4xx / 5xx, drop live_url to None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content.
If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains "There isn't a GitHub Pages site here"), drop live_url and skip beats 6–7.
Otherwise, keep live_url for beats 6–7.

This mirrors the original tarball's requests.head(live_url, timeout=6, allow_redirects=True) reachability gate.

Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)

Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.

A. Bot-block / captcha detection — abort if matched:

If the response body contains any of:

"Verify you are human" / "verify you are not a robot"
"captcha" / "CAPTCHA" / "reCAPTCHA"
"403 Forbidden" / "Access Denied"
"Just a moment" + cf-chl-bypass (Cloudflare challenge)
"We're sorry, something went wrong" (Amazon-style bot block)
A <title> or h1 of just "Robot Check" / "Are you a robot?"

→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."

B. Cookie / consent-banner detection — defuse with extra_css + optional click:

Scan the response for these patterns (case-insensitive):

IDs / classes starting with onetrust-, truste-, cookie-banner, cookie-consent, gdpr-, consent-, cmp-
Buttons matching (?i)accept (all )?cookies / (?i)agree.{0,10}cookies / (?i)i (accept|agree)
Apple-specific banner: id ac-gdpr-banner or class as-globalfooter-curtain
Google consent: [role="dialog"] with text "Before you continue"

If detected, set cookie_banner_present = true. Defense in depth — the recording uses BOTH:

CSS injection (extra_css) in the capture_website call to hide common banners universally — even if the click below misses, the banner is visually gone.
A click timed_action at at_s: 0.0 against the most likely dismissal selector (extracted from the WebFetch DOM, e.g. #onetrust-accept-btn-handler, [aria-label="Accept all" i], button[id="accept"]).

The extra_css payload (use this verbatim — covers ~80% of consent platforms):

#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; }  /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }

C. Real-DOM element identification — emit concrete selectors:

Generic CSS selectors (h1, [class="hero"], section h2) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses tile-headline / as-headline-section-title, not hero-). For each beat, prefer the actual DOM elements observed in the WebFetch:

Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary <h1> text and class.
Note the page's section structure (h2 headings + their parent containers).
Note any prominent CTA / signup / pricing element.
Emit zoom_target.selector using the actual class or id observed, falling back to semantic structure (main > section:nth-of-type(N) h2) when class names look auto-generated (Tailwind _1a2b3c, CSS modules module__hero___xYz).

D. SPA / lazy-render detection — bump initial wait:

If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-domcontentloaded. Emit a longer initial wait action ({type: "wait", at_s: 0.0, ms: 2500}) before any beat fires, instead of the default 600ms settle.

E. --focus is honored when supplied (do not solicit):

Without --focus, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with --focus "the X feature" if the first pass misses the angle they wanted. With --focus supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.

Step 3.0 — Required README section scan (GitHub mode only, no MCP call)

Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:

README contains...	Required beat
`how it works`	scroll_to that heading; zoom `article h2:has(#user-content-how-it-works)`
`audio layer` / `audio timeline`	scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading
`claude code` / `mcp integration`	scroll_to that section; zoom `article pre` or `.highlight` (terminal screenshot / code block)
`architecture` / `system design`	scroll_to that section; zoom `article h2:has(#user-content-architecture)`
`features` (when prominent at top)	scroll_to that heading; zoom `article h2:has(#user-content-features)`
`getting started` / `quick start` / `installation`	scroll_to that heading; zoom `article h2:has(#user-content-installation)` (or the matching slug) — falls back to `article pre` if you want the install code block instead
`usage` / `examples`	scroll_to that heading; zoom `article h2:has(#user-content-usage)` (or the matching slug) — or the first code block under it

GitHub heading slug rule: lowercase, spaces → dashes, strip non-[a-z0-9-] characters. So "How it works" → #user-content-how-it-works, "Quick Start" → #user-content-quick-start. GitHub injects the <a id="user-content-{slug}"> anchor inside each rendered <hN>, so hN:has(#user-content-{slug}) reliably grabs the heading element across any GitHub README.

Selector contract: bbox_selector needs to be vanilla CSS that resolves via document.querySelector (capture_website runs the post-action smooth-scroll JS via page.evaluate, which uses the browser's native selector engine). Avoid Playwright extensions like :has-text("..."), text=..., or :visible: those resolve in Playwright's page.query_selector (so the bbox capture finds the element) but silently fail in the smooth-scroll's document.querySelector (so the page never scrolls to the target, and bbox.y ends up at document-Y instead of top - 60 px, which trips Step 8b's bbox.y > recording_viewport.h degenerate filter and falls back to default-position zoom). CSS Level 4 :has(...) is vanilla and supported in modern Chromium.

These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with SPECIAL rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.

Step 3 — Author beat sheet (main thread, no MCP call)

Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:

{
  "t_start": 0.0,
  "t_end": 7.5,
  "action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
  "zoom_target": { "selector": "...", "description": "..." },
  "vo_text": "exact words to speak — 1 to 2 conversational sentences"
}

Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):

Every beat needs all five fields: t_start, t_end, action (with type and url), zoom_target (with selector), vo_text. Missing fields ⇒ reject and re-author. (Mirrors tarball's github_explainer.py:183-190 validation pass.)
t_start of beat 0 = 0.0; t_end[i] == t_start[i+1] (continuity).
len(vo_text.split()) / 2.5 ≈ t_end - t_start per beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten the vo_text until it fits.
Total t_end of last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.)
Total spoken word count between 165 and 200 words.
Every beat's zoom_target.selector needs to be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors: h1.f1, #readme, article h2, .blob-code-inner, .highlight, .octicon-star, nav. Generic-URL mode prefers robust generic selectors: h1, [role="main"], main, header, nav, .hero, .feature, section h2, [class="cta"], [class="hero"], button, a[href]. Selectors need to resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting.
vo_text is 1-2 conversational sentences. Dev voice. No stage directions. No markdown.
action.url is a valid https://... URL when action.type == "navigate"; required.

Self-check before Step 4: verify total_words is in [165, 200] AND total_seconds (= beats[-1].t_end) is in [65, 80]. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)

Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):

Beat 1: navigate repo root, zoom h1.f1 (repo title), hook sentence.
Beats 2–3: navigate to specific source files (https://github.com/{owner}/{repo}/blob/HEAD/<path>), zoom .blob-code-inner or .highlight. Pick files that match the narration's claim — don't navigate to a file you won't talk about.
Beats 4–5: scroll_to README sections, zoom article h2 or #readme. If Step 3.0 surfaced required sections, replace these slots with the required ones.
Beats 6–7 (only if live_url survived Step 2.5): navigate to live_url, zoom nav / h1 / .hero / main / button / .feature.
Beat 8: back to repo root, zoom .octicon-star, outro.

Structural skeleton — Generic-URL mode:

Beat 1: navigate to the input URL, zoom h1 or [class*="hero"] h1 (the page's primary headline), hook sentence.
Beats 2–3: scroll_to the page's hero / value-prop / first feature section. Zoom .hero, [class="hero"], [class="feature"], or section:nth-of-type(1) h2. Pick visible elements the narration references.
Beats 4–5: scroll_to deeper sections — feature lists, screenshots, pricing, social proof. Zoom section h2, [class="feature"] img, [class="testimonial"], [class*="pricing"], or any prominent semantic element on the page.
Beats 6–7: scroll_to CTA / signup / demo embed. Zoom [class="cta"], button, a[class="button"], or [id*="signup"]. (No live-demo navigation in generic mode — the input URL IS the demo.)
Beat 8: scroll_to footer / closing element, zoom footer h2, footer, or back to top with h1. Outro sentence.

If --focus is supplied, weave its angles into vo_text without mutating the structural skeleton. Prefer CSS selectors over text_content in zoom_target.selector — bbox capture is selector-only (see Known gaps).

Step 4 — TTS

Call mcp__pika__generate_speech with provider: "minimax-tts", text: <full vo_text join>, optional voice_id. Capture result.audio_url (the dispatcher returns audio under audio_url, not url) and result.duration_seconds. Voice defaults to identity-store injection in plugin mode.

Stale-voice fallback detection (AGNT-231): the dispatcher retries once with the default Calm_Woman voice on Minimax status_code:2054 (voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough): voice_id_requested (the planted-but-stale id the worker tried first) and fallback_reason: "invalid_minimax_voice_id". If you see fallback_reason == "invalid_minimax_voice_id" in the response, surface a one-line note to the user along the lines of: "your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via clone_voice if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.

Cookie-banner audio padding (Generic-URL mode with cookie_banner_present == true from Step 2.6 §B): prepend MiniMax's pause marker <#1.5#> to the text: argument before calling generate_speech. MiniMax's speech-2.8-hd honors <#N#> as N-second silence; the returned audio_url and duration_seconds include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.

Fallback (only if smoke-test shows the marker is ignored on this voice): call generate_speech normally, then mcp__pika__edit_audio_mix to overlay the result onto a 1.5s silent base at offset 1.5s. Then call mcp__pika__analyze_media(url=<padded_audio_url>) to probe the padded duration and rebind duration_seconds = result.duration_seconds before Step 4.5 consumes it. analyze_media is the single authoritative duration probe — do not rely on edit_audio_mix's return payload (its duration field is not contractually guaranteed).

Step 4.5 — Audio length verification + beat-sheet rescale

Applied to audio_duration_seconds post Step 4 (which includes any cookie lead-in pad). End state: beats[].t_start / t_end are absolute wall-clock seconds matching the audio playback timeline. All beats[] mutations happen here; Steps 6 and 8 are read-only consumers.

Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds): If audio_duration_seconds > 90, abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.

Gate 2 — Degenerate TTS (spoken-content length): Compute narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0). If narration_duration < 30s, retry Step 4 once (and recompute narration_duration from the retry's audio). If the retry also returns narration_duration < 30s, abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.

Gate 3 — Rescale:

narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)
scale = narration_duration / beats[-1].t_end
If scale < 0.5 or scale > 1.5, abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it.
For each beat: beat.t_start = scale; beat.t_end = scale
If cookie_banner_present: for each beat, beat.t_start += 1.5; beat.t_end += 1.5
Final clamp: beats[-1].t_end = audio_duration_seconds (exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.

After Gate 3 passes, emit a one-line operator log to surface the scale value for post-run diagnosis:

Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)

Advisory (not a gate): scale near 1.0 is ideal. scale > 1.2 means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync. scale < 0.85 means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.

Step 5 — Preview gate (opt-in via `--preview`)

Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed --preview — do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.

--skip-preview and --yes are accepted as no-ops for backward compatibility — they were the old opt-out flags.

If --preview was supplied:

mcp__pika__generate_speech with text: "Hi, I'm your presenter. Let's explore this repo together." → preview_audio_url.
mcp__pika__generate_lipsync with provider: <resolved_lipsync_provider> (defaults to pika; honor --lipsync-provider kling if supplied), image: <avatar>, audio: preview_audio_url → preview_lipsync_url (bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.
Present to the user verbatim:

Preview ready: <preview_lipsync_url> This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Reply yes to proceed, or anything else to cancel.

Match ^(yes|go|proceed|confirm|y)$ (case-insensitive). Anything else → STOP, no further MCP calls.

Step 6 — Build `timed_actions` and record

Translate the beat sheet into capture_website timed_actions. One timed_action per beat — set bbox_selector to the beat's zoom_target.selector and capture_website captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-top - 60 px → 1300 ms post-anim → measure, all server-side).

For each beat in order, emit one entry:

navigate beats: {type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}. The worker navigates, waits to absolute at_s + 0.6 s, scrolls bbox_selector into view, and measures the bbox — all without the caller scheduling a follow-up step.
scroll_to / hover beats: {type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}. The action's own selector drives the page scroll; bbox_selector drives the bbox measurement (it can be the same selector or different — usually the same). (capture_website has no hover; scroll-into-view is the analog.)

Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from t=0 of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at t=0 with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.

Call mcp__pika__capture_website:

url: <beat 0's action.url>
timed_actions: <the N-element list built above> (one entry per beat)
duration_s: ceil(audio_duration_seconds) — beats[].t_start and t_end have already been rescaled to the TTS audio timeline by Step 4.5, so duration_s is simply the audio length. The old max(...) defense against TTS overrun is no longer needed.

Generic-URL mode additions (per Step 2.6 pre-flight):

extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B> — defensive: hides common consent platforms via display: none !important; so even if the optional click misses, the banner is invisible in the recording.
Prepend a wait action {type: "wait", at_s: 0.0, ms: 2500} for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires.
If cookie_banner_present from Step 2.6 §B, also prepend a click action {type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}. The beats[] array has already been shifted by +1.5s in Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1's timed_action.at_s reads beats[0].t_start directly, which is 1.5 in cookie mode.
No cookie banner action needed if cookie_banner_present == false; just the prepended wait action.

Capture video_url, recording_viewport, action_bboxes. The result returns recording_viewport: {w, h} and action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}] alongside video_url.

action_bboxes[].idx semantics: the idx field is the position in the input timed_actions array.

GitHub mode: with one timed_action per beat, idx maps 1:1 to beat index — Step 8 uses entry.idx directly as beat_idx.
Generic-URL mode: the prepended wait (and optional cookie-dismissal click) shift the array by 1 or 2. Compute beat_idx = entry.idx - prepend_count where prepend_count is 1 (wait only) or 2 (wait + click). Skip entries where beat_idx < 0 (those are the prepended setup actions, not beats).

The selector field on each entry reports bbox_selector (i.e. zoom_target.selector), not the action's own selector.

Step 7 — Browser chrome

mcp__pika__edit_browser_frame:

video_url: <Step 6 video_url>
url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)
tab_title: <30-char title> — GitHub mode: (meta.description or repo_name or "")[:30]. Generic-URL mode: the page's <title> (from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard against None/empty.

Returns framed_url (1280×800 Sonoma + chrome).

Step 8 — Build `zoom_keyframes` and apply

Constants:

INTRO_BEATS = 2 — gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).
HOLD_GAP = 0.6 — seconds of 1.0× before each zoom-in and after each zoom-out.
MIN_BEAT_DUR = 1.5 — beats shorter than this are skipped (no room for a meaningful zoom).
SCALE = 1.35 (precise element-targeted zoom).
FALLBACK_SCALE = 1.25 (default-position fallback when no usable bbox).
FALLBACK_RAMP = 0.4.

Note: beats[].t_start / t_end were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.

edit_browser_frame's inner-content offsets: CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637 (verified against the worker's edit_browser_frame/main.py).

Coord transform (recording px → framed px):

cx_framed = 56  + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637  / recording_viewport.h)

Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:

Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.

Constants for the default position:

DEFAULT_CX = 56 + 1168 // 2 (screen center of the framed canvas)
DEFAULT_CY = 108 + 637 // 3 (upper-third of the content area, where most GitHub UI prominence lives)

Walk the beat sheet from index INTRO_BEATS (= 2) to the end. For each beat:

If t_end - t_start < MIN_BEAT_DUR (1.5s), skip — too short for a meaningful zoom.
Compute the keyframe's interior interval as [t_start + HOLD_GAP, t_end - HOLD_GAP]. If that interval is shorter than 1.0s, skip.
Otherwise pre-fill that beat's slot in a per-beat map (call it zoom_keyframes_by_beat[beat_idx]) with {cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)} plus the trimmed t_start/t_end.

Step 8b — Override with bbox-derived precise zoom where action_bboxes provided a usable measurement.

For each entry in action_bboxes:

beat_idx = entry.idx (since Step 6 emits one timed_action per beat). If beat_idx < INTRO_BEATS, skip.
If entry.found is false, skip.
If the beat isn't already in zoom_keyframes_by_beat (was filtered out in Step 8a by MIN_BEAT_DUR/1.0s rules), skip.
Filter degenerate bboxes: skip if bbox.y > recording_viewport.h (offscreen capture — page didn't scroll the element into view in time) or bbox.h > recording_viewport.h * 1.5 (full-page <main> element — yields a meaningless zoom center).
Compute cx_framed/cy_framed from the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with {cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}.

Final list: sort the values of zoom_keyframes_by_beat by t_start to produce the zoom_keyframes array.

This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.

If len(zoom_keyframes) > 0, call mcp__pika__edit_animate_zoom with video_url: framed_url, zoom_keyframes. Returns zoomed_url. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use framed_url as zoomed_url.

Step 9 — Lipsync the full audio

mcp__pika__generate_lipsync:

provider: <resolved_lipsync_provider> — default: pika (parrot a2v). Honor --lipsync-provider kling if explicitly passed.
image: <avatar>
audio: <Step 4 audio_url>
kling-only knobs (since pika-mcp-server BACK-339, 2026-05-10): when provider == "kling", add mode: "pro" and prompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter" for the polished-presenter feel. Both are silently ignored on pika (parrot has its own driver).

Provider tradeoffs:

Provider	Wall-clock	Head motion	When to use
`pika` (default)	~2–5 min	Slightly more dramatic, naturalistic	Default for most runs — fast iteration, watchable output, ~10× faster than kling
`kling` (opt-in)	~5–30 min	Minimal, face-centered, presenter-style	High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole

Server-side-await covers the call inline; if the response shape is {task_id, status: "queued"}, poll mcp__pika__task_status in a tight loop (no sleep) until the status reaches a terminal state (completed, failed, or cancelled). On completed, capture lipsync_url. On failed / cancelled, fall back to the other provider (kling ↔ pika) per the failover note below.

Failover:

If pika fails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once with provider: "kling".
If kling stalls past the worker's 1200s ceiling (visible as repeated processing status with no completion) → fall back to provider: "pika". Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.

Why pika is the default:

Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.

For the canonical "polished presenter" feel of the original tarball reference output, pass --lipsync-provider kling explicitly.

Step 10 — PiP composite

mcp__pika__edit_pip:

main_video_url: <zoomed_url>
overlay_video_url: <lipsync_url>
shape: "circle"
size_px: 246 ← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball's CIRCLE_OUT = CIRCLE_SIZE + STROKE * 2
stroke_width_px: 3
stroke_color: "white"
position_px: {x: 20, y: 476} ← 800 − 246 − 78 for dock clearance (matches tarball's H − CIRCLE_OUT − 78)

Pass size_px, not size; the fields are mutually exclusive. Returns final_url.

Master-duration / audio-source contract (matching tarball github_explainer.py:418-419, 531-533, 578-582): edit_pip uses shortest=1 semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's duration_s = ceil(audio_duration_seconds) ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via edit_pip's shortest=1 semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone audio_url is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.

Step 11 — Burn captions

Call mcp__pika__add_captions(video_url=<final_url>, style="classic"). classic renders a bottom subtitle bar — the right register for an explainer video (use tiktok / hormozi / karaoke only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as captioned_url.

Skip this step only if the user passed --no-captions (parsed in Step 1) — the default is captions on. (Note: /pika:podcast does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)

Step 12 — Return

Emit captioned_url (or final_url if Step 11 was skipped) on one line: Done: <url>.

Load-bearing phrases

These anchors preserve the visual contract across page types:

Phrase	Where	Why load-bearing
`vanilla CSS that resolves via document.querySelector`	Selector contract	Keeps scroll, bbox capture, and zoom targeting aligned inside `capture_website`.
`GitHub URLs activate repo-aware mode`	Mode detection	Prevents generic product-page beats from replacing README/code walkthrough beats.
`8-10 beats`, `65-80 seconds`, `165-200 words`	Beat-sheet authoring	Keeps narration, screen recording, lipsync, and captions within the reliable duration envelope.
`all beats[] mutations happen here`	Audio rescale step	Ensures later capture/zoom/composite steps consume one stable timeline.
`extra_css` cookie-banner hiding payload	Generic URL pre-flight	Reduces first-frame banner occlusion when a banner click misses.

Engine choice: Pika lipsync default, Kling opt-in

Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests --lipsync-provider kling or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.

Runtime expectations

Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:

Step	Wall clock	Notes
URL read + pre-flight	10-60s	GitHub README scan or generic URL DOM/cookie checks
TTS + audio rescale	30-90s	Beat timing is normalized after actual audio length
Screen recording	60-180s	Depends on page load and navigation count
Browser frame + zooms	1-3 min	Deterministic edit/composite stages
Lipsync	2-5 min Pika / 5-30 min Kling	Kling is opt-in because it is the long pole
PiP + captions	1-3 min	Captions skipped when `--no-captions` is set

Known gaps (carried as follow-up server-side work)

~~Kling avatar mode:"pro" and prompt not exposed.~~ Resolved by pika-mcp-server BACK-339 (PR #186, shipped 2026-05-10): generate_lipsync now wires both mode (e.g. "pro") and prompt end-to-end for the kling provider. To enable polished-presenter mode here, pass --lipsync-provider kling and the Step 9 call should add mode: "pro" plus a prompt like "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter". Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap.
No caller-controlled white-frame trim on the screen recording. capture_website has internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800ms wait action at at_s: 0.0 mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement.
No networkidle wait on per-beat navigation. Tarball uses page.goto(url, wait_until="networkidle", timeout=20000) plus wait_for_timeout(600) after every navigate. capture_website settles to domcontentloaded plus the bbox-capture branch's 600 ms post-action settle (server-side, when bbox_selector is set), but SPA blob pages whose final render happens after domcontentloaded can still get bbox'd against unmounted code blocks. Worker enhancement: expose a wait_until knob on timed_actions[].navigate.
No per-step output-size verification gates. Tarball's verify() helper at github_explainer.py:35-39 checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extra mcp__pika__analyze_media call per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11.
text_content bbox capture not implemented. capture_website v1 returns action_bboxes only for steps with a CSS selector. text_content-only steps produce no entry. Prefer CSS selectors in zoom_target for guaranteed zoom coverage.
Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (<h1> + clear <section> + named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names — tile-headline instead of hero-title defeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright's scrollIntoView; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements with loading="lazy" may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass --focus "the X feature" to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry + wait_until=networkidle + animation-trigger via IntersectionObserver polyfill).
Cookie-banner click is single-attempt. Step 2.6 §B emits one click against the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — the extra_css payload is the load-bearing defense. Worker enhancement: support a list of fallback selectors per click action so the worker tries each in order.
Step 8b ↔ Step 6 idx-mapping mismatch in Generic-URL mode. Step 6 maps beat_idx = entry.idx - prepend_count, but Step 8b uses beat_idx = entry.idx naively. Pre-existing bug independent of rescaling.

Auth

If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for @pika.art Google login). For non-interactive environments, set MCP_AUTH_TOKEN.

Examples

GitHub-mode (repo-aware: README scan + live-demo detection):

/pika:explainer https://github.com/leigest519/OpenGame
/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"
/pika:explainer https://github.com/openai/whisper --preview (opt-in to the preview gate when testing a new avatar)

Generic-URL mode (any non-GitHub URL — drives through the page directly):

/pika:explainer https://pika.art
/pika:explainer https://vercel.com --focus "the deployment workflow"
/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins
/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview