/pika:explainer
Generate a ~60–80s URL explainer video: drive a real browser through the URL along a beat-sheet timeline, generate an avatar lipsync of the narration, and composite it all in a 1280×800 macOS Sonoma frame with a 240-pixel inner avatar (246-pixel outer including 3px white stroke ring) at canvas (20, 476) and element-targeted zoom on every mid-section beat. Works on any URL — product pages, docs sites, blog posts, launches. GitHub URLs activate a repo-aware mode (README scan + live-demo detection); all other URLs use a generic page-walkthrough flow.
Usage: /pika:explainer <url> [--focus "angles"] [--avatar <url>] [--voice <id>] [--lipsync-provider pika|kling] [--preview] [--live-url <url>]
Behavior
Defaults — fire fast, no mid-flow confirmation
- Use identity-store defaults silently for avatar / voice. Never ask "should I use your avatar?" or "which voice?" before firing. Honor explicit overrides (
--avatar,--voice) when supplied; otherwise resolve viaidentity_avatar_url/identity_voice_idand proceed. See Step 1 for the full resolution waterfall (including the silent fallback when identity returns null). - No mid-flow "type yes to proceed" gates by default. Step 5 preview is opt-in via
--preview(for power users testing new avatar/voice combos before the long-pole render); the default flow runs end-to-end without pausing. - Do not solicit
--focuseither. Make a confident first attempt from page structure; users re-run with--focus "X"if the angle missed.
These defaults match industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return. Account credit balance + provider failover (Step 9) are the canonical guardrails.
Local avatar images on Claude Desktop
Claude Desktop can't pass inline-pasted images to MCP tools yet (Anthropic-side limitation). If the user pastes a photo inline, or mentions a local file they want as --avatar, pause Step 1 and kindly send them this — something like:
Heads up — pasted images don't reach MCP tools on Claude Desktop yet (Anthropic limitation). Two easy options for your avatar: - Paste a URL if it's already hosted (Imgur, S3, your site) — fastest - Attach the image file so I can upload it before generation.
When a local file arrives, convert it to a public URL with upload_asset and use the returned public_url as --avatar <url> before Step 1. Already-hosted https://... URLs work as-is and skip this entirely. If no avatar is supplied at all, the identity-store default fires.
Step 0 — Resolve URL (empty-args menu)
Strip flags (--focus, --avatar, --voice, --live-url, --lipsync-provider, --no-captions, --preview, --skip-preview, --yes) and key=value parameters from $ARGUMENTS. If what remains contains no https://... URL (or is empty / whitespace-only), print this menu verbatim as your full response, then stop and wait for the user's next message. Calling a tool here risks recording or explaining the wrong page. If $ARGUMENTS already carries a URL, skip this step silently and proceed to Step 1.
Which URL would you like me to walk through? Works on any of: - A GitHub repo — e.g.
https://github.com/anthropics/claude-code(activates repo-aware mode: README scan + live-demo detection) - A product page / launch page — e.g.https://pika.art- A docs site — e.g.https://docs.anthropic.com- A blog post / article URL Output: 1280×800 macOS Sonoma frame with a bottom-left avatar lipsync and element-targeted zoom on every mid-section beat. Default flow runs end-to-end with no confirmation gates — pass--previewif you want a 3-second lipsync sanity check first. Reply with the URL and I'll start. Tip: you don't need to type/pika:explainer— just say things like "walk me through <url>", "make a demo video of <url>", or "explain this repo: <github-url>" and I'll fire this skill automatically.
When the user replies with a URL, treat it as the resolved input and proceed to Step 1. Do not re-prompt.
Step 1 — Parse input + detect mode
Required: url (must be https://...). Optional: --avatar <url> (overrides identity-store default), --voice <minimax-voice-id>, --focus "..." (editorial guidance woven into vo_text), --live-url <url> (force-supply live demo URL — GitHub mode only), --lipsync-provider <pika|kling> (defaults to pika — parrot a2v, ~2-5 min wall-clock, slightly more dramatic head motion. Pass kling for tighter face-centered output at ~5-30 min wall-clock — Kling produces minimal-head-motion presenter shots but is the long-pole stage; reserve for high-stakes renders), --no-captions (skip the Step 11 caption burn — default is captions on), --preview (opt-in to the Step 5 preview gate — ~3s lipsync of "Hi, I'm your presenter" for testing new avatar/voice combos before the long-pole render; default is no preview). --skip-preview and --yes are accepted as no-ops for backward compatibility.
Mode detection:
- GitHub mode — URL host is
github.comAND path matches/{owner}/{repo}(no further path segments past the repo root). Activates the repo-aware extras: README scan, live-demo detection, GitHub-specific selectors. - Generic-URL mode — anything else (a product page, docs site, blog post, deeper GitHub path like
/blob/HEAD/path). Skips the GitHub extras; uses generic CSS selectors and walks through the URL itself.
Avatar resolution (silent — never ask the user):
- If
--avatar <url>was passed, use it. - Else call
mcp__pika__identity_avatar_url. If non-null, use it. - Else (fresh user, no identity avatar set yet) call
mcp__pika__generate_imageonce with prompt"professional presenter, friendly tech narrator, studio portrait, 1:1, natural lighting"and use the returned URL. Do not ask the user "should I generate one?" — just generate silently.
Voice resolution (silent — never ask the user):
- If
--voice <id>was passed, use it. - Else call
mcp__pika__identity_voice_id. If non-null, use it. - Else pick a casual MiniMax
speech-2.8-hdpreset matching the resolved avatar's apparent gender:
- Female-coded avatar →
English_PlayfulGirl(warm, casual, clearly female-voiced — verified) - Male-coded avatar →
English_Jovialman(warm, casual male) - Unclear / gender-neutral →
English_Jovialman(default)
Determine gender from mcp__pika__identity_persona_read (look for a gender / pronouns field) when identity exists; otherwise infer from the resolved avatar image. Do not call analyze_media for this — it's not worth the extra ~30s round-trip. Do not ask the user.
Do NOT use English_FriendlyPerson — despite being categorized under "female" in MiniMax's catalog, its display name is "Friendly Guy" and it reads as male in playback. English_PlayfulGirl is the canonical casual-female pick. Other verified-female alternates: English_Upbeat_Woman, English_LovelyGirl, English_radiant_girl.
The flow below is annotated per step: GitHub-only, Generic-only, or Both modes.
Step 2 — Read source (no MCP call)
Both modes: use Claude's WebFetch on the input URL to pull the page's main content (h1, hero section, headings, primary copy).
GitHub mode additions: also fetch top-level file tree, (best-effort) package.json / pyproject.toml, and GitHub API repo metadata via gh api repos/{owner}/{repo} for homepage, description, language, topics. Detect a candidate live_url in this priority:
- User-supplied
--live-url. - GitHub API
meta.homepagefield — set when the maintainer configured the repo's homepage in GitHub settings (matches tarballrepo_analyzer.py:66-77). package.json"homepage"field.- First match in README of
https?://[^\s)\"'<>]+(?:vercel\.app|netlify\.app|github\.io|fly\.dev|railway\.app|render\.com|herokuapp\.com|surge\.sh)[^\s)\"'<>]*. - Any other URL in README that the badge area / "Live Demo" / "Project Page" / "Demo" text points at. The allowlist regex above misses arbitrary custom domains (e.g.
<project>-project-page.com); when the README explicitly designates a project page, prefer that over the github.io fallback. - GitHub Pages convention
https://{owner}.github.io/{repo}— but only if the deep tree contains a frontend signal (one ofindex.html,App.tsx,App.jsx,App.vue,app.py,main.py).
If no candidate resolves, the beat sheet skips beats 6–7.
Generic-URL mode: the input URL itself is the only URL the beats walk through — no live_url inference, no extra metadata fetches. Skip Step 2.5 and Step 3.0; jump straight to Step 3.
Step 2.5 — Verify live_url reachability (GitHub mode only, no MCP call)
If a candidate live_url was selected, verify it serves real content before authoring beats 6–7. Use WebFetch on the candidate and check the response:
- If the response status is 4xx / 5xx, drop
live_urlto None and skip beats 6–7. The github.io fallback in particular is reachable as a hostname but often returns 404 ("There isn't a GitHub Pages site here") for repos that haven't enabled Pages — recording that 404 page wastes ~12s of the explainer on wrong content. - If the response renders the GitHub Pages "404 — There isn't a GitHub Pages site here." template (heuristic: response body contains
"There isn't a GitHub Pages site here"), droplive_urland skip beats 6–7. - Otherwise, keep
live_urlfor beats 6–7.
This mirrors the original tarball's requests.head(live_url, timeout=6, allow_redirects=True) reachability gate.
Step 2.6 — Generic-URL pre-flight (Generic-URL mode only, no MCP call)
Before authoring beats for a non-GitHub URL, WebFetch the input URL and inspect the response. This step prevents three common Generic-URL failure modes: (a) recording a captcha / bot-block page instead of content, (b) the cookie/consent banner eating the first ~3 seconds of video, (c) generic CSS selectors missing the page's actual hero / sections.
A. Bot-block / captcha detection — abort if matched:
If the response body contains any of:
"Verify you are human"/"verify you are not a robot""captcha"/"CAPTCHA"/"reCAPTCHA""403 Forbidden"/"Access Denied""Just a moment"+cf-chl-bypass(Cloudflare challenge)"We're sorry, something went wrong"(Amazon-style bot block)- A
<title>or h1 of just "Robot Check" / "Are you a robot?"
→ ABORT with a clear error to the user: "Generic-URL mode can't render this site — the page is showing a bot-detection / captcha challenge under headless Chrome. Try a different URL, or run a real-user version of the page first to verify it loads cleanly."
B. Cookie / consent-banner detection — defuse with extra_css + optional click:
Scan the response for these patterns (case-insensitive):
- IDs / classes starting with
onetrust-,truste-,cookie-banner,cookie-consent,gdpr-,consent-,cmp- - Buttons matching
(?i)accept (all )?cookies/(?i)agree.{0,10}cookies/(?i)i (accept|agree) - Apple-specific banner: id
ac-gdpr-banneror classas-globalfooter-curtain - Google consent:
[role="dialog"]with text "Before you continue"
If detected, set cookie_banner_present = true. Defense in depth — the recording uses BOTH:
- CSS injection (
extra_css) in thecapture_websitecall to hide common banners universally — even if the click below misses, the banner is visually gone. - A
clicktimed_actionatat_s: 0.0against the most likely dismissal selector (extracted from the WebFetch DOM, e.g.#onetrust-accept-btn-handler,[aria-label="Accept all" i],button[id="accept"]).
The extra_css payload (use this verbatim — covers ~80% of consent platforms):
#onetrust-banner-sdk, #onetrust-pc-sdk, #onetrust-consent-sdk { display: none !important; }
#truste-consent-track, #truste-consent-content, .truste_box_overlay { display: none !important; }
[id*="gdpr-cookie"], [id*="cookie-consent"], [id*="cookie-banner"] { display: none !important; }
[class*="cookie-banner"], [class*="cookie-consent"], [class*="consent-banner"] { display: none !important; }
[class*="CookieBanner"], [class*="CookieConsent"], [class*="ConsentBanner"] { display: none !important; }
#ac-gdpr-banner, .as-globalfooter-curtain { display: none !important; } /* Apple */
[role="dialog"][aria-label*="cookie" i], [role="dialog"][aria-label*="consent" i] { display: none !important; }
.cmp-container, .cmp-modal, .cmp-banner { display: none !important; }
C. Real-DOM element identification — emit concrete selectors:
Generic CSS selectors (h1, [class="hero"], section h2) work on semantic / well-marked-up sites but miss obfuscated class names on big-name corporate sites (apple.com uses tile-headline / as-headline-section-title, not hero-). For each beat, prefer the actual DOM elements observed in the WebFetch:
- Read the rendered HTML/markdown WebFetch returned. Note the page's actual primary
<h1>text and class. - Note the page's section structure (h2 headings + their parent containers).
- Note any prominent CTA / signup / pricing element.
- Emit
zoom_target.selectorusing the actual class or id observed, falling back to semantic structure (main > section:nth-of-type(N) h2) when class names look auto-generated (Tailwind_1a2b3c, CSS modulesmodule__hero___xYz).
D. SPA / lazy-render detection — bump initial wait:
If the WebFetch response has fewer than 3 visible headings / minimal text content, the page may be SPA-rendered post-domcontentloaded. Emit a longer initial wait action ({type: "wait", at_s: 0.0, ms: 2500}) before any beat fires, instead of the default 600ms settle.
E. --focus is honored when supplied (do not solicit):
Without --focus, select beats from generic structure cues — proceed silently with a confident first attempt. Do not ask the user "what should I focus on?" before firing; users iterate by re-running with --focus "the X feature" if the first pass misses the angle they wanted. With --focus supplied, anchor beat selection on the phrase: uses concrete page sections that match it, ignores irrelevant marketing chrome.
Step 3.0 — Required README section scan (GitHub mode only, no MCP call)
Before authoring the beat sheet, scan the README (case-insensitive, full-text) for any of these section names. If a match is found, you must add a dedicated beat for that section in Step 3, replacing one of the generic beats 4–5 if necessary:
| README contains... | Required beat |
|---|---|
how it works | scroll_to that heading; zoom article h2:has(#user-content-how-it-works) |
audio layer / audio timeline | scroll_to the audio-layer diagram; zoom on the rendered figure or its surrounding heading |
claude code / mcp integration | scroll_to that section; zoom article pre or .highlight (terminal screenshot / code block) |
architecture / system design | scroll_to that section; zoom article h2:has(#user-content-architecture) |
features (when prominent at top) | scroll_to that heading; zoom article h2:has(#user-content-features) |
getting started / quick start / installation | scroll_to that heading; zoom article h2:has(#user-content-installation) (or the matching slug) — falls back to article pre if you want the install code block instead |
usage / examples | scroll_to that heading; zoom article h2:has(#user-content-usage) (or the matching slug) — or the first code block under it |
GitHub heading slug rule: lowercase, spaces → dashes, strip non-[a-z0-9-] characters. So "How it works" → #user-content-how-it-works, "Quick Start" → #user-content-quick-start. GitHub injects the <a id="user-content-{slug}"> anchor inside each rendered <hN>, so hN:has(#user-content-{slug}) reliably grabs the heading element across any GitHub README.
Selector contract: bbox_selector needs to be vanilla CSS that resolves via document.querySelector (capture_website runs the post-action smooth-scroll JS via page.evaluate, which uses the browser's native selector engine). Avoid Playwright extensions like :has-text("..."), text=..., or :visible: those resolve in Playwright's page.query_selector (so the bbox capture finds the element) but silently fail in the smooth-scroll's document.querySelector (so the page never scrolls to the target, and bbox.y ends up at document-Y instead of top - 60 px, which trips Step 8b's bbox.y > recording_viewport.h degenerate filter and falls back to default-position zoom). CSS Level 4 :has(...) is vanilla and supported in modern Chromium.
These sections are the highest-information visuals in most explainer-worthy repos. Missing them produces a generic walkthrough; including them gives the explainer a concrete "show, don't tell" beat. The original tarball SKILL.md flagged the first four with SPECIAL rules in the Gemini prompt; this Step 3.0 promotes them from incidental guidance to a hard requirement and adds three more high-signal headings common in OSS READMEs.
Step 3 — Author beat sheet (main thread, no MCP call)
Write a JSON array of 8–10 beats, with a hard total duration of 65–80 seconds and a hard total word count of 165–200 words (assuming a speaking rate of 2.5 words/sec). Each beat:
{
"t_start": 0.0,
"t_end": 7.5,
"action": { "type": "navigate" | "scroll_to" | "hover", "url": "...", "selector": "..." },
"zoom_target": { "selector": "...", "description": "..." },
"vo_text": "exact words to speak — 1 to 2 conversational sentences"
}
Hard constraints (validate before emitting the beat sheet — reject the draft if any fails):
- Every beat needs all five fields:
t_start,t_end,action(withtypeandurl),zoom_target(withselector),vo_text. Missing fields ⇒ reject and re-author. (Mirrors tarball'sgithub_explainer.py:183-190validation pass.) t_startof beat 0 = 0.0;t_end[i] == t_start[i+1](continuity).len(vo_text.split()) / 2.5≈t_end - t_startper beat. Aim for ±10% of this estimate; if your draft is denser than 2.5 wps, tighten thevo_textuntil it fits.- Total
t_endof last beat ≤ 80 seconds. (Reference output is 86.5s including intro; lipsync audio is ~83s. Kling avatar/image2video stalls reliably past ~90s of audio under current load — going over 80s risks a 20-min Kling timeout.) - Total spoken word count between 165 and 200 words.
- Every beat's
zoom_target.selectorneeds to be a valid CSS selector for the page that beat lands on. GitHub mode prefers GitHub-specific selectors:h1.f1,#readme,article h2,.blob-code-inner,.highlight,.octicon-star,nav. Generic-URL mode prefers robust generic selectors:h1,[role="main"],main,header,nav,.hero,.feature,section h2,[class="cta"],[class="hero"],button,a[href]. Selectors need to resolve on the rendered page after the beat's action settles — verify against the DOM you can see via WebFetch before emitting. vo_textis 1-2 conversational sentences. Dev voice. No stage directions. No markdown.action.urlis a validhttps://...URL whenaction.type == "navigate"; required.
Self-check before Step 4: verify total_words is in [165, 200] AND total_seconds (= beats[-1].t_end) is in [65, 80]. If either misses bounds, re-author the beat sheet — do not proceed to TTS. (No need to "print" anywhere — this is an internal draft validation; just reject the draft and re-author until it passes.)
Structural skeleton — GitHub mode (load-bearing for the visual contract — match origin, but Step 3.0 overrides if applicable):
- Beat 1:
navigaterepo root, zoomh1.f1(repo title), hook sentence. - Beats 2–3:
navigateto specific source files (https://github.com/{owner}/{repo}/blob/HEAD/<path>), zoom.blob-code-inneror.highlight. Pick files that match the narration's claim — don't navigate to a file you won't talk about. - Beats 4–5:
scroll_toREADME sections, zoomarticle h2or#readme. If Step 3.0 surfaced required sections, replace these slots with the required ones. - Beats 6–7 (only if
live_urlsurvived Step 2.5):navigatetolive_url, zoomnav/h1/.hero/main/button/.feature. - Beat 8: back to repo root, zoom
.octicon-star, outro.
Structural skeleton — Generic-URL mode:
- Beat 1:
navigateto the input URL, zoomh1or[class*="hero"] h1(the page's primary headline), hook sentence. - Beats 2–3:
scroll_tothe page's hero / value-prop / first feature section. Zoom.hero,[class="hero"],[class="feature"], orsection:nth-of-type(1) h2. Pick visible elements the narration references. - Beats 4–5:
scroll_todeeper sections — feature lists, screenshots, pricing, social proof. Zoomsection h2,[class="feature"] img,[class="testimonial"],[class*="pricing"], or any prominent semantic element on the page. - Beats 6–7:
scroll_toCTA / signup / demo embed. Zoom[class="cta"],button,a[class="button"], or[id*="signup"]. (No live-demo navigation in generic mode — the input URL IS the demo.) - Beat 8:
scroll_tofooter / closing element, zoomfooter h2,footer, or back to top withh1. Outro sentence.
If --focus is supplied, weave its angles into vo_text without mutating the structural skeleton. Prefer CSS selectors over text_content in zoom_target.selector — bbox capture is selector-only (see Known gaps).
Step 4 — TTS
Call mcp__pika__generate_speech with provider: "minimax-tts", text: <full vo_text join>, optional voice_id. Capture result.audio_url (the dispatcher returns audio under audio_url, not url) and result.duration_seconds. Voice defaults to identity-store injection in plugin mode.
Stale-voice fallback detection (AGNT-231): the dispatcher retries once with the default Calm_Woman voice on Minimax status_code:2054 (voice id not found — typically a per-agent workspace pointer that Minimax auto-deleted after 7 days of inactivity). On retry success the response carries two extra fields beyond the documented schema (passthrough): voice_id_requested (the planted-but-stale id the worker tried first) and fallback_reason: "invalid_minimax_voice_id". If you see fallback_reason == "invalid_minimax_voice_id" in the response, surface a one-line note to the user along the lines of: "your registered voice expired on Minimax (auto-GC'd after 7 days of inactivity); we used the system default. Re-clone via clone_voice if you want personalization back." The render does NOT fail — it just uses the default voice — so this is informational, not a retry trigger.
Cookie-banner audio padding (Generic-URL mode with cookie_banner_present == true from Step 2.6 §B): prepend MiniMax's pause marker <#1.5#> to the text: argument before calling generate_speech. MiniMax's speech-2.8-hd honors <#N#> as N-second silence; the returned audio_url and duration_seconds include the 1.5s lead-in natively. This aligns the audio with the screen recording's cookie-dismissal +1.5s offset applied in Step 4.5.
Fallback (only if smoke-test shows the marker is ignored on this voice): call generate_speech normally, then mcp__pika__edit_audio_mix to overlay the result onto a 1.5s silent base at offset 1.5s. Then call mcp__pika__analyze_media(url=<padded_audio_url>) to probe the padded duration and rebind duration_seconds = result.duration_seconds before Step 4.5 consumes it. analyze_media is the single authoritative duration probe — do not rely on edit_audio_mix's return payload (its duration field is not contractually guaranteed).
Step 4.5 — Audio length verification + beat-sheet rescale
Applied to audio_duration_seconds post Step 4 (which includes any cookie lead-in pad). End state: beats[].t_start / t_end are absolute wall-clock seconds matching the audio playback timeline. All beats[] mutations happen here; Steps 6 and 8 are read-only consumers.
Gate 1 — Kling stall ceiling (provider cap, raw audio_duration_seconds): If audio_duration_seconds > 90, abort and re-author the beat sheet with a tighter word budget. Kling avatar/image2video stalls past ~90s.
Gate 2 — Degenerate TTS (spoken-content length): Compute narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0). If narration_duration < 30s, retry Step 4 once (and recompute narration_duration from the retry's audio). If the retry also returns narration_duration < 30s, abort and investigate — likely failure modes: truncated MiniMax response, silent audio, vo_text not joined correctly.
Gate 3 — Rescale:
narration_duration = audio_duration_seconds - (1.5 if cookie_banner_present else 0.0)scale = narration_duration / beats[-1].t_end- If
scale < 0.5orscale > 1.5, abort and re-author. Structurally broken TTS (or wildly off word budget); rescaling won't save it. - For each beat:
beat.t_start = scale; beat.t_end = scale - If
cookie_banner_present: for each beat,beat.t_start += 1.5; beat.t_end += 1.5 - Final clamp:
beats[-1].t_end = audio_duration_seconds(exact). Guarantees float equality of the invariant regardless of cookie mode or accumulated float drift.
After Gate 3 passes, emit a one-line operator log to surface the scale value for post-run diagnosis:
Rescaled beats by scale=X.XX (audio=Y.YYs, narration_duration=Z.ZZs, cookie_pad=W.Ws)
Advisory (not a gate): scale near 1.0 is ideal. scale > 1.2 means audio is meaningfully slower than predicted — visuals feel "stretched" but stay in-sync. scale < 0.85 means audio is faster — visuals feel "rushed" but in-sync. Both pass the gates; if the user reports "feels off-pace" rather than "out of sync," re-author with a tighter / looser word budget.
Step 5 — Preview gate (opt-in via --preview)
Skip Step 5 entirely by default. Proceed directly to Step 6 unless the user explicitly passed --preview — do not generate a preview, do not ask for confirmation. This matches industry standard for media-gen tools (Midjourney / Sora / Runway / HeyGen / Pika.art): submit → render → return; account credit balance + provider failover are the canonical guardrails.
--skip-preview and --yes are accepted as no-ops for backward compatibility — they were the old opt-out flags.
If --preview was supplied:
mcp__pika__generate_speechwithtext: "Hi, I'm your presenter. Let's explore this repo together."→preview_audio_url.mcp__pika__generate_lipsyncwithprovider: <resolved_lipsync_provider>(defaults topika; honor--lipsync-provider klingif supplied),image: <avatar>,audio: preview_audio_url→preview_lipsync_url(bare lipsync, ~3s). Use the same provider here as Step 9 will use for the full audio — the preview's job is to confirm the avatar+voice+provider combo before the long-pole render.- Present to the user verbatim:
Preview ready:
<preview_lipsync_url>This confirms the avatar + voice combo. The full render is a long pole (~5–30 min Kling lipsync on the full audio). Replyyesto proceed, or anything else to cancel.
- Match
^(yes|go|proceed|confirm|y)$(case-insensitive). Anything else → STOP, no further MCP calls.
Step 6 — Build timed_actions and record
Translate the beat sheet into capture_website timed_actions. One timed_action per beat — set bbox_selector to the beat's zoom_target.selector and capture_website captures the post-action bbox of that element internally (legacy 600 ms settle → smooth-scroll-to-top - 60 px → 1300 ms post-anim → measure, all server-side).
For each beat in order, emit one entry:
navigatebeats:{type: "navigate", at_s: <t_start>, url: <action.url>, bbox_selector: <zoom_target.selector>}. The worker navigates, waits to absoluteat_s + 0.6 s, scrollsbbox_selectorinto view, and measures the bbox — all without the caller scheduling a follow-up step.scroll_to/hoverbeats:{type: "scroll", at_s: <t_start>, selector: <action.selector or zoom_target.selector>, bbox_selector: <zoom_target.selector>}. The action's ownselectordrives the page scroll;bbox_selectordrives the bbox measurement (it can be the same selector or different — usually the same). (capture_websitehas nohover; scroll-into-view is the analog.)
Do NOT prepend the eight-step intro scroll-through that the tarball ran. The lipsync audio is timed from t=0 of the beat sheet; a prepended intro shifts the screen recording forward by ~3 s while leaving the audio un-shifted, causing audio/video desync. The capture_website recording begins at t=0 with beat 0's URL already loaded — that's the orientation the tarball's intro scroll provided, minus the desync.
Call mcp__pika__capture_website:
url: <beat 0's action.url>timed_actions: <the N-element list built above>(one entry per beat)duration_s: ceil(audio_duration_seconds)—beats[].t_startandt_endhave already been rescaled to the TTS audio timeline by Step 4.5, soduration_sis simply the audio length. The oldmax(...)defense against TTS overrun is no longer needed.
Generic-URL mode additions (per Step 2.6 pre-flight):
extra_css: <the cookie-banner-hiding CSS payload from Step 2.6 §B>— defensive: hides common consent platforms viadisplay: none !important;so even if the optional click misses, the banner is invisible in the recording.- Prepend a
waitaction{type: "wait", at_s: 0.0, ms: 2500}for SPA / lazy-render pages (per Step 2.6 §D); use 1500ms for "normal" pages. This gives time for hero images to lazy-load, fonts to swap, and scroll-triggered animations to be ready before the first beat fires. - If
cookie_banner_presentfrom Step 2.6 §B, also prepend aclickaction{type: "click", at_s: 0.5, selector: <detected dismissal selector from WebFetch DOM>}. Thebeats[]array has already been shifted by+1.5sin Step 4.5 to account for the cookie-dismissal lag, and the TTS audio has already been padded with 1.5s of silence (Step 4); no further shifting is required here. Beat 1'stimed_action.at_sreadsbeats[0].t_startdirectly, which is 1.5 in cookie mode. - No cookie banner action needed if
cookie_banner_present == false; just the prepended wait action.
Capture video_url, recording_viewport, action_bboxes. The result returns recording_viewport: {w, h} and action_bboxes: [{idx, selector, found, bbox: {x,y,w,h}}] alongside video_url.
action_bboxes[].idx semantics: the idx field is the position in the input timed_actions array.
- GitHub mode: with one timed_action per beat,
idxmaps 1:1 to beat index — Step 8 usesentry.idxdirectly asbeat_idx. - Generic-URL mode: the prepended
wait(and optional cookie-dismissalclick) shift the array by 1 or 2. Computebeat_idx = entry.idx - prepend_countwhereprepend_countis 1 (wait only) or 2 (wait + click). Skip entries wherebeat_idx < 0(those are the prepended setup actions, not beats).
The selector field on each entry reports bbox_selector (i.e. zoom_target.selector), not the action's own selector.
Step 7 — Browser chrome
mcp__pika__edit_browser_frame:
video_url: <Step 6 video_url>url: (live_url if GitHub-mode and survived Step 2.5 else input_url, truncated to 65 chars)tab_title: <30-char title>— GitHub mode:(meta.description or repo_name or "")[:30]. Generic-URL mode: the page's<title>(from WebFetch in Step 2) or the URL's hostname, truncated to 30 chars. Guard againstNone/empty.
Returns framed_url (1280×800 Sonoma + chrome).
Step 8 — Build zoom_keyframes and apply
Constants:
INTRO_BEATS = 2— gates by beat-sheet index. Skips zoom on beat indices 0 and 1 ("Beat 1" and "Beat 2" in the structural skeleton above).HOLD_GAP = 0.6— seconds of 1.0× before each zoom-in and after each zoom-out.MIN_BEAT_DUR = 1.5— beats shorter than this are skipped (no room for a meaningful zoom).SCALE = 1.35(precise element-targeted zoom).FALLBACK_SCALE = 1.25(default-position fallback when no usable bbox).FALLBACK_RAMP = 0.4.
Note: beats[].t_start / t_end were rescaled (and cookie-shifted if applicable) to the audio timeline by Step 4.5. HOLD_GAP (0.6s), MIN_BEAT_DUR (1.5s), and the 1.0s interior-interval check all operate on those final values — they are real visual seconds on the rendered video.
edit_browser_frame's inner-content offsets: CONTENT_X=56, CONTENT_Y=108, CONTENT_W=1168, CONTENT_H=637 (verified against the worker's edit_browser_frame/main.py).
Coord transform (recording px → framed px):
cx_framed = 56 + (bbox.x + bbox.w/2) * (1168 / recording_viewport.w)
cy_framed = 108 + (bbox.y + bbox.h/2) * (637 / recording_viewport.h)
Build the zoom list with a per-beat default + bbox override pattern. The legacy rig followed an "every non-intro beat gets a zoom — bbox-derived if available, default-position otherwise" rule. Reproduce that here:
Step 8a — Pre-fill default-position keyframes for every non-intro, long-enough beat.
Constants for the default position:
DEFAULT_CX = 56 + 1168 // 2(screen center of the framed canvas)DEFAULT_CY = 108 + 637 // 3(upper-third of the content area, where most GitHub UI prominence lives)
Walk the beat sheet from index INTRO_BEATS (= 2) to the end. For each beat:
- If
t_end - t_start < MIN_BEAT_DUR(1.5s), skip — too short for a meaningful zoom. - Compute the keyframe's interior interval as
[t_start + HOLD_GAP, t_end - HOLD_GAP]. If that interval is shorter than 1.0s, skip. - Otherwise pre-fill that beat's slot in a per-beat map (call it
zoom_keyframes_by_beat[beat_idx]) with{cx: DEFAULT_CX, cy: DEFAULT_CY, scale: FALLBACK_SCALE (1.25), ramp_s: FALLBACK_RAMP (0.4)}plus the trimmedt_start/t_end.
Step 8b — Override with bbox-derived precise zoom where action_bboxes provided a usable measurement.
For each entry in action_bboxes:
beat_idx = entry.idx(since Step 6 emits one timed_action per beat). Ifbeat_idx < INTRO_BEATS, skip.- If
entry.foundis false, skip. - If the beat isn't already in
zoom_keyframes_by_beat(was filtered out in Step 8a byMIN_BEAT_DUR/1.0srules), skip. - Filter degenerate bboxes: skip if
bbox.y > recording_viewport.h(offscreen capture — page didn't scroll the element into view in time) orbbox.h > recording_viewport.h * 1.5(full-page<main>element — yields a meaningless zoom center). - Compute
cx_framed/cy_framedfrom the bbox center using the recording-px → framed-px transform shown above. Override the beat's slot with{cx: cx_framed, cy: cy_framed, scale: SCALE (1.35), ramp_s: min(0.5, (t_end - t_start) * 0.15)}.
Final list: sort the values of zoom_keyframes_by_beat by t_start to produce the zoom_keyframes array.
This guarantees every non-intro, long-enough beat gets a zoom — precise when bbox capture worked, default-positioned otherwise. Avoids the "flat video for the whole runtime" failure mode.
If len(zoom_keyframes) > 0, call mcp__pika__edit_animate_zoom with video_url: framed_url, zoom_keyframes. Returns zoomed_url. Otherwise (no qualifying beats — should be rare given Step 3's 65-80s constraint) skip and use framed_url as zoomed_url.
Step 9 — Lipsync the full audio
mcp__pika__generate_lipsync:
provider: <resolved_lipsync_provider>— default:pika(parrot a2v). Honor--lipsync-provider klingif explicitly passed.image: <avatar>audio: <Step 4 audio_url>- kling-only knobs (since
pika-mcp-serverBACK-339, 2026-05-10): whenprovider == "kling", addmode: "pro"andprompt: "talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter"for the polished-presenter feel. Both are silently ignored onpika(parrot has its own driver).
Provider tradeoffs:
| Provider | Wall-clock | Head motion | When to use |
|---|---|---|---|
pika (default) | ~2–5 min | Slightly more dramatic, naturalistic | Default for most runs — fast iteration, watchable output, ~10× faster than kling |
kling (opt-in) | ~5–30 min | Minimal, face-centered, presenter-style | High-stakes renders where the avatar must read like a polished presenter; tolerate the long pole |
Server-side-await covers the call inline; if the response shape is {task_id, status: "queued"}, poll mcp__pika__task_status in a tight loop (no sleep) until the status reaches a terminal state (completed, failed, or cancelled). On completed, capture lipsync_url. On failed / cancelled, fall back to the other provider (kling ↔ pika) per the failover note below.
Failover:
- If
pikafails (rare — parrot a2v is robust at typical explainer audio lengths) → retry once withprovider: "kling". - If
klingstalls past the worker's 1200s ceiling (visible as repeatedprocessingstatus with no completion) → fall back toprovider: "pika". Step 4.5's audio-length gate should catch the long-audio case before it gets here, but the failover handles the residual risk.
Why pika is the default:
- Speed — typical explainer wall-clock drops from ~10–15 min to ~5–7 min total because lipsync is the long pole.
- Quality is good enough — parrot a2v is naturalistic; the slight extra head motion reads as engaging rather than distracting in a 60-80s clip with avatar circle PiP.
- Kling-mode-pro polish is mostly invisible inside the 246-pixel circle anyway — face area is too small for the minimal-head-motion difference to register on most viewers.
For the canonical "polished presenter" feel of the original tarball reference output, pass --lipsync-provider kling explicitly.
Step 10 — PiP composite
mcp__pika__edit_pip:
main_video_url: <zoomed_url>overlay_video_url: <lipsync_url>shape: "circle"size_px: 246← pixel-pinned 246px outer diameter (240 inner avatar + 3+3 stroke ring); matches tarball'sCIRCLE_OUT = CIRCLE_SIZE + STROKE * 2stroke_width_px: 3stroke_color: "white"position_px: {x: 20, y: 476}←800 − 246 − 78for dock clearance (matches tarball'sH − CIRCLE_OUT − 78)
Pass size_px, not size; the fields are mutually exclusive. Returns final_url.
Master-duration / audio-source contract (matching tarball github_explainer.py:418-419, 531-533, 578-582): edit_pip uses shortest=1 semantics by default, which means the composite's duration is the shorter of (zoomed screen recording) and (lipsync video). Step 6's duration_s = ceil(audio_duration_seconds) ensures the screen recording length matches the lipsync exactly (Step 4.5 rescaled beats to the audio timeline). The composite duration is set by the lipsync via edit_pip's shortest=1 semantics. Audio comes from the lipsync video's audio track (the lipsync embeds the original TTS audio); the standalone audio_url is not re-mixed. If the lipsync video is shorter than the screen recording (Kling sometimes trims trailing silence), the screen will get cut off at the lipsync end — accept this; the alternative (looping the screen) is worse for explainer content.
Step 11 — Burn captions
Call mcp__pika__add_captions(video_url=<final_url>, style="classic"). classic renders a bottom subtitle bar — the right register for an explainer video (use tiktok / hormozi / karaoke only when the user explicitly asks for word-level highlight). The audio is extracted server-side from the PiP composite's lipsync track, so transcription matches the narration verbatim. Capture the result as captioned_url.
Skip this step only if the user passed --no-captions (parsed in Step 1) — the default is captions on. (Note: /pika:podcast does not burn captions — narration in an explainer is more transcription-friendly than fast two-host dialogue.)
Step 12 — Return
Emit captioned_url (or final_url if Step 11 was skipped) on one line: Done: <url>.
Load-bearing phrases
These anchors preserve the visual contract across page types:
| Phrase | Where | Why load-bearing |
|---|---|---|
vanilla CSS that resolves via document.querySelector | Selector contract | Keeps scroll, bbox capture, and zoom targeting aligned inside capture_website. |
GitHub URLs activate repo-aware mode | Mode detection | Prevents generic product-page beats from replacing README/code walkthrough beats. |
8-10 beats, 65-80 seconds, 165-200 words | Beat-sheet authoring | Keeps narration, screen recording, lipsync, and captions within the reliable duration envelope. |
all beats[] mutations happen here | Audio rescale step | Ensures later capture/zoom/composite steps consume one stable timeline. |
extra_css cookie-banner hiding payload | Generic URL pre-flight | Reduces first-frame banner occlusion when a banner click misses. |
Engine choice: Pika lipsync default, Kling opt-in
Default to Pika/parrot lipsync because it is faster and keeps most explainers in a short iteration loop. Use Kling only when the user explicitly requests --lipsync-provider kling or when a high-stakes render needs a more centered presenter look and can tolerate a much longer long-pole stage. Screen capture, browser frame, zoom, PiP, and captions remain deterministic edit/composite steps around that lipsync choice.
Runtime expectations
Typical wall-clock is 5-10 minutes with Pika lipsync, or 10-30+ minutes with Kling lipsync:
| Step | Wall clock | Notes |
|---|---|---|
| URL read + pre-flight | 10-60s | GitHub README scan or generic URL DOM/cookie checks |
| TTS + audio rescale | 30-90s | Beat timing is normalized after actual audio length |
| Screen recording | 60-180s | Depends on page load and navigation count |
| Browser frame + zooms | 1-3 min | Deterministic edit/composite stages |
| Lipsync | 2-5 min Pika / 5-30 min Kling | Kling is opt-in because it is the long pole |
| PiP + captions | 1-3 min | Captions skipped when --no-captions is set |
Known gaps (carried as follow-up server-side work)
- ~~Kling avatar
mode:"pro"andpromptnot exposed.~~ Resolved bypika-mcp-serverBACK-339 (PR #186, shipped 2026-05-10):generate_lipsyncnow wires bothmode(e.g."pro") andpromptend-to-end for the kling provider. To enable polished-presenter mode here, pass--lipsync-provider klingand the Step 9 call should addmode: "pro"plus a prompt like"talking head, face centered, mouth syncs to audio, minimal head movement, professional presenter". Real quality lever for reducing dramatic head motion in the lipsync — no longer a server-side gap. - No caller-controlled white-frame trim on the screen recording.
capture_websitehas internal trim heuristics but doesn't expose them to the caller. Visible as a brief white flash at the start of the explainer when the page is still loading. The 800mswaitaction atat_s: 0.0mitigates this somewhat by giving the page time to paint, but doesn't trim already-recorded white frames. Worker enhancement. - No
networkidlewait on per-beat navigation. Tarball usespage.goto(url, wait_until="networkidle", timeout=20000)pluswait_for_timeout(600)after every navigate.capture_websitesettles todomcontentloadedplus the bbox-capture branch's 600 ms post-action settle (server-side, whenbbox_selectoris set), but SPA blob pages whose final render happens afterdomcontentloadedcan still get bbox'd against unmounted code blocks. Worker enhancement: expose await_untilknob ontimed_actions[].navigate. - No per-step output-size verification gates. Tarball's
verify()helper atgithub_explainer.py:35-39checked TTS ≥ 50KB, preview ≥ 100KB, screen ≥ 200KB, lipsync ≥ 500KB, final ≥ 1MB after each step. The MCP path returns URLs only; verifying file size would require an extramcp__pika__analyze_mediacall per step (~30s overhead each). Worth adding once user-side latency budget allows it. For now, a downstream-failure cascade (e.g. zero-byte TTS → silent lipsync → blank composite) only surfaces at Step 11. text_contentbbox capture not implemented.capture_websitev1 returnsaction_bboxesonly for steps with a CSSselector.text_content-only steps produce no entry. Prefer CSS selectors inzoom_targetfor guaranteed zoom coverage.- Beat-sheet wording is non-deterministic. Running the same input twice produces different vo_text and different zoom positions. Visual kind is the contract, not pixel-exact reproduction.
- Generic-URL mode quality varies by site. Modern indie / SaaS landing pages with semantic markup (
<h1>+ clear<section>+ named class hooks) work well. Big-name corporate sites (apple.com, microsoft.com, amazon.com) hit several known limits: (a) bot detection — the page may serve a degraded version under headless Chrome, or a captcha; Step 2.6 §A aborts on these but the heuristics aren't exhaustive; (b) obfuscated class names —tile-headlineinstead ofhero-titledefeats generic selectors; Step 2.6 §C's WebFetch DOM scan helps but isn't perfect; (c) scroll-triggered animations don't play — IntersectionObserver-driven hero reveals fire on real user scrolls, not Playwright'sscrollIntoView; the recorded frame may be a static placeholder; (d) lazy-loaded images — picture/source elements withloading="lazy"may not have resolved by the 600ms-or-2500ms settle window; the bbox lands on a transparent placeholder. Workarounds: prefer simpler / smaller marketing pages for launch demos, always pass--focus "the X feature"to anchor beat selection, accept that big-name sites need a follow-up server PR (cookie-banner click retry +wait_until=networkidle+ animation-trigger viaIntersectionObserverpolyfill). - Cookie-banner click is single-attempt. Step 2.6 §B emits one
clickagainst the dismissal selector extracted from the WebFetch DOM. If the WebFetch's HTML doesn't include the banner (rendered post-JS) or the selector is wrong, the click silently misses — theextra_csspayload is the load-bearing defense. Worker enhancement: support a list of fallback selectors perclickaction so the worker tries each in order. - Step 8b ↔ Step 6
idx-mapping mismatch in Generic-URL mode. Step 6 mapsbeat_idx = entry.idx - prepend_count, but Step 8b usesbeat_idx = entry.idxnaively. Pre-existing bug independent of rescaling.
Auth
If any call returns 401: the user's OAuth token has expired or hasn't been issued. The next authenticated MCP call triggers OAuth automatically (browser opens for @pika.art Google login). For non-interactive environments, set MCP_AUTH_TOKEN.
Examples
GitHub-mode (repo-aware: README scan + live-demo detection):
/pika:explainer https://github.com/leigest519/OpenGame/pika:explainer https://github.com/anthropics/claude-cookbooks --focus "Claude Code MCP integration"/pika:explainer https://github.com/openai/whisper --preview(opt-in to the preview gate when testing a new avatar)
Generic-URL mode (any non-GitHub URL — drives through the page directly):
/pika:explainer https://pika.art/pika:explainer https://vercel.com --focus "the deployment workflow"/pika:explainer https://docs.anthropic.com/en/docs/claude-code/plugins/pika:explainer https://your-product-page.com --avatar https://cdn.example.com/me.png --preview

