Video Analysis

Analyze video files using either native model understanding or frame extraction + transcription.

How It Works

analyze_video(path, question)
      │
      ├─ file_size ≤ threshold (default 20MB)
      │     → Send video to a supports_video model (default Gemini 3.1 Flash Lite)
      │     → Model sees full video natively (best quality)
      │
      └─ file_size > threshold
            → ffmpeg extracts keyframes (scene detection for long videos)
            → Whisper transcribes audio track
            → Returns frame image paths + transcript text
            → Agent feeds these to the current chat model

Quick Start

⚠️ Invocation — do NOT use dotted imports. The directory name contains a hyphen (video-analysis), so from skills.video-analysis.exports import ... is a Python syntax error (- is parsed as minus). This is true for every hyphenated skill, not just this one. Use one of the two patterns below.

Pattern A — from workspace root (recommended for scripts):

cd /data/workspace/skills/video-analysis && \
  python3 -c "from exports import analyze_video; \
    import json; \
    print(json.dumps(analyze_video('output/videos/clip.mp4', \
      question='What happens in this video?'), ensure_ascii=False))"

Note: pass the video path workspace-relative (analyze.py resolves it against WORKSPACE_DIR), even though you cd into the skill dir.

Pattern B — inside a starchild-clawd script:

from core.skill_tools import video_analysis
result = video_analysis.analyze_video("output/videos/clip.mp4",
                                      question="What happens in this video?")

❌ Do NOT exec(open('skills/video-analysis/analyze.py').read()) — analyze.py uses __file__ at import time, which is undefined under exec, so it crashes. Load it by file path with importlib.util.spec_from_file_location if you must avoid both patterns above.

# result keys (same for both patterns):
# Analyze a video — auto-selects native or extraction mode
# result = analyze_video("output/videos/clip.mp4", question="What happens in this video?")

# result keys:
#   success: bool
#   mode: "native" | "extraction"
#
# If mode == "native":
#   analysis: str (model's text response)
#   model: str (which model was used)
#   tokens: {input, output, video, audio}
#
# If mode == "extraction":
#   frame_paths: list[str] (workspace-relative paths to keyframe JPEGs)
#   transcript: str | None (Whisper transcription text)
#   frame_count: int
#   duration_sec: float

Using the Exports

from core.skill_tools import video_analysis

# Full analysis (auto-selects mode)
result = video_analysis.analyze_video("output/videos/my_video.mp4", question="Describe this video")

# Check current config
config = video_analysis.get_config()

# Get video metadata without analyzing
info = video_analysis.get_video_info("output/videos/my_video.mp4")
# → {"duration": 45.2, "size": 12345678, "width": 1920, "height": 1080, "has_audio": true}

Native Mode (small videos)

For videos under the size threshold, the skill sends the full video to a model that supports native video input. The model sees every frame and hears the audio.

Default model: google/gemini-3.1-flash-lite — best price/quality for video.

Model benchmark (6MB clip, vs gemini-3.1-pro-preview baseline):

Model	Tier	Cost	Time	Accuracy	Notes
google/gemini-3.1-flash-lite	budget	~$0.0014	8.1s	~88%	⭐ Default — cheapest + fastest
google/gemini-3.5-flash	std	~$0.0152	11.8s	~85%	More detail, higher cost
qwen/qwen3.6-plus	budget	~$0.0058	44.2s	~95%	Accurate but slow
qwen/qwen3.6-flash	budget	~$0.0027	16.6s	~80%	Misreads subjects sometimes
google/gemini-3.1-pro-preview	std	~$0.0199	19.7s	100%	Baseline (best, most expensive)

flash-lite identifies the full scene, action sequence, and transitions correctly at ~14x lower cost than the Pro baseline. For maximum accuracy (exact character names, fine detail), switch default_model to gemini-3.1-pro-preview or gemini-3.5-flash in config/video-analysis.yaml.

Extraction Mode (large videos)

For videos over the size threshold, the skill extracts keyframes and transcribes audio:

Short videos (≤60s): One frame every N seconds (default: 2s)
Long videos (>60s): Scene-change detection picks visually distinct frames
Audio: Extracted and sent to Whisper for transcription
Max frames: Capped at 30 (configurable) to control cost

The agent receives frame image paths and transcript text, then feeds them to the current chat model as image attachments + context text.

Configuration

Edit config/video-analysis.yaml (in the workspace) to customize. This file is created automatically on first use, only needs the keys you want to override, and survives skill updates.

Do NOT edit skills/video-analysis/config.yaml — that's the factory default and is overwritten on every skill auto-update. The user file overlays it.

Both the standalone skill and the chat "send a video" flow read this same config, so one edit changes the model everywhere. Available keys:

# Model for native video understanding
default_model: google/gemini-3.1-flash-lite

# Size threshold: native (≤) vs extraction (>)
# Set to 0 → always extraction. Set to 100 → always native.
native_size_limit_mb: 20

# Frame extraction settings
extraction:
  max_frames: 30                  # Max keyframes to extract
  short_video_interval_sec: 2     # Frame interval for ≤60s videos
  scene_threshold: 0.3            # Scene detection sensitivity (0.0-1.0)
  transcribe_audio: true          # Whether to Whisper-transcribe audio

Available Video Models

Model	Alias	Tier	Notes
google/gemini-3.1-flash-lite	flash31	budget	⭐ Default, best price/quality
google/gemini-3.5-flash	gemini35	standard	More detail, higher cost
google/gemini-3.1-flash-lite	flash31	budget	Cheapest option
google/gemini-3.1-pro-preview	gemini	standard	Highest quality
qwen/qwen3.6-flash	qwenf	budget	Good alternative
qwen/qwen3.6-plus	qwen	budget	—
minimax/minimax-m3	mm3	standard	—
meta-llama/llama-4-maverick	maverick	standard	—
meta-llama/llama-4-scout	scout	budget	—
xiaomi/mimo-v2.5	mimo	standard	—
z-ai/glm-5v-turbo	glm5v	standard	—
minimax/minimax-m2.7	mm27	budget	Audio-only, no image

Agent Behavior

When the user provides a video file (via upload or file path) and the current chat model does NOT support video:

Call analyze_video(path, question).
If result mode is "native" → return result["analysis"] directly.
If result mode is "extraction" → use result["frame_paths"] as image

references and result["transcript"] as context, then ask the current model to analyze based on the frames + transcript.

When the current model DOES support video, the backend handles it natively via Phase 1 (base64 content block injection) — no need for this skill.

Troubleshooting

Problem	Fix
"File not found"	Check path is workspace-relative (e.g. `output/videos/x.mp4`)
Native mode returns error	Check `default_model` in config/video-analysis.yaml is valid
No audio transcription	Video may have no audio track; check `has_audio` in result
Too few frames extracted	Lower `scene_threshold` in config/video-analysis.yaml (e.g. 0.15)
Too many frames / high cost	Reduce `max_frames` or raise `scene_threshold`

video-analysis