cerase-media MCP

0 starsMITCommunity

Install to Claude Code

This server doesn't publish a one-line install command. Follow the setup in the source repository.

Summary

Provides multimodal understanding tools including OCR, image description, audio transcription, UI screenshot analysis, and screenshot comparison via async tools on a multimodal endpoint.

README.md

cerase-media MCP

First-party multimodal understanding (M-MEDIA-1 = the merge of the former cerase-ocr + cerase-transcriber): five async tools over the multimodal tool-model alias through cerase-litellm, billed per-agent. The last two (analyze_ui, compare_screenshots) are the UX/UI screenshot pair added by M-CERASE-MEDIA-UX — same multimodal endpoint, specialised prompts, no extra dependency.

| Tool | Question it answers | Returns | |---|---|---| | ocr | what is WRITTEN in this image? | {text, model} | | describe_image | what does this image SHOW? | {description, model} | | transcribe | what does this audio say? | {text, model} | | analyze_ui | what's in this UI screenshot? — structured audit of layout, typography, colours, interactive elements, text, visual errors, accessibility, consistency | {analysis, model} | | compare_screenshots | what changed between two screenshots? — before/after visual diff (layout / text / style / new / removed / regressions) | {diff, model} |

Image input is accepted three ways (pick one): path (a file under CERASE_TOOL_WORKSPACE_ROOT), image_url, or image_base64. compare_screenshots takes the two-image variants (path1/image1_url/ image1_base64 and path2/…).

Async by design: the tools are ~100% LLM-wait, so concurrent requests run on parallel I/O lanes inside the single runner container (no per-modality queue). ffmpeg (audio normalisation) runs as an async subprocess.

Env: LITELLM_BASE_URL, LITELLM_MASTER_KEY (scoped service key), CERASE_MULTIMODAL_ALIAS (default multimodal), CERASE_TOOL_WORKSPACE_ROOT (path-traversal guard root).