plugin-computer-use
A persistent stdio MCP server that exposes the Anthropic computer-use action surface (screenshot, click, move, keyboard, clipboard, batch) against the SideButton agent desktop on DISPLAY=:10.
This repo is the scaffold + dispatch core for the Computer Use epic (SCRUM-1399). It is delivered by SCRUM-1397:
- the long-lived stdio MCP server loop (
initialize/tools/list/tools/call), - the ported
computer.pydispatch base (DISPLAY targeting, screenshot →
base64 PNG, coordinate scaling, single-owner lock, xdotool runner),
- the full tool surface declared so
tools/listreturns it, screenshotwired end-to-end as the proof action.
The individual tool bodies land in sibling tickets (SCRUM-1400…1405) and hosting this as a runtime: "service" plugin is SCRUM-1406.
Why a persistent server
The current SideButton plugin model (the-assistant packages/server/src/plugins) spawns a fresh, stateless handler process per tools/call and SIGKILLs it at a 30s timeout. That cannot host the computer-use surface, which needs cross-call state: a held mouse button (left_mouse_down … left_mouse_up), the screenshot→coordinate session, session grants, and holds up to ~100s. So this is a single, long-lived child process that speaks MCP over stdio.
Tool surface
24 tools, grouped by the sibling ticket that owns each body. The capture group (screenshot, zoom, SCRUM-1400), the click group (left_click / right_click / middle_click / double_click / triple_click, SCRUM-1401), the keyboard group (type / key / hold_key, SCRUM-1403), and the clipboard + session group (SCRUM-1404) are implemented; the rest are declared and return a clear pending-owner error until their ticket lands. Full input schemas: docs/computer-use-mcp-tools-schema.md.
| Group | Ticket | Tools | | --- | --- | --- | | capture | SCRUM-1400 | screenshot ✅, zoom ✅ | | click | SCRUM-1401 | left_click ✅, right_click ✅, middle_click ✅, double_click ✅, triple_click ✅ | | move / drag / scroll | SCRUM-1402 | mouse_move, left_click_drag, scroll, left_mouse_down, left_mouse_up | | keyboard | SCRUM-1403 | type ✅, key ✅, hold_key ✅ | | clipboard + session | SCRUM-1404 | read_clipboard ✅, write_clipboard ✅, request_access ✅, list_granted_applications ✅, open_application ✅, switch_display ✅ | | utility / batch | SCRUM-1405 | computer_batch, wait, cursor_position |
Clipboard + session behaviour (SCRUM-1404)
The macOS session/permission model has no XFCE/Xvfb equivalent, so these degrade gracefully instead of erroring — keeping cross-runner (macOS-authored) skills working — while honouring the native grant flags so call shapes match:
request_accessauto-grants the requestedapps(no compositor dialog),
records the clipboardRead / clipboardWrite / systemKeyCombos flags (additive across calls), and returns screenshotFiltering: false.
list_granted_applicationsechoes the allowlist + active grant flags.read_clipboard/write_clipboardshell out to
xclip -selection clipboard, gated on the clipboardRead / clipboardWrite grants (a call without the grant returns an isError result, matching native).
open_applicationis best-effort window focus (wmctrl -a, then
xdotool search --name … windowactivate); the primary target is the single RDP window. With neither binary installed it returns a non-error no-op note.
switch_displayis a no-op on the single Xvfb:10and reports the
current display (accepts "auto").
Surface count. This is the 24-tool surface the epic (SCRUM-1399) specifies. The clipboard + session group follows the explicit enumeration in SCRUM-1404 (
read_clipboard/write_clipboardsplit +list_granted_applications), which is the 2-tool delta over the work plan's interim count of 22.src/tools.pyis the single source of truth;docs/computer-use-mcp-tools-schema.md(AC4) is generated from it.
Bare names + collisions. Names are the canonical Anthropic action ids.
screenshot,type,scroll,wait,clickcollide with core SideButton MCP tools, and the current loader drops the entire plugin on any collision. That is fine standalone (this server owns its namespace); namespacing on aggregation is deferred to SCRUM-1406 (recommended: bare names in the child, prefix/slug-namespace on the host).
Layout
plugin-computer-use/
├── plugin.json # generated service-plugin manifest (proposes runtime:"service")
├── src/
│ ├── server.py # stdio MCP loop: initialize / tools/list / tools/call
│ ├── computer.py # dispatch base (ported computer.py)
│ └── tools.py # canonical tool surface (single source of truth)
├── scripts/
│ └── build_manifest.py # regenerates plugin.json + the schema doc from tools.py
├── tests/ # unittest: dispatch-base unit + stdio round-trip + manifest
├── docs/
│ └── computer-use-mcp-tools-schema.md # generated; the AC4 schema doc
├── run_tests.sh # runs the suite (xvfb-wrapped when no DISPLAY)
├── pyproject.toml # dependency-free, python>=3.10
├── README.md LICENSE .gitignore
Run it standalone
# speak MCP by hand (newline-delimited JSON-RPC):
printf '%s\n' \
'{"jsonrpc":"2.0","id":1,"method":"initialize","params":{}}' \
'{"jsonrpc":"2.0","id":2,"method":"tools/list"}' \
'{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"screenshot","arguments":{"save_to_disk":true}}}' \
'{"jsonrpc":"2.0","id":4,"method":"tools/call","params":{"name":"zoom","arguments":{"region":[600,300,900,500]}}}' \
'{"jsonrpc":"2.0","id":5,"method":"tools/call","params":{"name":"type","arguments":{"text":"hello"}}}' \
'{"jsonrpc":"2.0","id":6,"method":"tools/call","params":{"name":"key","arguments":{"text":"ctrl+a","repeat":1}}}' \
'{"jsonrpc":"2.0","id":7,"method":"tools/call","params":{"name":"hold_key","arguments":{"text":"shift","duration":2}}}' \
'{"jsonrpc":"2.0","id":8,"method":"tools/call","params":{"name":"left_click","arguments":{"coordinate":[600,300],"text":"ctrl"}}}' \
| DISPLAY=:10 python3 src/server.py
initialize returns the handshake, tools/list the 24-tool surface, the screenshot call a base64 PNG image block (plus a Saved to disk: <path> text block when save_to_disk is set), and zoom a magnified PNG of the region. The keyboard and click calls each return a short text ack (isError:false); they need xdotool on PATH. The left_click maps its [600, 300] against the id:3 screenshot's coordinate session — a click before any screenshot returns a clear no screenshot session yet error instead of clicking blind.
Capture & coordinates
screenshot captures DISPLAY=:10 and, when the measured size matches a model resolution, downscales it (on the live 1920×1080 :10 it returns 1366×768). Each capture records a screenshot → coordinate session: the measured device geometry and the returned image geometry. Coordinates the model returns are in image space (relative to the last screenshot); the server maps them back to device pixels via Computer.to_device(x, y) — the foundation the click/move siblings (SCRUM-1401/1402) consume. Both the downscale and the coordinate mapping are derived from the same measured geometry, so they can never use different bases (the wrong-pixel-click failure mode).
zoom takes region: (x0, y0, x1, y1) in image space, maps it to a device rect, and crops it from a fresh full-resolution capture — genuine magnification, not an upscale of the downscaled screenshot. It is read-only: it never moves the click-coordinate origin (clicks still refer to the last screenshot). If no screenshot has been taken yet, zoom establishes the session lazily.
True 1:1 (no downscale) would require pinning :10 / the RDP window to a model-friendly size — that is provisioning (SCRUM-1396), out of scope here.
Click group (SCRUM-1401)
Pointer clicks at a screenshot-session coordinate. The [x, y] coordinate is image space (relative to the last screenshot) and is mapped to device pixels via Computer.to_device — a click before any screenshot returns a clear no screenshot session yet error (look before you click). Optional text modifier(s) ('ctrl', 'shift+alt', …) are held for the click and always released (keyup in a finally, the same guarantee as hold_key).
| Tool | xdotool | Button | | --- | --- | --- | | left_click | mousemove --sync <dx> <dy> click 1 | left (1) | | right_click | mousemove --sync <dx> <dy> click 3 | right (3) | | middle_click | mousemove --sync <dx> <dy> click 2 | middle (2) | | double_click | mousemove --sync <dx> <dy> click --repeat 2 --delay 100 1 | left ×2 | | triple_click | mousemove --sync <dx> <dy> click --repeat 3 --delay 100 1 | left ×3 |
With a modifier the click is wrapped in keydown -- <text> → click → keyup -- <text>. On-screen pixel accuracy is validated live in SCRUM-1408 (xdotool is absent on the current runner image, so the unit tests assert the device-pixel argv).
Keyboard group (SCRUM-1403)
| Tool | xdotool | Notes | | --- | --- | --- | | type | xdotool type --delay 12 -- <text> | types text at the current focus | | key | xdotool key --repeat <repeat> -- <text> | chords, e.g. ctrl+s; optional repeat (default 1) | | hold_key | keydown -- <text> → sleep <duration> → keyup -- <text> | the hold runs in the persistent server (Python time.sleep), so durations up to ~100s do not trip the per-call subprocess timeout; keyup runs in a finally so a held key/modifier is always released |
Test
./run_tests.sh # uses $DISPLAY if set, else wraps in xvfb-run
# or directly:
DISPLAY=:10 python3 -m unittest discover -s tests -v
tests/test_dispatch_base.py— coordinate-scaling math, the screenshot →
coordinate session + to_device mapping, the measured-basis downscale target, zoom region validation + region→device-rect math, xdotool command construction, single-owner lock, screenshot-backend detection, surface shape, plus live screenshot/zoom + save_to_disk (DISPLAY-gated).
tests/test_stdio_roundtrip.py—initialize→tools/list→tools/call
screenshot (incl. save_to_disk path block) and zoom over a spawned server, plus error paths.
tests/test_manifest.py—plugin.json+ schema doc are present and in sync
with src/tools.py.
The screenshot round-trip needs an X display; run_tests.sh provides one via xvfb-run when $DISPLAY is unset, so AC3 still exercises in headless CI.
System dependencies
System packages (apt), not pip — the plugin install copies no node_modules/venv and runs no build step, so the server is stdlib-only and shells out to:
| Tool | Used for | Notes | | --- | --- | --- | | a screenshot backend | screenshot | gnome-screenshot or scrot or ImageMagick (import/convert). The runner ships ImageMagick. | | xdotool | pointer/keyboard actions; open_application fallback | required by the click/move/keyboard groups (siblings); absent on the runner image. | | xclip | read_clipboard / write_clipboard | already on the runner; grant-gated. | | wmctrl | open_application window focus | best-effort; open_application degrades to a no-op when absent. |
scrot and gnome-screenshot are absent on the runner image, so the screenshot backend falls through to ImageMagick import -window root (verified on DISPLAY=:10). When SCRUM-1407 adds this plugin to the agent-runners catalog, declare xdotool, a screenshot backend, and xclip in its system_deps.
DISPLAY and single-owner
- The server targets the inherited
$DISPLAY, defaulting to:10(the
runner desktop). It never hardcodes a display — the screen-record plugin's bug was capturing a non-existent :1.0.
- It takes a process-lifetime single-owner lock (
flock,
/tmp/sidebutton-computer-use.lock, override with CU_LOCK_PATH) so only one session drives the shared pointer/keyboard; a second instance exits non-zero.
Service-manifest contract (SCRUM-1406)
plugin.json targets the merged runtime: "service" tier: the SideButton server keeps the child alive, discovers its tools via tools/list, and forwards tools/call to it.
{
"name": "computer-use",
"runtime": "service",
"service": {
"command": "python3 src/server.py", // non-empty string; the engine splits on
// whitespace and spawns with cwd=plugin dir
"toolNamespace": "computer_use", // tools surface as computer_use_<tool>
"tools": { // per-tool timeout overrides (ms)
"hold_key": { "timeoutMs": 120000 },
"wait": { "timeoutMs": 120000 }
}
},
"tools": [] // service plugins declare no static tools
}
The loader (
the-assistantpackages/server/src/plugins/loader.ts) recognizes onlycommand/timeoutMs/toolNamespace/toolsunderservice, and hard-rejects the manifest unlesscommandis a non-empty string — an array fails validation and the plugin never loads. Tools are discovered live, so the top-leveltoolsarray is normalized to[]. This repo owns onlyplugin.json; the agent-runners catalog entry +system_depsare SCRUM-1407.
Configuration (env)
| Var | Default | Purpose | | --- | --- | --- | | DISPLAY | :10 | target X display | | CU_WIDTH / CU_HEIGHT | 1920 / 1080 | screen size for coordinate scaling | | CU_SCREENSHOT_DELAY | 2.0 | post-action settle before a screenshot | | CU_LOCK_PATH | /tmp/sidebutton-computer-use.lock | single-owner lock file | | CU_SAVE_DIR | /tmp/sidebutton-computer-use/ | where save_to_disk writes shareable PNGs (host-pruned; saved files are not auto-deleted) |
License
MIT © 2026 SideButton






