remote-gpu-train

remote-gpu-train

OtherClaude Codeby maedmatt

Summary

Run a long training job on a remote shared-GPU box so it survives disconnects, and keep your agent in the loop while it runs.

Install to Claude Code

/plugin install remote-gpu-train@remote-gpu-train

Run in Claude Code. Add the marketplace first with /plugin marketplace add maedmatt/remote-gpu-train if you haven't already.

README.md

remote-gpu-train

An agent skill for running long jobs on a remote, shared GPU machine over SSH: the job survives your connection dropping, and the agent can follow it the whole time.

Works with any harness that follows the Agent Skills standard (Claude Code, Codex, and others). It's plain bash plus a short guide that tells an agent how to set it up for your machine.

The problem

You want an agent to train a model (or run any long job) on a remote GPU machine.

ssh host "python train.py" has three traps:

  • It stops the moment the command returns, or when the VPN drops. Hours of compute, gone.
  • The agent can't see what's happening: it waits forever on one command, or returns

and forgets the job is running.

  • When it fails there's often nothing to search for: a bad flag can kill the program

before it prints anything.

How it works

Start the job in a background tmux session, save its output to a log on the remote, and watch it with a loop that checks whether the session is alive and reads the log for progress. A run that takes hours becomes a few short status lines (RUNNING, the dashboard link, RUN ENDED) the agent reacts to. The job is over when the session disappears, not when an error shows up in the log: a crash, an out-of-memory kill, or a bad flag can end it without printing anything to search for, and watching the session catches those too.

The rules it follows

The few things that must stay true. Everything else is meant to be changed to fit your setup; a change that breaks one of these has broken the skill.

1. The job survives a dropped connection. Always a background session with output saved to a log. Never run the real job in the foreground of an ssh command. 2. A dead session means it ended; the log means it's progressing. Detect the end by the session disappearing, not by searching for an error. Read the log only to confirm it's moving. 3. The log says what ran. Record the git branch, commit, and uncommitted changes: the remote runs whatever is checked out, not a fixed version. 4. Stay out of others' way on a shared machine. Refuse a GPU that's in use, don't pick one automatically, never touch a job you didn't start. 5. The script stays plain bash. No agent-tool calls inside it. Only how the watcher reports to the model changes from one harness to another. 6. Set it up, don't rewrite it. Adapt through the settings at the top of the script, the launch command, and one search pattern, not by editing the logic.

What's in here

remote-gpu-train/
├── README.md                          # this file: the idea and the rules
├── LICENSE
├── .claude-plugin/marketplace.json    # lets Claude Code install it as a plugin
└── skills/remote-gpu-train/
    ├── SKILL.md                       # how to use it day to day
    ├── train.sh                       # the script (gpus / run / watch / cp / ssh)
    └── ADAPTING.md                    # how an agent sets it up for your machine

Install

With npx skills (vercel-labs/skills), which can install into several harnesses:

npx skills add maedmatt/remote-gpu-train -a claude-code      # or -a codex, etc.
npx skills add maedmatt/remote-gpu-train -g -a claude-code   # -g installs it everywhere (globally)

As a Claude Code plugin:

/plugin marketplace add maedmatt/remote-gpu-train
/plugin install remote-gpu-train@remote-gpu-train

By hand: copy skills/remote-gpu-train/ into ~/.claude/skills/ (or your agent's skills folder).

Use it

On first use in a project the skill sets itself up: an agent reads ADAPTING.md, asks what it needs (ssh name, repo path, environment setup, launch command, the log line that means "running"), fills in train.sh, wires watch to your harness, and verifies it. Then:

train.sh gpus                       show GPUs, mark which are FREE
train.sh run <gpu> <tag> <args...>  start the job on <gpu> in a background session
train.sh watch <tag>                follow it until the run ends
train.sh cp <local> <remote>        copy a file into the repo
train.sh ssh [cmd...]               git pull/status, tail a log, list/kill sessions, anything else

Beyond GPUs

The core (a background job that keeps running, plus a watcher that reads the log and notices when the session dies) fits any long remote job: a data pipeline, a slow build, a simulation, a batch of evals. Drop gpus and point the launch command at whatever you run.

License

MIT. See LICENSE.

Related plugins

Browse all →