SLURM
Remote GPU compute platform for clusters managed by SLURM. Jobs are submitted from the TAO service or SDK host to a login node over SSH, staged on a shared filesystem, submitted with sbatch, and executed with srun container support.
When to use
Use SLURM when the user has access to a managed GPU cluster, shared Lustre storage, and scheduler-owned GPU allocation. Do not use SLURM for local files that exist only on the agent machine; data and outputs must be reachable from the cluster.
Preflight + SSH
Confirm SLURM_USER and SLURM_HOSTNAME are exported and passwordless SSH to a login host works (ssh -o BatchMode=yes). Optionally install the TAO SDK wrapper for Job handles + S3 wrapping (nvidia-tao-sdk[slurm], on public PyPI). For private nvcr.io images, install ~/.config/enroot/.credentials on the cluster once per (cluster, user): Pyxis/Enroot does not read NGC_KEY from the job env, and without persistent credentials, auth-gated pulls fail with "Could not process JSON input" at job startup. Install it via the printf | ssh heredoc so the NGC_KEY value never lands in shell history, intermediate files, or chat output; never cat/echo the value.
If a preflight check fails, the agent prompts the user to authorize the install/fix via Bash. Pip-installable Python requirements are the exception: install them automatically, then rerun preflight.
See references/slurm-ssh-credentials.md for the full preflight script, the enroot-credentials heredoc, prerequisite key setup (keypair, ssh-copy-id, known_hosts, container key mounts, 2FA handling), and the SSH failure remediation prompt.
Storage
Use shared-filesystem URIs, not local or file:// paths; tao-core rejects local/file paths for remote backends.
lustre:///absolute/pathfor user-provided datasets on Lustre.slurm://paths may appear in microservices metadata and are converted to
Lustre paths before the container starts.
Accept either dataset roots (model skills map them to required files) or direct spec-key paths. After SSH succeeds and before generating scripts, test -e each required dataset path from the login host; if it fails, stop and ask for corrected paths or staged data rather than producing scripts that fail in the first training job. See references/slurm-ssh-credentials.md for root vs. direct-spec modes, backend details, and the results-dir default.
Container execution
tao-core runs TAO containers through Pyxis/Enroot:
- Stage compact JSON files for specs, environment, and cloud metadata under
<job_dir>/specs, <job_dir>/env, and <job_dir>/meta.
- Optionally convert the Docker image to a cached SQSH image with
srun -n1 -p <conversion_partition> enroot import.
- Write an sbatch script under
<job_dir>/sbatch/job_<job_id>.sbatch. - Submit
sbatch --export=ALL <script>. - Run the container with
srun --container-image=<image> --container-mounts=/lustre.
Accepted image formats: /path/to/image.sqsh, registry#image:tag, docker://registry#image:tag, and ordinary registry/image:tag (converted to Pyxis form when needed). SQSH conversion is cached by image name; for :latest images the cached SQSH is reused unless force_reconvert_latest is enabled.
Monitoring and cancellation
- Scheduler status comes from the stored SLURM job id via
squeue/sacct;
TAO terminal status comes from status.json in the shared results folder.
- While chat monitoring is enabled, keep polling at the requested interval for
any non-terminal job (PENDING, RUNNING, or otherwise). Do not stop after a fixed elapsed time such as 30 minutes; long queue waits are normal on shared GPU partitions.
- Do not send a final response for a non-terminal SLURM job when chat
monitoring is enabled. A final response is a detach action; use it only if the user asked to detach/stop or the job reached terminal state.
- Logs are read over SSH from
<job_dir>/slurm-logs/<slurm_job_name>-<slurm_job_id>/main.out and .err.
- Cancel by looking up
backend_details.slurm_metadata.slurm_job_idand running
scancel <slurm_job_id> over SSH. Treat missing or already terminated jobs as successful cancellation.
Status mapping:
PENDING->PendingRUNNINGorCOMPLETING->RunningCOMPLETED-> checkstatus.jsonFAILED,BOOT_FAIL,DEADLINE,OUT_OF_MEMORY,NODE_FAIL-> retry if
logs match retriable infrastructure patterns, otherwise Error
CANCELLED,PREEMPTED,REVOKED->CanceledTIMEOUT->ErrorSUSPENDED,STOPPED->Paused
Required inputs
Ask for these in the SLURM intake; see references/slurm-ssh-credentials.md for the full credential list, microservices schema keys, and defaults.
- SLURM_USER (required): SSH username for the login node.
- SLURM_HOSTNAME (required): Comma-separated login hostnames for failover.
- SLURM_PARTITION (required): Partition list for GPU submission. Packaged
default polar,polar3,polar4,grizzly, treated as 4-hour queues.
- SSH_KEY_PATH (preferred, expected before launch): private key for
non-interactive public-key auth. Ask for this first in remediation; prefer it over the SSH_AUTH_SOCK agent-socket fallback.
- SLURM_BASE_RESULTS_DIR (optional): base shared-filesystem path; default
/lustre/fsw/portfolios/edgeai/users/<your-dir> (your per-user Lustre dir).
- SLURM_ACCOUNT (usually required by site policy): account for
#SBATCH --account.
Do not ask for SLURM_ACCOUNT or SLURM_BASE_RESULTS_DIR in the initial intake unless the user says their site requires an account, wants a custom results root, or the workflow cannot proceed without overriding defaults.
Resource defaults
Defaults from tao-core:
num_nodes: 1num_gpus: 4max_num_gpus_per_node: 8cpus_per_task: 16time_hours: 4timeout_hours: 3.8max_time_hours: 4container_mounts:/lustreuse_requeue: trueuse_sqsh: true
When generating launchers or wrapper scripts for SLURM, set the wall-time defaults explicitly from the packaged platform resource defaults:
export SLURM_TIME_HOURS="${SLURM_TIME_HOURS:-4}"
export SLURM_TIMEOUT_HOURS="${SLURM_TIMEOUT_HOURS:-3.8}"
Do not default to 12 hours on SLURM. If the user supplies a longer SLURM_TIME_HOURS, verify that the selected partition supports it before submitting. For the packaged default partition list polar,polar3,polar4,grizzly, reject requests above 4 hours and ask for a different partition only if the user actually wants a longer wall time.
When num_gpus is greater than or equal to max_num_gpus_per_node, the handler treats the request as exclusive per node and computes additional nodes from total GPU count when necessary.
Multi-node, SDK, and retries
For multi-node jobs (num_nodes > 1), the SDK builds the sbatch directives and exports the PyTorch-distributed rendezvous env vars automatically: WORLD_SIZE, NUM_GPU_PER_NODE, NODE_RANK, MASTER_ADDR, and MASTER_PORT (29500). TAO entrypoints read WORLD_SIZE + NUM_GPU_PER_NODE and build torchrun internally. Cosmos-RL has special multi-node role handling for controller, policy, and rollout workers.
Use Lustre, not S3, for SLURM job inputs. The GPU allocation starts the moment the job is dispatched, so a long s3:// download at the top of the script burns the allocation, can get the job killed for GPU-idle, and is billed either way. Stage training data on the shared filesystem first and reference it as lustre:///.... S3/HF/NGC pre-fetch is fine for small auxiliary inputs (checkpoints, configs), not training datasets. K8s/Brev do not share this scheduler-idle constraint.
Auto-retry of infrastructure failures (NODE_FAIL, BOOT_FAIL, NCCL transport timeouts, CUDA driver init failures, GPU/IB link-down, OOM-killer node reaping, Xid errors) is automatic in the SDK, with a stable user-facing Job.id across retries. Plain training failures surface immediately so a broken spec does not consume the retry budget. #SBATCH --requeue is enabled by default via SLURM_USE_REQUEUE=true.
See references/slurm-container-execution.md for the full multi-node env-var/sbatch directive detail and table, cluster requirements, the optional TAO SDK path (SlurmSDK, build_entrypoint, ActionWorkflow) with code, the Lustre-not-S3 rule in full, and the failure-mode checklist; references/slurm-execution-sdk.md covers the MAX_JOB_RETRIES retry budget. When the SDK is in scope, read tao-skill-bank:tao-run-platform for the SlurmSDK kwarg reference.
References
references/slurm-ssh-credentials.md— preflight script, SSH/key setup,
enroot credentials, full credential list, backend details, storage rules, SSH remediation prompt.
references/slurm-container-execution.md— container execution steps,
monitoring, status mapping, cancellation, multi-node detail, SDK use, Lustre-not-S3, auto-retry, failure modes.
references/slurm-preflight-storage.md— extended preflight/storage notes.references/slurm-execution-sdk.md— extended execution/SDK notes.references/detailed-guide.md— navigation map for the split references.



