Codex Skill

Audio Transcribe

Transcribe audio files to text with optional diarization and known-speaker hints. Use when a user asks to transcribe speech from audio/video, extract text from recordings, or label speakers in interviews or meetings.

Editor's Note

Install

npx skills add https://github.com/openai/skills --skill transcribe

Page Outline

WorkflowDecision rulesOutput conventionsDependencies (install if missing)EnvironmentSkill path (set once)CLI quick startReference map

Source Content

Normalized top-level metadata comes from the directory layer. The body below is the upstream source content for this item.

Audio Transcribe

Transcribe audio using OpenAI, with optional speaker diarization when requested. Prefer the bundled CLI for deterministic, repeatable runs.

Workflow

Collect inputs: audio file path(s), desired response format (text/json/diarized_json), optional language hint, and any known speaker references.
Verify `OPENAI_API_KEY` is set. If missing, ask the user to set it locally (do not ask them to paste the key).
Run the bundled `transcribe_diarize.py` CLI with sensible defaults (fast text transcription).
Validate the output: transcription quality, speaker labels, and segment boundaries; iterate with a single targeted change if needed.
Save outputs under `output/transcribe/` when working in this repo.

Decision rules

Default to `gpt-4o-mini-transcribe` with `--response-format text` for fast transcription.
If the user wants speaker labels or diarization, use `--model gpt-4o-transcribe-diarize --response-format diarized_json`.
If audio is longer than ~30 seconds, keep `--chunking-strategy auto`.
Prompting is not supported for `gpt-4o-transcribe-diarize`.

Output conventions

Use `output/transcribe/<job-id>/` for evaluation runs.
Use `--out-dir` for multiple files to avoid overwriting.

Dependencies (install if missing)

Prefer `uv` for dependency management.

uv pip install openai

If `uv` is unavailable:

python3 -m pip install openai

Environment

`OPENAI_API_KEY` must be set for live API calls.
If the key is missing, instruct the user to create one in the OpenAI platform UI and export it in their shell.
Never ask the user to paste the full key in chat.

Skill path (set once)

export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export TRANSCRIBE_CLI="$CODEX_HOME/skills/transcribe/scripts/transcribe_diarize.py"

User-scoped skills install under `$CODEX_HOME/skills` (default: `~/.codex/skills`).

CLI quick start

Single file (fast text default):

python3 "$TRANSCRIBE_CLI" \
  path/to/audio.wav \
  --out transcript.txt

Diarization with known speakers (up to 4):

python3 "$TRANSCRIBE_CLI" \
  meeting.m4a \
  --model gpt-4o-transcribe-diarize \
  --known-speaker "Alice=refs/alice.wav" \
  --known-speaker "Bob=refs/bob.wav" \
  --response-format diarized_json \
  --out-dir output/transcribe/meeting

Plain text output (explicit):

python3 "$TRANSCRIBE_CLI" \
  interview.mp3 \
  --response-format text \
  --out interview.txt

Reference map

`references/api.md`: supported formats, limits, response formats, and known-speaker notes.

Related Items

Codex Skill

PDF Skill

Use when tasks involve reading, creating, or reviewing PDF files where rendering and layout matter; prefer visual checks by rendering pages (Poppler) and use Python tools such as `reportlab`, `pdfplumber`, and `pypdf` for generation and extraction.

Codex Skill

DOCX Skill

Use when the task involves reading, creating, or editing `.docx` documents, especially when formatting or layout fidelity matters; prefer `python-docx` plus the bundled `scripts/render_docx.py` for visual checks.

Codex Skill

Security Best Practices

Perform language and framework specific security best-practice reviews and suggest improvements. Trigger only when the user explicitly requests security best practices guidance, a security review/report, or secure-by-default coding help. Trigger only for supported languages (python, javascript/typescript, go). Do not trigger for general code review, debugging, or non-security tasks.

Codex Skill

Speech Generation Skill

Use when the user asks for text-to-speech narration or voiceover, accessibility reads, audio prompts, or batch speech generation via the OpenAI Audio API; run the bundled CLI (`scripts/text_to_speech.py`) with built-in voices and require `OPENAI_API_KEY` for live calls. Custom voice creation is out of scope.