NCBI Sequence Fetch
Prerequisites
uv: Read theuvskill and follow its Setup instructions to ensure
uv is installed and on PATH.
- User Notification: If LICENSE_NOTIFICATION.txt does not already exist in
this skill directory then (1) prominently notify the user to check the terms at https://www.ncbi.nlm.nih.gov/ and https://www.ncbi.nlm.nih.gov/home/about/policies/, then (2) create the file recording the notification text and timestamp.
.envfile: Make sure the.envfile exists in your home directory.
Create one if it does not exist.
NCBI_API_KEY(optional): Raises the NCBI rate limit from 3 to 10
requests/second. The skill works without it, but a key is recommended if the user plans many queries or encounters a 429 error. The user can obtain one for free by registering at https://www.ncbi.nlm.nih.gov/account/settings/. If the variable is missing from .env, do NOT ask the user to paste it into the chat (this would leak the key into the agent's context). Instead, give the user this command — substituting ENV_FILE with the resolved literal path to the .env file:
printf "Enter NCBI API key (typing hidden): " && read -s key && echo && echo "NCBI_API_KEY=$key" >> "ENV_FILE" && echo "Saved."
The scripts load credentials automatically via dotenv. NEVER read, print, or inspect the .env file or its variables (e.g. no cat, grep, echo, printenv, or os.environ.get on keys). Credentials must stay out of the agent's context.
Core Rules
- Use the Wrapper: ALWAYS execute the provided helper scripts to query the
database rather than accessing the database directly. The scripts automatically enforce the required rate limit gracefully.
- API Key Support: If the user provides an
NCBI_API_KEYin their
environment, the query speed limits are automatically increased significantly.
- Notification: If this skill is used, ensure this is mentioned in the
output.
Overview
Wraps NCBI's Entrez E-utilities (efetch, esearch, elink, esummary) for retrieving protein and nucleotide sequences. Provides 10 subcommands covering the full range of sequence retrieval workflows:
fetch-protein— Direct protein accession lookup (GenPept, RefSeq)fetch-nucleotide— Direct nucleotide accession lookupcds-translate— Fetch CDS and translate to protein (3 methods)search— Free-text search of any NCBI databaseelink— Follow cross-database links (PubMed→Protein, etc.)gene-protein— Search protein by gene name + organismlocus-protein— Search protein by locus tag + organismpubmed-proteins— Find proteins linked to a PubMed articlepatent-search— Extract protein sequences from patentsorganism-length— Last-resort search by organism + exact AA length
Utility Scripts
scripts/ncbi_fetch.py — Single script with subcommands.
All subcommands write structured JSON output. Use --output FILE to save to a file, or omit it to print to stdout. A human-readable summary is always printed to stdout.
1. Fetch Protein by Accession
Fetches protein FASTA from NCBI by accession (XP_, NP_, GenPept, etc.)
uv run scripts/ncbi_fetch.py fetch-protein XP_022033624 -o /tmp/result.json
uv run scripts/ncbi_fetch.py fetch-protein NP_001234567 ABC12345.1
2. Fetch Nucleotide by Accession
Fetches nucleotide FASTA from NCBI by accession.
uv run scripts/ncbi_fetch.py fetch-nucleotide MK034466 -o /tmp/result.json
3. CDS Translate
Fetches a CDS/nucleotide accession and translates to protein sequence. Tries three approaches in order: 1. NCBI's pre-translated CDS protein (fasta_cds_aa)
- GenBank XML CDS annotation translations 3. Raw nucleotide → 6-frame ORF
finding
uv run scripts/ncbi_fetch.py cds-translate MK034466 -o /tmp/result.json
uv run scripts/ncbi_fetch.py cds-translate HQ662330 --target-length 1043
If the accession is a genomic record (not mRNA/CDS), the tool will report is_genomic: true so you can fall back to a homology-based approach instead.
4. Search Any Database
Free-text search using Entrez query syntax. Supports all NCBI databases.
# Search protein database
uv run scripts/ncbi_fetch.py search "WRR4B[Gene Name] AND Arabidopsis[Organism]" \
--database protein --retmax 5 --fetch-sequences
# Search nucleotide database
uv run scripts/ncbi_fetch.py search "Rz2[Gene Name] AND Beta vulgaris[Organism]" \
--database nuccore --retmax 10
# Search with patent filter
uv run scripts/ncbi_fetch.py search "disease resistance AND Solanum[Organism] AND patent[Properties]" \
--database protein --fetch-sequences
# Search by sequence length
uv run scripts/ncbi_fetch.py search '"Oryza sativa"[Organism] AND 1043[SLEN]' \
--database protein --fetch-sequences --retmax 50
5. Cross-Database Links (elink)
Follow NCBI's cross-database links (e.g., PubMed article → linked proteins).
uv run scripts/ncbi_fetch.py elink 24896089 --dbfrom pubmed --db protein \
--fetch-sequences -o /tmp/linked.json
6. Gene + Organism Search
Searches for protein sequences by gene name and organism. Searches NCBI Protein with [Gene Name] and [Organism] qualifiers.
uv run scripts/ncbi_fetch.py gene-protein WRR4B --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py gene-protein Pikh-2 --organism "Oryza sativa" \
--target-length 1043 -o /tmp/result.json
7. Locus Tag Search
Searches by locus tag in both NCBI Protein and Nuccore databases. Extracts CDS translations from GenBank XML when direct protein hits aren't available.
uv run scripts/ncbi_fetch.py locus-protein At1g56540 --organism "Arabidopsis thaliana"
uv run scripts/ncbi_fetch.py locus-protein Niben101Scf02422g02015.1 \
--organism "Nicotiana benthamiana" -o /tmp/result.json
8. PubMed-Linked Proteins
Finds protein sequences linked to a PubMed article. Searches NCBI Protein by PMID, follows elink PubMed→Protein, and extracts CDS translations from linked Nuccore records.
uv run scripts/ncbi_fetch.py pubmed-proteins 30692254 --identifier WRR4B
uv run scripts/ncbi_fetch.py pubmed-proteins 24896089 --identifier "K2" \
-o /tmp/result.json
9. Patent Sequence Search
Two modes:
By patent number — fetches all protein sequences from a specific patent: bash uv run scripts/ncbi_fetch.py patent-search --patent-number US10123456 -o /tmp/patent.json
By keywords — searches NCBI Protein with patent[Properties] filter: bash uv run scripts/ncbi_fetch.py patent-search --keywords WRR4B Albugo --organism "Arabidopsis thaliana" -o /tmp/patent.json
[!IMPORTANT] Patent convention: In molecular biology patents, SEQ ID NO: 1 is typically the DNA sequence and SEQ ID NO: 2 is the primary protein. Higher SEQ ID NOs are variants or related sequences. Prefer Sequence 2 when selecting the primary protein of interest.
10. Organism + Length Search
Last-resort search when only organism and expected protein length are known. Uses NCBI's [SLEN] filter for exact length matching.
uv run scripts/ncbi_fetch.py organism-length \
--organism "Arabidopsis thaliana" --length 1048 --retmax 50 \
-o /tmp/result.json
[!NOTE] This often returns multiple candidates. Use the JSON output headers to identify the correct protein.
Workflow
Standard Sequence Retrieval Cascade
When trying to find a protein sequence, follow this priority order:
- Direct accession —
fetch-proteinwith GenPept/RefSeq accession - CDS translation —
cds-translatewith nucleotide/CDS accession - PubMed-linked —
pubmed-proteinswith PMID + gene name - Locus lookup —
locus-proteinwith locus tag + organism - Gene + organism —
gene-proteinwith gene name + organism - Patent search —
patent-searchwith patent number or keywords - Organism + length —
organism-lengthas last resort
Interpreting Results
- All subcommands return JSON with a
resultsarray - Each result has
sequence(AA string),length, andheader/metadata - When multiple results are returned, select by:
- Closest match to expected length (
target_length) - Header relevance (matching gene name, "disease resistance" keywords)
- Source priority (RefSeq > GenPept > patent)
Reference
- NCBI E-utilities docs: https://www.ncbi.nlm.nih.gov/books/NBK25499/
- Entrez search syntax: https://www.ncbi.nlm.nih.gov/books/NBK49540/
- Database list: protein, nuccore, gene, pubmed, pmc, biosample, etc.
- Common accession formats:
XP_/NP_— NCBI RefSeq proteinAAAtoAZZ+ digits — GenPept (translated GenBank)MK,MN,HQ, etc. + digits — GenBank nucleotideENSG,ENST,ENSP— Ensembl (useensembl-databaseskill instead)Q,P,O+ digits — UniProt (useuniprot-databaseskill instead)

