OpenTelemetry Skill: A Cognitive Architecture for AI-Assisted Observability Engineering
  
Overview
The opentelemetry-skill is an AI assistant skill designed to help with OpenTelemetry configuration and observability engineering tasks. This skill employs progressive disclosure to optimize context usage and deliver production-ready OpenTelemetry configurations.
This repository contains the source code for the OpenTelemetry Skill tile released by Tessl.
- Published versions: https://tessl.io/registry/o11y-dev/opentelemetry-skill
- Source code: https://github.com/o11y-dev/opentelemetry-skill
Key Features
Comprehensive Coverage: Specialized reference docs covering collector architecture, security, sampling, AI agents, and compatibility
Production Focus: Emphasizes stability, security, and cost optimization patterns
AI Agent Support: Configuration guidance for monitoring AI coding agents alongside traditional applications
Progressive Loading: Context-aware reference loading prevents information overload
Continuous Updates: Automated upstream monitoring tracks OpenTelemetry releases and AI agent repositories
π Table of Contents
- Key Features
- What Makes This Different?
- Core Features
- Skill Structure
- Installation
- Architecture
- Architecture Patterns
- Usage Examples
- Reference Documentation
- Contrib Components & Example Configs
- Testing & Validation
- Contributing
- Known Limitations
- Roadmap
- Compatibility
- License
- Related Projects
What Makes This Different?
Unlike loading the entire OpenTelemetry documentation into an AI's context (which leads to hallucinations and information overload), this skill acts as a cognitive router:
1. System 2 Thinking: Forces the AI to analyze critical observability signals (throughput, cardinality, resiliency) before generating code 2. Progressive Disclosure: Loads detailed reference materials only when specific topics are triggered 3. Production-First: Prioritizes stability, security, and cost optimization over feature completeness 4. Convention Enforcement: Ensures semantic conventions, proper processor ordering, and architectural best practices 5. AI Agent Support: Includes guidance for observing AI coding agents in production environments
Core Features
- π§ Cognitive Architecture: Meta-knowledge layer that teaches AI how to think about observability
- π Cardinality Management: Built-in guards against metric explosion and cost overruns
- ποΈ Deployment Patterns: DaemonSet vs Gateway vs Sidecar decision matrices for Kubernetes
- π Security by Default: PII redaction, TLS, and authentication patterns
- π OTTL Transformations: Comprehensive OpenTelemetry Transformation Language guidance with patterns and best practices
- π Scaling Strategies: Load balancing with sticky sessions for tail sampling
- π― Sampling Intelligence: Head vs tail sampling with statistical trade-off analysis
- π Meta-Monitoring: Self-observability patterns for collector health
- π€ AI Agent Observability: Configuration guides for monitoring AI coding agents including Claude Code, Gemini CLI, GitHub Copilot, Codex CLI, Qwen Code, Pi Agent, and more via OpenTelemetry
- β Test & Validation Framework: TDD-based testing methodology to ensure skill effectiveness
Skill Structure
SKILL.md acts as the cognitive router β a compact instruction set that tells the AI how to reason about observability before generating any output. docs/index.md is the tile's on-demand documentation entrypoint for Tessl, and references/ contains the deep-dive documents that the skill links to when specific topics are triggered.
π Content Overview
- Packaged reference docs for architecture, collector design, instrumentation, security, sampling, AI agents, and compatibility
- AI coding agent coverage tracked with upstream monitoring
- Production-tested configurations with validation commands
- Current & updated - automatically synced with latest OpenTelemetry releases
Installation
skills.sh
Install this skill with the skills.sh CLI:
npx skills add o11y-dev/opentelemetry-skill
Tessl Registry
Install this tile from the Tessl registry (workspace: o11y-dev):
tessl tile install o11y-dev/opentelemetry-skill
GitHub Copilot
Attach SKILL.md as a custom instructions file, or reference the repository as a Copilot Skill in your Copilot settings: https://github.com/o11y-dev/opentelemetry-skill
Claude
Add SKILL.md to your project knowledge or paste it into your system prompt.
Cursor
Plugin manifests are available in .cursor-plugin/ for use with the Cursor marketplace.
Other AI Systems
Point your agent at SKILL.md as the primary instruction set, with references/ available for context loading.
Architecture
opentelemetry-skill/
βββ .claude-plugin/
β βββ marketplace.json # Plugin metadata
βββ .cursor-plugin/
β βββ marketplace.json # Cursor marketplace metadata
β βββ plugin.json # Cursor plugin manifest
βββ docs/
β βββ index.md # Tessl docs entrypoint for bundled references and eval assets
βββ SKILL.md # Cognitive router (the "brain")
βββ README.md # This file
βββ references/
β βββ ai-agents.md # AI agent observability patterns & configurations
β βββ architecture.md # Deployment patterns & scaling
β βββ compatibility.md # Version-sensitive support and compatibility notes
β βββ collector.md # Pipeline configuration & components
β βββ instrumentation.md # SDKs & semantic conventions
β βββ sampling.md # Sampling strategies
β βββ security.md # PII redaction & authentication
β βββ monitoring.md # Self-monitoring patterns
βββ LICENSE # Apache 2.0
Architecture Patterns
| Category | Pattern | Description | |----------|---------|-------------| | Kubernetes | DaemonSet / Gateway / Sidecar | Choose based on workload type and data volume | | Serverless | FaaS Extension Layer | Lambda, Azure Functions, GCP with non-blocking export | | Sampling | Head / Tail Sampling | Trade-off between cost and completeness | | Security | mTLS + RBAC | Secure cross-network telemetry pipelines | | AI Agents | Agent Telemetry | Monitor coding agents as first-class services in your observability stack |
Usage Examples
Example 1: Gateway Deployment for Tail Sampling
User: "I need to deploy an OpenTelemetry gateway in Kubernetes for tail sampling."
AI Response (leveraging the skill):
- β Asks about throughput to size replicas
- β
Loads
references/architecture.mdandreferences/sampling.md - β Generates Deployment with loadbalancing exporter (routing_key: traceID)
- β Includes Headless Service for sticky sessions
- β Configures tail_sampling processor with error/latency policies
- β Warns about Beta stability level
- β Provides validation commands
Example 2: Preventing Data Loss
User: "Ensure we don't lose telemetry data if the backend goes down."
AI Response:
- β
Loads
references/collector.md - β Configures file_storage extension
- β Attaches persistent queues to exporters
- β Explains disk space requirements
- β Provides PersistentVolume YAML for Kubernetes
Example 3: Cardinality Guard
User: "Add user_id as a metric dimension."
AI Response:
- β Blocks the request: "user_id violates the Rule of 100βit's unbounded and will cause metric explosion."
- β Suggests alternative: "Use user_id as a trace attribute instead, and create a metric for aggregated user counts."
- β
Loads
references/instrumentation.mdto explain cardinality management
Example 4: Reviewing Existing Helm Values
User: "Review this collector Helm values file and tell me what's risky."
AI Response:
- β Audits cross-field contradictions, not just YAML syntax
- β
Compares
memory_limitersettings to container memory limits - β
Flags scaled
tail_samplingwithout sticky routing - β
Questions
hostPorton gateway Deployments - β
Calls out retry/queue durability gaps, unsafe RWX/EFS-backed
file_storage, and rollout-setting conflicts - β
Audits metrics temporality conversions such as
deltatocumulativewhen restart or scaling behavior would make the state unsafe - β Corrects OTTL type mismatches and stale semantic convention keys
See SKILL.md for the full list of progressive disclosure triggers, System 2 thinking signals, core principles, and production-ready configuration defaults.
Reference Documentation
Deep-dive guides are available in the references/ directory:
- ai-agents.md: AI agent observability patterns, per-agent setup guidance, dashboards, and operational caveats
- architecture.md: Deployment patterns, load balancing, Target Allocator
- collector.md: Pipeline anatomy, processor ordering, memory management
- instrumentation.md: SDKs, semantic conventions, cardinality management
- ottl.md: OpenTelemetry Transformation Language syntax, functions, patterns, and best practices
- platforms.md: FaaS (Lambda, Azure, GCP), client-side apps, serverless best practices
- sampling.md: Head vs tail, probabilistic strategies, sticky sessions
- security.md: PII redaction, TLS, extension security
- monitoring.md: Collector metrics, dashboards, alerts
- playbooks.md: Reusable production playbooks distilled from OpenTelemetry blog posts and real-world deployment stories
Contrib Components & Example Configs
The OpenTelemetry Collector Contrib repository contains extended components and curated example configurations. Always verify component stability and pin to released versions (e.g., v0.100.0+) instead of main.
Stability & Registry
- VERSIONING.md: Component stability matrix (Stable/Beta/Alpha/Development)
Component Directories
- Receivers: Entry points for telemetry data
- Processors: Transform, filter, and enrich data
- Exporters: Send data to backends
Key Components (Production-Ready)
- transformprocessor: Apply OTTL transformations
- filterprocessor: Drop spans/metrics based on conditions
- k8sattributesprocessor: Enrich with Kubernetes metadata
- tailsamplingprocessor: Intelligent sampling decisions
- filelogreceiver: Read logs from disk
- loadbalancingexporter: Route to multiple backends with consistent hashing
Example Configurations
- examples/: Curated collector configurations
- Gateway deployments with tail sampling
- Agent/DaemonSet configurations for Kubernetes
- Logging and filelog receiver examples
- Kubernetes attribute enrichment patterns
Best Practice: Always pin to released tags matching your collector version (e.g., v0.100.0+) instead of using main branch for production stability.
Testing & Validation
This skill includes a comprehensive test and validation framework following TDD (Test-Driven Development) principles:
- Structured Tessl evals live in
evals/and are the canonical published scenarios - tests/baseline-scenarios.md: RED phase support document for baseline behavior capture
- tests/compliance-verification.md: GREEN phase support document for verifying behavior changes
- tests/rationalization-table.md: REFACTOR phase log of agent rationalizations and counters
The testing framework validates that the skill actually changes AI behavior and doesn't allow common anti-patterns. GitHub Actions automatically validates skill structure and content on every change, and the Tessl report workflow posts best-practice review feedback on every pull request.
An additional GitHub Agentic Workflow (.github/workflows/otel-upstream-maintenance.yml) runs weekly to create an upstream maintenance digest issue with recent OpenTelemetry GitHub issues, releases, and blog/community updates for practical repository refreshes.
Contributing
This skill is designed to evolve with the OpenTelemetry ecosystem. Contributions are welcome:
1. Update Reference Docs: As OTel features stabilize, update stability warnings 2. Add Patterns: New deployment architectures (e.g., eBPF-based collection) 3. Expand Examples: Language-specific SDK patterns 4. Improve Triggers: Refine the progressive disclosure logic
Known Limitations
- AI agent trace coverage varies: Claude Code does not emit traces natively; observability relies on opentelemetry-hooks or native logs/metrics. Each agent has different signal coverage.
- Tail sampling memory: Scales with in-flight trace count. Beyond 10k RPS, consider tiered architecture (Agent -> Gateway -> Analysis) rather than single-collector tail sampling.
- OTTL regex transforms: Can impact p99 latency at high span volume. Profile with production traffic before deploying regex-heavy transformations.
- Semantic conventions are evolving: The
gen_ai.*namespace is experimental. Attribute names may change in future OpenTelemetry releases. - Kubernetes version requirements: Native sidecar container support requires v1.24+. Earlier versions need traditional sidecar patterns.
Roadmap
- Expand AI agent observability coverage as new agents ship native telemetry (Qwen Code, Windsurf, Zed)
- Track OpenTelemetry semantic convention releases for
gen_ainamespace stabilization - Add cost optimization patterns for high-volume agent deployments
- Expand production playbook coverage with new upstream blog posts
- Add eBPF-based collection patterns for auto-instrumentation
- Collector processor stability matrix tracking across releases
Compatibility
Compatibility details move faster than the cognitive-router guidance in SKILL.md. See references/compatibility.md for the current version floors and AI agent support notes.
License
This skill is licensed under the Apache License 2.0. See LICENSE for details.
The OpenTelemetry project itself is a CNCF project licensed under Apache 2.0.
Acknowledgments
- OpenTelemetry Community: For building the foundational observability standard
- monitoringartist: For the collector monitoring dashboards and patterns
Related Projects
- OpenTelemetry Collector - The core collector
- OpenTelemetry Operator - Kubernetes operator for OTel
---
Transform your AI into an observability-focused assistant. Production-ready. AI-agent aware.
Deploy with confidence. Observe with precision.





