Elastic ML Anomaly Detection

Single skill covering all anomaly detection work against Kibana Agent Builder MCP at {KIBANA_URL}/api/agent_builder/mcp. Use the Mode Selector below to pick the right approach for the user's question — modes share the same tool surface and concepts.

Platform

Read path: ES|QL against .ml-anomalies-, .ml-config, .ml-notifications-, .ml-annotations-*
Always-available: platform.core.execute_esql (plus additional platform tools for search, index mapping, and

documentation — see scripts/agent_builder_constants.json)

ML API spec (if available): .kibana_ai_openapi_spec_elasticsearch — see

references/anomaly-detection-openapi-spec-discover.md for discovery pattern.

Run ad_validate_ml_tool_permissions first when tools return empty/misleading results — missing privileges are

the most common cause of false negatives. Full permissions matrix: references/permissions-matrix.md.

Mode Selector

User intent	Mode
"What broke?" / RCA / cross-job / blast radius / influencers / log categories	Investigate
"Why score high/low?" / renormalization / model bounds / forecasts	Explain
Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendars	Troubleshoot
Create a job / configure a datafeed / start analysis / retrieve results	Manage
Security framing (attack chains, MITRE, exfil)	Investigate + references/security-anomaly-expert.md
Observability/SRE framing (degradation, capacity, deployment regression)	Investigate + references/observability-anomaly-expert.md

When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving on.

---

Score Quick Reference

record_score bands: >75 critical · 50–75 warning · 25–50 minor · <25 informational
multi_bucket_impact ≥ 3 → sustained shift (not a transient spike)
initial_record_score >> record_score → renormalization (model saw worse anomalies later)
actual << typical with count/low_count/low_mean → absence/outage, not just low value
Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity

Full score definitions, renormalization mechanics, and anomaly_score_explanation components: references/score-reference.md.

Core concepts

Treat .ml-anomalies-* as three layers, accessed via result_type:

bucket — bucket-level unusualness per bucket_span. anomaly_score is the aggregate across all detectors.
record — finest-grained rows with actual vs typical, probability, record_score,

anomaly_score_explanation.

influencer — entity contributions ranked within a bucket (influencer_score).

Read scores this way:

anomaly_score / record_score = current normalized values (move as the model sees new extremes).
initial_anomaly_score / initial_record_score = immutable snapshots from detection time.
Compare actual to typical; use probability for raw likelihood.
Map entities via partition_field_value / by_field_value / over_field_value.
Read multi_bucket_impact (-5 to +5) to separate single-bucket spikes from sustained trends.

---

Mode: Investigate — RCA

When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.

Tool chain

Phase	Tools
Discovery	`ad_get_available_metadata`, `ad_get_jobs`, `ad_discover_related_jobs`, `ad_discover_jobs_by_datafeed_index`
Timeline / scope	`ad_query_anomaly_timeline`
Cross-job / entities	`ad_rca_cross_job_entity_match`, `ad_rca_multi_job_entities`, `ad_rca_entity_profile`
Records / influencers	`ad_query_anomaly_records`, `ad_query_influencers`
RCA depth	`ad_rca_detector_fingerprint`, `ad_rca_correlation`, `ad_rca_blast_radius`, `ad_rca_score_reassessment`
Evidence / categories	`ad_get_job_datafeed_config`, `ad_rca_source_evidence`, `ad_get_categories`, `ad_search_log_category_examples`

Protocol

Follow the 14-step sequence in references/protocols/investigation.md. High level: ad_get_available_metadata → pair ad_discover_jobs_by_datafeed_index with ad_discover_related_jobs → ad_query_anomaly_timeline → rank with ad_rca_multi_job_entities (min_job_count=2) → ad_rca_detector_fingerprint → drill with ad_query_anomaly_records + ad_query_influencers (low min_score=25) → profile with ad_rca_entity_profile → order with ad_rca_correlation → confirm with ad_rca_source_evidence. When by_field_name == "mlcategory", compare with ad_get_categories + paired ad_search_log_category_examples (baseline vs. anomaly window).

Finish with a written RCA: root cause entity · affected jobs · temporal progression · fault class (resource/network/application) · severity · recommended actions. Worked example: references/worked-example.md. Full ES|QL templates and parameters: references/investigate-anomaly-esql-tools.md.

Rules

Multi-job entities are prime suspects; single-job entities are usually victims. Use min_job_count=2.
Earliest anomaly timestamp wins — sort ad_rca_correlation by timestamp; first-appearing entity = origin.
multi_bucket_impact ≥ 3 = sustained behavioral shift, weight higher than transient spikes.
Never close an RCA without ad_rca_source_evidence — raw source documents are ground truth.
Use low min_score (25 or lower) for influencer queries — high thresholds miss correlated entities.

---

Mode: Explain — Score / model behavior

When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".

Score types

Field	Scope	Meaning
`record_score`	Single record	Normalized severity after renormalization.
`initial_record_score`	Single record	Score at detection time. Gap vs `record_score` = renormalization drift.
`anomaly_score`	Bucket	Aggregate severity across all detectors in a bucket.
`influencer_score`	Entity × bucket	How anomalous a specific entity is in that bucket.

`anomaly_score_explanation` components

Component	Effect	What it means
`anomaly_length`	↑ score	More consecutive anomalous buckets
`single_bucket_impact`	↑ score	Lower probability → higher impact
`multi_bucket_impact`	↑ score	Sustained pattern contribution
`anomaly_characteristics_impact`	↑ score	Mean shift vs. variance change
`high_variance_penalty`	↓ score	Noisy data → wide bounds → anomaly less surprising
`incomplete_bucket_penalty`	↓ score	Bucket has less data than expected (ingest lag, sparse data)

Why a score looks wrong

Unexpectedly low: high_variance_penalty, renormalization, <3 weeks training for weekly seasonality,

bucket_span too large, wrong detector function (mean vs high_mean), incomplete_bucket_penalty, suppression by custom_rules.

Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per

entity), use_null: true on a sparse field.

Tool chain

Purpose	Tools
Records + explanation	`ad_query_anomaly_records` (exact `job_id_pattern`)
Renormalization drift	`ad_rca_score_reassessment` (`score_drift = initial_record_score - record_score`)
Model bounds (visual)	`ad_get_model_plot` — actual outside `model_lower`/`model_upper` = anomaly
Forecast overlap	`ad_get_forecast_results`
Influencer attribution	`ad_query_influencers`
Config & detector	`ad_get_job_datafeed_config` — `bucket_span`, function, `custom_rules`, `use_null`
Categorization	`ad_get_categories`
Model snapshots	`ad_get_model_snapshots`
Structured diagnostic	`ad_wf_troubleshoot_anomaly_score` (full decision tree)

Decision tree (`ad_wf_troubleshoot_anomaly_score`)

ad_get_jobs — ≥3 weeks data for weekly seasonality?
ad_ts_model_memory_health — memory_status healthy?
ad_ts_delayed_data_annotations — no incomplete buckets?
ad_query_anomaly_records — compare record_score vs initial_record_score.
ad_get_job_datafeed_config — bucket_span, detector function, custom_rules, use_null.
ad_get_model_plot — wide bounds → high_variance_penalty.
ad_rca_score_reassessment — renormalization drift across history.
Explain anomaly_score_explanation factors.

Rules

Always show both initial_record_score and record_score — the gap is the renormalization story.
Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs

no config change.

actual << typical with count/low_count is an absence anomaly — distinguish outages from value spikes.
high_variance_penalty and incomplete_bucket_penalty explain most "low score" surprises without remediation.
Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.

For detector function selection details, see references/anomaly-detection-functions.md.

---

Mode: Troubleshoot — Job ops

When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars, CCS.

Common issues → fast paths

Issue	Fast path	Full decision tree
Missing docs / `query_delay` warning	`ad_ts_delayed_data_annotations` → `ad_ts_bucket_event_gaps` → `ad_ts_ingest_latency_estimate` → `ad_update_datafeed_query_delay`	`ad_wf_troubleshoot_query_delay`
Memory `soft_limit` / `hard_limit`	`ad_ts_model_memory_health` → `ad_wf_ts_field_cardinality` → `ad_estimate_memory_requirement` → `ad_update_model_memory_limit`	`ad_wf_troubleshoot_memory_limit`
Datafeed not running / job state	`ad_get_jobs` (state) → `ad_get_job_messages` → `ad_manage_datafeed`	—
CCS / `remote_cluster:` indices	`ad_ts_ccs_diagnostics`	—
Score sanity check	—	`ad_wf_troubleshoot_anomaly_score`

hard_limit corrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events for unknown categories). Fix memory before fixing query_delay.

Memory concepts

Field	Meaning
`model_bytes`	Current memory used
`peak_model_bytes`	High-water mark since job opened
`model_bytes_memory_limit`	Configured `model_memory_limit`
`memory_status`	`ok` / `soft_limit` (pruning) / `hard_limit` (critical)
`total_by_field_count > 100k`	`by_field` cardinality too high — dominant driver
`total_partition_field_count > 10k`	Partition explosion
`total_category_count > 10k`	Too many distinct log patterns

Prefer ad_estimate_memory_requirement (samples cardinality from source, calls Estimate Model Memory API) over heuristics like peak_model_bytes * 1.3 — the heuristic ignores pure influencer and categorization memory.

Datafeed & timing concepts

query_delay — how far behind real time the datafeed queries. Too small → missing docs; too large → slower

alerts. Set to P95 ingest latency + buffer (default 60s–120s).

delayed_data_check_config — how aggressively the datafeed checks for late data.
bucket_span — analysis interval. Align with data granularity and detection window.
frequency — defaults to min(query_delay, bucket_span / 2).

Lifecycle for config changes (memory limit, query_delay)

Stop datafeed: ad_manage_datafeed (action=_stop)
Close job
Update config: ad_update_model_memory_limit, ad_update_datafeed_query_delay,

ad_update_delayed_data_check_config

Open job: ad_open_job
Start datafeed: ad_manage_datafeed (action=_start)

Recover a corrupted period without resetting the whole model: ad_revert_model_snapshot.

Tool surface

Category	Tools
Permissions / metadata	`ad_validate_ml_tool_permissions`, `ad_get_available_metadata`, `ad_get_jobs`
Job + datafeed state	`ad_get_job_datafeed_config`, `ad_get_job_messages`, `ad_manage_datafeed`, `ad_preview_datafeed_with_latency`
Timing / missing docs	`ad_ts_delayed_data_annotations`, `ad_ts_bucket_event_gaps`, `ad_ts_ingest_latency_estimate`, `ad_update_datafeed_query_delay`, `ad_update_delayed_data_check_config`, `ad_wf_troubleshoot_query_delay`
Memory	`ad_ts_model_memory_health`, `ad_wf_ts_field_cardinality`, `ad_estimate_memory_requirement`, `ad_update_model_memory_limit`, `ad_wf_troubleshoot_memory_limit`
Model / lifecycle	`ad_get_model_snapshots`, `ad_revert_model_snapshot`, `ad_open_job`, `ad_create_job`
CCS	`ad_ts_ccs_diagnostics`
Calendars	`ad_get_calendar_events`, `ad_create_calendar_event`

Full parameter tables, ES|QL templates, and REST step lists: references/troubleshoot-anomaly-tool-reference.md.

Rules

ad_validate_ml_tool_permissions first — missing privileges produce misleading empty results.
Fix memory before query_delay — hard_limit corrupts state; query_delay fixes on a memory-limited job are

wasted.

Stop the datafeed before updating it. Updating a running datafeed is rejected.
Close the job before updating memory limit. Sequence above.
*Prefer workflow tools (ad_wf_) over manually chaining diagnostics** for complex decisions.
ad_preview_datafeed_with_latency before starting — confirm the datafeed returns data after config changes.

---

Mode: Manage — Create / configure jobs

When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".

4-step workflow

PUT  _ml/anomaly_detectors/<job_id>          # 1. Define job        (ad_create_job)
PUT  _ml/datafeeds/datafeed-<job_id>         # 2. Define datafeed   (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open    # 3a. Open job         (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start  # 3b. Start datafeed   (ad_manage_datafeed action=_start)
GET  _ml/anomaly_detectors/<job_id>/results/records  # 4. Read results

Process

Build configs. Parse the user request into job + datafeed JSON with no null fields.
Apply smart defaults:

Field	Default	Override when
`bucket_span`	`"15m"`	User specifies a different span
`time_field`	`"@timestamp"`	User names a different timestamp field
`index`	`"logs-*"`	User specifies an index or pattern
`datafeed_query`	`{"match_all": {}}`	User mentions filters, processes, or time windows
`influencers`	by/over/partition fields from detectors	User adds extra influencer fields
`job_id`	Generated from user description	User provides an explicit ID
`query_delay`	`"60s"`	P95 ingest latency is higher

Choose detector function from user intent — full table in

references/anomaly-detection-functions.md:

"high CPU" / "unusually large" → high_mean or high_sum
"rare logins" / "unusual values" → rare (variants below)
"too many requests" / "spike in count" → high_count

rare variants:

Infrequent globally → rare by_field_name: X
Infrequent vs peers → rare by_field_name: X over_field_name: Y
Infrequent per segment → rare by_field_name: X partition_field_name: Y
Infrequent per segment vs peers → rare by_field_name: X over_field_name: Y partition_field_name: Z

Validate. platform.core.get_index_mapping on the target index to verify field existence/types →

ad_validate_job_spec. If errors, fix and re-validate (max 3 attempts).

Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for

approval once. If feedback, incorporate and re-present (up to 3 rounds).

Deploy. After confirmation: ad_create_job → ad_create_datafeed → ad_open_job → ad_manage_datafeed

(action=_start). Report final job_id and datafeed_id.

For batch analysis on historical data, pass start and end to the datafeed start call.

Worked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters: references/job-creation-recipes.md.

Rules

Create job before datafeed. Datafeed references job by ID.
Open job before starting datafeed. Start on a closed job is rejected.
query_delay = P95 ingest latency + buffer (60s–120s safe default).
Forecasts require non-population jobs — over_field_name jobs cannot be forecasted; warn before attempting.
by_field_name vs over_field_name: by compares entity to its own history; over compares to peer group in

the same bucket. partition_field_name = fully independent sub-model with its own normalization.

bucket_span matches detection granularity — 15m for high-frequency, 1h for operational metrics, 1d for daily

patterns. Larger smooths short spikes; smaller increases noise.

---

Registration (Kibana Agent Builder)

Requires Node.js 18+. Defaults to elastic/changeme when no credentials are supplied.

cd skills/kibana/kibana-anomaly-detection

# tools → workflows → skills
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601

# HTTPS with self-signed cert
node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure

all register runs tools register, then workflows register, then skills register. Kibana allows at most five tool_ids per skill; the script fills them by scanning SKILL.md for tool mentions (in document order), then appends ids from references/kibana/tools/esql/*.json until the cap (workflow-only tools omitted by default). If you run skills register alone, run tools register first so those ids exist.

Workflow tool exclusions and prefixes live in scripts/agent_builder_constants.json.

MCP API key permissions:

Kibana: read_onechat, space_read
Index: read, view_index_metadata on .ml-anomalies-, .ml-annotations-, .ml-notifications-*, .ml-config
For source evidence: read on source data indices

---

Tool inventory

ES|QL tool specs live under references/kibana/tools/esql/.json; workflow definitions under references/kibana/workflows/.yaml. Each Mode section above lists the tools it uses. Full surface: references/tools.md (ES|QL) and references/workflow-tools.md (workflows).

Key system indices

Index	Relevant content
`.ml-anomalies-*`	`record`, `bucket`, `influencer`, `model_plot`, `model_forecast`, `model_snapshot`, `category_definition`, `model_size_stats`
`.ml-config`	job/datafeed documents (visible even for never-run jobs)
`.ml-annotations-*`	delayed data (`event == "delayed_data"`)
`.ml-notifications-*`	job messages (`level`: info/warning/error)

---

Examples

RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate → ad_get_available_metadata → ad_query_anomaly_timeline → ad_rca_cross_job_entity_match → ad_rca_multi_job_entities → RCA report.

Score drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain → ad_rca_score_reassessment for drift → explain renormalization if score_drift is large.

Memory limit: "Job status shows hard_limit and results look wrong." → Troubleshoot → ad_ts_model_memory_health → ad_wf_ts_field_cardinality → ad_estimate_memory_requirement → ad_update_model_memory_limit (lifecycle: stop datafeed → close → update → open → start).

New job: "Detect unusual error rates per host on nginx access logs." → Manage → high_count detector with by_field_name: "host.keyword" → validate → present → deploy.

Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the incident → Explain the score drift → Troubleshoot if hard_limit or delayed data is suspected.

---

Guidelines

Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
ad_validate_ml_tool_permissions first on empty results — privileges are the most common false-negative cause.
Score bands are absolute thresholds: >75 critical, 50–75 warning, 25–50 minor, <25 informational.
Multi-job entities are prime suspects. Use min_job_count=2 in ad_rca_multi_job_entities.
Show initial_record_score alongside record_score — the gap tells the renormalization story.
Fix memory before query_delay. hard_limit invalidates downstream diagnostics.
Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query

delay.

Confirm RCAs with ad_rca_source_evidence. Raw source documents are ground truth.

kibana-anomaly-detection

Elastic ML Anomaly Detection

Platform

Mode Selector

Score Quick Reference

Core concepts

Mode: Investigate — RCA

Tool chain

Protocol

Rules

Mode: Explain — Score / model behavior

Score types

anomaly_score_explanation components

Why a score looks wrong

Tool chain

Decision tree (ad_wf_troubleshoot_anomaly_score)

Rules

Mode: Troubleshoot — Job ops

Common issues → fast paths

Memory concepts

Datafeed & timing concepts

Lifecycle for config changes (memory limit, query_delay)

Tool surface

Rules

Mode: Manage — Create / configure jobs

4-step workflow

Process

Rules

Registration (Kibana Agent Builder)

Tool inventory

Key system indices

Examples

Guidelines

Recommended skills

`anomaly_score_explanation` components

Decision tree (`ad_wf_troubleshoot_anomaly_score`)