Elastic ML Anomaly Detection
Single skill covering all anomaly detection work against Kibana Agent Builder MCP at {KIBANA_URL}/api/agent_builder/mcp. Use the Mode Selector below to pick the right approach for the user's question — modes share the same tool surface and concepts.
Platform
- Read path: ES|QL against
.ml-anomalies-,.ml-config,.ml-notifications-,.ml-annotations-* - Always-available:
platform.core.execute_esql(plus additional platform tools for search, index mapping, and
documentation — see scripts/agent_builder_constants.json)
- ML API spec (if available):
.kibana_ai_openapi_spec_elasticsearch— see
references/anomaly-detection-openapi-spec-discover.md for discovery pattern.
- Run
ad_validate_ml_tool_permissionsfirst when tools return empty/misleading results — missing privileges are
the most common cause of false negatives. Full permissions matrix: references/permissions-matrix.md.
Mode Selector
| User intent | Mode |
|---|---|
| "What broke?" / RCA / cross-job / blast radius / influencers / log categories | Investigate |
| "Why score high/low?" / renormalization / model bounds / forecasts | Explain |
| Missing docs / memory limit / datafeed stopped / CCS / lifecycle / calendars | Troubleshoot |
| Create a job / configure a datafeed / start analysis / retrieve results | Manage |
| Security framing (attack chains, MITRE, exfil) | Investigate + references/security-anomaly-expert.md |
| Observability/SRE framing (degradation, capacity, deployment regression) | Investigate + references/observability-anomaly-expert.md |
When a question spans modes: Investigate → Explain → Troubleshoot. Don't blend mode logic — finish one before moving on.
---
Score Quick Reference
record_scorebands: >75 critical · 50–75 warning · 25–50 minor · <25 informationalmulti_bucket_impact ≥ 3→ sustained shift (not a transient spike)initial_record_score >> record_score→ renormalization (model saw worse anomalies later)actual << typicalwithcount/low_count/low_mean→ absence/outage, not just low value- Low scores across many jobs > one high score — composite cross-job signal often beats single-detector severity
Full score definitions, renormalization mechanics, and
anomaly_score_explanationcomponents: references/score-reference.md.
Core concepts
Treat .ml-anomalies-* as three layers, accessed via result_type:
bucket— bucket-level unusualness perbucket_span.anomaly_scoreis the aggregate across all detectors.record— finest-grained rows withactualvstypical,probability,record_score,
anomaly_score_explanation.
influencer— entity contributions ranked within a bucket (influencer_score).
Read scores this way:
anomaly_score/record_score= current normalized values (move as the model sees new extremes).initial_anomaly_score/initial_record_score= immutable snapshots from detection time.- Compare
actualtotypical; useprobabilityfor raw likelihood. - Map entities via
partition_field_value/by_field_value/over_field_value. - Read
multi_bucket_impact(-5 to +5) to separate single-bucket spikes from sustained trends.
---
Mode: Investigate — RCA
When: "what broke?", "which entity caused this?", cross-job correlation, blast radius, attack/cascade chains.
Tool chain
| Phase | Tools |
|---|---|
| Discovery | ad_get_available_metadata, ad_get_jobs, ad_discover_related_jobs, ad_discover_jobs_by_datafeed_index |
| Timeline / scope | ad_query_anomaly_timeline |
| Cross-job / entities | ad_rca_cross_job_entity_match, ad_rca_multi_job_entities, ad_rca_entity_profile |
| Records / influencers | ad_query_anomaly_records, ad_query_influencers |
| RCA depth | ad_rca_detector_fingerprint, ad_rca_correlation, ad_rca_blast_radius, ad_rca_score_reassessment |
| Evidence / categories | ad_get_job_datafeed_config, ad_rca_source_evidence, ad_get_categories, ad_search_log_category_examples |
Protocol
Follow the 14-step sequence in references/protocols/investigation.md. High level: ad_get_available_metadata → pair ad_discover_jobs_by_datafeed_index with ad_discover_related_jobs → ad_query_anomaly_timeline → rank with ad_rca_multi_job_entities (min_job_count=2) → ad_rca_detector_fingerprint → drill with ad_query_anomaly_records + ad_query_influencers (low min_score=25) → profile with ad_rca_entity_profile → order with ad_rca_correlation → confirm with ad_rca_source_evidence. When by_field_name == "mlcategory", compare with ad_get_categories + paired ad_search_log_category_examples (baseline vs. anomaly window).
Finish with a written RCA: root cause entity · affected jobs · temporal progression · fault class (resource/network/application) · severity · recommended actions. Worked example: references/worked-example.md. Full ES|QL templates and parameters: references/investigate-anomaly-esql-tools.md.
Rules
- Multi-job entities are prime suspects; single-job entities are usually victims. Use
min_job_count=2. - Earliest anomaly timestamp wins — sort
ad_rca_correlationby timestamp; first-appearing entity = origin. multi_bucket_impact ≥ 3= sustained behavioral shift, weight higher than transient spikes.- Never close an RCA without
ad_rca_source_evidence— raw source documents are ground truth. - Use low
min_score(25 or lower) for influencer queries — high thresholds miss correlated entities.
---
Mode: Explain — Score / model behavior
When: "why is my score 30/90?", "score dropped overnight", "what is renormalization?", "why wasn't this detected?".
Score types
| Field | Scope | Meaning |
|---|---|---|
record_score | Single record | Normalized severity after renormalization. |
initial_record_score | Single record | Score at detection time. Gap vs record_score = renormalization drift. |
anomaly_score | Bucket | Aggregate severity across all detectors in a bucket. |
influencer_score | Entity × bucket | How anomalous a specific entity is in that bucket. |
anomaly_score_explanation components
| Component | Effect | What it means |
|---|---|---|
anomaly_length | ↑ score | More consecutive anomalous buckets |
single_bucket_impact | ↑ score | Lower probability → higher impact |
multi_bucket_impact | ↑ score | Sustained pattern contribution |
anomaly_characteristics_impact | ↑ score | Mean shift vs. variance change |
high_variance_penalty | ↓ score | Noisy data → wide bounds → anomaly less surprising |
incomplete_bucket_penalty | ↓ score | Bucket has less data than expected (ingest lag, sparse data) |
Why a score looks wrong
- Unexpectedly low:
high_variance_penalty, renormalization, <3 weeks training for weekly seasonality,
bucket_span too large, wrong detector function (mean vs high_mean), incomplete_bucket_penalty, suppression by custom_rules.
- Unexpectedly high: insufficient history (early training over-flags), high-cardinality split (too few points per
entity), use_null: true on a sparse field.
Tool chain
| Purpose | Tools |
|---|---|
| Records + explanation | ad_query_anomaly_records (exact job_id_pattern) |
| Renormalization drift | ad_rca_score_reassessment (score_drift = initial_record_score - record_score) |
| Model bounds (visual) | ad_get_model_plot — actual outside model_lower/model_upper = anomaly |
| Forecast overlap | ad_get_forecast_results |
| Influencer attribution | ad_query_influencers |
| Config & detector | ad_get_job_datafeed_config — bucket_span, function, custom_rules, use_null |
| Categorization | ad_get_categories |
| Model snapshots | ad_get_model_snapshots |
| Structured diagnostic | ad_wf_troubleshoot_anomaly_score (full decision tree) |
Decision tree (ad_wf_troubleshoot_anomaly_score)
ad_get_jobs— ≥3 weeks data for weekly seasonality?ad_ts_model_memory_health—memory_statushealthy?ad_ts_delayed_data_annotations— no incomplete buckets?ad_query_anomaly_records— comparerecord_scorevsinitial_record_score.ad_get_job_datafeed_config—bucket_span, detector function,custom_rules,use_null.ad_get_model_plot— wide bounds →high_variance_penalty.ad_rca_score_reassessment— renormalization drift across history.- Explain
anomaly_score_explanationfactors.
Rules
- Always show both
initial_record_scoreandrecord_score— the gap is the renormalization story. - Explain renormalization before diagnosing config — score drift is the most common "score dropped" cause and needs
no config change.
actual << typicalwithcount/low_countis an absence anomaly — distinguish outages from value spikes.high_variance_penaltyandincomplete_bucket_penaltyexplain most "low score" surprises without remediation.- Weekly seasonality needs ≥3 weeks of training data — flag young jobs as the cause.
For detector function selection details, see references/anomaly-detection-functions.md.
---
Mode: Troubleshoot — Job ops
When: "missing documents", "datafeed stopped", "hard_limit", "results look wrong", lifecycle changes, calendars, CCS.
Common issues → fast paths
| Issue | Fast path | Full decision tree |
|---|---|---|
Missing docs / query_delay warning | ad_ts_delayed_data_annotations → ad_ts_bucket_event_gaps → ad_ts_ingest_latency_estimate → ad_update_datafeed_query_delay | ad_wf_troubleshoot_query_delay |
Memory soft_limit / hard_limit | ad_ts_model_memory_health → ad_wf_ts_field_cardinality → ad_estimate_memory_requirement → ad_update_model_memory_limit | ad_wf_troubleshoot_memory_limit |
| Datafeed not running / job state | ad_get_jobs (state) → ad_get_job_messages → ad_manage_datafeed | — |
CCS / remote_cluster: indices | ad_ts_ccs_diagnostics | — |
| Score sanity check | — | ad_wf_troubleshoot_anomaly_score |
hard_limitcorrupts model state and causes downstream missing-doc false alarms (categorizer silently skips events for unknown categories). Fix memory before fixingquery_delay.
Memory concepts
| Field | Meaning |
|---|---|
model_bytes | Current memory used |
peak_model_bytes | High-water mark since job opened |
model_bytes_memory_limit | Configured model_memory_limit |
memory_status | ok / soft_limit (pruning) / hard_limit (critical) |
total_by_field_count > 100k | by_field cardinality too high — dominant driver |
total_partition_field_count > 10k | Partition explosion |
total_category_count > 10k | Too many distinct log patterns |
Prefer ad_estimate_memory_requirement (samples cardinality from source, calls Estimate Model Memory API) over heuristics like peak_model_bytes * 1.3 — the heuristic ignores pure influencer and categorization memory.
Datafeed & timing concepts
query_delay— how far behind real time the datafeed queries. Too small → missing docs; too large → slower
alerts. Set to P95 ingest latency + buffer (default 60s–120s).
delayed_data_check_config— how aggressively the datafeed checks for late data.bucket_span— analysis interval. Align with data granularity and detection window.frequency— defaults tomin(query_delay, bucket_span / 2).
Lifecycle for config changes (memory limit, query_delay)
- Stop datafeed:
ad_manage_datafeed(action=_stop) - Close job
- Update config:
ad_update_model_memory_limit,ad_update_datafeed_query_delay,
ad_update_delayed_data_check_config
- Open job:
ad_open_job - Start datafeed:
ad_manage_datafeed(action=_start)
Recover a corrupted period without resetting the whole model: ad_revert_model_snapshot.
Tool surface
| Category | Tools |
|---|---|
| Permissions / metadata | ad_validate_ml_tool_permissions, ad_get_available_metadata, ad_get_jobs |
| Job + datafeed state | ad_get_job_datafeed_config, ad_get_job_messages, ad_manage_datafeed, ad_preview_datafeed_with_latency |
| Timing / missing docs | ad_ts_delayed_data_annotations, ad_ts_bucket_event_gaps, ad_ts_ingest_latency_estimate, ad_update_datafeed_query_delay, ad_update_delayed_data_check_config, ad_wf_troubleshoot_query_delay |
| Memory | ad_ts_model_memory_health, ad_wf_ts_field_cardinality, ad_estimate_memory_requirement, ad_update_model_memory_limit, ad_wf_troubleshoot_memory_limit |
| Model / lifecycle | ad_get_model_snapshots, ad_revert_model_snapshot, ad_open_job, ad_create_job |
| CCS | ad_ts_ccs_diagnostics |
| Calendars | ad_get_calendar_events, ad_create_calendar_event |
Full parameter tables, ES|QL templates, and REST step lists: references/troubleshoot-anomaly-tool-reference.md.
Rules
ad_validate_ml_tool_permissionsfirst — missing privileges produce misleading empty results.- Fix memory before
query_delay—hard_limitcorrupts state;query_delayfixes on a memory-limited job are
wasted.
- Stop the datafeed before updating it. Updating a running datafeed is rejected.
- Close the job before updating memory limit. Sequence above.
- *Prefer workflow tools (
ad_wf_) over manually chaining diagnostics** for complex decisions. ad_preview_datafeed_with_latencybefore starting — confirm the datafeed returns data after config changes.
---
Mode: Manage — Create / configure jobs
When: "set up a job", "create an ML detector", "monitor X over time", "detect rare/unusual/anomalous values".
4-step workflow
PUT _ml/anomaly_detectors/<job_id> # 1. Define job (ad_create_job)
PUT _ml/datafeeds/datafeed-<job_id> # 2. Define datafeed (ad_create_datafeed)
POST _ml/anomaly_detectors/<job_id>/_open # 3a. Open job (ad_open_job)
POST _ml/datafeeds/datafeed-<job_id>/_start # 3b. Start datafeed (ad_manage_datafeed action=_start)
GET _ml/anomaly_detectors/<job_id>/results/records # 4. Read results
Process
- Build configs. Parse the user request into job + datafeed JSON with no null fields.
- Apply smart defaults:
| Field | Default | Override when |
|---|---|---|
bucket_span | "15m" | User specifies a different span |
time_field | "@timestamp" | User names a different timestamp field |
index | "logs-*" | User specifies an index or pattern |
datafeed_query | {"match_all": {}} | User mentions filters, processes, or time windows |
influencers | by/over/partition fields from detectors | User adds extra influencer fields |
job_id | Generated from user description | User provides an explicit ID |
query_delay | "60s" | P95 ingest latency is higher |
- Choose detector function from user intent — full table in
references/anomaly-detection-functions.md:
- "high CPU" / "unusually large" →
high_meanorhigh_sum - "rare logins" / "unusual values" →
rare(variants below) - "too many requests" / "spike in count" →
high_count
rare variants:
- Infrequent globally →
rare by_field_name: X - Infrequent vs peers →
rare by_field_name: X over_field_name: Y - Infrequent per segment →
rare by_field_name: X partition_field_name: Y - Infrequent per segment vs peers →
rare by_field_name: X over_field_name: Y partition_field_name: Z
- Validate.
platform.core.get_index_mappingon the target index to verify field existence/types →
ad_validate_job_spec. If errors, fix and re-validate (max 3 attempts).
- Present and confirm. Show the complete job + datafeed bodies formatted as the exact API calls. Ask for
approval once. If feedback, incorporate and re-present (up to 3 rounds).
- Deploy. After confirmation:
ad_create_job→ad_create_datafeed→ad_open_job→ad_manage_datafeed
(action=_start). Report final job_id and datafeed_id.
For batch analysis on historical data, pass start and end to the datafeed start call.
Worked examples (rare-username, DNS exfil, large-downloads) with full JSON bodies and datafeed filters: references/job-creation-recipes.md.
Rules
- Create job before datafeed. Datafeed references job by ID.
- Open job before starting datafeed. Start on a closed job is rejected.
query_delay= P95 ingest latency + buffer (60s–120s safe default).- Forecasts require non-population jobs —
over_field_namejobs cannot be forecasted; warn before attempting. by_field_namevsover_field_name:bycompares entity to its own history;overcompares to peer group in
the same bucket. partition_field_name = fully independent sub-model with its own normalization.
bucket_spanmatches detection granularity — 15m for high-frequency, 1h for operational metrics, 1d for daily
patterns. Larger smooths short spikes; smaller increases noise.
---
Registration (Kibana Agent Builder)
Requires Node.js 18+. Defaults to elastic/changeme when no credentials are supplied.
cd skills/kibana/kibana-anomaly-detection
# tools → workflows → skills
node scripts/kibana-agent-builder.mjs all register --kibana-url http://localhost:5601
# HTTPS with self-signed cert
node scripts/kibana-agent-builder.mjs all register --kibana-url https://localhost:5601 --insecure
all register runs tools register, then workflows register, then skills register. Kibana allows at most five tool_ids per skill; the script fills them by scanning SKILL.md for tool mentions (in document order), then appends ids from references/kibana/tools/esql/*.json until the cap (workflow-only tools omitted by default). If you run skills register alone, run tools register first so those ids exist.
Workflow tool exclusions and prefixes live in scripts/agent_builder_constants.json.
MCP API key permissions:
- Kibana:
read_onechat,space_read - Index:
read,view_index_metadataon.ml-anomalies-,.ml-annotations-,.ml-notifications-*,.ml-config - For source evidence:
readon source data indices
---
Tool inventory
ES|QL tool specs live under references/kibana/tools/esql/.json; workflow definitions under references/kibana/workflows/.yaml. Each Mode section above lists the tools it uses. Full surface: references/tools.md (ES|QL) and references/workflow-tools.md (workflows).
Key system indices
| Index | Relevant content |
|---|---|
.ml-anomalies-* | record, bucket, influencer, model_plot, model_forecast, model_snapshot, category_definition, model_size_stats |
.ml-config | job/datafeed documents (visible even for never-run jobs) |
.ml-annotations-* | delayed data (event == "delayed_data") |
.ml-notifications-* | job messages (level: info/warning/error) |
---
Examples
RCA: "Something caused a spike in our error rate at 2pm — what broke?" → Investigate → ad_get_available_metadata → ad_query_anomaly_timeline → ad_rca_cross_job_entity_match → ad_rca_multi_job_entities → RCA report.
Score drop: "My anomaly score went from 90 to 55 — did the model change?" → Explain → ad_rca_score_reassessment for drift → explain renormalization if score_drift is large.
Memory limit: "Job status shows hard_limit and results look wrong." → Troubleshoot → ad_ts_model_memory_health → ad_wf_ts_field_cardinality → ad_estimate_memory_requirement → ad_update_model_memory_limit (lifecycle: stop datafeed → close → update → open → start).
New job: "Detect unusual error rates per host on nginx access logs." → Manage → high_count detector with by_field_name: "host.keyword" → validate → present → deploy.
Multi-mode: "We had an incident last night, scores were high but now low — is the job healthy?" → Investigate the incident → Explain the score drift → Troubleshoot if hard_limit or delayed data is suspected.
---
Guidelines
- Pick a mode first. Don't blend RCA logic with score-explanation logic in one response.
ad_validate_ml_tool_permissionsfirst on empty results — privileges are the most common false-negative cause.- Score bands are absolute thresholds:
>75critical,50–75warning,25–50minor,<25informational. - Multi-job entities are prime suspects. Use
min_job_count=2inad_rca_multi_job_entities. - Show
initial_record_scorealongsiderecord_score— the gap tells the renormalization story. - Fix memory before
query_delay.hard_limitinvalidates downstream diagnostics. - Stop datafeed → close job → update config → open job → start datafeed for any config change to memory or query
delay.
- Confirm RCAs with
ad_rca_source_evidence. Raw source documents are ground truth.
