Measuring What Matters — SOC Metrics for the Agentic Era
Most SOC metrics measure activity. The ones that matter measure outcomes. Here's what to track at 30, 60, and 90 days after implementing the agentic SOC architecture.
Measuring What Matters — SOC Metrics for the Agentic Era
There's a problem with most SOC metrics programs.
They measure activity. Alerts reviewed. Incidents closed. Tickets resolved. Mean time to respond. These numbers go up and down week to week, and they tell you almost nothing about whether your organization is actually harder to compromise than it was six months ago.
The series so far covers what to build: threat-informed detection, log architecture, agent automation, graph analytics. This final article covers how to know if it's working. Not through activity metrics, but through outcome metrics that connect directly to the threat model.
The benchmark that anchors everything
Before any other metric, establish this one:
Mean Time to Detect (MTTD) vs. the 12-day MDDR benchmark.
Microsoft Digital Defense Report 2025 reports 12 days as the average ransomware dwell time. That's the time between initial compromise and detection. Your MTTD target is below 12 days.
Measuring it: track estimated compromise date (usually determined through forensic investigation when initial access is confirmed) against the Sentinel incident creation date. For incidents that are discovered proactively through hunting rather than through alert firing, the discovery date improves the number.
MTTD is a lagging indicator. It tells you how you're performing against historical incidents. It doesn't tell you why. The metrics below tell you why.
Log source coverage: the foundation check
Before tuning detection rules or deploying agents, validate that your telemetry is actually flowing. This is the check that the most confident teams skip and then regret during an incident investigation.
Must-have coverage validation
// Run this weekly — validate all four must-have sources have recent data
let check_window = 1d;
let sources = datatable(TableName: string) [
"SigninLogs",
"SecurityAlert",
"AuditLogs",
"AzureActivity"
];
sources
| extend LastRecord = toscalar(
union SigninLogs, SecurityAlert, AuditLogs, AzureActivity
| where TimeGenerated > ago(check_window)
| where $tableName == TableName // This is a simplified illustration — run per-table
| summarize max(TimeGenerated))
| extend Status = iff(LastRecord > ago(check_window), "✓ Flowing", "⚠ Gap detected")
Run per-table until the union query is tuned for your environment. The goal is a weekly automated check that flags any table that has stopped receiving data before you need it for an investigation.
Target: 4/4 must-have sources active with data flowing in the last 24 hours. Any gap is a P2 operational incident affecting detection coverage.
Should-have coverage
Track which sources from the Article 2 should-have tier (DNS, VPN, PAM, NDR, proxy) are connected. This isn't binary — partial coverage at each tier is expected as connectors are deployed over months.
Track as a percentage of should-have sources connected and data-validated. Direction matters more than absolute score. 40% this month, 60% next month, 80% in 90 days is a healthy trajectory.
Behavioral baseline maturity
This metric is unique to the agentic SOC architecture and doesn't exist in traditional SOC metric frameworks.
The data lake behavioral baselines that power entity triage (Use Case 5), threat hunting (Article 6), and anomaly detection require time to build. They're not useful on day 1. They become meaningful at 30 days, reliable at 90 days, and high-confidence at 180 days.
Track retention depth per critical table:
// Check how far back your critical tables go
let tables = dynamic(["SigninLogs", "AuditLogs", "DeviceLogonEvents"]);
print table_name = tables
| mv-expand table_name to typeof(string)
| extend earliest_record = toscalar(
union isfuzzy=true SigninLogs, AuditLogs, DeviceLogonEvents
| summarize min(TimeGenerated))
| extend days_of_history = datetime_diff('day', now(), earliest_record)
| project table_name, days_of_history,
baseline_status = case(
days_of_history < 30, "Insufficient — agents operating without baseline",
days_of_history < 90, "Building — low-confidence anomaly detection",
days_of_history < 180, "Reliable — use for triage; extend for hunting",
"High-confidence — full behavioral baseline available"
)
Track days_of_history for SigninLogs, AuditLogs, and DeviceLogonEvents monthly. The baseline status label changes over time as your retention depth grows.
Target at 90 days post-implementation: All three critical identity tables at 90+ days of data lake history. At this threshold, the entity triage agent and password spray hunting queries produce reliable, tuned output.
Agent workflow performance
The use cases from Article 3 each have a measurable cycle time. These are the metrics that tell you whether the agent layer is delivering its claimed value — or whether it needs tuning.
| Use Case | Metric | Target |
|---|---|---|
| UC1 — Posture reporting | End-to-end time from script trigger to leadership-ready markdown | < 15 minutes |
| UC2 — Threat context gap analysis | Time from advisory receipt to analyst-reviewed gap report | < 30 minutes |
| UC3 — KQL hunting assistant | Time from hypothesis statement to runnable KQL query | < 5 minutes |
| UC4 — Defender for Cloud posture loop | Open high-severity recommendation count, week-over-week trend | Downward trend |
| UC5 — Entity triage | Time from incident creation to entity triage brief | < 10 minutes |
| UC6 — TI and adversary context | Time from indicator extraction to adversary context brief | < 15 minutes |
Measure these against a pre-agent baseline. If your team was spending 2-3 hours on weekly posture reports before UC1 was deployed, and the cycle time is now 15 minutes, that's the productivity metric. Track the delta, not just the post-implementation number.
EntityAnalyzerAudit_CL discrepancy rate
From the Article 3 guardrails section: every agent verdict must be logged to EntityAnalyzerAudit_CL (or your equivalent custom log table). Track the discrepancy rate — incidents where the agent's verdict and the analyst's final assessment diverged.
A discrepancy doesn't automatically mean the agent was wrong. Sometimes the analyst has context the agent didn't. But a rising discrepancy rate is signal that the agent's data access, prompt design, or cross-validation step needs tuning. Track it monthly.
Target: < 10% of entity triage verdicts result in material analyst override after cross-validation. Higher than that means the agent is either working from incomplete data or the baseline window is too short.
Secure Score as a lagging indicator
Microsoft Secure Score is a useful portfolio view, but its limitations as a primary SOC metric are real.
Secure Score goes up when you implement specific controls. It goes up in a predictable way when you first configure a tenant and check all the recommended settings. After the initial configuration burst, Secure Score improvement requires intentional posture work, not just operational SOC work.
What it's good for: tracking whether the organization is regressing on security posture. A Secure Score that drops 15 points between quarterly reviews means something changed — permissions were relaxed, policies were disabled, a configuration drift event occurred. That's worth investigating.
What it's not good for: measuring whether your SOC is detecting faster, or whether your behavioral baselines are deeper, or whether your agent workflows are reducing analyst burden. Those questions require operational metrics, not posture scores.
Track Secure Score as a floor — set a minimum acceptable threshold and alert if it drops below it. Consider it a confirmation that posture work is being done, not a measure of detection effectiveness.
The 30/60/90 day check-in
Here's what the implementation trajectory should look like, using the metrics above:
Day 30:
- Must-have log sources: 4/4 flowing and data-validated
- Data lake enabled: data lake tier configured on workspace
- UC1 deployed: posture report running weekly, cycle time measured
- MTTD baseline established: calculate average MTTD from last 5 incidents to set a starting baseline
Day 60:
- Should-have coverage: 3+ of 6 should-have sources connected
- Behavioral baseline at 60 days:
SigninLogsdata lake history > 60 days - UC5 deployed: entity triage running in assisted mode (agent proposes, analyst reviews)
EntityAnalyzerAudit_CLlogging active: all agent verdicts logged- Password spray hunt query (T5) running weekly; first findings reviewed
Day 90:
- Should-have coverage: 5+ of 6 should-have sources connected
- Behavioral baseline at 90 days: all three critical tables > 90 days; baselines declared reliable
- Dual-ingest patterns: DNS and DeviceProcessEvents full-volume routed to data lake
- Hunting library: 3+ threat hypotheses with tuned KQL queries
- MTTD trend: compare 90-day post-implementation MTTD against pre-implementation baseline
- Unified portal: primary investigation workflow moved to security.microsoft.com
What "working" looks like
The metrics tell you when individual components are running correctly. Here's the outcome picture that indicates the full architecture is working:
Detections improving: New detection rules are being added from hunting findings, not just from vendor content updates. Hunts are finding activity that analytics rules didn't catch. The library is growing because your team is discovering patterns, not just consuming them.
Agent workflows are generating signal beyond their defined scope. UC5 entity triage is producing findings that change the analyst's scope of investigation — the agent noticed something the analyst would have missed without the 30-day behavioral comparison. This happens at least monthly.
Late-discovered forensic investigations are completing. An incident that began three months before detection was successfully investigated using data lake history. The investigation concluded with a reconstructed attack chain, not an incomplete timeline with gaps where logs weren't retained.
MTTD is below 12 days. You're tracking estimated compromise date per incident. The rolling average is consistently below 12 days. Your detection velocity exceeds the average ransomware operator's dwell-time target.
That's the outcome the blueprint was designed to produce.
Executive Summary for Security Leadership
Activity metrics — alerts reviewed, incidents closed, tickets resolved — don't indicate whether detection velocity is improving. The single most important outcome metric is MTTD vs. the 12-day MDDR benchmark. If your SOC is finding ransomware compromises in under 12 days on average, the architecture is working. If not, the gap is somewhere in the detection stack, log coverage, or investigation process.
Log source coverage validation should be automated and run weekly. A table that has stopped ingesting data is a P2 operational incident affecting detection coverage for specific threat categories. Most teams discover log gaps during incident forensics — not before.
Agent workflow cycle time metrics are the primary leading indicator for AI agent program value. If UC1 posture reporting is taking 45 minutes instead of 15 minutes, something is wrong with the data access, prompt design, or output validation step — not with the underlying technology.
Behavioral baselines take 90 days to reach reliable confidence levels. Any organization measuring AI agent triage accuracy before 90 days of data lake history exists is measuring a system that is still warming up. Set the evaluation timeline accordingly.
This quarter: establish MTTD measurement methodology using the last 5 incidents, deploy the must-have coverage validation query as a weekly scheduled alert, and begin tracking UC agent cycle times. Those three give you enough signal to identify where the architecture needs investment versus where it's performing as designed.
Where the series ends
Nine articles. The architecture is complete.
You have the threat model grounded in breach data (pillar + Articles 1-3), the data foundation to detect and investigate it (Articles 4a, 4b, 5), the proactive operations layer on top of it (Articles 6-7), and the measurement framework to know if it's working (this article).
The gap between what attackers do and what SOCs see is structural. It's closed the same way it was opened — through deliberate architectural decisions about what you collect, how long you keep it, how fast you can find it, and who (or what) can query it.
The technology is ready. The data is there if you build for it. Let me know what you implement.
This article is part of the Threat-Informed Defense Series: The Agentic SOC. See the pillar article for the complete framework.