·9 min read·Threat Hunting

Building a Threat Hunting Program with KQL and the Sentinel Data Lake

Detection rules catch what you expect. Hunting finds what you don't. Here's how to build a systematic threat hunting program using KQL hypotheses, behavioral baselines, and the Sentinel data lake.

threat huntingKQLMicrosoft Sentineldata lakebehavioral analyticsSOC
Building a Threat Hunting Program with KQL and the Sentinel Data Lake

Building a Threat Hunting Program with KQL and the Sentinel Data Lake

Detection rules catch what you expect. They fire when known-bad patterns appear in your telemetry. They're essential, and they're not enough.

Human-operated ransomware dwell time averages 58 days. That means attackers are in your environment, conducting reconnaissance, escalating privileges, and staging data — all while your detection rules are silent, because nothing they're doing yet matches a known-bad signature.

Threat hunting is the discipline of proactively searching for evidence of active intrusions that detection rules haven't caught. The data lake architecture from Articles 4a, 4b, and 5 is what makes systematic hunting possible at depth. This article covers how to build a hunting program that actually runs.


The hypothesis model

Good hunting starts before you write a single line of KQL.

A hunt hypothesis answers three questions:

  1. What attacker technique am I looking for? Be specific — not "lateral movement" but "Pass the Hash from a workstation to a domain controller via SMB."
  2. What data would show evidence of this technique if it occurred? Name the specific log source and event types.
  3. What distinguishes attacker activity from normal admin behavior? This is the filter that makes the hunt useful rather than just returning every event.

Without a hypothesis, you're not hunting. You're browsing. Browsing produces noise, not findings.

The threat landscape from Article 1 provides your hypothesis backlog. For each of T1-T11, there's a corresponding set of techniques with data sources. Start your hunting program by building hypotheses for the two or three threats most relevant to your environment.


KQL hunting patterns by threat

These are working KQL patterns for the highest-priority threats in the model. Each is designed to run against the data lake using the historical depth from the retention architecture in Article 4b. Tune the thresholds and time windows for your environment.

T1/T2 — Credential abuse and lateral movement

Hypothesis: Infostealer-harvested credentials are being used for initial access and lateral movement. Look for sign-in patterns that diverge from a user's behavioral baseline — new IPs, new countries, new application access — occurring within a narrow time window (implying credential handoff from access broker to operator).

let lookback = 90d;
let baseline_window = 60d;
let alert_window = 7d;
// Establish per-user sign-in IP baseline
let user_ip_baseline = SigninLogs
| where TimeGenerated between (ago(lookback) .. ago(alert_window))
| summarize known_ips = make_set(IPAddress), known_countries = make_set(LocationDetails.countryOrRegion) by UserPrincipalName;
// Find recent sign-ins from outside baseline
SigninLogs
| where TimeGenerated > ago(alert_window)
| where ResultType == 0  // Successful sign-in
| join kind=leftouter user_ip_baseline on UserPrincipalName
| where not(IPAddress in (known_ips))
| where not(tostring(LocationDetails.countryOrRegion) in (known_countries))
| project TimeGenerated, UserPrincipalName, IPAddress, LocationDetails, AppDisplayName, known_ips
| order by TimeGenerated desc

Why 90-day lookback matters: A 60-day baseline window with a 7-day alert window requires 67 days of data minimum. Run this against a 30-day retention window and the baseline is too thin to distinguish new locations from normal travel.

T2 — Lateral movement via RDP and SMB

Hypothesis: Pass the Hash or stolen credential lateral movement from a compromised workstation to a domain controller or high-value server, using Windows authentication events.

// Lateral movement: logon type 3 (network) from workstation to DC
// Requires DeviceLogonEvents with DC and server scoping in analytics tier
let dcs = dynamic(["dc01.contoso.com", "dc02.contoso.com"]);  // Replace with your DCs
DeviceLogonEvents
| where TimeGenerated > ago(30d)
| where ActionType == "LogonSuccess"
| where LogonType == "Network"
| where DeviceName in (dcs)
| where InitiatingProcessFileName !in ("lsass.exe", "services.exe", "ntlm.exe")
// Filter out expected service account patterns
| where AccountName !endswith "$"
| summarize 
    hop_count = count(),
    source_devices = make_set(RemoteDeviceName),
    first_seen = min(TimeGenerated),
    last_seen = max(TimeGenerated)
    by AccountName
| where hop_count > 3 or array_length(source_devices) > 2
| order by hop_count desc

T5 — Password spray detection

Hypothesis: A credential stuffing or password spray campaign is testing your environment. Look for distributed low-frequency failed authentications that stay below per-account lockout thresholds but sum to significant volume across a short window.

// Password spray: many accounts, few failures each, concentrated time window
let time_window = 1h;
SigninLogs
| where TimeGenerated > ago(7d)
| where ResultType != 0  // Failed sign-ins
| where ResultDescription !has "MFA"  // Exclude MFA failures — different pattern
| summarize 
    failed_accounts = dcount(UserPrincipalName),
    total_failures = count(),
    source_ips = make_set(IPAddress)
    by bin(TimeGenerated, time_window), IPAddress
| where failed_accounts > 10 and total_failures > 20
| extend failures_per_account = total_failures / failed_accounts
| where failures_per_account < 5  // Low per-account = spray, high per-account = brute force
| order by failed_accounts desc

T7 — Access broker and third-party anomalies

Hypothesis: A third-party service principal or guest account has accessed resources outside its normal scope. Look for first-time resource access from external identities.

// Third-party first-access to sensitive resources
let lookback = 30d;
let new_access_window = 7d;
// Resources accessed historically
let historical_scope = AuditLogs
| where TimeGenerated between (ago(lookback) .. ago(new_access_window))
| where Category == "ApplicationManagement"
| where InitiatedBy.user.userPrincipalName has_any ("#EXT#", "_adm@")
| summarize historical_resources = make_set(TargetResources[0].id) by tostring(InitiatedBy.user.userPrincipalName);
// New resource access in alert window
AuditLogs
| where TimeGenerated > ago(new_access_window)
| where Category == "ApplicationManagement"
| where InitiatedBy.user.userPrincipalName has_any ("#EXT#", "_adm@")
| extend upn = tostring(InitiatedBy.user.userPrincipalName)
| extend resource_id = tostring(TargetResources[0].id)
| join kind=leftouter historical_scope on $left.upn == $right.upn
| where not(resource_id in (historical_resources))
| project TimeGenerated, upn, resource_id, Result = OperationName
| order by TimeGenerated desc

T9 — Supply chain: unauthorized build artifact access

Hypothesis: A developer credential has accessed artifact registry entries it has no business need for, or accessed build artifacts associated with sensitive deployments outside normal build windows.

// CI/CD artifact access outside business hours
// Assumes CI/CD logs ingested to data lake table
// Adjust table name and schema to your connector
GitHubAuditLog_CL
| where TimeGenerated > ago(30d)
| where action_s in ("packages.download", "registry.get_scope_token")
| extend hour_of_day = hourofday(TimeGenerated)
| where hour_of_day !between (7 .. 18)  // Outside business hours UTC — adjust for your timezone
| summarize 
    off_hours_access = count(),
    packages = make_set(name_s),
    users = make_set(actor_s)
    by bin(TimeGenerated, 1d)
| where off_hours_access > 5

Jupyter notebooks for behavioral baselines

KQL is excellent for discrete pattern matching. For behavioral analytics — finding anomalies relative to statistical norms across thousands of users over months of data — Python and Jupyter notebooks against the data lake tier are more appropriate.

Microsoft Sentinel supports Jupyter notebooks natively. Open Notebooks from the Sentinel workspace and you get direct access to your data lake telemetry with Python, pandas, scikit-learn, and visualization libraries available without any additional setup.

A baseline use case: authentication anomaly scoring

Build a per-user authentication profile using 90 days of SigninLogs data lake history: typical hours of access, typical applications, typical network locations, typical device count. Compute a distance metric for new authentication events against each user's profile. Score outliers.

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import IsolationForest
from azure.monitor.query import LogsQueryClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
client = LogsQueryClient(credential)

# Pull 90 days of sign-in data from data lake
query = """
SigninLogs
| where TimeGenerated > ago(90d)
| project UserPrincipalName, IPAddress, AppDisplayName, 
          hour_of_day = hourofday(TimeGenerated),
          day_of_week = dayofweek(TimeGenerated) / 1d,
          country = tostring(LocationDetails.countryOrRegion)
"""

# ... execute query, build per-user profiles, fit IsolationForest
# See full implementation in the companion gist

The output is a risk score per authentication event, derived from 90 days of behavioral history. Analysts can query the scoring output for the current day's outliers rather than reviewing every sign-in event manually.

This is the behavioral baseline data the entity triage agent in Use Case 5 (Article 3) depends on. The agent doesn't compute the baseline itself — it queries the output of a notebook workflow that runs nightly against the data lake.


From manual hunt to repeatable scheduled query

Most hunts start manually: a new threat advisory, a peer's published detection research, or a specific incident that suggests attacker activity you haven't been looking for.

The workflow from manual to repeatable:

  1. Write the hypothesis. Specific technique, data source, distinguishing filter.
  2. Validate against 30 days of data lake history. Does the query return expected results? Does it produce noise that requires additional filtering?
  3. Tune the thresholds. Adjust counts, time windows, and filters until the false positive rate is manageable.
  4. Promote to a scheduled analytics rule if real-time detection is warranted — or keep as a hunting query run weekly against the data lake.
  5. Document the technique-to-KQL mapping so another analyst can run it, understand the rationale, and improve it.

Hunting queries that prove their value get promoted. Hunting queries that return consistent noise get refined or retired. The program compounds over time as your library grows.


Practical starting point

If you don't have a formal hunting program today, start with one hypothesis from the T5 pattern above. Password spray is the highest-frequency initial access pattern after infostealer-delivered credentials. It's detectable, the KQL is straightforward, and you'll find results in most environments within the first run.

Run it against the last 7 days of data. Tune the thresholds. See what you get.

That's the first entry in your hunting library.


Executive Summary for Security Leadership

  • Detection rules protect against the known. Threat hunting finds the unknown — specifically, the credential-based intrusions and lateral movement activity that dwell silently until your detection rules have nothing to fire on. MDDR 2025 reports 58-day average attack length; hunting closes the window between dwell and detection.

  • The data lake architecture from Articles 4a and 4b is the prerequisite for systematic hunting. The KQL patterns here require 60-90 days of behavioral history to distinguish attacker activity from normal variation. Without data lake retention, hunting baselines are too thin to produce reliable signal.

  • Jupyter notebooks against the data lake enable behavioral anomaly detection that KQL discrete-pattern matching doesn't reach. The authentication anomaly scoring workflow is the foundation for any AI agent entity triage use case. The agent's accuracy is directly proportional to the quality of the behavioral model it queries.

  • Starting a hunting program doesn't require a dedicated analyst. One scheduled hunt query run weekly by the same analyst who handles the alert queue is a hunting program. It compounds over time as the query library grows and findings produce new detection rules.

  • This quarter: deploy the password spray detection query (T5) and schedule it to run weekly. Review the output with the same analyst who reviews the alert queue. Track the findings over 30 days. That's the foundation of a repeatable program.


What's next

Hunting with KQL and notebooks finds intrusions that have already moved into your environment. Graph analytics finds attack paths before they're used — relationships between identities, resources, and permissions that create blast radius risk even before any attacker credential is compromised.

Article 7: Seeing the Attack Path Before the Breach — Graph Analytics in Microsoft Sentinel


This article is part of the Threat-Informed Defense Series: The Agentic SOC. See the pillar article for the complete framework.