If you have watched a few KQL videos, copied some queries from a wiki, and changed a couple of column names without really being sure what you were doing, this guide is for you. The goal here is not to make you fast at KQL. The goal is to give you a mental picture of what is actually happening when you type a query, so that when something does not work the way you expected, you have somewhere to start looking.
Each section builds on the previous one. We start with what a query is as a piece of text, then move to filtering, then to picking columns, then to grouping, and so on. By the time you get to joins, you should already understand that a join is not a magical operator but just one more step in a sequence of table transformations. The exercises at the end are intentionally easy - the point is to wire the new ideas into your hands, not to test you.
Throughout the guide you will see four kinds of highlighted boxes. The blue ones are general points worth pausing on. The amber ones are warnings about things that look right but are wrong. The red ones are outright traps. The purple ones, marked From the field, are stories from real query reviews where something went sideways - those are some of the most valuable bits, because they show you what the warnings actually look like when they happen to you.
The single most important thing to internalize about KQL is this: every query is a sequence of steps, and every step takes a table as input and produces a table as output. There is no concept of looping, no concept of variables in the imperative sense, and almost no concept of "fetching one record." You start with a table, you pass it through a series of transformations, and what comes out the other end is the result you see.
The pipe character - the vertical bar | - is what connects one step to the next. You can
read it out loud as the word "then" or "and then". That single habit will change how queries feel to
you. When you look at a query, do not try to understand the whole thing at once. Read it left to right,
line by line, saying "then" every time you hit a pipe.
Here is the smallest meaningful query you can write:
SecurityEvent
| take 10
That is the whole shape of every KQL query you will ever write. The first line names a table. Every subsequent line that begins with a pipe is one more transformation step. You can have one step or thirty steps; it is all the same pattern.
A query in KQL is sometimes called a "pipeline" for exactly this reason. Each step is fed by the step before it and feeds the step after it. The output of one step is always a table - possibly with different columns, fewer rows, or aggregated values, but always a table. This is why you can stack steps freely: every step speaks the same language to its neighbors.
// this is a comment. There is no multi-line comment syntax. Use comments liberally, for
your own sake, and for the sake of others who may be reading/using your query.
Let us go back to the take 10 query and look at what comes back. The
SecurityEvent table holds Windows Security events forwarded from your domain-joined
machines, things like logons, logon failures, process creation, service installs, and so on. When you
run SecurityEvent | take 10, you get back ten rows from that table. Which ten? That is
actually an important question.
The answer is: any ten. Microsoft's documentation is explicit that take does not guarantee
any particular ordering - it gives you a sample of rows the engine happened to encounter first. This is
fine when you are exploring schema (you just want to see what columns exist and what they look like),
but it is a bad idea any time you actually care which rows you get. There is a different operator,
top, that orders the data first and then gives you the top however-many; I will get to that
one later.
For now, run SecurityEvent | take 10 in your environment and look at the result. You will
probably notice that SecurityEvent has dozens of columns. Most of them are empty for any
given row. This is because SecurityEvent is a wide table that has to accommodate every
possible Windows Security event, but each individual event only fills out a handful of those columns - a
logon event fills in LogonType and IpAddress, while a process-creation event
fills in NewProcessName and CommandLine, and so on. This is normal. Almost
every Sentinel table is built this way.
Before you do anything else with a new table, get used to running
TableName | take 10 against it just to see what columns exist and different value types
exist. Then run TableName | getschema, which gives you a tidy list of column names and
their data types. The two together are the cheapest possible way to orient yourself in a new dataset.
SecurityEvent
| getschema
| project ColumnName, ColumnType
| order by ColumnName asc
Notice the structure of that query. It has three steps. The first step is the table. The second step
asks for the schema, which produces a small table where each row describes one column of
SecurityEvent. The third step keeps only two columns of that schema-description table and
sorts them. The output is a clean alphabetical list of every column with its type. Read that out loud:
"Take SecurityEvent, then describe its schema, then keep only the column name and type, then sort
alphabetically." That is the whole query.
Almost every Sentinel table has a column called TimeGenerated. It is a
datetime value that records when the event was generated. This is the column you will
filter on more than any other, because the data behind these tables is enormous - often billions of rows
- and the engine indexes data by time. If you do not constrain the time range, the engine has to
consider far more data than it needs to, your query takes longer to run, and you may even hit a query
limit (10m by default in Log Analytics).
The function you will use most for time filtering is ago(). It takes a duration and returns
a datetime that is that far in the past, relative to right now. So ago(1h) means "one hour
ago," ago(7d) means "seven days ago," and so on. The syntax for the duration uses suffixes:
s for seconds, m for minutes, h for hours, d for
days. There is no week or month suffix; you write seven days as 7d and a month as roughly
30d.
The conventional first step in any production query is to filter TimeGenerated to the
smallest range that still answers your question:
SecurityEvent
| where TimeGenerated > ago(1h)
| take 10
Time-filter first, before doing anything else. The reason is that the engine pushes time filters down to the storage layer - it never even reads data from outside your time window. By contrast, if you filter on a different column first and then filter on time, the engine has to do work it could have skipped. This will not matter on a tiny query in a tiny lab; it will matter a lot the first time you write a query against a busy production workspace.
where TimeGenerated > ago(1h) or
where TimeGenerated >= ago(1h). They are nearly identical for almost every realistic
case. What you should NOT do is write
where TimeGenerated > "2026-04-25". That is comparing a datetime to a string and you
will usually get either zero rows back or a confusing error. Always wrap explicit timestamps in
either:datetime(): where TimeGenerated > datetime(2026-04-25)todatetime(): where TimeGenerated > todatetime(2026-04-25T15:22:00Z)
If you need a fixed window - say, between two specific timestamps - use between:
SecurityEvent
| where TimeGenerated between (datetime(2026-04-20) .. datetime(2026-04-21))
The two dots between the start and end times are the range syntax in KQL. You will see it again with
in (a .. b) for numeric ranges.
where
The where operator is the workhorse of KQL. You will use it more than any other operator.
Its job is to keep only the rows that satisfy a condition; rows that do not satisfy the condition are
dropped. The remaining rows are passed forward to the next step.
The condition can be almost anything that evaluates to true or false: a comparison, a string match, a
check against a list of values, the result of a function, or any combination of those joined with
and and or. Here is a typical example with several conditions chained
together:
SecurityEvent
| where TimeGenerated > ago(24h)
| where EventID == 4625
| where Account !endswith "$"
That query keeps only events from the last 24 hours, only event ID 4625 (Windows failed-logon), and only accounts that do not end with a dollar sign. The dollar sign at the end of an account name is the convention for machine accounts (computer objects in Active Directory), so this query is asking for human-account failed logons.
You might have noticed that there are three separate where clauses instead of one big one.
Both are valid, and the engine is smart enough to combine them internally. Many people prefer one filter
per line, like above, because it makes the query easier to read and easier to comment out individual
conditions while you are debugging. Either of these forms works:
// Style A: one where per line
SecurityEvent
| where TimeGenerated > ago(24h)
| where EventID == 4625
// Style B: combined with "and"
SecurityEvent
| where TimeGenerated > ago(24h) and EventID == 4625
Pick whichever you find easier to read. The performance difference is negligible; the readability difference is real.
and; the pipe is telling the KQL do this
and then do this. Whenever using Style B to one-line your logic, if you use
or instead of and, you may filter a lot less than intended.
For numbers, dates, and exact-match strings, use the standard comparison operators: == for
equals, != for not-equals, <, >, <=,
>=. Note that equality is a double equals sign - a single equals sign is the assignment
operator and you will see it inside extend and project-rename later.
To check whether a value is in a list, use in:
SecurityEvent
| where TimeGenerated > ago(1h)
| where EventID in (4624, 4625, 4634, 4648)
That is the same as writing
EventID == 4624 or EventID == 4625 or EventID == 4634 or EventID == 4648, only shorter and
easier to maintain. You can negate it with !in. There is also a case-insensitive variant
in~ that I will return to in the next section, when we talk about strings.
let is for. I will get to let in detail
later, but here is a sneak peek so the idea sits in the back of your mind:
let AuthEvents = dynamic([4624, 4625, 4634, 4648]);
SecurityEvent
| where TimeGenerated > ago(1h)
| where EventID in (AuthEvents)
The dynamic([...]) wrapper turns a comma-separated list into an array value that
in knows how to use. You will see dynamic a lot when we get to columns that
hold structured data.
Most of the columns in security data are strings: account names, file paths, command lines, URLs, IP addresses (yes, often as strings), and so on. KQL gives you a lot of string operators, and the differences between them matter for both correctness and performance. This is also one of the most common places where new KQL users write something that "works" on their dev data and falls over later.
Let us walk through them in order from cheapest to most expensive. The mental picture you should keep is that some operators can use the engine's index, which means they look at very little data, while others have to scan every row, which means they read everything.
== and =~ for exact match
The double-equals operator == is case-sensitive:
"Administrator" == "administrator" is false. The squiggle-equals operator
=~ is case-insensitive: "Administrator" =~ "administrator" is true.
Microsoft's official guidance, taken straight from their best-practices documentation, is to use
=~ rather than wrapping a column in tolower() before comparing. The reason is
that tolower(Account) == "administrator" applies a function to the column, which prevents
the engine from using its index - every row has to be lowercased before comparison. Using
=~ tells the engine "do a case-insensitive comparison" and lets it stay efficient.
// Microsoft recommends this:
| where Account =~ "corp\\bryce"
// Not this - slower because the function blocks index use:
| where tolower(Account) == "corp\\bryce"
Also note the doubled backslash in the string. KQL string literals use the backslash as an escape
character, so a single backslash needs to be written as two backslashes. If you want to avoid this and
write the path more naturally, use a "raw" string by prefixing with the at-sign:
@"corp\bryce".
has versus containsThis is the most important distinction in KQL string operators. They look almost identical and people use them interchangeably, but they behave very differently and one of them is much faster than the other.
The has operator looks for a whole-word match, which means it works against the engine's
word index and is fast. The contains operator looks for the substring anywhere - even in
the middle of another word - which means it cannot use the index and has to scan every row. For columns
like CommandLine, which can be megabytes of text per row across millions of rows, the
difference between has and contains is the difference between a query that
finishes in seconds and one that times out.
The catch is that has only matches whole "terms," where a term is roughly an alphanumeric
run separated by punctuation or whitespace. So CommandLine has "powershell" will match
"powershell.exe -enc ABCD" because powershell is a separate term inside that
string. But CommandLine has "shell" will not match "powershell.exe",
because shell is glued to power with no separator. If you needed to find
"shell" anywhere, including embedded in other words, that is when you would reach for
contains.
// Fast: uses the term index
| where CommandLine has "mimikatz"
// Also fast: case-sensitive variant
| where CommandLine has_cs "Mimikatz"
// Slow: full scan, only use when you really need substring match
| where CommandLine contains "shel"
The default is case-insensitive. If you want case-sensitive, append _cs to almost any
string operator: has_cs, contains_cs, startswith_cs, and so on.
has filter first, and only then dig into the JSON. Their example is "where DynamicColumn has \"Rare value\" | where DynamicColumn.SomeKey == \"Rare value\"" rather than "where DynamicColumn.SomeKey == \"Rare value\"." The first form filters out
most rows cheaply with has and only does JSON parsing on the remainder; the second form
parses every row's JSON.
has_any and has_all
When you want to match any of a list of terms, do not write a long chain of or conditions.
Use has_any:
let Suspicious = dynamic(["mimikatz", "rubeus", "impacket", "bloodhound"]);
SecurityEvent
| where TimeGenerated > ago(1d)
| where CommandLine has_any (Suspicious)
The has_all variant requires that every term in the list be present. You will use
has_any far more often than has_all. Both still use the term index, so they
are fast.
startswith and endswith
Self-explanatory: startswith matches a prefix and endswith matches a suffix.
Both are reasonably efficient. The classic use case for endswith is filtering out machine
accounts, which always end in $: where Account !endswith "$". For
startswith, a typical use is filtering by a domain prefix:
where Account startswith "CORP\\".
matches regexThis is the operator of last resort. It supports proper regular expressions and will match anything you can describe in regex syntax, but it cannot use any index - it has to evaluate the regex against every row. Reach for it only when no other operator will do.
// Look for a 32-character hex string anywhere in the URL
| where RemoteUrl matches regex @"\b[a-f0-9]{32}\b"
Notice the @ prefix on the string. That is the raw-string syntax - it tells KQL to treat
the backslashes as literal backslashes rather than escape sequences. You almost always want raw strings
for regex patterns; otherwise you end up writing "\\b[a-f0-9]\\{32\\}\\b" and going slowly
mad.
not function
Certain operators - such as has_any to name one - do not allow you to prepend a
! (e.g. !has_any is not valid KQL). In those instances, you
can use the not() operator to reverse the logic through a boolean expression. This is
telling the query, in more literal terms, return data "where X is not true".
// Where the following statement is not true (or false)
| where not(Account has_any ('$','NT AUTHORITY','DWM'))
You should use the not() function when you need to negate complex logical expressions,
multiple grouped conditions, or scalar functions that do not have a dedicated negative operator built
into KQL.
| Operator | What it does | Speed | Typical use |
|---|---|---|---|
== |
Exact equality, case-sensitive | Fast (indexed) | EventID, numeric IDs, exact known strings |
=~ |
Exact equality, case-insensitive | Fast | Account names, file names where casing varies |
has |
Whole-word substring | Fast (indexed) | Searching CommandLine, RemoteUrl, etc. |
has_any / has_all |
Multiple terms | Fast | Watch lists of suspicious tokens |
startswith / endswith |
Prefix / suffix match | Fast-ish | Domain prefixes, file extensions |
contains |
Any substring, anywhere | Slow (scan) | Last resort when has won't match |
matches regex |
Full regex pattern | Slowest | When you absolutely need a pattern |
in / !in |
Membership in a list | Fast | EventID lists, allow/deny lists |
not() |
Reverses the boolean result | Operator dependent | Complex logical expressions or when there is not a dedicated negative operator |
project
Tables in Sentinel can have a hundred columns, most of which are empty for any given event. Once you
have filtered to the rows you care about, the next thing you usually want to do is narrow down to the
columns you care about. That is what project is for. It selects a subset of columns and
discards the rest. As a side effect, it also reorders the columns to whatever order you list them in.
SecurityEvent
| where TimeGenerated > ago(1h) and EventID == 4624
| project TimeGenerated, Computer, Account, IpAddress, LogonType
Why bother projecting? Two reasons. First, your output is much easier to read when you are not staring at 90 empty columns. Second, projecting early makes downstream operations faster - every step from this point on only has to carry five columns through, not ninety. Microsoft's official guidance is to project as early as possible in the pipeline.
There are several variants of project that handle slightly different cases. The basic one
we just saw selects columns. The others are about renaming, dropping, and reordering without losing the
rest. Here is the whole family in one example:
DeviceProcessEvents
| where TimeGenerated > ago(1h)
| project-keep TimeGenerated, DeviceName, AccountName, FileName, ProcessCommandLine, SHA256
// project-keep does the same as project but is clearer when you are
// keeping a subset of columns rather than reorganizing them
| project-rename Host = DeviceName, User = AccountName, Cmd = ProcessCommandLine
// project-rename changes column names without dropping the others
| project-away SHA256
// project-away removes specific columns and keeps the rest
You will mostly use plain project early on. As your queries grow,
project-rename becomes valuable because it lets you give columns more meaningful names
without rewriting half the query. project-away is useful late in a pipeline when you want
to drop a few "scratch" columns you used for intermediate calculations.
Once a column has been dropped by a project, it is gone for the rest of the pipeline. If
you then write where DroppedColumn == "x" you will get an error saying the column does not
exist. This sounds obvious but trips people up because the original table still has that column - it is
just no longer in the row-set being passed to the next step. Remember that every step's output is a
complete table on its own; you only see what you projected.
extend
Where project picks columns that already exist, extend creates new columns by
computing values from existing ones. This is where you start doing real work - extracting the file name
from a full path, flagging events that match some condition, or computing time differences.
SecurityEvent
| where TimeGenerated > ago(1h) and EventID == 4688
| extend HourOfDay = datetime_part("hour", TimeGenerated)
| extend IsAfterHours = HourOfDay < 7 or HourOfDay > 19
That query adds two new columns: HourOfDay, which extracts the hour from the timestamp, and
IsAfterHours, which is a boolean true/false based on whether the hour falls outside
business hours. Now any later steps in the pipeline can filter or summarize using these new columns. You
can stack as many extend clauses as you need, and you can compute multiple columns in a
single extend by separating them with commas:
extend
HourOfDay = datetime_part("hour", TimeGenerated),
IsAfterHours = datetime_part("hour", TimeGenerated) !between (7 .. 19),
LeafFileName = tostring(split(NewProcessName, "\\")[-1])
That third column is worth pausing on because it shows a pattern you will use often. The
NewProcessName column on a 4688 event holds a full Windows path like
C:\Windows\System32\cmd.exe. The split function breaks that path on the
backslash separator, returning an array. Indexing with [-1] gives you the last element,
which is the file name. Wrapping the whole thing in tostring converts it from a dynamic
value to a plain string. The result is just cmd.exe, which is much easier to filter and
summarize on than the full path.
extend statement. In
the above example, if you attempt to use HourOfDay on the next line, it would return an
error saying the column does not exist. To use an extended column in another extend statement, you have
to re-extend first.
| extend HourOfDay = datetime_part("hour", TimeGenerated)
| extend IsAfterHours = HourOfDay !between (7 .. 19)
iff and case
Two functions you will see constantly inside extend are iff and
case. The iff function is a simple if/then/else:
iff(condition, valueIfTrue, valueIfFalse). You can also use iif, there is no
functional difference, it is purely an alias. The case function is a multi-branch version:
case(cond1, val1, cond2, val2, ..., defaultVal).
| extend Severity = case(
EventID == 4625, "medium",
EventID == 4624 and LogonType == 10, "high",
EventID in (4672, 4720, 4732), "high",
"low"
)
Each pair is a condition followed by the value to return when that condition is true. The conditions are evaluated top to bottom and the first match wins. The last argument, with no condition before it, is the default - what gets returned if none of the conditions matched.
summarize
Up to this point, every operator we have seen returns a row-by-row transformation of the input. The
output has the same general shape as the input, just with fewer rows or different columns.
summarize is the first operator that fundamentally changes the shape of the data. It groups
rows together and produces one output row per group, replacing the original rows with a summary value.
The simplest use of summarize is just counting all the rows:
SecurityEvent
| where TimeGenerated > ago(1h) and EventID == 4625
| summarize count()
That returns exactly one row with one column - the total number of failed logons in the last hour.
Useful, but boring. The interesting form is summarize ... by ..., which says "give me one
row per distinct value of these columns":
SecurityEvent
| where TimeGenerated > ago(1h) and EventID == 4625
| summarize Failures = count() by Account, Computer
The output has three columns: Account, Computer, and Failures.
There is one row per unique (Account, Computer) pair, and Failures is the
count of failed logons for that pair. Notice the Failures = count() bit - that is naming
the aggregated column. Without the assignment, the column would be named count_, which is
fine but ugly. Always name your aggregates.
Inside the summarize, you can compute several different summary values at once. The most
useful aggregation functions in security work are these:
| Function | What it returns |
|---|---|
count() |
Number of rows in the group. |
countif(predicate) |
Number of rows where the predicate is true. Saves you from running two summarize branches. |
dcount(column) |
Approximate distinct count. Very efficient even on huge data. |
min(column) / max(column) |
Smallest / largest value. Common for first-seen / last-seen times. |
sum(column) / avg(column) |
Total / average of a numeric column. |
take_any(column) |
Any single value of the column from the group. Cheaper than min/max
when you just need a sample.
|
arg_max(column, *) |
The row with the largest value of column, with * meaning "keep all
the other columns from that row too." This is hugely useful and we will return to it.
|
make_set(column) |
A deduplicated array of all values of the column in the group. |
make_list(column) |
An ordered array of all values, possibly with duplicates. |
Here is a query that uses several of these together. It produces, for each user account in the last day, the total number of process executions, the number of distinct devices they ran processes on, the first time we saw them, the last time we saw them, and a sample of up to ten distinct file names:
SecurityEvent
| where TimeGenerated > ago(1d)
| where Account !endswith "$" and isnotempty(Account)
| summarize
Executions = count(),
DeviceCount = dcount(DeviceName),
FirstSeen = min(TimeGenerated),
LastSeen = max(TimeGenerated),
SampleFiles = make_set(FileName, 10)
by AccountName
| order by Executions desc
That last argument inside make_set, the 10, caps the size of the resulting
array. Without a cap, a noisy account could produce a make-set with thousands of file names, which is
rarely what you want. Always cap your make_set and make_list calls.
arg_max trick
Of all the aggregation functions, arg_max is probably the one that will most change how you
write queries once you discover it. The signature is arg_max(column, *), where
column is the value you want to maximize and * means "keep all the other
columns from the winning row." It answers the question "for each group, give me the latest row, with
everything that was on that row."
// Per device, the most recent network adapter info available
CommonSecurityLog
| where TimeGenerated > ago(7d)
| summarize arg_max(TimeGenerated, *) by DeviceName
That returns one row per device, with all of DeviceNetworkInfo's columns, taken from the
most recent row for that device. Without arg_max, doing this would require either a join or
a window function. With arg_max, it is one line.
bin()
The bin function is a small but extremely useful one. It takes a value and rounds it down
to the nearest multiple of some unit. For datetime values, that means rounding down to the start of an
hour, day, or any other interval. The killer use is grouping events by time bucket.
SecurityEvent
| where TimeGenerated > ago(24h) and EventID == 4625
| summarize Failures = count() by bin(TimeGenerated, 5m)
| order by TimeGenerated asc
That gives you the count of failed logons in each 5-minute bucket over the last 24 hours, in
chronological order. You can plot the result as a time series in Sentinel directly. Adjust the bucket
size to your question - 1m for fine-grained timelines, 1h for daily patterns,
1d for week-long patterns.
You can combine time bucketing with grouping by other columns:
SecurityEvent
| where TimeGenerated > ago(24h) and EventID in (4624, 4625)
| summarize
Successes = countif(EventID == 4624),
Failures = countif(EventID == 4625)
by Computer, bin(TimeGenerated, 15m)
| order by Computer asc, TimeGenerated asc
Now you have one row per (Computer, 15-minute window) with separate counts of successes and
failures. Notice the countif trick - it counts only the rows where its condition is true.
Without it, you would have to either run two separate summarize queries and join them, or
split the data and union it back, both of which are uglier and slower.
You have already met take, which gives you "any N rows" with no ordering guarantee. The two
cousins of take are top and order by (which is the same as
sort by, just spelled differently).
The order by operator sorts the entire row-set by one or more columns. It takes
asc for ascending or desc for descending; the default is descending. Multiple
columns are separated by commas (order matters):
DeviceProcessEvents
| where TimeGenerated > ago(1h)
| summarize n = count() by DeviceName, FileName
| order by n desc, DeviceName asc
The top operator combines sorting and limiting in one step: it sorts and then takes the top
N. Use it when you only want the top so many rows by some metric:
| top 10 by n desc
That is functionally equivalent to order by n desc | take 10, but it expresses the intent
more clearly. It is also slightly more efficient because the engine can stop early once it has the top
N.
| limit 100 (a synonym for take) or | count.
Without it, on an unfamiliar dataset, your query may return gigabytes of rows before you realize you
forgot to filter something. Adding the limit is a cheap insurance policy while you are still feeling out
a new table.
This section is one of the most important in the whole guide, because it explains a class of bug that confuses new and experienced KQL users alike. The same conceptual idea - "what user logged on?" or "what was the logon type?" - is represented differently in different tables, sometimes in subtle ways that will silently produce wrong results if you do not know about them.
The clearest example is SecurityEvent versus DeviceLogonEvents. Both record
logon-related activity. SecurityEvent is sourced from the older Microsoft Monitoring Agent
reading the Windows Security event log; DeviceLogonEvents is sourced from Microsoft
Defender for Endpoint. They overlap heavily but they are not the same table, and the schemas are not
identical.
Look carefully at how each table represents LogonType:
| Table | Column name | Data type | Example value |
|---|---|---|---|
SecurityEvent |
LogonType |
int |
2, 3, 10 |
DeviceLogonEvents |
LogonType |
string |
"Interactive", "Network", "RemoteInteractive" |
If you write where LogonType == 2 against SecurityEvent, you correctly get
interactive logons. If you write the same thing against DeviceLogonEvents, you will get
nothing - not because there are no interactive logons, but because the comparison
"Interactive" == 2 is always false, and KQL will not even warn you. The query runs cleanly
and returns zero rows, which is the worst possible failure mode.
The same pattern applies to account names. SecurityEvent has a unified
Account column that looks like "CORP\\bryce" - domain and username glued
together with a backslash. DeviceLogonEvents and DeviceProcessEvents instead
have AccountDomain and AccountName as separate columns, with no backslash
anywhere. If you have a query that filters on Account =~ "corp\\bryce" and you change the
source table to DeviceProcessEvents without updating the filter, you will again get zero
rows and no error.
EventType input array was empty:
where "FileAccess" in~ (EventType) against an empty EventType array silently
returns nothing. That is a legitimate design choice - it acted as a safety net when run with no scoping
arguments - but it also means the query "works" in the sense that it does not error, while producing a
zero-row output that an analyst could mistake for "no findings." Empty results are not the same as no
findings; they are no answer. Always sanity-check that your filters are not silently filtering away
everything.
The defense against these types of issues is to know the schemas of the tables you query most. Run
TableName | getschema the first time you touch a new table, and again whenever a query
returns zero rows when you expected some. Keep a small reference card for the tables you use most if you
feel it is necessary. The appendix at the end of this document has compact schema cards for the tables
you will encounter day to day.
union and join
So far every query has touched a single table. Real investigations often span multiple tables, because
the data you need is spread across them - logons in one table, process executions in another, network
activity in a third. KQL gives us two operators for combining tables: union, which stacks
tables on top of each other, and join, which links rows from one table to rows from another
based on a key.
union
Use union when you want all the rows from multiple tables together as if they came from
one. The tables do not need to have the same schema; columns that are in one table but not another will
simply be empty for rows that came from the other table. The most common use is when the same conceptual
data lives in two tables - for instance, both SecurityEvent (4624) and
SigninLogs record logon activity, but for different sources.
union
(SecurityEvent | where TimeGenerated > ago(1h) and EventID == 4624
| project TimeGenerated, Computer, Account, IpAddress, Source = "SecEvent"),
(DeviceLogonEvents | where TimeGenerated > ago(1h) and ActionType == "LogonSuccess"
| project TimeGenerated, Computer = DeviceName, Account = AccountName, IpAddress = RemoteIP, Source = "DefenderEDR")
| order by TimeGenerated desc
Notice that I projected each side to the same column names before unioning. That is not strictly
required - union will tolerate mismatched columns - but it makes the result much cleaner.
Notice also the added Source column, which lets you tell which table each row came from.
This is a good habit; it costs nothing and you will be glad of it five minutes later when you wonder why
a row looks weird.
join and the trap that catches everyone
The join operator merges two tables based on matching values in one or more columns, like
SQL joins. The general syntax is:
LeftTable
| join kind=<kind> (RightTable) on Key
The kind= parameter is critical. There are several kinds of join (9 to be exact):
inner (only matching rows from both sides), leftouter (all left rows, matching
right or null), leftanti (left rows with no match in right), and a few more shown in the
image below. I will cover the most useful ones in a moment. But first, the trap.
LeftTable | join (RightTable) on Key with no kind= specified, KQL
does not default to a SQL-style inner join. It defaults to a kind called
innerunique, which silently de-duplicates the left side on the join key before joining.
This means duplicate left rows are dropped before you ever see them, and your query returns fewer rows
than you expected, with no error and no warning.
Always specify kind= explicitly. If you want a SQL inner join, write
kind=inner. If you want all left rows, write kind=leftouter. The one thing you
almost certainly do not want is the default.
With that warning recorded, here are the joins you will use most. The mental model that helps is to picture two columns of rows next to each other, and ask: "for each row on the left, what should happen?"
kind= |
For each left row... | Use it for |
|---|---|---|
inner |
Keep it once for every matching right row; drop if no match | True intersection of two sets |
leftouter |
Keep it; matched right columns or null | Enrichment - adding optional context |
leftanti |
Keep it only if there is no match in the right | "First time seen" - events not in your baseline |
innerunique |
De-dup left first, then inner join | The default; usually not what you want |
A very common pattern is "I have some events, and I want to attach extra context from another table."
The right operator for this is almost always kind=leftouter:
// Logons enriched with the most recent device info per device
let RecentDeviceInfo = DeviceNetworkInfo
| where TimeGenerated > ago(7d)
| summarize arg_max(TimeGenerated, *) by DeviceName
| project DeviceName, IPAddresses, MacAddress;
DeviceLogonEvents
| where TimeGenerated > ago(1h) and ActionType == "LogonSuccess"
| join kind=leftouter (RecentDeviceInfo) on DeviceName
| project TimeGenerated, DeviceName, AccountName, LogonType, IPAddresses, MacAddress
Read this slowly. You first define a small lookup table called RecentDeviceInfo with one
row per device, holding only the columns you want to attach. Then you take the actual logon events, do a
left-outer join to that lookup, and project the fields you care about. Each logon event keeps everything
it had before and gains the device's most recent IP and MAC address. If the device is brand new and not
in the lookup table, those fields will be empty rather than dropping the row - that is the "leftouter"
behavior in action.
One of the most useful patterns in detection engineering is "show me the things I have not seen before."
The leftanti join makes this clean:
// Process executions in the last hour that did not appear anywhere in the last 30 days
let Baseline = DeviceProcessEvents
| where TimeGenerated between (ago(30d) .. ago(1h))
| distinct DeviceName, FileName, SHA256;
DeviceProcessEvents
| where TimeGenerated > ago(1h)
| join kind=leftanti (Baseline) on DeviceName, FileName, SHA256
Each row in the recent events is kept only if there is no matching row in the baseline. The result is
the set of "new" things - process executions on a host running with a hash combination that has never
been seen on that host before in the last 30 days. This is much cleaner than trying to do it with
!in against a giant in-memory list.
Sooner or later you will need to extract a value out of a string column that does not have its own
dedicated column. The classic example is the EventData column on
SecurityEvent: it contains the raw XML payload of the underlying Windows Security event,
with all the field-by-field details that did not get promoted to top-level columns. To get at any of
those details, you have to parse.
KQL has several tools for this, ranging from "easy and fast" to "powerful but slow." In rough order:
parse - easy when the data has a stable shape
The parse operator works when the source string follows a stable pattern of literal text
and variable values. You write the literal text in quotes and the variable parts as bare names, and KQL
extracts them. I will outline three examples below to help outline multiple use cases.
This is the simplest execution of parse, you are looking solely for the value between "from" and "port". In this case, that is the Source IP so we name it accordingly.
Syslog
| where TimeGenerated > ago(1h) and ProcessName == "sshd-session"
| parse SyslogMessage with * " from " SourceIP " port " *
| project TimeGenerated, Computer, SourceIP
This is a similar, but more specific parse, instead of parsing one value, you are parsing out three. This interprets more as a sentence than above, where all strings (for, from, port, ssh2:) have to exist and in the manner you typed them for the data to be parsed out. Any typographical error or SyslogMessage that does not contain the full sentence will not be parsed.
Syslog
| where TimeGenerated > ago(1h) and ProcessName == "sshd-session"
| parse SyslogMessage with * "for " User " from " SourceIP " port " Port " ssh2:" *
| project TimeGenerated, Computer, User, SourceIP, Port
The last example has a subtle difference, the * was removed from the end of the parse
statement. You should use this whenever the value you are parsing out is at the end of the data,
otherwise you will receive an error for ambiguity. A second change was adding :string and
:int to the User and Port variables inside the parse statement.
This allows for explicit type casting during extraction, it is not necessary, but depending on the data
set, can be useful.
Syslog
| where TimeGenerated > ago(1h) and ProcessName == "sshd-session"
| parse SyslogMessage with * "for " User:string " from " SourceIP " port " Port:int " ssh2: " PublicKey
| project TimeGenerated, Computer, User, SourceIP, Port, PublicKey
The asterisks at the start and end mean "skip everything up to the next literal" and "ignore anything
after the last literal." The string fragments are the literal text you expect to see, and
User, SourceIP, Port are the variable parts you want to extract.
Microsoft's official guidance is to prefer parse over multiple extract() calls
when the strings all follow the same format - it is one operator, the engine optimizes it together, and
it is easier to read.
extract and extract_all - when the format varies
When the string is not predictably formatted, regex is your tool. extract pulls out a
single capture group; extract_all pulls out all of them.
| extend Host = extract(@"https?://([^/:]+)", 1, RemoteUrl)
The first argument is the regex pattern, the second argument is which capture group to return (1-indexed), and the third argument is the column to extract from. The result is the captured text or an empty string if there is no match.
For Sysmon events forwarded into the SecurityEvent table, the field-level data lives inside
the EventData string as XML. There are two common ways to get at it. The first is to use
extract with a regex against the raw string, which is robust and easy to reason about. The
second is to use parse_xml followed by mv-apply with bag_pack to
construct a key-value bag of every <Data Name="..."> element. Here is the second
pattern, in the form I have used for Sysmon process-access events:
SecurityEvent
| where TimeGenerated > ago(1h) and EventID == 10
| extend CleanXml = replace_string(
EventData, @' xmlns="http://schemas.microsoft.com/win/2004/08/events/event"', ""
)
| extend ParsedXml = parse_xml(CleanXml)
| extend DataFields = ParsedXml.EventData.Data
| mv-apply Field = DataFields on (
extend Key = tostring(Field['@Name']), Value = Field['#text']
| summarize PackedBag = make_bag(bag_pack(Key, Value))
)
| evaluate bag_unpack(PackedBag, "Unpacked_")
| project TimeGenerated, Computer,
SourceImage = tostring(Unpacked_SourceImage),
TargetImage = tostring(Unpacked_TargetImage),
GrantedAccess = tostring(Unpacked_GrantedAccess),
CallTrace = tostring(Unpacked_CallTrace)
This is more advanced than where you are right now, and that is fine. The reason it appears here is so
you have a complete worked example to come back to once you start touching Sysmon data seriously. Read
the steps and trust that they will become clear with practice. The high-level shape is: parse the XML,
walk over each Data element, build a property bag mapping Name → value, and
then read fields out of the bag by name.
mv-expand
You have just seen mv-apply. Its simpler cousin is mv-expand, which takes a
single row containing an array and turns it into multiple rows, one per element. If a row has columns
Host = "WS01" and Tags = ["a","b","c"], then
mv-expand Tags produces three rows, all with Host = "WS01", and
Tags set to "a", then "b", then "c". This is
invaluable when a column holds a list and you want to filter on individual elements.
DeviceProcessEvents
| where TimeGenerated > ago(1h) and ProcessCommandLine has "powershell"
| extend Tokens = split(ProcessCommandLine, " ")
| mv-expand Token = Tokens
| where Token matches regex @"^[A-Za-z0-9+/=]{200,}$"
| project TimeGenerated, Token, Tokens
That query splits each PowerShell command line on whitespace, expands one row per token, and then keeps
only tokens that look like long base64-ish blobs - a quick way to find encoded PowerShell payloads.
mv-expand is the bridge between a column-oriented world and a row-oriented one when the
column contains a list.
This section is the most concentrated value in the document. Each item here is a real error I have caught in past query work. Read these in advance and you will recognize them when they happen to you.
This one came up in a review for a MITRE ATT&CK association query and is worth understanding. The setup
was a query with a shared base - BaselineSecEvent - that was used to feed two downstream
sub-queries, FileOrObjectAccessEvents and SystemAccess. The original query
looked something like this in shape:
let FileOrObjectAccessEvents = materialize(
SecurityEvent
| where ... lots of common logic ...
| where EventID in (2, 6, 7, 11, 12, ...)
);
let SystemAccess = materialize(
SecurityEvent
| where ... lots of common logic, again ...
| where EventID in (4624, 4625, 4634, ...)
);
union FileOrObjectAccessEvents, SystemAccess
The issue: the materialize() function forces the query engine to execute a subquery exactly
once, cache its results in memory, and then reuse that cached data for any subsequent references within
the same query.. The big SecurityEvent scan and all its parsing and lookups happened twice
- once for each materialized sub-query. The fix was to materialize the shared base once and let the
sub-queries read from it:
let BaselineSecEvent = materialize(
SecurityEvent
| where TimeGenerated between (StartTime .. EndTime)
| where ... all the common logic, run once ...
);
let FileOrObjectAccessEvents = BaselineSecEvent | where EventID in (2, 6, 7, 11, ...);
let SystemAccess = BaselineSecEvent | where EventID in (4624, 4625, 4634, ...);
union FileOrObjectAccessEvents, SystemAccess
The lesson: materialize is cheap to use but expensive to misuse. When you have shared logic
that feeds multiple consumers, materialize the shared piece, not each consumer. Two materializes around
the same expensive base means doing the expensive work twice.
In the same review I caught a subtler issue. There was a lookup table mapping Windows Event IDs to MITRE
ATT&CK techniques. Event 4625 (failed logon) is associated with both T1110.001 (password guessing)
and T1110.002 (password cracking) - two techniques. When the lookup was done with a simple
join, only one of those mappings was kept; the other was silently dropped. The lesson: any
time a lookup key can map to multiple values in your reference table, a join will not return what you
might expect. Either explicitly handle the multi-row case, use make_set to gather all
matches, or restructure the lookup.
The same query had a guard pattern: input scoping arrays like EventType = dynamic([]) with
a filter where "FileAccess" in~ (EventType). When the array was empty, the filter would be
false for every row and the query would return zero results - not an error, just nothing. This is a
legitimate design as a safety net (so the query does not run unbounded), but it also means that "zero
results" can mean either "no findings" or "you forgot to populate the input." Always sanity-check that
your filters are not silently filtering away everything.
The schema gotcha section above covered this in detail, but it is worth repeating in this list.
Comparing EventID == "4624" (string against an int column) works in some KQL contexts and
silently produces zero rows in others. Always match the type. If you are not sure of the type, run
YourTable | getschema.
From the same review: a literal string @"N\A" where the intent was "N/A". The
escape-aware string had a backslash where there should have been a forward slash, so anything you
compared to it would not match. There was no error; the comparison just always failed. When you have a
query that "looks right" but returns nothing, copy-paste a literal value from the data into your filter
rather than retyping it.
To consolidate everything I have covered, let us walk through a non-trivial query as if you had encountered it in a wiki and needed to understand it before running it. Here is the query:
let Window = 5m;
let Fails = SecurityEvent
| where TimeGenerated > ago(24h) and EventID == 4625
| where Account !endswith "$"
| project FailTime = TimeGenerated, Account, IpAddress, Computer;
let Wins = SecurityEvent
| where TimeGenerated > ago(24h) and EventID == 4624
| project WinTime = TimeGenerated, Account, IpAddress, Computer;
Fails
| join kind=inner Wins on Account, IpAddress, Computer
| where WinTime > FailTime and WinTime - FailTime <= Window
| summarize
FailCount = count(),
FirstFail = min(FailTime),
Success = min(WinTime)
by Account, IpAddress, Computer
| where FailCount >= 3
| order by Success desc
Walk through it line by line, saying "then" at each pipe. The query starts by defining a constant
Window = 5m - a five-minute time window you will use later. It then defines two named
subqueries with let: Fails is all 4625 failed-logon events from the last 24
hours, with the timestamp renamed to FailTime so you can tell it apart later;
Wins is the same shape but for 4624 successful logons, with WinTime.
Click the "Show solution" button to check if you read/interpreted the query right.
let the Window equal five minutes;
Then let Fails equal SecurityEvent
| where the TimeGenerated is greater than the time was 24 hours ago and the EventID is equal to 4625
| then take Accounts that do not end with "$"
| then project the columns where FailTime equals TimeGenerated, Account, IpAddress, and Computer;
Next, let Wins equal SecurityEvent
| then take events where the TimeGenerated is greater than the time was 24 hours ago and the EventID is equal to 4624
| then project the columns where WinTime equals TimeGenerated, Account, IpAddress, and Computer;
Next take Fails
| then join Fails and Wins where they have matching row values for the columns Account, IpAddress, and Computer
| then take the WinTimes that are greater than the FailTimes and where their difference is less than or equal to five minutes
| then summarize the count of failed logins, the first failure time, and the first successful login time by the Account, IpAddress, and Computer
| then take the FailCounts that are greater than or equal to three
| then order by Success (First successful login time) descending
The actual computation starts at the third let-less block. Read it: "Take the Fails table,
then inner-join it to the Wins table on the combination of Account, IpAddress, and Computer. Then keep
only matched rows where the win came after the fail and the gap was at most five minutes. Then group by
Account, IpAddress, and Computer, counting fails and grabbing the earliest fail time and earliest
success time per group. Then keep only groups with three or more fails. Then sort by success time, most
recent first."
What this query does, in plain language: it finds cases where the same account, from the same IP, on the same machine, had three or more failed logons within a five-minute window followed by a successful one. That is a classic password-guessing-then-success pattern.
If you can read a query like that comfortably and predict roughly what its output will look like before running it, you are ready for some exercises.
These are intentionally easy. The goal is to use what you just read, not to test you. Each exercise has a worked solution that you can expand. Try the exercise in your own environment first, and only open the solution after you have something working - even if "working" means "does not error but I am not sure of the output."
Use SecurityEvent. EventID 4624 is a successful logon. Project just the time, computer,
account, and IP address. Sort so the most recent is at the top and limit to 20 rows.
SecurityEvent
| where TimeGenerated > ago(1h)
| where EventID == 4624
| project TimeGenerated, Computer, Account, IpAddress
| order by TimeGenerated desc
| take 20
Three things to notice about this solution. First, the time filter comes before the EventID
filter - time is always first. Second, the project happens before the
order by, which is good practice because it reduces the amount of data the engine
has to sort. Third, I used take at the very end after sorting, which is the safe
pattern. Using top 20 by TimeGenerated desc would also work and is slightly more
efficient.
Use SecurityEvent. EventID 4625 is a failed logon. Group by Account,
count, and only show accounts with at least 5 failures.
SecurityEvent
| where TimeGenerated > ago(24h) and EventID == 4625
| where Account !endswith "$"
| summarize Failures = count() by Account
| where Failures >= 5
| order by Failures desc
The !endswith "$" filter is the convention for excluding machine accounts. Notice
that I filter on the count after the summarize, which is the only place you can - the
Failures column does not exist before that line.
Use DeviceProcessEvents. For each (DeviceName, FileName) pair, give the
first time you saw it, the last time you saw it, and a count of executions.
DeviceProcessEvents
| where TimeGenerated > ago(1h)
| summarize
FirstSeen = min(TimeGenerated),
LastSeen = max(TimeGenerated),
Executions = count()
by DeviceName, FileName
| order by Executions desc
This is the bread-and-butter "what ran where" query. If you wanted only suspicious-looking
executables, add a filter on FileName after the summarize, or filter
InitiatingProcessFileName before the summarize to scope to specific parents. The
point of this exercise is just to get comfortable with multiple aggregates in one summarize.
Use SecurityEvent. Group failures into 5-minute buckets. The result should be one row
per bucket with a count.
SecurityEvent
| where TimeGenerated > ago(6h) and EventID == 4625
| summarize Failures = count() by bin(TimeGenerated, 5m)
| order by TimeGenerated asc
| render timechart
The render timechart at the end is a Sentinel-friendly way of saying "draw this as
a line chart." You can leave it off if you just want the table. Notice that the
bin result is named TimeGenerated by default - the binning preserves
the source column name, which is convenient because it lines up with the chart axis.
Use DeviceProcessEvents and look at ProcessCommandLine. Use
has_any with a list. Project just the time, device, account, and command line.
let Suspicious = dynamic(["Invoke-Expression", "DownloadString", "FromBase64String"]);
DeviceProcessEvents
| where TimeGenerated > ago(1d)
| where FileName =~ "powershell.exe"
| where ProcessCommandLine has_any (Suspicious)
| project TimeGenerated, DeviceName, AccountName, ProcessCommandLine
| order by TimeGenerated desc
Three things to notice. First, has_any takes a parenthesized argument that is the
array. Second, you filter on FileName =~ "powershell.exe" before the
has_any filter - this narrows the rows the more expensive string match has to look
at. Third, we used =~ for the file name because we do not care about case.
Show the last 20 logons from DeviceLogonEvents in the last hour, joined to
DeviceNetworkInfo to show the device's most recent IP address.
let RecentInfo = DeviceNetworkInfo
| where TimeGenerated > ago(7d)
| summarize arg_max(TimeGenerated, *) by DeviceName
| project DeviceName, IPAddresses, MacAddress;
DeviceLogonEvents
| where TimeGenerated > ago(1h) and ActionType == "LogonSuccess"
| join kind=leftouter (RecentInfo) on DeviceName
| project TimeGenerated, DeviceName, AccountName, LogonType, IPAddresses
| order by TimeGenerated desc
| take 20
Two things to notice. First, we explicitly wrote kind=leftouter. You do not want
the default innerunique; we want to keep every logon event whether or not we have a
matching device-info row. Second, the lookup table RecentInfo has exactly one row
per device thanks to the arg_max summarize, so the join cannot fan out into
duplicates.
Use DeviceProcessEvents. Build a 30-day baseline of
(DeviceName, FileName, SHA256), then find executions in the last hour that do not match
anything in the baseline.
let Baseline = DeviceProcessEvents
| where TimeGenerated between (ago(30d) .. ago(1h))
| distinct DeviceName, FileName, SHA256;
DeviceProcessEvents
| where TimeGenerated > ago(1h)
| join kind=leftanti (Baseline) on DeviceName, FileName, SHA256
| project TimeGenerated, DeviceName, AccountName, FileName, ProcessCommandLine, SHA256
| order by TimeGenerated desc
This is the "first time seen on this host" pattern using leftanti. It is one of the
most useful detection idioms in KQL. Notice that the baseline window explicitly stops at
ago(1h), not at now() - we want to compare recent events to a baseline
that does not already include those recent events.
Group these in your head by what they do, not alphabetically:
| Goal | Operators |
|---|---|
| Filter rows |
where, distinct, take, top,
limit
|
| Pick / rename / drop columns |
project, project-keep, project-rename,
project-away, project-reorder
|
| Add computed columns | extend |
| Group and aggregate | summarize ... by ... |
| Sort | order by (synonym sort by) |
| Combine tables | union, join kind=..., lookup |
| Parse strings |
parse, parse_json, parse_xml, extract,
extract_all, split
|
| Work with arrays |
mv-expand, mv-apply, make_set, make_list
|
| Time math |
ago(), now(), bin(), datetime_part(),
format_datetime()
|
| Render results |
render timechart, render barchart, render columnchart
|
These are condensed reference cards. Bold columns are the ones you will reach for almost every query.
LogonType is a string here ("Interactive",
"Network", "RemoteInteractive"...) unlike SecurityEvent.
AdditionalFields as JSON.
SyslogMessage; you will
parse this with parse or extract.
| EventID | Meaning |
|---|---|
| 4624 | An account was successfully logged on |
| 4625 | An account failed to log on |
| 4634 / 4647 | An account was logged off |
| 4648 | A logon was attempted using explicit credentials |
| 4672 | Special privileges assigned to new logon |
| 4688 | A new process has been created |
| 4697 | A service was installed |
| 4720 / 4732 / 4738 | Account created / member added to group / account changed |
| 4768 / 4769 / 4771 | Kerberos TGT / service ticket / pre-auth failure |
| 4776 | NTLM credential validation |
| 1102 | The audit log was cleared |
| Code | Name | What it means |
|---|---|---|
| 2 | Interactive | Console logon (someone typed their password at the keyboard) |
| 3 | Network | Network access to shared resources (most common type) |
| 4 | Batch | Scheduled tasks |
| 5 | Service | Service account starting a service |
| 7 | Unlock | Workstation unlock |
| 8 | NetworkCleartext | Network logon with cleartext credentials (suspicious) |
| 9 | NewCredentials | RunAs with /netonly |
| 10 | RemoteInteractive | RDP |
| 11 | CachedInteractive | Cached domain credentials (laptop offline) |