Recently we had a requirement to provide more than basic CPU threshold queries for Log Analytics.
We have been watching the upcoming dynamic threshold functionality to see if this will cover what we need. However, this appears to only be available for systems running in Azure.
For our on-prem systems, we have developed the following queries to provide an alert when any server is over or under a specific threshold, a specific percentage of the instances over a specific timeframe. Examples:
- Notify when a server is over 90% CPU for more than 70% of the past 10-minute timeframe.
- Notify when a server is over 95% CPU for more than 99% of the past 60-minute timeframe.
- Notify when a server is under 600 Mbytes of available memory for more than 90% of the past 60-minute timeframe.
This blog post will show sample memory and CPU queries thresholds for virtual machines, however, the queries can be used for any performance counter in Log Analytics.
Monitoring Processor Health
If we want to look at the CPU usage for a system, we can use a query like this one which shows how a specific system’s % CPU looks over the last hour for each instance of the counter for that system (0, 1, 2, 3, _Total)
Perf
| where CounterName == "% Processor Time"
and TimeGenerated > ago(AssessTime) and Computer contains
"XYZ"
If we render this data as a Stacked Column by the InstanceName we see the following results:
Below is the query for the Processor or % Processor Time counters. This query looks at the “Processor” or “%Processor Time” counter and sees which computers have a value of more than 90% over the last hour for more than 99% of the time.
An example of next Generation CPU queries
let AssessTime = 30m;
let CounterThreshold = 90;
let CounterThresholdPct = 70;
Perf
| where (ObjectName == "Processor" or ObjectName == "System") and CounterName == "% Processor Time" and TimeGenerated > ago(AssessTime)
| summarize MaxCPU = max(CounterValue), CpuOverLimit = countif(CounterValue > CounterThreshold), PerfInstanceCount = count(Computer), PctOver = round(todouble(todouble(((countif(CounterValue > CounterThreshold)*100))/todouble((count(Computer)))))) by Computer
| where PctOver > CounterThresholdPct
This example of CPU queries can adapt based on any of the configurations that you are looking for. The format is:
- AssessTime = How long of a timeframe (10 minutes or 60 minutes in the examples above)
- CounterThreshold = What is the threshold for the counter we are watching (90% CPU in the example above)
- CounterThresholdPCT = What percent of the time does the threshold have to be above the CounterThreshold (70% in the example above)
A sample result set is shown below (with CounterThreshold and CounterThresholdPct updated so there is sample data):
Long Description
This query approach only alerts when a counter is above a threshold for a percentage of the data points over a specified timeframe. This should result in a much more targeted alert – IE: When is my CPU really a bottleneck.
Monitoring Memory Health
If we want to look at the memory usage for a system we can use a query like this one which shows how a specific system’s available memory looks over the last hour. We can see that the available memory is consistently less than the threshold of 700.
Perf
| where CounterName == "Available Mbytes" and TimeGenerated > ago(AssessTime) and Computer contains "XYZ"
Below is a variation of the query above re-written for the available memory counter: (changes compared to the first query are in Bold below). This query looks at the “Available Mbytes” counter and sees which computers have a value of less than 700 Mbytes over the last hour for more than 90% of the time.
Next Generation memory query
let AssessTime = 60m;
let CounterThreshold = 700;
let CounterThresholdPct = 90;
Perf
| where CounterName == “Available MBytes” and TimeGenerated > ago(AssessTime)
| summarize MinMemory = min(CounterValue), MemoryUnderLimit = countif(CounterValue < CounterThreshold), PerfInstanceCount = count(Computer), PctUnder = round(todouble(todouble(((countif(CounterValue < CounterThreshold)*100))/todouble((count(Computer)))))) by Computer
| where PctUnder > CounterThresholdPct
A sample result set is shown below:
This query approach only alerts when a counter is below a threshold for a percentage of the data points over a specified timeframe. This should result in a much more targeted alert – IE: When is my memory really a bottleneck.
Monitoring Disk Space
The query below shows a similar type of query focused on free disk space.
let AssessTime = 1d;
let CounterThreshold = 5;
let CounterThresholdPct = 70;
Perf
| where (ObjectName == “LogicalDisk” and CounterName == “% Free Space” and InstanceName != “_Total”)and TimeGenerated > ago(AssessTime)
| summarize DiskFreeUnderLimit = countif(CounterValue < CounterThreshold), PerfInstanceCount = count(InstanceName), PctOver = round(todouble(todouble(((countif(CounterValue < CounterThreshold)*100))/todouble((count(InstanceName)))))) by Computer, InstanceName
| where PctOver > CounterThresholdPct
Summary:
The sample queries in this blog post (see the “Next Generation CPU query” and “Next Generation memory query” sections for the queries) should provide extremely actionable alerting for these two KPI’s for servers. Additionally, these queries can be used for any performance metrics which you gather into Log Analytics!
P.S. I owe a huge shout-out to Thomas Forbes for his development of the CPU query contained in this blog post. Way to go Thomas!
UPDATE: Updated on 6/1/21 with functional enhancements to the queries that Thomas put together.