Recently we were faced with a modern version of the “monitor-the-monitor” question.
For background, “monitor-the-monitor” basically asks – who monitors the monitoring system to know that it’s working? The end result of this question is often a second (potentially lighter) monitoring system as a secondary check that the monitoring system itself is functional. In this case, we are using Azure Monitor (Log Analytics workspace per this article) to provide monitoring for on-prem systems and applications.
The question was raised – what happens if Azure Monitor Logs is not able to ingest data and/or if network connectivity is not functional between the various systems and the Log Analytics workspace in Azure Monitor.
This can be an interesting question because data isn’t always ingested from on-prem systems – it can also be gathered via Azure or systems which exist in Azure.
As a result, there are several different conditions which may need to be checked for which we will discuss in this blog post.
- Has there been a significant decrease in the amount of data written to the workspace? If so, this could indicate either Azure Monitor’s inability to ingest data or more likely the loss of network connectivity between the systems and Azure Monitor.
- Has no data been written recently to the workspace? If so, this would most likely indicate Azure Monitor’s inability to ingest data but if the only resources in the workspace are on-prem systems this could also be loss of network connectivity between the systems and Azure Monitor.
- What about the opposite of these situations – when the amount of data has taken a sharp increase in a short period of time? In these situations, this may indicate an issue with what is being monitored or the addition of new logs which may have a significant impact on data volume (IE: cost) to the workspace.
This blog post will showcase some sample queries/alerts to identify the above conditions. For background, the expected timeframe to ingest data is between 2-5 minutes for Azure Monitor (log analytics workspace per this article).
A significant decrease in data being ingested by Azure Monitor (log analytics workspace):
The query below identifies the pattern of data which is being ingested by Azure Monitor over the last week and uses that to compare the data which has been ingested over the last two hours. Note: I am multiplying the comparison by 10 because the alert criteria only supports integers. The alert logic is set so that if the threshold is < 2 it will trigger an alert indicating that the amount of data being sent to Azure Monitor has significantly decreased over the last two hours compared to the last week (or the last day if used in an alert).
let Base = Usage
| where TimeGenerated > startofday(ago(7d)) and TimeGenerated < now(-2hour)
| summarize AvgVolume = avg(Quantity)
| project AvgVolume, MyKey = "Key";
let LastHour = Usage
| where TimeGenerated > now(-2hour)
| summarize RecentVolume = avg(Quantity)
| project RecentVolume, MyKey = "Key";
Base | join LastHour on MyKey | project-away MyKey1 |
extend Comparison = (RecentVolume/AvgVolume) * 10 | project Comparison, TimeGenerated=now() | summarize AggregatedValue = avg(Comparison) by bin(TimeGenerated, 1h)
No data is being ingested by Azure Monitor:
If there isn’t a need to have a short time for notification on when data is being ingested, the Usage option works but it appears that it aggregates on an hourly basis. See the example below:
Usage
| where TimeGenerated > now(-65min)
| summarize AggregatedValue = sum(Quantity) by bin(TimeGenerated, 1h)
Additionally, you can do a search over the last 15 minutes for all data types and can alert if this value is at 0 (yes, I know “search *” is frowned on but this appears to be a valid use-case).
search *
| where TimeGenerated > now(-15min)
| project TimeGenerated | count
However, the above is not allowed in an alert… So the only workaround I can see to use this query would be Flow or LogicApps running on a scheduled basis.
Significant increase in data being ingested by Azure Monitor:
The query below identifies the pattern of data which is being ingested by Azure Monitor over the last week and uses that to compare the data which has been ingested over the last two hours (this is the same query used in the first example in this blog post). Note: I am multiplying the comparison by 10 because the alert criteria only supports integers. The alert logic is set so that if the threshold is > 20 it will trigger an alert indicating that the amount of data being sent to Azure Monitor has significantly increased over the last two hours compared to the last week (or the last day if used in an alert).
let Base = Usage
| where TimeGenerated > startofday(ago(7d)) and TimeGenerated < now(-2hour)
| summarize AvgVolume = avg(Quantity)
| project AvgVolume, MyKey = "Key";
let LastHour = Usage
| where TimeGenerated > now(-2hour)
| summarize RecentVolume = avg(Quantity)
| project RecentVolume, MyKey = "Key";
Base | join LastHour on MyKey | project-away MyKey1 |
extend Comparison = (RecentVolume/AvgVolume) * 10 | project Comparison, TimeGenerated=now() | summarize AggregatedValue = avg(Comparison) by bin(TimeGenerated, 1h)
Additional readings:
- I ran into this after I built my queries but it’s got some solid insights on how to notify if the volume is higher than expected.
- Microsoft also recently put up a blog post on the release of the Azure Monitor status blog which is directly available here.
Summary: The above queries combined with alerts can provide a first step towards using Azure itself to provide alerts when Azure Monitor is not receiving data, or when the amount of data being sent to the workspace has significantly increased.