In Operations Manager, it is common to send alerts which a computer’s heartbeat has failed (Health Service Heartbeat Failure) to the server owner. It’s important to alert on heartbeat failures as they indicate an issue with agent’s ability to communicate with Operations Manager. This condition can occur due to a shutdown of the server or an issue with the Operations Manager health service (Microsoft Monitoring Agent). When an agent is not communicating to Operations Manager it will not identify any other critical issues which occur on that server until the agent is back online. So as an example, if the agent were to fail and then the server ran out of disk space there would be no alert indicating that the server ran out of disk space (or at least there would not be an alert until the agent was communicating again). A combination of alerting on both of these conditions may be a better solution where heartbeat alerts are sent to Operations Manager administrators (and hopefully auto-remediated with Operations Manager recoveries or an Orchestrator workflow), and computer not reachable alerts are sent to the Operating System team and/or server owners.
The challenge: During a recent client engagement I was working with a customer who was focused on only alerting in off-hours when a server was down versus when the agent was not reporting to Operations Manager and on tracking availability based upon when a server was able to be pinged versus when the agent on the server was reporting.
To provide this type of alerting we used the built in group for health service watchers shown below (Microsoft.SystemCenter.AgentWatchersGroup or Health Service Watcher Group):
The Computer Not Reachable monitor:
The Computer Not Reachable monitoring is an interesting one. This monitor performs a ping test on servers where the agent has failed to heartbeat (the default configuration in Operations Manager is three heartbeats executing every 60 seconds). The summary for the monitor explains it well:
The monitor is shown below as part of the Health Service Watcher within Availability.
Custom availability reporting for computer not reachable:
The out of the box availability report is available for an object, not a specific monitor in Operations Manager. To provide availability reporting we used a custom availability report which Daniel Savage created 6 years ago at http://blogs.technet.com/b/momteam/archive/2008/06/26/the-power-of-linked-reports.aspx. This custom availability report allowed us to report on availability for a specific monitor and adds to the reporting pane as shown below:
A sample output for this report is shown below:
Note that in the report above, it is reporting on the availability for the specific monitor (Computer Not Reachable) versus the health server watcher itself.
Summary: By directing the two different alerts generated by Operations Manager to two different groups we can have one group which focuses on when servers are completely non-responsive and another which focuses on working to get Operations Manager agents back online when failures occur. Also, by using the custom availability report shown above, we can easily report on available based upon when a server is able to be pinged instead of when the server agent is functional.