Monitoring the health of physical or virtual systems should focus on four key pillars: Availability, Performance, Security, and Configuration (please note, this is a concept that I have shamelessly stolen from my experiences working with System Center Operations Manager for more than a decade). Availability focuses on the system being able to be accessible and being online. System performance focuses on the key performance indicators (KPI’s) that determine how well a system is functioning. Security focuses on logically enough how the resource is secured, and Configuration focuses on how the system is configured. Each of these focus areas needs to have effective alerting and a method to visualize the state of these areas (generally done through dashboards). Today’s blog post will focus on two of those areas: Performance and Availability.
Performance monitoring & Simulating performance failures
Performance monitoring for systems focuses on four KPI’s: Disk, CPU (Processor), Memory, and Network. To develop effective alerting for the performance of systems, you need a way to test failure conditions to validate that they alert as expected. As an example, if you are monitoring for low disk space conditions you need a way to cause a disk to go to a warning level and error level condition to validate that the alert fires as expected. These are the methods I recommend to quickly test the health of each of the four KPI’s:
While there are other factors to disk health from a performance perspective, the primary one to focus on is the amount of free disk space that exists on a drive. To cause a failure condition there are two quick tricks I am aware of:
- To quickly fill up a disk for testing of low disk space conditions, open disk management, pick the disk, create a VHDX (larger size drive), and choose a fixed size. If you allocate this to the right size it will cause a low disk space condition to occur.
- You can also try this trick from Steve Buchanan using a tool called Philip: Simulate low disk space to trigger SCOM alert | Buchatech.com
The primary metrics for CPU performance health are the % Processor and the Processor queue. This tool provides a way to quickly generate a high CPU situation:
- There is an easy-to-use tool available online that stress tests the CPU (https://cpux.net/cpu-stress-test-online).
You can also take a page from the Gamer’s notebook and use some of the tools that they recommend for stress testing. However, their focus is slightly different as they are pushing to identify errors or check the specifications on systems versus what is being attempted within this blog post.
testlimit64.exe -d 4096 -c 1
SolarWinds provides a free 14-day evaluation of a solution that they call “WAN Killer“. It was really easy to use and worked like a champ to simulate a heavily used network connection.
The tools listed above provide a way to simulate conditions where alerts would fire due to an unhealthy state of the core KPI’s for systems.
Availability monitoring & Simulating availability failures
Availability monitoring for systems focuses on receiving a heartbeat from a system, if the system can be reached via the network (ping), as well as stability items such as Operating System crashes and Application crashes.
To simulate heartbeat failures, simply stop the service that is providing the heartbeat. As an example, if you are using Log Analytics to provide heartbeat monitoring the agent is currently the Microsoft Monitoring Agent. This agent runs as a service that can be shut down to simulate a heartbeat failure.
To simulate a ping level failure, either shut down the system or use the Windows Firewall to block traffic to the system from the system you are using to perform the ping testing.
To simulate an application crash, I found “Bad Application” to be the easiest to use.
Operating System crashes
So far, I haven’t found an effective way to simulate Operating System crashes.
Update 8/31/21: On Twitter, Steve Burkett (Steve Burkett (@steveburkett) / Twitter) pointed out that NotMyFault from Sysinternals! Check it out at: NotMyFault – Windows Sysinternals | Microsoft Docs