I was
requested to monitor ONE server for every 1 minute to determine the uptime
of the server instead of the default of 5 minutes – just this ONE server,
all others should be monitored the default of every 5 minutes.
So far I’m aware that I need to schedule the "Agent HeartBeat Failure
(Analysis)" rule to 1 minute in the Data Provider settings.
Before that I also need to create a computer group which only has this one
server inside it and create the exact or copy the "Agent HeartBeat Failure
(Analysis)" script into newly created PRG and assign it accordingly to the
CG.
My desired result is to get an alert *whenever* the server is noted as down
after *checking each minute* in other words I want to make sure the server
is up every single minute.
My question is, am I planning this right or have I overlooked something? How
about in terms of performance between consolidator and agent since the
communication of 1 minute is too short. Will it cause any performance
degradation?
Contributed
By: Baelson Duque [MSFT]
Depending on the number of servers you want to start monitoring at this
lower threshold yes it will affect the DCAM and the Agent simply because we
use the heart beat as a mechanism for also requesting new rules, so there is
a bit of an overhead there.
What you would need to do is:
- Change the configuration you mention
below for the Agent to HB every 60 seconds or less; lets call this value
AgentHB
- You will need to change the Consolidator
properties to check agent HB every AgentHB + <Sometime>.
- Set the Create Data for analysis if no
agent HB within to something < 60 seconds.
- Copy the following rules that you would
link to the custom Computer Group that you would create:
- Agent Heartbeat Failed - Single Agent
- service down.
- Agent Heartbeat Failed - Single Agent - Computer off-line.
- Agent Heartbeat Failed - Single Agent - Undetermined Reason.
- Agent Heartbeat Failure (Analysis)
- Modify the Provider on the "Agent
Heartbeat Failure (Analysis)" rule to run at least every minute.
Now that you know what to do, I'll let
you know why it will be difficult to get it close to one minute.
MOM currently uses a longwinded process before the alert gets generated
that the computer is down. You need to make sure you are getting HB's in
way less than 60 seconds since in Step 2, the value needs to be
something a little over the Value you specify in Step 1. If it's the
exact same value, you may get a pseudo-race condition where you are
overlapping in time and will eventually generate false positive alerts.
After you've got Step 1 and Step 2 down, you need to be able to
configure the value in Step 3 such that it generates the data for the
Script in Step 5 to do some checking. After the script in Step 5 detects
that there is some potential server down, then it will generate an event
that one of the rules in Step 4 will alert on.
Stringing all of that together with the right timings happening is very
challenging. For one an environment I once managed, I was able to get
alerts of the computer being offline within 2 minutes -- but that was
for a large number of servers ( > 500)
If you are monitoring JUST ONE server this way or < 10, you might be
able to get away with tuning the values really low.
Please let me know how your experimentation goes (I'm taking a stab that
you are actually doing this in a lab first ;0)
|