How Can I Monitor One Server every Minute but the Rest Every 5 Minutes?
Goto the MOM Home Page
 
I was requested to monitor ONE server for every 1 minute to determine the uptime of the server instead of the default of 5 minutes – just this ONE server, all others should be monitored the default of every 5 minutes.

So far I’m aware that I need to schedule the "Agent HeartBeat Failure (Analysis)" rule to 1 minute in the Data Provider settings.

Before that I also need to create a computer group which only has this one server inside it and create the exact or copy the "Agent HeartBeat Failure (Analysis)" script into newly created PRG and assign it accordingly to the CG.

My desired result is to get an alert *whenever* the server is noted as down after *checking each minute* in other words I want to make sure the server is up every single minute.

My question is, am I planning this right or have I overlooked something? How about in terms of performance between consolidator and agent since the communication of 1 minute is too short. Will it cause any performance degradation?

Contributed By: Baelson Duque [MSFT]
Depending on the number of servers you want to start monitoring at this lower threshold yes it will affect the DCAM and the Agent simply because we use the heart beat as a mechanism for also requesting new rules, so there is a bit of an overhead there.

What you would need to do is:

  1. Change the configuration you mention below for the Agent to HB every 60 seconds or less; lets call this value AgentHB
     
  2. You will need to change the Consolidator properties to check agent HB every AgentHB + <Sometime>.
     
  3. Set the Create Data for analysis if no agent HB within to something < 60 seconds.
     
  4. Copy the following rules that you would link to the custom Computer Group that you would create:

- Agent Heartbeat Failed - Single Agent - service down.
- Agent Heartbeat Failed - Single Agent - Computer off-line.
- Agent Heartbeat Failed - Single Agent - Undetermined Reason.
- Agent Heartbeat Failure (Analysis)

  1. Modify the Provider on the "Agent Heartbeat Failure (Analysis)" rule to run at least every minute.

Now that you know what to do, I'll let you know why it will be difficult to get it close to one minute.

MOM currently uses a longwinded process before the alert gets generated that the computer is down. You need to make sure you are getting HB's in way less than 60 seconds since in Step 2, the value needs to be something a little over the Value you specify in Step 1. If it's the exact same value, you may get a pseudo-race condition where you are overlapping in time and will eventually generate false positive alerts.

After you've got Step 1 and Step 2 down, you need to be able to configure the value in Step 3 such that it generates the data for the Script in Step 5 to do some checking. After the script in Step 5 detects that there is some potential server down, then it will generate an event that one of the rules in Step 4 will alert on.

Stringing all of that together with the right timings happening is very challenging. For one an environment I once managed, I was able to get alerts of the computer being offline within 2 minutes -- but that was for a large number of servers ( > 500)

If you are monitoring JUST ONE server this way or < 10, you might be able to get away with tuning the values really low.

Please let me know how your experimentation goes (I'm taking a stab that you are actually doing this in a lab first ;0)
 

© FAQShop.com 2003 - 2008

Goto the MOM Home Page

Email the Author