DX Infrastructure Manager

Expand all | Collapse all

Basic server up/down monitoring? Best practices?

  • 1.  Basic server up/down monitoring? Best practices?

    Posted 02-14-2008 09:53 AM
    Fairly new to NimSoft monitoring and I'm really curious about some of the best practises as far as basic server up/down monitoring.  The net_connect is an obvious answer using ping.  But if I have a robot installed, I don't want to necessarliy add it to the net_connect probe just to get basic up/down status on a server.  (extra license and time).  Plus sometimes a server will ping but is not actually available, it's stuck half way up.

    I thought maybe using the services probe and monitor a particular service...

    Or would using another probe make more sense.  Any ideas, common useage or best practises would be greatly appreciated!!


  • 2.  Basic server up/down monitoring? Best practices?

    Posted 02-14-2008 10:29 PM
    I think there is some form of example within the ntevl help documentation about picking up a shutdown event and the clearing it on another event when the server comes back up.

    Bit of advice, don't rely on the 'robot inactive' alarm generated by the hub for the heartbeat, default interval is 15 minutes and you could wait upto 30 minutes for the alarm to come through and reducing the interval on the controller has caused me false robot inactive alarms in the past.

    Don't forget you'd only need one net connect probe as that'll ping all your servers/devices and you can mass configure them by dragging a list of server names and IPs and it the most reliable method on detecting a server down (although you can get network interfaces still responding to ping while the O/S has crashed...). Just have to weigh up the investment I guess.

  • 3.  Basic server up/down monitoring? Best practices?

    Posted 02-20-2008 05:24 AM

    I agree with Justin that you might want to consider using ping as your failsafe method of finding out when a server goes down.  I recommend monitoring critical services on Windows servers, but if the server goes down, none of the probes will be running.  You are right that ping success is not proof the server is up, but ping failure is a pretty clear sign the server is down.

    Like Justin said, if relying on the robot down alarms to tell you when a host is not running can have some difficulties of its own.  You may have to balance response time with false positives.


  • 4.  Basic server up/down monitoring? Best practices?

    Posted 02-23-2008 04:24 AM
    It is a bit of a conundrum - as you say, the server might well respond to a ping but any critical services on it might be offline.

    The best thing to do (from my limited experience with the product) is use the ping as your most basic test, and then pay up for a suitable probe for the application you wish to monitor - like (for example) one of the sql probes to just fire a query at a server you know should be up and running, or for the url_response probe to expect certain content back from a web site you specify. That way you'll definately know if an application fails even if the server still responds to ping.

    As has been said, waiting for a 'Robot is inactive' message to come through can be a bit of a wild goosechase - often the problem is spotted, escalated and then fixed before you even see one of those messages. The problem seems to be especially bad if you have a NetWare server because of the way the whole proxy probe "solution" works. I've actually waited over an HOUR once for a netware server inactive message to come through.........

    In our environment we're primarily using the simple net_connect probe as an early warning and then using the app specific probes to target troublesome or "known to be flaky" apps/services.. seems to be doing the job so far! :smileyhappy:

  • 5.  Re: Basic server up/down monitoring? Best practices?

    Posted 07-05-2012 06:05 PM

    There really needs to be a better way than setting up so many net_connect probes and profiles.

    Where I work we have a multi-tennant environment.  We have HUB servers in each tenant environment.

    We would have to setup the net_connect probe on every tenants hub server, and then configure a net_connect profile for every target ROBOT in every tenant hub.  THat sucks.


    Seems that when you are logged into the primary hub and have infrastructure manager open, you can hit "F5" to refresh the Domain tree of origins and robots.  If a Robot is down or not responding it turns red fairly instantly.

    Why not have a Primary HUB based probe that alarms to the NAS if any Robot is in a RED state?

    That is what I really want.  It would dynamically include all new robots setup in the future that way.

    And it would just need to poll every 3 minutes or so and return one alarm for each Robot in a RED state.

    Anyone know how to do that?



  • 6.  Re: Basic server up/down monitoring? Best practices?

    Posted 07-05-2012 07:23 PM

    If I understand your idea correctly, that already exists. When a robot goes down, the default behavior for the hub is to send an alarm message. The alarm messages repeat until the robot is up again (every minute I think).


    The trick with robot down alarms it to set the hub update interval on the robot/controller to make sure you find out quickly enough when they go down. If a robot stops cleanly, it should tell the hub it is doing down, and you will see it turn red immediately. But if it crashes or network connectivity to the hub is severed, you may not get a robot down alarm for up to 1.5 times the update interval. The default update interval is 15 minutes, meaning the hub will not conclude a robot is down until 22.5 minutes have passed since the last check-in. With a 5-minute update interval, the hub will generate the robot down alarm after 7.5 minutes since the last check-in.

  • 7.  Re: Basic server up/down monitoring? Best practices?

    Posted 07-06-2012 02:46 AM

    Hi Ben,


    have a look at Discovery and Unified Service Manager in the UMP? You can set up monitoring profiles to automaticaly deploy ping tests to net_connect probes. This may cover what you need.