DX Unified Infrastructure Management

 View Only
Expand all | Collapse all

SNMPCollector 2.0 - Thoughts and Experiences

  • 1.  SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 16, 2015 04:24 PM

    With the release of SNMPCollector 2.0, we were quick to implement the probe due to the new features and support for network devices. However, as with most releases of the probe, it was more complicated and bug ridden than the documentation let on.

     

    Noteable Bugs: The probe does not work on Tunneled HUBS with robots attached. When configuring the probe you get a 'MONS-002' error and are unable to save the templates, thus not able to monitor anything. 

     

    Templating: While the templating process seems to introduce new features, it's a rather complicated process to configure alarms. Features like being able to specify the alert messages are now gone and you get generic messages that are sometimes hard to decode for those that haven't memorized OID values. You are also unable to specify the sampling count and rearm.


    Benefits: The new discovery filter is definitely nice for a multily client environment. Also, the new probe is able to monitor many metrics that were unavailable before (ie. interface metrics for SW's, FW's, etc. depending on the versions.)

     

    Does anyone else have any thoughts or want to share their experiences? 



  • 2.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 16, 2015 05:36 PM

    Awesome timing as we have been playing around with it ourselves the last couple of days, trying to figure out if we can use it, and if so, how.

     

    The conclusion so far is no, we cant.

     

    Some of my preliminary findings. Keep in mind that it might just be because I don't understand the probe fully yet and haven't figured out how to do a specific thing.

     

    Since we run no monitoring on our primary hub, everything is attempted on remote hub(s).

     

    • Dependencies

      snmpcollector 2.0 depends on the latest ppm (v3.03), which will give 
      you the two awesome alarms:

      "Unable to load or save the Time Over Threshold configuration. Activate the alarm_enrichment probe."
      "The prediction_engine probe must be running on the hub nmshub01 in order to configure predictive alarms."

      The alarm source is "ppm" so it makes sense to everyone when you get an alarm on hostname "ppm". Keep in mind that both alarms are critical.

      I am very happy that my monitoring systems gives me a critical alarm that informs me I need prediction_engine if I want predictive alarms. I don't want that. I just want to monitor a simple oid...

      Also, the dependency to alarm_enrichment is dubious. Due to the way alarm_enrichment rules are usually named (<0>, <1>, ..), Nimsoft can't really provide a profile without the risk of breaking something for the customer. Also, deploying alarm_enrichment and nas on all remote hubs is something I assume everyone wants to do.

    • Discovery

      snmpcollector appears to be playing a much larger part when it comes to discovery. Something the following alarm clearly indicates:

      "The probe snmpcollector cannot publish discovery results because an attach or post queue for the subject probe_discovery on the hub /nms-lab/nmshub01.lab/nmshub01.lab/hub is not configured."

      Embrace the magic.

    • Custom monitors

      The abilites to create custom monitors seems to be very limited. While you can create them and add some OID, you are ONLY able to select:

      Greater than (>)
      Less than (<)

      And you must use numeric values. This hardly seems flexible. What about that OID where you want to check if returned value is the one you want? Or the status OID that should always say "OK"?

    • Alarm texts

      It is my understanding that snmpcollector doesn't actually send alarms. It only sends qos (through pollagent), and alarms are based on QoS values (which kinda explains only numeric tests). Also, even if you configure snmpcollector, the probe in the alarm is pollagent. Just so it's easy to find things.

      When it comes to the alarm text itself, they are hardly obvious. Here is an example:

      "QOS_REACHABILITY_REACHABILITY = 0.0 from source 172.24.128.6 targeting Reachability has crossed the critical static threshold of 100.0"

      The problem is snmp is not responding on that host. Obiously.

      Since everything is QoS, everything is "standard". So you get the same useless alarm text on everything with no way of changing it. Very useful. This includes Static Alarm as well.

    • Tedious

      Adding ad-hoc snmp monitoring is tedious and non-obvious. It seems to have been designed entirely with automatically deploying templates to discovered devices in mind, and for all I know, that might work nicely. We haven't tested that. But manually configuring custom ad-hoc snmp monitoring is very tiresome.

    < /rant >

     

    icmp probe basically have the same problems. I sincerely hope icmp and snmpcollector isn't a sign of a new trend that is starting. If it is, I am worried.



  • 3.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 16, 2015 09:29 PM

    So I do know what probe_discovery messages are.  They're compressed JSON data describing the devices being monitored by the probe.  The discovery server processes them and updates the CM_ and possibly other tables accordingly.

     

    Acronym soup.

    TNT = the next thing NIS

    TNT2 = NIS2 the next thing 2

    TNT3 = NIS3 ...

     

    I'm fuzzy on where TNT and TNT2 are separated, but up through TNT2, it's all using niscache on the robot. Discovery_server periodically polls, collects, and cleans up some of that cache while publishing details to the CM tables.

     

    The data is a combination of object style thingies.

     

    device = is a device presumably with an IP.

    dev_id = encoded of device id, unique identier for device

    ci = is typically a component on a device, like eth0

    ci_id = encoded ci id, unique identifier for a ci

    ci_type = this is equivelent to subsysid assigned in the nas and is something like System.Disk 1.1 or whatever.

    Metric_Item = available metrics under an ci_type.  Octets in, octets out etc.

    MI_id = numeric value of an MI

    ci_type_id = numer value of ci_type or aka subsysid

    metric_type_id = combination of ci_type_id:metric_type_id 1.1:39 System.disk:Read In

    metric_id = an encoded measurable instance of a metric_type_id for a specific ci on a specific device.  Network.Interface:smileysurprised:ctetsIn eth0

     

    This is basically TNT2.  discovery server collects this data, and publishes it to the CM tables.  The formal units and types and associations allow for automated reporting in proper formats associated with the devices and components of the devices being reported on.

     

    The down side is that all of the met_id, ci_id, and dev_ids end up as files in a flat directory on the robot ./niscache which can become an io bottleneck when you are monitoring things like switches.

     

    TNT3 adds the probe_framework which is the bases for new probes going forward including snmptoolkit, icmp... and vmware.

     

    It allows the developer of the probe to discover the met_id objects and organize them logically into objects and containers with some other attributes that allow for automated generation of the configuration "gui" they keep calling it, in admin console.  This device topology published as a "graph" under the subject probe_discovery.  It's compressed JSON that goes to the discovery_server for processing.  

     

    ppm fits in their somewhere.  Maybe something to do with applying Canonical Topology Description (CTD) to the topology information in a probe to generate the config gui.  I think maybe ppm is like a bridge of some sort between cfg, ctd, graph or old nis2 probes not built on Probe Framework, but who knows?

     

    The gateway to TNT2 is using ciopen cialam and metric this that the other functions instead of the much simpler nimalarm ...  This buys you magically configured graphs associated with your device in USM from a custom probe like magic for the extra effort.

     

    TNT3 is all probe framework.  I haven't sifted it all out yet, but the outputs appear to be very much a work in progress.  The whole direction is fairly promising.

     

    Side note: If you have a HUGE vcenter, the compressed graph in probe_discovery messages can exceed 1MB.  This is significant due to a bug in hubs prior to the 7.x series where a lazy megabyte, 1000000 bytes, was a hard coded maximum in an internal hub routine that took messages off the spooler in_ queues and pushed them to the hub for processing.  The internal spooler would accept the message and say ok to the sender, then bork when trying to send it to the hub and you would lose anywhere between zero and 19 other messages as collateral damage due to a hard coded bulk size of 20 in the operation without any alert.



  • 4.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 16, 2015 10:16 PM
    I think udm_manager and the datomic dB might BE linked to that ctd mess. That whole new setup is somewhat mystical

    -jon


  • 5.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 18, 2015 10:56 AM

    @jonhcw 

     

    I think the ctd is related to ppm; but ppm is a black box to everyone that I've spoken to.

     

    I don't think udm_manager is related to it but who knows. From what I understand udm_manager is heavily used by USM and plays a part in datomic keys and discovery.

     

     



  • 6.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 19, 2015 11:12 AM

    Alright, was looking at the callbacks and there is stuff for getting and setting ctd configuration.



  • 7.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Feb 26, 2015 05:55 PM

    More info on udm_manager has been posted to the wiki:

     

    https://wiki.ca.com/display/UIMPGA/v8.1+UDM+Manager+%28udm_manager%29+-+Admin+Console

     

    One thing has become abundantly clear with 8.1 and anything new that supposed to be replacing "legacy" probes that they don't want to support, they are not developing these probes for large enterprise or service provider environments.  The idea of running hundreds of nas instances and then somehow syncing them to support predictive alerting is boggling, and they can't provide any guidance on how that would work because they aren't even trying it.

     

    My guess is they'll start using alarm_enrichment as a hack to rewrite message text on their poorly designed generic qos alarms and expose configuring that "message pool" in admin-console right next to predictive alerting and call it a new feature delivered!

     

    Sad state of things.  Maybe if they had a reference architecture all of the new programmers they have hacking out code would have some idea of the large-scale and diverse deployment scenarious they need to keep in mind with their designs, and maybe customers wouldn't have to treat deploying the new version like a research project.



  • 8.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 16, 2015 07:13 PM
    Just wanted to highlight some of the new features of SNMPCollector 2.0 as well:

    Bulk configuration

    OOB template that helps users start monitoring quickly

    Greatly expanded device support and an easy way to see that support

    Easier to use speed overrides, More flexible polling frequencies, etc.


  • 9.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 16, 2015 07:47 PM

    I agree with most of those minus the OOB template. It did include some standard metrics but had no alarms configured which made it useless. Luckily, you can copy the default to a new template and create the alarms (somewhat odd process at first, so I stuck to statics.)



  • 10.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 21, 2015 12:51 AM

    Hi Dr1993,

     

    exact same issue over here, have you found any solution or workarounds yet?

     

    Regards



  • 11.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 21, 2015 04:07 PM

    I have tested 2.0 as well and as the other have noted I have found the same issues for the most part.  One other I experienced is when creating a template you can select an alert threshold but when the template gets deployed no alerts are configured.   I also requested they give a way to copy and or push templates out.   I was able to copy the template created on one hub to another without a issue just could not edit it cause of the bug described.  I would also like to set QOS to the hostname or IP address not just the ip address as I did not see a way to change the source of the qos data.



  • 12.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 21, 2015 05:26 PM

    Because the new probe is able to pull metrics that 1.61 wouldn't (interface statistics on a lot of switches, TCP, etc.), I was somewhat desperate to get it working on a client HUB.

     

    Setup: Primary Hub -> Secondary -> Client Hub -> Robot with SNMPCollector 2.0

     

    Issue: PPM doesn't seem to work with SNMPCollector 2.0 when its on a remote hub that has a robot attached. When saving any configuration it throws a "mons-002" error and templates are never actually created. Nimsoft was able to replicate this bug on their end.

     

    Workaround: I installed the probe on a HUB without robots attached and it worked. From their I created the templates and copied them over to the client HUB/robot. Reboot the probe and the template should apply correctly. I haven't seen any issue with metrics being polled yet.

     



  • 13.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 22, 2015 12:31 PM

    I know that I have deployed snmpcollector on a remote hub in lab at least. And configured it using Admin Console on the the primary hub. Don't think I had to do anything special? Had to upgrade to latest ppm on all hubs, but yeah.



  • 14.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 23, 2015 01:27 PM

    We created a support case for the MONS-002 issue and not being able to clone/ copy templates. it turned out there was an issue with the monitoring services probe. We received a new version (2.0.2), which fixed the problem for us. 

     

    If you run into this issue the best thing to do is create a support case and ask for version 2.0.2 of the monitoring_services probe. 

     

    I have not yet found a way to take the template from one hub and deploy it on another. Does anybody know how to do this? 



  • 15.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Jan 23, 2015 01:37 PM

    Oh and we created an idea for being able to choose how to publish the QOS from the SNMPcollector:

     

    snmpcollector : QoS source with "Profile name"

     

    Please promote it if you would like to have this feature.



  • 16.  Re: SNMPCollector 2.0 - Thoughts and Experiences

    Posted Feb 26, 2015 01:15 AM

    We are in the middle of a UIM8.1 training and the first thing we wanted to learn about was the snmp collector 2.0.

    1. First off the major thing that threw me off was the naming convention. Why are the machines or devices that are in monitoring by the SNMP Collector refered to as "Profiles"? They should be called "Resources" or "Monitored Objects". Secondly the Templates I think should be called Profiles and in Profiles you should have different "Templates" for each different filter you can setup in each profile. The naming is totally screwing me up right off the bat. 

    2. Its is very not intuitive. There really should be  a Wizard like the way the Discovery Wizard walks you thru the whole process. 

    3. Don't see a way for it to load custom MIB files and then create a "Template" off a 3rd party MIB file. This ability is essential for custom hardware and vendors. 

    4. ...