DX Infrastructure Manager

Expand all | Collapse all

vmware self-monitoring alerts

Daniel Blanco02-02-2017 12:36 PM

  • 1.  vmware self-monitoring alerts

    Posted 01-24-2017 11:07 AM

    Hello, in vmware v6.72 we started getting flooded with the "Self-Monitoring Failures for ..." alerts. Can someone explain how do you fix these and what is causing them? 

    I know the old alerts in the previous versions used to say "Metric ***. Update not available" but for those we would just go into all monitors under that profile and un-check it from being monitored.

     

    In this 6.72 version that's not fixing it. I also try a combination of clearing the nis-cache then restarting the probe, then trying to remove the static items w/in the probe but again those "Self-Monitoring Failure" alerts are popping up.

     

    Anyone have any suggestions here....



  • 2.  Re: vmware self-monitoring alerts

    Posted 01-24-2017 11:22 AM

    Example Alert:

    Self-Monitoring Failures for 'vmware1:HOST_CPU_AGGREGATE.usage': Data Collection (1 of 47 failed).  See vmware.log for more details

     

    Then in log we have:

    Jan 23 02:02:17:242 [Data Collector - vmware1, vmware] PERF: DONE: Updating monitors in graph {Seconds=2.075}
    Jan 23 02:02:17:242 [Data Collector - vmware1, vmware] SENDING_ALARM: Self-Monitoring Failures for 'vmware1:HOST_CPU_AGGREGATE.used': Data Collection (1 of 47 failed). See vmware.log for more details
    Jan 23 02:02:17:242 [Data Collector - vmware1, vmware] SENDING_ALARM: Self-Monitoring Failures for 'vmware1:RESOURCE_POOL.CPUOverallUsagePercent': Monitor Correlation (2 of 6 failed), Data Collection (2 of 12 failed). See vmware.log for more details
    Jan 23 02:02:17:242 [Data Collector - vmware1, vmware] SENDING_ALARM: Self-Monitoring Failures for 'vmware1:HOST.Status': Data Collection (10 of 56 failed). See vmware.log for more details
    Jan 23 02:02:17:258 [Data Collector - vmware1, vmware] SENDING_ALARM: Self-Monitoring Failures for 'vmware1:RESOURCE_POOL.MemoryOverallUsagePercent': Monitor Correlation (2 of 6 failed), Data Collection (2 of 12 failed). See vmware.log for more details
    Jan 23 02:02:17:258 [Data Collector - vmware1, vmware] SENDING_ALARM: Self-Monitoring Failures for 'vmware1:HOST_CPU_AGGREGATE.usage': Data Collection (1 of 47 failed). See vmware.log for more details
    Jan 23 02:02:17:258 [Data Collector - vmware1, vmware] SENDING_ALARM: Self-Monitoring Failures for 'vmware1:HOST.Alarm': Data Collection (8 of 8 failed). See vmware.log for more details
    Jan 23 02:02:17:258 [Data Collector - vmware1, vmware] ===== Self-Monitoring Alarm Failures: 6 alarm(s) sent for resource: vmware1 =====
    Note 1: These Self-Monitoring Alarm Failures are aggregated for an element.metric type per resource. Individual failure details or related exceptions should proceed this log entry.
    Note 2: 'Monitor Correlation' failures occur when a monitor does not find it's specific element in the inventory, or no metric value is available for the element.
    With static monitors and changing inventory, these are sometimes expected and may be transitory.
    Note 3: The failure count of 'Data Collection' failures often correlate with Monitor Correlation failures.
    When there are only 'Data Collection' failures, or when they exceed 'Monitor Correlation' failures, that usually indicates a problem in collecting that metric value.
    Some metric values are only available with additional system administration.
    Some metric values are only available for specific element types. For instance one type of storage might have a metric, while another does not.
    Generally it is desirable to understand 'Data Collection' failures for desired metrics, and sometimes the probe needs to be tuned for them



  • 3.  Re: vmware self-monitoring alerts

    Broadcom Employee
    Posted 01-24-2017 08:23 PM


  • 4.  Re: vmware self-monitoring alerts

    Posted 01-25-2017 11:16 AM

    Hi Dave, we do use Auto Configurations in our profiles. We also have to make a few of the metrics static as client do not want us to monitor specific elements that we alert on in the Auto Config template.

    The items above are NOT static elements. 



  • 5.  Re: vmware self-monitoring alerts

    Posted 02-22-2017 07:43 PM

    Hi Daniel,

     

    Are your vmware probes pointing at the ESX host level or at the top vCenter level?

     

    I experienced the same thing. Initially I had my vmware probes pointing to the ESX hosts so we could load balance and separate monitoring of different host environments. However, I received so many self-monitoring alarms which was not manageable. I think the issue is, as VM's move around from host to host, the vmware probe can't track what is going on, despite all the ESX hosts being monitored by the same vmware probe. This happens whether you're using static monitors or Auto Configurations.

     

    I changed my setup and used one vmware probe pointing at our vCenter server (we only have one vCenter for our whole environment). This has stopped the self-monitoring alarms. Because the vCenter server is ultimately the highest point which "knows" what is going on in the whole ESX environment, it can track which host the VM's are moving around to. I'm not a vmware expert at all, but that is my understanding of what I think is going on.

     

    There obviously is a trade off between pointing the probe at the ESX host versus at the vCenter server. The issue I'm having now is that EVERYTHING in our whole vmware environment is getting detected and loaded into the UIM inventory. There is no easy way to exclude ESX Hosts or VM's from monitoring and that is the headache I'm having now.

     

    I'm interested to know where you're at now with your situation.

     

    Cheers,

    Max.



  • 6.  Re: vmware self-monitoring alerts

    Posted 02-27-2017 05:00 PM

    Hi Max, still in the same situation. We always point to the VCenter itself and very rarely the host itself. 

    Just this morning I had two of these come in and looked at the vmware probe "All Monitors" and have no idea where, or why these got generated. We had to set them to invisible so they could be ignored. 



  • 7.  Re: vmware self-monitoring alerts

    Posted 05-04-2017 09:56 AM

    For the vCenter account you are using, you should configure it to have restricted views, so that the vmware probe can only discover the objects you are interested in monitoring. For example, in our environment, we only wanted to monitor the vm infra and not all the individual vm hosts, since we already have robots installed on them. This has greatly improved the performance of the vmware probe.



  • 8.  Re: vmware self-monitoring alerts

    Posted 06-20-2017 11:25 AM

    As an MSP, this is not a viable option. We would need to ask our customers to customize their environment and they might not like that or be willing to do so. Our ability to monitor their environment should not alter the way they design/use vCenter.

     

    BTW, this has been an idea since 2015: 

    VMware Probe - Prevent 'all' VM's from discovery



  • 9.  Re: vmware self-monitoring alerts

    Broadcom Employee
    Posted 06-21-2017 12:11 AM


  • 10.  Re: vmware self-monitoring alerts

    Broadcom Employee
    Posted 06-21-2017 12:13 AM


  • 11.  Re: vmware self-monitoring alerts

    Posted 06-21-2017 08:40 AM

    Thank you for this, however, in most cases, as an MSP, our customers want us to monitor the hosts and the VM's. Disabling the 'show_vms' will not work in this scenario. We need the ability to filter out VM's and hosts from discovery that don't need to be monitored. Asking the customers to use vCenter folders, if not already, is not viable either. 



  • 12.  Re: vmware self-monitoring alerts

    Broadcom Employee
    Posted 06-21-2017 10:42 PM

    There are 2 contexts.

     

    1. "Auto Configurations" feature in IM GUI do not have filtering out ability.

     

    It is possible to overcome this via Admin Console based 'bulk configuration' mode.

    The mode uses Admin Console template (This is different from IM GUI based template) and it allows you to build specific criteria for template to affect.

     

    2. "Unwanted VMs appear in your USM"

     

    Unfortunately this is a design. Current behavior probe produces inventory data of VMs as long as its accessible via credential that you assign for probe.

    CA development has tried to explore possible options and no luck in the current design so far.



  • 13.  Re: vmware self-monitoring alerts

    Posted 02-02-2017 12:36 PM

    Bumping.... Not answered..



  • 14.  Re: vmware self-monitoring alerts

    Posted 02-05-2017 11:43 AM

    We also getting self-monitoring failuers .... 
    Who can help to resolve it ? 



  • 15.  Re: vmware self-monitoring alerts

    Posted 02-06-2017 11:15 AM

    CA vmware probe developers please put some sampling logic into this scenario of alerts before you generate an alert. Somewhere in the Options settings where Only alert on these if failed to read in value X # of consecutive polls.

    This is the most annoying alert I have to deal with. When its generated I have to constantly re-explain to helpdesk and clients that the specific metric means the probe wasn't able to be read during that specific moment when it went to poll the vmware api. 



  • 16.  Re: vmware self-monitoring alerts

    Posted 02-28-2017 05:32 PM

    DanielBlanco sorry have missed this one.  I'm passing this on to engineering to look at and comment.



  • 17.  Re: vmware self-monitoring alerts

    Posted 03-01-2017 01:53 PM

    DanielBlanco and all:  Did some digging into this issue.  While I'm not sure why you would see MORE self-monitoring alarms with 6.72 (we'd have to follow up and do more digging) there was a behavior change in 6.72.  Based on some customer feedback we started aggregating these alarms with the goal of reducing them.  But as you correctly point out this does make it more difficult to troubleshoot and deselect the problem metric.

     

    Some options to help in the near term:

    1. You can switch back to the previous probe behavior by using setup key enable_self_monitoring_alarm_aggregation set to false. This will at least let you see the alarms/logs as they used to be.

    2. There is a beta (soon to be GA) update to the VMware probe, version 6.82.  We've improved how we handle when components (such as a CPU) are removed and the -1 response we get from VMware API.  This should reduce some self-monitoring alarms.

    3. I like Dan's proposal to allow configuration to ignore a self-monitoring alarm until received a set number of times.  We'll look at that and other options such as better logging to improve this experience.  Dan I'll set up a follow up with you to discuss.

     

    Thanks,

    Andy



  • 18.  Re: vmware self-monitoring alerts

    Posted 03-03-2017 07:40 AM

    any better solution to avoid these self monitoring alerts?



  • 19.  Re: vmware self-monitoring alerts

    Posted 03-08-2017 09:50 PM

    Daniel,

     

    I have dealt with several past issues with the exact same errors in the logs.  This usually happens when some monitors get in a stuck status and begin generating self monitoring alarms.  In some cases, vmware was set up to use only auto monitors but these errors were showing up in the logs.  The following resolved the issue for the customers I had previous cases with.  

     

    1. Deactivate the vmware probe
    2. Take a backup copy of the entire vmware folder
    3. Delete the probe from IM
    4. Redeploy the probe from the Archive
    5. Copy the config file over from the backup vmware folder to the new vmware folder.  This will ensure that all of the configured virtual machines are recognized by the vmware folder once again.
    6. You may also need to copy over some of the templates from the backup bulk_config folder if custom templates are being used.
    7. Cold start the vmware probe through deactivate/activate for the new files to take effect

     

     

    I realize that this is not the most ideal scenario, but it resolved the alarms from being generated and the errors also stopped showing in the logs.

     

    Regards,

     

    Ryan Currey



  • 20.  Re: vmware self-monitoring alerts

    Posted 03-10-2017 03:20 PM

    Any other solution for this case? I applied the previous troubleshooting and the alerts still appear



  • 21.  Re: vmware self-monitoring alerts

    Posted 03-16-2017 10:36 AM

    I'm opening a case on this now. At one site have 2 of these generated and there are 0 items in the “All Monitors” section that pertain to the items in the alert.

    Even tried upgrading to 6.82 version cleared alerts and they still come back. 

    Self-Monitoring Failures for '***-vcenter:CLUSTER_COMPUTE_RESOURCE.MemoryUsage': Data Collection (2 of 4 failed).  See vmware.log for more details

    Self-Monitoring Failures for '***-vcenter:CLUSTER_COMPUTE_RESOURCE.CPUusage': Data Collection (2 of 4 failed).  See vmware.log for more details

    and in log:

    Mar 16 09:11:36:379 [Data Collector - ***-vcenter, vmware] SENDING_ALARM: Self-Monitoring Failures for '***-vcenter:CLUSTER_COMPUTE_RESOURCE.CPUusage': Data Collection (2 of 4 failed).  See vmware.log for more details

    Mar 16 09:11:36:379 [Data Collector - ***-vcenter, vmware] SENDING_ALARM: Self-Monitoring Failures for '***-vcenter:CLUSTER_COMPUTE_RESOURCE.MemoryUsage': Data Collection (2 of 4 failed).  See vmware.log for more details

    Mar 16 09:11:36:379 [Data Collector - ***-vcenter, vmware] ===== Self-Monitoring Alarm Failures: 2 alarm(s) sent for resource: ***-vcenter =====

                   Note 1: These Self-Monitoring Alarm Failures are aggregated for an element.metric type per resource.  Individual failure details or related exceptions should proceed this log entry.

                   Note 2: 'Monitor Correlation' failures occur when a monitor does not find it's specific element in the inventory, or no metric value is available for the element.

                           With static monitors and changing inventory, these are sometimes expected and may be transitory.

                   Note 3: The failure count of 'Data Collection' failures often correlate with Monitor Correlation failures.

                           When there are only 'Data Collection' failures, or when they exceed 'Monitor Correlation' failures, that usually indicates a problem in collecting that metric value.

                           Some metric values are only available with additional system administration.

                           Some metric values are only available for specific element types.  For instance one type of storage might have a metric, while another does not.

                           Generally it is desirable to understand 'Data Collection' failures for desired metrics, and sometimes the probe needs to be tuned for them



  • 22.  Re: vmware self-monitoring alerts

    Posted 03-16-2017 11:45 AM

    Hi Dan,

     

    I'll reach out to you offline and see if we can find some time to discuss with one of our VMware probe engineers and see if we can better understand why you are seeing so many self-monitoring alarms and how we can best address.

     

    Thanks,

    Andy



  • 23.  Re: vmware self-monitoring alerts

    Posted 03-16-2017 11:11 AM

    Sharing this as this is useful. Needs to be added to the probes wiki page as no one would know about these options for the probe otherwise.

    Tech Tip: UIM - vmware probe - tips for troubleshooting some common vmware problems / alarms 



  • 24.  Re: vmware self-monitoring alerts

    Posted 05-02-2017 01:44 AM

    Thanks , I had the same issue and this saved a lot of time for me today



  • 25.  Re: vmware self-monitoring alerts

    Posted 05-02-2017 10:17 AM

    I have faced this issue and i tried to delete the vmware probe and deployed again .It resolved the issue .

    Phani.Devulapalli what changes have you done to resolve ?



  • 26.  Re: vmware self-monitoring alerts

    Posted 05-02-2017 07:23 PM

    Here is what I have done 

     

    As of vmware probe v6.41 or higher, in the Admin Console GUI review the element: ***Detached Configuration*** folder in the left-hand navigation tree for the probe as it displays resources that have been deleted in the VMware vSphere but are still configured in the probe.

     

    Remove these and it stops the alerts , in my case these were the deprecated data stores which are not attached to vcenter anymore

     

    In addition changed severity by updating setup key; self_monitoring_alarm_severity to 2 so that it dosent create any more tickets for us in SNOW but still can see the alert in alarm console 



  • 27.  Re: vmware self-monitoring alerts

    Posted 05-02-2017 07:25 PM

    Btw, take a look at the VMware ‘Self-Monitoring’ Alarms section in the Tech Tip for the exact details 

     

    Tech Tip: UIM - vmware probe - tips for troubleshooting some common vmware problems / alarms 



  • 28.  Re: vmware self-monitoring alerts

    Posted 05-02-2017 01:51 PM

    Hi Andy, thanks for the meeting and the explanation.

    So just for everyone's fyi here is what we ended up doing:

    1. Searched for every vmware probe in our environment. 

    2. Created a batch file that updated each vmware probe instances' cfg with the following settings:

     

    pu -u administrator -p PW /NMS/Hub1/Robot1/controller probe_config_set vmware /setup self_monitoring_alarm_severity 2 /NMS/Hub1/Robot1/
    pu -u administrator -p PW /NMS/Hub1/Robot1/controller probe_config_set vmware /setup enable_self_monitoring_alarm true /NMS/Hub1/Robot1/
    pu -u administrator -p PW /NMS/Hub1/Robot1/controller probe_config_set vmware /setup enable_self_monitoring_alarm_aggregation false /NMS/Hub1/Robot1/
    pu -u administrator -p PW /NMS/Hub1/Robot1/vmware _restart

     

    self_monitoring_alarm_severity = 2 //Sets these alerts to MINOR

    enable_monitoring_alarm = true //Tells vmware probe to generate the alerts when a metric gets no value returned. 

    enable_self_monitoring_alarm_aggregation = false // Mimic the old way where you get an alert a per every metric failure

     = true // Then generates the "Self monitoring (# of ##) failed alerts.  

     

    We are disabling the vmware probe's "Self Monitoring" alerts 'feature' and re-enabling the old way where it alerts on every metric that does not get a returned value. This is much easier to deal with than the "Self Monitor 4 of 33 failed" messages..

     

    We set the severity of these alerts to Minors. We can see specifically what metric in the "All Monitors" list is the probe complaining about and remove it if need be. This is much easier for us to address and correct.

     

    On the call asked if the probe can be enhanced to ONLY trigger these alerts on metrics that have their ALARM = true. We don't care about QoS only metrics and if the probe has this distinction it would be even more use full. 



  • 29.  Re: vmware self-monitoring alerts

    Posted 05-03-2017 06:41 AM
      |   view attached

    Dan,

     

    This helped me a lot. Thanks a lot for sharing this information.

     

    -kag