[DPP] vSphere Subscription and Cloud Services

 View Only
Expand all | Collapse all

Week 8 - vSphere Health Part 3: Design Iterations

  • 1.  Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 10, 2019 06:00 PM

    Hi, Design Team Participants,

    Thank you for your feedback last week on our iterated designs on health services. We got to understand how you normally troubleshoot for a vSphere environment and discovered insights we did not know before, all thanks to your comments! Check out our revised designs this week as well as more question embedded in the prototype and posted below. Here is the link to the prototype.

    Now that we are at the end of our 8-week design partner program, you may ask - what's next? We would like to have a meeting with all of you next week at 8-9am on Wednesday, June 19th to share our learnings and get any additional feedback. Stay tuned for a meeting invite soon!

    Thank you for your participation and we look forward to your feedback this week!

    Best,

    VMware Design Team



  • 2.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 10, 2019 06:02 PM

    All Questions:

    Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts? What information do you look for when you prioritize?

    Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart from acknowledging and snoozing, what are the other actions you would like to take?

    Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold?

    Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this view help you to address this scenario?

    Q5: What would you do next from here?



  • 3.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 10, 2019 11:27 PM

    Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts? What information do you look for when you prioritize?   In all of our monitoring apps, we've had to reclassify any production incident that is visible to our users as 'critical'.  Essentially, our vCSA could crash but since it isn't customer-facing it isn't critical.  Also, anything that affects test/dev/QA environments is not critical.

    Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart from acknowledging and snoozing, what are the other actions you would like to take?  To be honest, I almost never use those buttons in vROPS;  I would typically just fix the problem and then check to see if the alert clears.  It would be nice if it had a "reset to green" like vCenter and/or a "Check Status" button to check that an issue is resolved.  I would find those more useful than either of the current options.

    Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold?   Try to determine if the increase in utilization was a result of growth or if a VM is experiencing abnormal usage.  Then, either move VMs to another cluster or add additional host resources to accomodate the growth.

    Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this view help you to address this scenario?  I like that view.  Usually I have to work from memory (e.g. "I think I remember getting this alert before"), but it is much more helpful to see instantly that this cluster has a history/pattern of alerts.  The only change I would suggest is a dropdown to change the scope from 24 hours, 2 days, 7 days, 30 days, etc. just for an ease-of-use perspective. 

    Q5: What would you do next from here?   I would look at the cluster in vROPS and see the top memory heavy-hitters to see if one is abnormally high.



  • 4.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jul 17, 2019 11:14 PM

    Hi SirVesa,

    We are trying to do individual follow-up sessions with some of the participants. I'm having a hard time looking up your email in our DPP participant list because I do not know your legal name. Could you let me know what your email is so that I can send you a meeting invitation?

    Thanks,

    Lynette



  • 5.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jul 17, 2019 11:38 PM

    Hi Lynette,

    You can reach me at larry.miller@engie.com.

    Larry C. Miller Jr.

    Engie Resources NA



  • 6.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jul 19, 2019 03:38 PM

    Hi SirVesa, I sent you an invite where you can sign up a 30-min session with us. Let us know if any of the times work for you. Thanks!

    - Lynette



  • 7.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 11, 2019 07:56 AM

    Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts? What information do you look for when you prioritize?

    For us, alerts are prioritized on a business impact basis, and depending on the level of redundancy of the affected infrastructure. e.g. a redundant network link down is a warning, a single link down is critical when an entire branch loses conectivity

    Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart from acknowledging and snoozing, what are the other actions you would like to take?

    We usualy prefer to solve the root cause and wait to the console to get back to green. The reset to green is useful when you want to be notified of temporary issues to be investigated lately. e.g. temporary high lattency in a storage path.

    Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold?

    I would check if any unusual task was running at the time the problem was detected (VM cloning, snapshot revert, ....) , then analyze logs in the affected infrastructure, then dertemine cause then take any corrective/preventive actions, then reset to green if needed

    Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this view help you to address this scenario?

    Not exactly. I'd like to see something kind of 'This issue was detected 25 times over the last day'. To be able to identify design / sizing problems

    Q5: What would you do next from here?

    I would go to performance analysis tools (vRops if I have it or vCenter performance tab) and try to identify the cause(s) and follow the trail of bread crumbs (see answwr to Q3)



  • 8.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 11, 2019 10:00 PM

    Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts? What information do you look for when you prioritize?

    Some alerts, like vsan disk balance don't work well for all vsan deployments, for some clusters it's always going to be on.  Alerts like "haven't tested HCL against online test" aren't a priority, more informational.  I have to read the alert and look at what's really affected.

    Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart from acknowledging and snoozing, what are the other actions you would like to take?

    I acknowledge alerts all the time based on high usage, because they don't self clear.  Also some things you fix don't self clear, so I'm always acknowledging alerts to see if they are currently valid.  I would expect that if I snooze alert, that it takes another recurrence to bring it back up, not just re-raise the alarm because it was snoozed.  I would like option buttons for "disable this alert for this host" or for this cluster if you are viewing at the cluster object level.  The next screens showing the line graph are an awesome next step to see occurrence rates.  maybe a link to "open this in Log Insight" or some other tool.

    Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold?

    Sort by VMs, look at heat map, see what's causing it.  Then look at that object to see what the frequency is of causing it.  If it's everything, then I assume some sort of scan or hotfix reboot/deployment and I assume high usage is expected.  If they are new VMs, and they don't have a long history line, then I would look if there's another cluster for that VM.  At worst I raise a ticket to the app owner to discuss utilization.

    Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this view help you to address this scenario?

    yes yes and yes, i like the view with the line graph, and then the bullets of when it occurred and for how long, i like that display.  In the top part where it has CPU capacity for the vCenter wide view, does nothing for me.  With many clusters, there could be one with 10% free and one with 90% free, having a cluster wide %age doesn't help me at all.

    Q5: What would you do next from here?

    ​narrow down to the cluster view, see if it's all hosts or one host.  go to vm view, same, is it many or just one.  Then decide on relocation or just wait it out.



  • 9.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 12, 2019 03:27 PM

    All Questions:
    Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and
    you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts?
    What information do you look for when you prioritize?

    The priority when it comes to alerts for me is all about the impact to the business and the urgency in resolving the issue. Those VMs for example hosting customer facing services would have the highest priority given that issues with them could have a legal and negative reputational impact on the business. This also extends through the stack of underpinning services such as the Hosts, storage and networking. Therefore a cluster of hosts supporting production workloads that underpin customer facing services would get a higher priority than a cluster of host supporting internal test and development workloads. But this is still too broad to really prioritize where the focus need be.

    Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart
    from acknowledging and snoozing, what are the other actions you would like to take?

    Typically I would acknowledge an alert when it involves raising a support request to a third party support. In these types on scenarios it would be great to have the ability to add some additional notes such as support case number or local incident ID to enable others to search out additional information. This would be really useful for our follow the sun support teams as it would enable each shift to identify which issues have been handled as well as how to get further information. The main use case for snoozing an alert is when implementing a change to the infrastructure whereby some alerts such as increase CPU or memory use could occur when updating hosts for example. Not sure if this has been included, but it would be great to have the option on how long to snooze and alert for.

    Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold?

    First identify what resource is being exceeded and review changes made to the cluster, such as new VM provisioning, cloning or an issue or mainteance of a host. In addition, I would  also look to understand what the workload pattern looks like and understand if the usage is a one off and therefore an issue or legitimate activity such as an application test. While moving the VMs is a valid response, there are a couple of actions to take before considering this as a resolution.


    Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this
    view help you to address this scenario?

    This would be helpful because it would help drive a discussion with the application owners as to understand what is happening under the hood and help justify any additional resource required for the VM which in turn can also drive discussion in additional underlying infrastructure. It would be great to layer other information such as what host the VM was running on at the time. The scenario could play out where a VM migrates to a particular host during the day that in turn causes performance issues, and then later in the day migrates to another host where the issues do not exist.

    Q5: What would you do next from here?

    As previously stated, it would be to have conversations with the application owner and support team to understand what (if) application processes would be running that could cause
    such memory increases.



  • 10.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Broadcom Employee
    Posted Jun 13, 2019 08:54 AM

    " CSprehawrote: I would like option buttons for "disable this alert for this host" or for this cluster if you are viewing at the cluster object level. "

    Thanks for the excellent suggestions. We got the same request, for disabling alarms on individual inventory objects (hosts, VMs, etc.) and we are delivering this functionality in the next vSphere release. You can now disable the alarm not only at the object where it is defined but on each individual child object as well.

    --Antoan, PM vSphere Alarms



  • 11.  RE: Week 8 - vSphere Health Part 3: Design Iterations

    Posted Jun 14, 2019 08:55 AM

    All Questions:

    Q1: A few of you mentioned that the labeled severity (critical or warning) sometimes does not indicate the priority correctly, and you would prioritize the alerts based on your own experience. Can you tell us a little more about how you would prioritize the alerts? What information do you look for when you prioritize? Depends on how you look at the alert, maybe from the top down. Does it effect everything i.e. a red alert on vCenter to me is drop tools lets take a look, then a alert on a VM itself, is BAU. I still think you should be able to click on the alert and maybe perform some self check or self healthworkflow to see if can be correct by a reset or something like this.

    Q2: What are the scenarios when you would acknowledge an alert and when you snooze an alert? What is your expectation after snoozing an alert? Apart from acknowledging and snoozing, what are the other actions you would like to take? I never snooze an alert, unless its a bug and fixed in the next Update. TBH Snooze alert option is a cheat and your only going to get in trouble later on down the line unless VMware recc to snooze it. 

    Q3: For this particular issue, what will be your next step if you see a production cluster exceeding resource threshold? The age oldquestions, I have had this for over ten years. And it always comes back to two things 1. buy more hardware 2. move vms off to another cluster

    Q4: Some of you mentioned you would like to see alert information over time. Is the view above what you are looking for? How would this view help you to address this scenario? I would love to see some trending of alerts i.e. this host always has nic3 down, this host always gets the TPM warning..etc.

    Q5: What would you do next from here? na