rhaaaven wrote:
After reading some threads on Communities I've started to feel inspired to simply...ask for advice when I'm not able to find solution to the matter on my own :smileyhappy:
Nice! That's what it's all about. Anyway, hopefully we can help. This is a cool, but possibly tricky topic. Personally, I only use the 3rd party VDS (1000v) so I don't have production experience with this new Health Check feature of the VMwre VDS. However, the fundamentals of reviewing port channel configs, etc. for VMHosts is the same.
The first step would be to generate a CDP report using PowerCLI. This creates a .csv file which you can clean up in Excel then share with your Network folks. This will ensure that the ports under review are clear and distinct. Have them check the port channel configs for typos, etc. Ensure that they send you the text output showing the configs.
The problem is that I've had no problem at all since the beginning of this environment and the third party that is managing our network (switch configuration) is making me sure that no changes were made and these mismatches were present also since the host were first connected.
Is it possible that the health check feature of the VDS was only recently turned on? It's not on by default AFAIK. The issue may have been going on unobserved (speculation). It's also possible that no changes were made to your VMHost's switch ports, but a change in switch to switch communication could have happened.
Anyway, you may consider obtaining the MAC addresses of the Guest OS's affected during the loss of ping. Have the network guys review the switches to find the MAC and see if they find anything interesting.
I assume that increased number of VMs - therefore network traffic - caused this to come out?
That is absolutely possible and I have seen that many times. If your "VM Network" consists of 4 physical NIC ports, you may not observe the issue until the 4th VM is powered on. That's a simplified case, but when the port channel is misconfigured I typically see a percentage of VMs (i.e. 25%) fail to ping when placing them on a bad host.
It will serve you well to document the vCenter and ESXi versions in use along with the firmware, driver and NIC models, Number of NICs, etc. If your network team decides to escalate to the switch vendor (or you engage VMware support) you will need to have this info handy. We can help with any questions you have gathering that info if desired.