From the GUI, vCenter. Navigate to the Virtual Distributed Switch > Actions > Add and Manage Hosts... and select Manage host networking - from here make sure you only select the problematic host. On the Manage physical adapters page switch the vmnic to uplink association and then hit next, next.. (without assigning any portgroup on the VMkernel adapters).
Make sure you do all this while the host is in maintenance mode, just in case.
Original Message:
Sent: Jun 11, 2025 12:31 PM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster
@Alexandru Capras - thank you for the reply. I like the 2nd method. From where are you able to modify the vmnic to uplink association? Is this from the GUI or from the CLI. Any help is appreciated.
Thanks
Original Message:
Sent: Jun 11, 2025 09:33 AM
From: Alexandru Capras
Subject: Latency on 4 Node VSAN cluster
Either way you need to manage the host from the 'global' vDS and assign the new portgroup. Or you can simply switch the uplink/vmnic relation just on the problematic host.
Let's say you have vmnic1 = uplink1 (active) and vmnic2 = uplink2 (standby) as per your vSAN dPG configuration. Now, if you change the physical adapters on the one host to vmnic1 = uplink2 and vmnic2 = uplink1 you will have vmnic2 as the active NIC.
If you want to go with the dPG way, you also need to configure the active/standby uplinks for it.
Both methods worked in my lab.
Original Message:
Sent: Jun 09, 2025 09:07 PM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster
Hello, apologies for the delay, I appreciate the feedback from all.
I'm planning the maintenance. and I am going to try @TheBobkin suggestion below.
" then set that vmnic to down state (either via vSphere client or esxcli command), then once it is down it will be using the other vmnic (can confirm in esxtop), then you can retest the vmkpings and validate that there is no packet loss."
This seems simple enough, I will vmotion the VMs to the other 3 nodes, then I will take the NIC down as suggested.
Aside from this, I was considering switching the active/standby uplinks for the VSAN network on a single host, however the uplink teaming for VDS is global. And so modifying the VSAN VDS portgroup would apply to all hosts and I only want to do this for the single host. I wanted to ask the group the following. Considering the VSAN portgroup currently has uplink1 as active and uplink2 as standby, can I create a different VSAN portgroup, perhaps VSAN-DPG2, then assign the uplink2 as active and uplink1 as standby and then assign this portgroup to the single host? In this scenario hosts 1-3 would have the VSAN portgroup using uplink1 as active and uplink2 as standby. Only the 4th or problematic host will have uplink2 as active and uplink1 as standby. In this way I can simply change the portgroup for testing rather than running with reduced network redundancy.
Thanks again
Original Message:
Sent: May 26, 2025 04:54 AM
From: Duncan Epping
Subject: Latency on 4 Node VSAN cluster
Yes, that is a valid point.
Original Message:
Sent: May 23, 2025 05:05 PM
From: TheBobkin
Subject: Latency on 4 Node VSAN cluster
I would respectfully disagree - they should put the node in Maintenance Mode before testing that, reason being that we have no idea what the state of the other uplink is until it is in use, I have seen twice when doing this for this very reason that the other uplink is completely unusable (either due to misconfiguration or other reason) and when that occurs the node becomes isolated and all VMs on it would go down, so node being in MM with EA option is advisable just in case.
"If you ask me, I would suggest getting support to look at the environment and figure out what the root cause is."
Yes, they should, they can ask for me if they like and I am in their geo 😎.
"I wanted to provide more info. I find that on 3 of the 4 VSAN nodes, the clomd.log is not being updated. From what I read this is not normal, do you have any idea if the clomd.log not being updated is normal and if not any suggestions on how to correct that?"
Are you sure it is just clomd.log and not other logs too? Typically if clomd.log stops logging for any reason, the last lines of that are a backtrace from CLOM crashing (and generally indicates what object caused that crash) - check if CLOM is running:
# /etc/init.d clomd status
# /etc/init.d clomd start
"Additionally, at the cluster level and host level, the latency spikes seem to rise across all 4 nodes at the same times and these correlate with outstanding IO spikes. The correlation is common across frontend and backend at both cluster and host level. "
You may be reading this back to front - the OIO increasing is due to IOs not being processed in a timely manner.
"I also see delayed IOPs spikes at the host diskgroup level. The read cache hit rate drops to 70% "
I would advise focusing on ruling out the packet loss issue here and only assessing any further performance issues once that has been fixed.
Original Message:
Sent: May 21, 2025 02:55 AM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster
@TheBobkin, thanks for the reply. The network team once again confirmed no drops or CRC errors on the switch ports. I like your suggestion for switching NIC traffic. I am currently on vmnic1 and will plan to switch this to vmnic0. Do you suggest that I put the hosts in maintenance mode before doing this?
I wanted to provide more info. I find that on 3 of the 4 VSAN nodes, the clomd.log is not being updated. From what I read this is not normal, do you have any idea if the clomd.log not being updated is normal and if not any suggestions on how to correct that?
Additionally, at the cluster level and host level, the latency spikes seem to rise across all 4 nodes at the same times and these correlate with outstanding IO spikes. The correlation is common across frontend and backend at both cluster and host level.
I also see delayed IOPs spikes at the host diskgroup level. The read cache hit rate drops to 70%
Though VM latency is extremely high, the diskgroups ssd and capacity drive latency is only showing spikes to 10ms.
If you have any other advice based on the info, I appreciate any feedback.
Original Message:
Sent: May 16, 2025 03:59 PM
From: TheBobkin
Subject: Latency on 4 Node VSAN cluster
@jim33boy, thanks for following up on all the bits I asked clarity about.
That sounds to be purely a networking issue (because Backend would not show as fine if you had disk issues), that it is only being observed when testing vmkping to/from a single node would be a strong indicator it is just some problematic component associated with that node e.g. SFP/Cable/switchport, *most* of the time I have seen this, it is a cable and much rarely the others (couple instances of seating issues too).
If your network infrastructure team are reticent to replace/check anything without more evidence, I usually try and narrow this down further by confirming which uplink the issue is observed on - if you have Active/Standby vmnic configuration backing the vsan-tagged vmk then you can use esxtop and 'n' option and it will show you the vmnic being actively used for the vsan-tagged vmk, you can then put the node in Maintenance Mode (Ensure Accessibility option), then set that vmnic to down state (either via vSphere client or esxcli command), then once it is down it will be using the other vmnic (can confirm in esxtop), then you can retest the vmkpings and validate that there is no packet loss.
If there is no packet loss then you have halved the number of components that need to be checked and also should be able to run workloads better in that state (if you don't mind reduced network redundancy until that is fixed).
Original Message:
Sent: May 14, 2025 08:43 PM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster
@terrible_towel - thank you for that, this is a Dell 4 node VSAN cluster so this is encouraging.
@TheBobkin - Thank you for the reply.
Regarding how we become aware of the issue, you stated.. "Where/how are you measuring/reporting this? VM/application users reporting slow or non-responsiveness or from vSAN Performance graphs?"
A combination of both. Our production grinds to a halt so this is something that is reported by the end users. When the issue occurs, I login to the VCSA, and for each VM under monitoring/ VSAN/performance view, I see the obscene latency numbers. I also set up load detection on the guest OS and this alerts me. From within the guest, iostat shows disks as 100% utilzed while the IOPS and throughput are almost non existent, quite odd. I also collect stats with powershell over a week period and the latency is consistent.
Regarding vmkping and packet loss, you stated. "This is a serious point of concern as what that means is the test packets are likely failing which means actual data packets are likely being lost/dropped on the network also - ANY amount of packet-loss can have extremely deleterious impact on vSAN performance (as it would for any HCI solution that relies on synchronous replication)."
I agree, though I did leave something out. I am running the vmkping with the -d option so that packets are not defragmented. This is per Broadcom KB 344313 . This recommends using the -d switch and the -S switch with an argument of 1472 as we are using 1500 as the MTU. If I use regular vmkping I get much less packet loss, though still some. That being said with -d -S 1472, it is 8 to 10% loss. I had the network check switch ports on the 10GbE interfaces and they find no drops or CRC errors.
Regarding the onset of poor performance a few months back..You asked this.. "Did that coincide with any network or VM load changes?"
No, the VM environment is relatively static in the number of VMs and compute/memory provisioned. To me this says some sort of physical media issue but I have yet to find anything.
Regarding looking at latency metrics at the cluster level, you asked if we are seeing latency on both frontend and backend or both. The front end latency is the issue, the backend looks normal. Pictures attached.
Regarding the 10% packet loss being from 1 node or not. You asked.."have you narrowed it down to being to/from a single node? "
Yes, actually, there is only 1 node that seems to report loss with vmkping -d -S. I will look into the cable and SFPs for these.
Thank you for your replies, these are very helpful. Please let me know if anything else comes to mind.
Original Message:
Sent: May 14, 2025 05:43 PM
From: TheBobkin
Subject: Latency on 4 Node VSAN cluster
@jim33boy
"We have a severe latency problem on a 4 node Hybrid VSAN cluster."
Where/how are you measuring/reporting this? VM/application users reporting slow or non-responsiveness or from vSAN Performance graphs?
"Skyline checks are all green excluding the vSAN network alarm for MTU ping check which pops up occasionally."
This is a serious point of concern as what that means is the test packets are likely failing which means actual data packets are likely being lost/dropped on the network also - ANY amount of packet-loss can have extremely deleterious impact on vSAN performance (as it would for any HCI solution that relies on synchronous replication).
"We are using the default 1500 MTU on the VMware side and I have asked that the network ports be confirmed to be using 1500 as well. I believe that they are but I am awaiting confirmation. I have read that the switch being at 9000 and the VMware environment being at 1500 is OK, however I do not know if that is true or not."
The vsan-traffic tagged vmk on each node is the endpoint between each node here - if these are all set consistently (e.g. 1500 OR 9000 MTU on all vsan-tagged vmks) then having the vSwitch/vmnic/switchport MTU higher is not an issue at all.
"A few months back we started suffering from extremely poor performance."
Did that coincide with any network or VM load changes?
"The VSAN performance tab for the VMs shows an average read latency of ~ 200ms and an average write latency of ~ 300ms. The latency maximums are hitting over 5000ms."
One of the most simple manners of (99% of the time anyway) indicating whether a vSAN performance issue is due to network or storage (or the 1% of time vSAN modules/system) is at the cluster entity level to compare 'VM' (Frontend) to 'Backend' (Storage) latencies - if you have high latency on the former but low latency on the latter then the latency is being incurred on the network, if you have it on both then the latency is being incurred due to storage/LSOM/disk latencies - please inform what you see for both of these and share screenshots if possible.
If you DO see 'Backend' latency at the cluster level metrics then next step is to check 'Backend' latency on every node on the cluster, oftentimes there is a single Disk-Group or disk that might be causing this so going through each node to narrow this down (and then once outlier found, each Disk-Group and then each disk) is advised.
"when I do a vmkping on the VSAN storage interface, I can lose 10% of the packets at times."
That is a real problem - have you narrowed it down to being to/from a single node? If yes then you need to narrow down to which vmnic and then likely check this properly from the physical network perspective e.g. is it an issue with the NIC, cable or switchport.
Original Message:
Sent: May 14, 2025 01:14 AM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster
Hello,
We have a severe latency problem on a 4 node Hybrid VSAN cluster.
The cluster has 4 nodes, each node with 3 disk groups. Each disk group has 3x 2.4TB 10k RPM HDD and 1x 400GB SSD.
Skyline checks are all green excluding the vSAN network alarm for MTU ping check which pops up occasionally. We are using the default 1500 MTU on the VMware side and I have asked that the network ports be confirmed to be using 1500 as well. I believe that they are but I am awaiting confirmation. I have read that the switch being at 9000 and the VMware environment being at 1500 is OK, however I do not know if that is true or not.
The utilized VSAN storage is under 60%, cpu utilization is under 20%, RAM utilization is under 60%. There are no resync or rebalance operations occurring. The average IOPS for the cluster is under 1000, the average throughput is under 20MiBps.
A few months back we started suffering from extremely poor performance. The VSAN performance tab for the VMs shows an average read latency of ~ 200ms and an average write latency of ~ 300ms. The latency maximums are hitting over 5000ms. I have collected data with get-vsanstat to confirm. The VSAN performance metrics at the host level also show very high latency on average. The disk group read cache hit rate hovers between 60% and high 90s. There are no packet drops on the Host network tab for the host.
I am yet unable to find the cause of the latency. The IO is so low that it does not make sense. The only non green skyline check is the MTU check. I will say that for 1 of the hosts, when I do a vmkping on the VSAN storage interface, I can lose 10% of the packets at times. Does anyone have any advice as to where I can look to locate the cause of the latency based on what I have provided above? Any advice on where to look in general?
Thanks