Latency on 4 Node VSAN cluster

7. RE: Latency on 4 Node VSAN cluster

Recommend

jim33boy

Posted May 21, 2025 02:56 AM

@TheBobkin, thanks for the reply. The network team once again confirmed no drops or CRC errors on the switch ports. I like your suggestion for switching NIC traffic. I am currently on vmnic1 and will plan to switch this to vmnic0. Do you suggest that I put the hosts in maintenance mode before doing this?

I wanted to provide more info. I find that on 3 of the 4 VSAN nodes, the clomd.log is not being updated. From what I read this is not normal, do you have any idea if the clomd.log not being updated is normal and if not any suggestions on how to correct that?

Additionally, at the cluster level and host level, the latency spikes seem to rise across all 4 nodes at the same times and these correlate with outstanding IO spikes. The correlation is common across frontend and backend at both cluster and host level.

I also see delayed IOPs spikes at the host diskgroup level. The read cache hit rate drops to 70%

Though VM latency is extremely high, the diskgroups ssd and capacity drive latency is only showing spikes to 10ms.
If you have any other advice based on the info, I appreciate any feedback.

Original Message

Original Message:
Sent: May 16, 2025 03:59 PM
From: TheBobkin
Subject: Latency on 4 Node VSAN cluster

@jim33boy, thanks for following up on all the bits I asked clarity about.

That sounds to be purely a networking issue (because Backend would not show as fine if you had disk issues), that it is only being observed when testing vmkping to/from a single node would be a strong indicator it is just some problematic component associated with that node e.g. SFP/Cable/switchport, *most* of the time I have seen this, it is a cable and much rarely the others (couple instances of seating issues too).

If your network infrastructure team are reticent to replace/check anything without more evidence, I usually try and narrow this down further by confirming which uplink the issue is observed on - if you have Active/Standby vmnic configuration backing the vsan-tagged vmk then you can use esxtop and 'n' option and it will show you the vmnic being actively used for the vsan-tagged vmk, you can then put the node in Maintenance Mode (Ensure Accessibility option), then set that vmnic to down state (either via vSphere client or esxcli command), then once it is down it will be using the other vmnic (can confirm in esxtop), then you can retest the vmkpings and validate that there is no packet loss.

If there is no packet loss then you have halved the number of components that need to be checked and also should be able to run workloads better in that state (if you don't mind reduced network redundancy until that is fixed).

Original Message:
Sent: May 14, 2025 08:43 PM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster

@terrible_towel - thank you for that, this is a Dell 4 node VSAN cluster so this is encouraging.

@TheBobkin - Thank you for the reply.

Regarding how we become aware of the issue, you stated.. "Where/how are you measuring/reporting this? VM/application users reporting slow or non-responsiveness or from vSAN Performance graphs?"

A combination of both. Our production grinds to a halt so this is something that is reported by the end users. When the issue occurs, I login to the VCSA, and for each VM under monitoring/ VSAN/performance view, I see the obscene latency numbers. I also set up load detection on the guest OS and this alerts me. From within the guest, iostat shows disks as 100% utilzed while the IOPS and throughput are almost non existent, quite odd. I also collect stats with powershell over a week period and the latency is consistent.

Regarding vmkping and packet loss, you stated. "This is a serious point of concern as what that means is the test packets are likely failing which means actual data packets are likely being lost/dropped on the network also - ANY amount of packet-loss can have extremely deleterious impact on vSAN performance (as it would for any HCI solution that relies on synchronous replication)."

I agree, though I did leave something out. I am running the vmkping with the -d option so that packets are not defragmented. This is per Broadcom KB 344313 . This recommends using the -d switch and the -S switch with an argument of 1472 as we are using 1500 as the MTU. If I use regular vmkping I get much less packet loss, though still some. That being said with -d -S 1472, it is 8 to 10% loss. I had the network check switch ports on the 10GbE interfaces and they find no drops or CRC errors.

Regarding the onset of poor performance a few months back..You asked this.. "Did that coincide with any network or VM load changes?"

No, the VM environment is relatively static in the number of VMs and compute/memory provisioned. To me this says some sort of physical media issue but I have yet to find anything.

Regarding looking at latency metrics at the cluster level, you asked if we are seeing latency on both frontend and backend or both. The front end latency is the issue, the backend looks normal. Pictures attached.

Regarding the 10% packet loss being from 1 node or not. You asked.."have you narrowed it down to being to/from a single node? "

Yes, actually, there is only 1 node that seems to report loss with vmkping -d -S. I will look into the cable and SFPs for these.

Thank you for your replies, these are very helpful. Please let me know if anything else comes to mind.

Original Message:
Sent: May 14, 2025 05:43 PM
From: TheBobkin
Subject: Latency on 4 Node VSAN cluster

@jim33boy

"We have a severe latency problem on a 4 node Hybrid VSAN cluster."

Where/how are you measuring/reporting this? VM/application users reporting slow or non-responsiveness or from vSAN Performance graphs?

"Skyline checks are all green excluding the vSAN network alarm for MTU ping check which pops up occasionally."

This is a serious point of concern as what that means is the test packets are likely failing which means actual data packets are likely being lost/dropped on the network also - ANY amount of packet-loss can have extremely deleterious impact on vSAN performance (as it would for any HCI solution that relies on synchronous replication).

"We are using the default 1500 MTU on the VMware side and I have asked that the network ports be confirmed to be using 1500 as well. I believe that they are but I am awaiting confirmation. I have read that the switch being at 9000 and the VMware environment being at 1500 is OK, however I do not know if that is true or not."

The vsan-traffic tagged vmk on each node is the endpoint between each node here - if these are all set consistently (e.g. 1500 OR 9000 MTU on all vsan-tagged vmks) then having the vSwitch/vmnic/switchport MTU higher is not an issue at all.

"A few months back we started suffering from extremely poor performance."

Did that coincide with any network or VM load changes?

"The VSAN performance tab for the VMs shows an average read latency of ~ 200ms and an average write latency of ~ 300ms. The latency maximums are hitting over 5000ms."

One of the most simple manners of (99% of the time anyway) indicating whether a vSAN performance issue is due to network or storage (or the 1% of time vSAN modules/system) is at the cluster entity level to compare 'VM' (Frontend) to 'Backend' (Storage) latencies - if you have high latency on the former but low latency on the latter then the latency is being incurred on the network, if you have it on both then the latency is being incurred due to storage/LSOM/disk latencies - please inform what you see for both of these and share screenshots if possible.

If you DO see 'Backend' latency at the cluster level metrics then next step is to check 'Backend' latency on every node on the cluster, oftentimes there is a single Disk-Group or disk that might be causing this so going through each node to narrow this down (and then once outlier found, each Disk-Group and then each disk) is advised.

"when I do a vmkping on the VSAN storage interface, I can lose 10% of the packets at times."

That is a real problem - have you narrowed it down to being to/from a single node? If yes then you need to narrow down to which vmnic and then likely check this properly from the physical network perspective e.g. is it an issue with the NIC, cable or switchport.

Original Message:
Sent: May 14, 2025 01:14 AM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster

Hello,

We have a severe latency problem on a 4 node Hybrid VSAN cluster.

The cluster has 4 nodes, each node with 3 disk groups. Each disk group has 3x 2.4TB 10k RPM HDD and 1x 400GB SSD.

Skyline checks are all green excluding the vSAN network alarm for MTU ping check which pops up occasionally. We are using the default 1500 MTU on the VMware side and I have asked that the network ports be confirmed to be using 1500 as well. I believe that they are but I am awaiting confirmation. I have read that the switch being at 9000 and the VMware environment being at 1500 is OK, however I do not know if that is true or not.

The utilized VSAN storage is under 60%, cpu utilization is under 20%, RAM utilization is under 60%. There are no resync or rebalance operations occurring. The average IOPS for the cluster is under 1000, the average throughput is under 20MiBps.

A few months back we started suffering from extremely poor performance. The VSAN performance tab for the VMs shows an average read latency of ~ 200ms and an average write latency of ~ 300ms. The latency maximums are hitting over 5000ms. I have collected data with get-vsanstat to confirm. The VSAN performance metrics at the host level also show very high latency on average. The disk group read cache hit rate hovers between 60% and high 90s. There are no packet drops on the Host network tab for the host.

I am yet unable to find the cause of the latency. The IO is so low that it does not make sense. The only non green skyline check is the MTU check. I will say that for 1 of the hosts, when I do a vmkping on the VSAN storage interface, I can lose 10% of the packets at times. Does anyone have any advice as to where I can look to locate the cause of the latency based on what I have provided above? Any advice on where to look in general?

Thanks

8. RE: Latency on 4 Node VSAN cluster

Recommend

Broadcom Employee

Duncan Epping

Posted May 23, 2025 02:58 AM

You don't need to place hosts in maintenance mode to switch traffic.

Either way, if you have 8-10% packet loss, I am not surprised your environment performance poorly, it is a recipe for disaster. If you ask me, I would suggest getting support to look at the environment and figure out what the root cause is.

Original Message

Original Message:
Sent: May 21, 2025 02:55 AM
From: jim33boy
Subject: Latency on 4 Node VSAN cluster