vSAN1

 View Only
Expand all | Collapse all

Latency on 4 Node VSAN cluster

  • 1.  Latency on 4 Node VSAN cluster

    Posted May 14, 2025 01:14 AM

    Hello,

    We have a severe latency problem on a 4 node Hybrid VSAN cluster. 

    The cluster has 4 nodes, each node with 3 disk groups. Each disk group has 3x 2.4TB 10k RPM HDD and 1x  400GB SSD.

    Skyline checks are all green excluding the vSAN network alarm for MTU ping check which pops up occasionally. We are using the default 1500 MTU on the VMware side and I have asked that the network ports be confirmed to be using 1500 as well. I believe that they are but I am awaiting confirmation. I have read that the switch being at 9000 and the VMware environment being at 1500 is OK, however I do not know if that is true or not.  

    The utilized VSAN storage is under 60%, cpu utilization is under 20%, RAM utilization is under 60%. There are no resync or rebalance  operations occurring. The average IOPS for the cluster is under 1000, the average throughput is under 20MiBps. 

    A few months back we started suffering from extremely poor performance. The VSAN performance tab for the VMs shows an average read latency of ~ 200ms and an average write latency of ~ 300ms. The  latency maximums are hitting over 5000ms. I have collected data with get-vsanstat to confirm. The VSAN performance metrics at the host level also show very  high latency on average. The disk group read cache hit rate hovers between 60% and high 90s. There are no packet drops on the Host network tab for the host.

    I am yet unable to find the cause of the latency. The IO is so low that it does not make sense. The only non green skyline check is the MTU check. I will say that for 1 of the hosts, when I do a vmkping on the VSAN storage interface, I can lose 10% of the packets at times.  Does anyone have any advice as to where I can look to locate the cause of the latency based on what I have provided above? Any advice on where to look in general?

    Thanks



  • 2.  RE: Latency on 4 Node VSAN cluster

    Posted May 14, 2025 08:37 AM

    Hello.

    This is from a bit back.. but I had this same type of issue back in the early days when I had a similar setup (400GB SSD Cache disks). After a lot of back and forth between Dell and VMware at the time..  we found a firmware upgrade for the SSD themselves.  Immediately after doing that firmware update on the SSDs,  our latency was cut by 2 orders of magnitude. 

    Might be worth checking your firmware level on the SSDs.    If it just sort of started without other explanation, maybe you got a firmware update that is bad for VSAN.




  • 3.  RE: Latency on 4 Node VSAN cluster

    Posted May 14, 2025 05:55 PM

    @terrible_towel

    Out of general curiosity, do you by any chance recall the disk model and firmware version with that fix?

    Over the years I have seen multiple advisories/KBs/docs for SSD/NVMe firmware updates that fix X issue and improving performance (particularly from HPE, they seem to be really good and proactive in documenting such things publicly) but have never seen a proper before+after comparison to see such things with my own eyes.

    Most of the time if SSDs start having high latency spikes it is due to them being out of replacement wear-leveling cells (e.g. SMART shows them with a few % remaining 'Media Wearout Indicator'), this would of course be more common to occur with smaller SSDs and also those that are not intended/specced for write-intensive usages - that being said, it is possible firmware issues might influence that in some way.




  • 4.  RE: Latency on 4 Node VSAN cluster

    Posted May 14, 2025 05:52 PM

    @jim33boy

    "We have a severe latency problem on a 4 node Hybrid VSAN cluster."

    Where/how are you measuring/reporting this? VM/application users reporting slow or non-responsiveness or from vSAN Performance graphs?

    "Skyline checks are all green excluding the vSAN network alarm for MTU ping check which pops up occasionally."

    This is a serious point of concern as what that means is the test packets are likely failing which means actual data packets are likely being lost/dropped on the network also - ANY amount of packet-loss can have extremely deleterious impact on vSAN performance (as it would for any HCI solution that relies on synchronous replication).

    "We are using the default 1500 MTU on the VMware side and I have asked that the network ports be confirmed to be using 1500 as well. I believe that they are but I am awaiting confirmation. I have read that the switch being at 9000 and the VMware environment being at 1500 is OK, however I do not know if that is true or not."

    The vsan-traffic tagged vmk on each node is the endpoint between each node here - if these are all set consistently (e.g. 1500 OR 9000 MTU on all vsan-tagged vmks) then having the vSwitch/vmnic/switchport MTU higher is not an issue at all.

    "A few months back we started suffering from extremely poor performance."

    Did that coincide with any network or VM load changes?

    "The VSAN performance tab for the VMs shows an average read latency of ~ 200ms and an average write latency of ~ 300ms. The latency maximums are hitting over 5000ms."

    One of the most simple manners of (99% of the time anyway) indicating whether a vSAN performance issue is due to network or storage (or the 1% of time vSAN modules/system) is at the cluster entity level to compare 'VM' (Frontend) to 'Backend' (Storage) latencies - if you have high latency on the former but low latency on the latter then the latency is being incurred on the network, if you have it on both then the latency is being incurred due to storage/LSOM/disk latencies - please inform what you see for both of these and share screenshots if possible.

    If you DO see 'Backend' latency at the cluster level metrics then next step is to check 'Backend' latency on every node on the cluster, oftentimes there is a single Disk-Group or disk that might be causing this so going through each node to narrow this down (and then once outlier found, each Disk-Group and then each disk) is advised.

    "when I do a vmkping on the VSAN storage interface, I can lose 10% of the packets at times."

    That is a real problem - have you narrowed it down to being to/from a single node? If yes then you need to narrow down to which vmnic and then likely check this properly from the physical network perspective e.g. is it an issue with the NIC, cable or switchport.




  • 5.  RE: Latency on 4 Node VSAN cluster

    Posted May 14, 2025 08:43 PM

    @terrible_towel - thank you for that, this is a Dell 4 node VSAN cluster so this is encouraging.

    @TheBobkin -  Thank you for the reply.

    Regarding how we become aware of the issue, you stated.. "Where/how are you measuring/reporting this? VM/application users reporting slow or non-responsiveness or from vSAN Performance graphs?"

    A combination of both. Our production grinds to a halt so this is something that is reported by the end users. When the issue occurs, I login to the VCSA, and for each VM under monitoring/ VSAN/performance view, I see the obscene latency numbers. I also set up load detection on the guest OS and this alerts me. From within the guest, iostat shows disks as 100% utilzed while the IOPS and throughput are almost non existent, quite odd. I also collect stats with powershell over a week period and the latency is consistent. 

    Regarding vmkping and packet loss, you stated. "This is a serious point of concern as what that means is the test packets are likely failing which means actual data packets are likely being lost/dropped on the network also - ANY amount of packet-loss can have extremely deleterious impact on vSAN performance (as it would for any HCI solution that relies on synchronous replication)."

    I agree, though I did leave something out. I am running the vmkping with the -d option so that packets are not defragmented. This is per Broadcom KB 344313 . This recommends using the -d switch and the -S switch with an argument of 1472 as we are using 1500 as the MTU. If I use regular vmkping I get much less packet loss, though still some. That being said with -d -S 1472, it is 8 to 10% loss. I had the network check switch ports on the 10GbE interfaces and they find no drops or CRC errors.

    Regarding the onset of poor performance a few months back..You asked this.. "Did that coincide with any network or VM load changes?"

    No, the VM environment is relatively static in the number of VMs and compute/memory provisioned. To me this says some sort of physical media issue but I have yet to find anything. 

    Regarding looking at latency metrics at the cluster level, you asked if we are seeing latency on both frontend and backend or both. The front end latency is the issue, the backend looks normal. Pictures attached.

    Regarding the 10% packet loss being from 1 node or not. You asked.."have you narrowed it down to being to/from a single node? " 

    Yes, actually, there is only 1 node that seems to report loss with vmkping -d -S. I will look into the cable and SFPs for these.

    Thank you for your replies, these are very helpful. Please let me know if anything else comes to mind.




  • 6.  RE: Latency on 4 Node VSAN cluster

    Posted May 16, 2025 04:13 PM
    @jim33boy, thanks for following up on all the bits I asked clarity about.
     
    That sounds to be purely a networking issue (because Backend would not show as fine if you had disk issues), that it is only being observed when testing vmkping to/from a single node would be a strong indicator it is just some problematic component associated with that node e.g. SFP/Cable/switchport, *most* of the time I have seen this, it is a cable and much rarely the others (couple instances of seating issues too).
    If your network infrastructure team are reticent to replace/check anything without more evidence, I usually try and narrow this down further by confirming which uplink the issue is observed on - if you have Active/Standby vmnic configuration backing the vsan-tagged vmk then you can use esxtop and 'n' option and it will show you the vmnic being actively used for the vsan-tagged vmk, you can then put the node in Maintenance Mode (Ensure Accessibility option), then set that vmnic to down state (either via vSphere client or esxcli command), then once it is down it will be using the other vmnic (can confirm in esxtop), then you can retest the vmkpings and validate that there is no packet loss.
    If there is no packet loss then you have halved the number of components that need to be checked and also should be able to run workloads better in that state (if you don't mind reduced network redundancy until that is fixed).



  • 7.  RE: Latency on 4 Node VSAN cluster

    Posted May 21, 2025 02:56 AM

    @TheBobkin, thanks for the reply. The network team once again confirmed no drops or CRC errors on the switch ports. I like your suggestion for switching NIC traffic. I am currently on vmnic1 and will plan to switch this to vmnic0. Do you suggest that I put the hosts in maintenance mode before doing this? 

    I wanted to provide more info. I find that on 3 of the 4 VSAN nodes, the clomd.log is not being updated. From what I read this is not normal, do you have any idea if the clomd.log not being updated is normal and if not any suggestions on how to correct that?

    Additionally, at the cluster level and host level, the latency spikes seem to rise across all 4 nodes at the same times and these correlate with outstanding IO spikes. The correlation is common across frontend and backend at both cluster and host level. 

    I also see delayed IOPs spikes at the host diskgroup  level. The read cache hit rate drops to 70% 

    Though VM latency is extremely high, the diskgroups ssd and capacity drive latency is only showing spikes to 10ms. 
    If you have any other advice based on the info, I appreciate any feedback.




  • 8.  RE: Latency on 4 Node VSAN cluster

    Broadcom Employee
    Posted May 23, 2025 02:58 AM

    You don't need to place hosts in maintenance mode to switch traffic.

    Either way, if you have 8-10% packet loss, I am not surprised your environment performance poorly, it is a recipe for disaster. If you ask me, I would suggest getting support to look at the environment and figure out what the root cause is.




  • 9.  RE: Latency on 4 Node VSAN cluster

    Posted May 23, 2025 05:07 PM
     
    I would respectfully disagree - they should put the node in Maintenance Mode before testing that, reason being that we have no idea what the state of the other uplink is until it is in use, I have seen twice when doing this for this very reason that the other uplink is completely unusable (either due to misconfiguration or other reason) and when that occurs the node becomes isolated and all VMs on it would go down, so node being in MM with EA option is advisable just in case.
     
    "If you ask me, I would suggest getting support to look at the environment and figure out what the root cause is."
    Yes, they should, they can ask for me if they like and I am in their geo 😎.
     
     
    "I wanted to provide more info. I find that on 3 of the 4 VSAN nodes, the clomd.log is not being updated. From what I read this is not normal, do you have any idea if the clomd.log not being updated is normal and if not any suggestions on how to correct that?"
    Are you sure it is just clomd.log and not other logs too? Typically if clomd.log stops logging for any reason, the last lines of that are a backtrace from CLOM crashing (and generally indicates what object caused that crash) - check if CLOM is running:
    # /etc/init.d clomd status
    # /etc/init.d clomd start
     
    "Additionally, at the cluster level and host level, the latency spikes seem to rise across all 4 nodes at the same times and these correlate with outstanding IO spikes. The correlation is common across frontend and backend at both cluster and host level. "
    You may be reading this back to front - the OIO increasing is due to IOs not being processed in a timely manner.
     
    "I also see delayed IOPs spikes at the host diskgroup  level. The read cache hit rate drops to 70% "
    I would advise focusing on ruling out the packet loss issue here and only assessing any further performance issues once that has been fixed.




  • 10.  RE: Latency on 4 Node VSAN cluster

    Broadcom Employee
    Posted May 26, 2025 04:54 AM

    Yes, that is a valid point.




  • 11.  RE: Latency on 4 Node VSAN cluster

    Posted Jun 09, 2025 09:08 PM

    Hello, apologies for the delay, I appreciate the feedback from all.  

    I'm planning the maintenance. and I am going to try @TheBobkin suggestion below.

    " then set that vmnic to down state (either via vSphere client or esxcli command), then once it is down it will be using the other vmnic (can confirm in esxtop), then you can retest the vmkpings and validate that there is no packet loss."

    This seems simple enough, I will vmotion the VMs to the other 3 nodes, then I will take the NIC down as suggested.

    Aside from this, I was considering switching the active/standby uplinks for the VSAN network on a single host, however the uplink teaming for VDS is global.  And so modifying the VSAN VDS portgroup would apply to all hosts and I only want to do this for the single host. I wanted to ask the group the following. Considering the VSAN portgroup currently has uplink1 as active and uplink2 as standby, can I create a different VSAN portgroup, perhaps VSAN-DPG2, then assign the uplink2 as active and uplink1 as standby and then assign this portgroup to the single host? In this scenario hosts 1-3 would have the VSAN portgroup using uplink1 as active and uplink2 as standby. Only the 4th or problematic host will have uplink2 as active and uplink1 as standby. In this way I can simply change the portgroup for testing rather than running with reduced network redundancy.

     

    Thanks again




  • 12.  RE: Latency on 4 Node VSAN cluster

    Posted Jun 11, 2025 09:37 AM

    Either way you need to manage the host from the 'global' vDS and assign the new portgroup. Or you can simply switch the uplink/vmnic relation just on the problematic host.
    Let's say you have vmnic1 = uplink1 (active) and vmnic2 = uplink2 (standby) as per your vSAN dPG configuration. Now, if you change the physical adapters on the one host to vmnic1 = uplink2 and vmnic2 = uplink1 you will have vmnic2 as the active NIC.

    If you want to go with the dPG way, you also need to configure the active/standby uplinks for it.
    Both methods worked in my lab.




  • 13.  RE: Latency on 4 Node VSAN cluster

    Posted Jun 11, 2025 12:32 PM

    @Alexandru Capras - thank you for the reply. I like the 2nd method. From where are you able to modify the vmnic to uplink association? Is this from the GUI or from the CLI. Any help is appreciated.

    Thanks




  • 14.  RE: Latency on 4 Node VSAN cluster

    Posted Jun 12, 2025 04:08 AM

    From the GUI, vCenter. Navigate to the Virtual Distributed Switch > Actions  > Add and Manage Hosts... and select Manage host networking - from here make sure you only select the problematic host. On the Manage physical adapters page switch the vmnic to uplink association and then hit next, next.. (without assigning any portgroup on the VMkernel adapters).
    Make sure you do all this while the host is in maintenance mode, just in case.