Hello,
I try to explain the issue in english as best as I can.
I migrated an old 5.5 cluster under vCenter Essential Plus to a new one under 6.7 vCenter Essential Plus with 3 HPE Proliant ESXi and 2 Synology NAS as NFS shared storages.
Before, it was a VMWare cluster installation with 3 ESXi 5.5up2 vCenter Essential plus with 3 HPE Proliant Gen8 ESXi and 6 HPE gigabit nics connected to 2 Synology NAS, both set with NFS v4.1 without installing the Synology NFS plugin.
With this old set, I never had any NFS trouble but it was not so powerfull, maybe the cpu power of ESXi and NAS was not enough.
With the new set, I have a VMWare cluster installation with 3 ESXi 6.7.0up3 (Build 15160138) vCenter Essential Plus with 3 HPE Proliant Gen10 ESXi and 6 HPE gigabit nics connected to 2 Synology NAS (one is an existing old NAS, and a new one, same model), both set with NFSv3 with Synology NFS plugin installed on all ESXi wich normally improves the storage.
The migration of all VMs was terrible because I experienced severe issues loosing all NFS shared datastore on two NAS Synology when I wanted to copy, clone, migrate datas.
If the NFS bandswifth use is too high, ESXi looses the NFS datastores, the only way to retrive that is to force a hard reboot of the entire HPE Proliant server.
I contacted Synology and HPE to have a tech support but they replied they never had this issue before.
HPE told me to update all firmwares, I did it but nothing better.
Synology support took a look at the debug file and confirm me an error :
2020-03-17T10:02:43+01:00 syno_vm mountd[10955]: refused mount request from 192.168.141.33 for /volume1/datastore (/volume1/datastore): illegal port 16237
They gave me a tip asking me to apply a new setting on NFS shared folder authorization to authorize NFS connections for more than 1024 ports, I did it and no more NFS illegal port error notifications in the log.
but I had a new crash just after that.
HPE had deeper look to the VMWare logs, they found something.
Here a part of an ESXi log :
2020-03-17T16:49:47.365Z cpu5:2097603)StorageApdHandlerEv: 110: Device or filesystem with identifier [cbac65d1-3f9fe9c5] has entered the All Paths Down state.
2020-03-17T16:51:11.365Z cpu0:2098644)WARNING: NFS: 337: Lost connection to the server 192.168.141.12 mount point /volume1/datastore, mounted as cbac65d1-3f9fe9c5-0000-000000000000 ("Syno_backup")
2020-03-17T16:51:11.365Z cpu0:2098644)WARNING: NFS: 337: Lost connection to the server 192.168.141.6 mount point /volume1/datastore, mounted as 25a6dc8f-63bfb2da-0000-000000000000 ("Syno")
2020-03-17T16:51:11.366Z: [vmfsCorrelator] 841914822913us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 192.168.141.12 mount point /volume1/datastore, mounted as cbac65d1-3f9fe9c5-0000-000000000000 ("Syno_backup")
2020-03-17T16:51:11.366Z: [vmfsCorrelator] 841916240720us: [esx.problem.vmfs.nfs.server.disconnect] 192.168.141.12 /volume1/datastore cbac65d1-3f9fe9c5-0000-000000000000 Syno_backup
2020-03-17T16:51:11.366Z: [vmfsCorrelator] 841914823025us: [vob.vmfs.nfs.server.disconnect] Lost connection to the server 192.168.141.6 mount point /volume1/datastore, mounted as 25a6dc8f-63bfb2da-0000-000000000000 ("Syno")
They gave me a VMWare KB : https://kb.vmware.com/s/article/2016122
I applied the patch command for all ESXi : esxcfg-advcfg -s 64 /NFS/MaxQueueDepth
But it doesn't fix the issue, it's just a workarround which did the job for a while, but when the system needs a big storage bandswith like backup or restoration process, it crashs again.
I have no firewall between servers and storages and no active firewall under Windows vCenter server.
3 dedicated nics for the vmkernel_storage on each ESXi and 3 dedicated nics for the NFS communication on each NAS, all devices with the basic MTU 1500.
If I use MTU 9000 for all devices, it worses because I can't open the NFS datastores from ESXi any more.
I use the same settings for another VMWare installation with 3 HPE Proliant Gen10 under 6.5up2 HPE custom and 2 Synology NAS as NFSv3 shared storages, but without Synology NFS plugin. It's a great powerfull cluster without any trouble.
I also use the same settings for an old VMWare 5.0 and 4.0 installation because I had to much trouble with poor ISCSI connexion with Synology NAS.
Synology developer had checked the VMWare KB articles, and to their expertise and judgement, this issue is more about ESXi side rather than NAS side based on below conditions:
1. From the workaround: Lower the MaxQueueDepth of the NFS connection, is to lower the density of an ESXi client sending I/O request, which means the variable factor is on ESXi.
2. From the NAS kernel log, they didn't see some hung task call trace, but only illegal port error messages, which seems the problem is more related to ESXi side.
3. The NFS plugin is designed to improve the copy/clone speed.
For them, the differences are not only on the NFS plugin but also ESXi version, they can't firmly judge the issue is derived from the NFS plugin, but based on the above conditions, they would think the problem is more likely derived from VMware ESXi.
I asked them to give me the way to uninstall the plugin if needed and wait their reply.
Thanks for your help if anybody already had the same bad experience.