Hello everyone,
we are currently experiencing some very strange network problems. Since about 1,5 weeks ago some of our employees are complaining about lost connections to their PC via Remote Desktop. Because of the current situation, most of our employees work from home over an OpenVPN server which is running on one of our ESXi servers. After a bit of research (a lot of pings to googles DNS from different maschines in our network) we determined that the problem must come from our ESXi server, as we saw some strange behavior on all VMs running on the server. Here is an excerpt of one of those pings:
[Di 1. Feb 15:04:24 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3596 ttl=117 time=4.50 ms
[Di 1. Feb 15:04:25 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3597 ttl=117 time=4.68 ms
[Di 1. Feb 15:04:26 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3598 ttl=117 time=4.81 ms
[Di 1. Feb 15:04:27 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3599 ttl=117 time=4.74 ms
[Di 1. Feb 15:04:28 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3600 ttl=117 time=4.56 ms
[Di 1. Feb 15:04:29 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3601 ttl=117 time=4.68 ms
[Di 1. Feb 15:04:30 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3602 ttl=117 time=4.54 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3603 ttl=117 time=4.51 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3604 ttl=117 time=4.53 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3605 ttl=117 time=4.69 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3606 ttl=117 time=4.51 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3607 ttl=117 time=4.54 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3608 ttl=117 time=4.52 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3609 ttl=117 time=4.50 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3610 ttl=117 time=4.59 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3611 ttl=117 time=4.72 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3612 ttl=117 time=4.53 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3613 ttl=117 time=4.66 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3614 ttl=117 time=4.55 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3615 ttl=117 time=4.62 ms
[Di 1. Feb 15:04:44 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3616 ttl=117 time=4.71 ms
[Di 1. Feb 15:04:45 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3617 ttl=117 time=5.12 ms
[Di 1. Feb 15:04:46 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3618 ttl=117 time=4.50 ms
[Di 1. Feb 15:04:47 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3619 ttl=117 time=4.55 ms
[Di 1. Feb 15:04:48 CET 2022]: 64 bytes from 8.8.8.8: icmp_seq=3620 ttl=117 time=4.65 ms
Normally we should get one reply each second but then suddenly the replies just stop for a couple of seconds, before coming all at once. This would explain why our employees are experiencing connection loss on the VPN side.
I then went on to check the logs/events on the ESXi host and found these messanges:
Der Zugriff auf Volume 5776ab59-b2c08ac9-32d4-0cc47aabf506 (datastore2) wurde nach Konnektivitätsproblemen wiederhergestellt. Montag, 07. Februar 2022, 18:46:55 +0100 Warnung Keine
Der Zugriff auf Volume 5776a676-6cd2e353-9860-0cc47aabf506 (datastore1) wurde nach Konnektivitätsproblemen wiederhergestellt. Montag, 07. Februar 2022, 18:46:55 +0100 Warnung Keine
Wegen Konnektivitätsproblemen kann nicht mehr auf Volume 5776ab59-b2c08ac9-32d4-0cc47aabf506 (datastore2) zugegriffen werden. Es wird versucht, eine Wiederherstellung durchzuführen. Das Ergebnis liegt demnächst vor. Montag, 07. Februar 2022, 18:46:41 +0100 Warnung Keine
Wegen Konnektivitätsproblemen kann nicht mehr auf Volume 5776a676-6cd2e353-9860-0cc47aabf506 (datastore1) zugegriffen werden. Es wird versucht, eine Wiederherstellung durchzuführen. Das Ergebnis liegt demnächst vor. Montag, 07. Februar 2022, 18:46:41 +0100 Warnung Keine
They basically say, that the connection to our datastores was lost, and its trying to reconnect. The timing of these messages and the length of connection loss fits with the timing of the network problems.
We suspected a hardware problem at first so we switched to our backup ESXi server. We do nightly replication to that server. With every VM startet on the new server, we experienced the same problem. So the theory of it being a hardware problem seems highly unlikely now, as it is super unlikely for both systems to have the same hardware problem all out of a sudden.
Later on we moved the VPN VM (and only that VM) back to the old ESXi host and the problems (atleast for the OpenVPN VM) were gone. No connection problems were reported anymore.
Did any of you ever experience similar problems? Does anyone have an idea what is going on? I feel like I am going nuts trying to find a credible explanation and a solution... If I can provide anymore information, just let me know
Thanks in advance and kind regards,
Jan