Hi everyone,
We are currently in the process of migrating our big physical database servers to new VMware Infrastructure with multiple new Rocky 9 VMs. Most of the VMs are stable, while we faced unknown crashes of the filesystem on currently 2 different hosts on different ESX Servers. These crashes only ever appeared on the new vm database servers. Not any other servers.
Current setup, though the same issues appeared with Rocky 9.3 before:
EsXi Host: VMware ESXi, 8.0.3, 24022510
OS Version: Rocky Linux 9.4 Blue Onyx
Kernel Version: 5.14.0-427.22.1.el9_4.x86_64
open-vm-tools: 12.3.5.46049 (build-22544099)
CPUs: AMD EPYC 9274F 24-Core Processor
Cores: 48 physical – 48 logical
RAM: 257 GB
We do not see any specific increase in resources in vmware or our monitoring system.
These crashes appear quite frequently on one server, always around in between 6-10 days uptime, no systemlogs are available, as the server seems to loose access to / partition and about 15m later from the /data partition. The server has already been cloned and recreated as a trial and also moved to a different esx.
ESX log shows these lines:
scsi0:0: aborting cmd 0x335
[...]
scsi0:1: aborting cmd 0x3fa
Unfortunately Broadcom Support is not able to give us any answer regarding these problems.
stdout on the screen of the crashed machine:

The servers are connected to a fibre channel storage solution in combination with the ESXI. The storage is used on a lot of Servers, with multiple OSes, (Rocky9, SL6.4 as well as some Alma8) without any issues on other servers.
Anyone ever experienced something like this? These are some of our most important servers, on which a lot of jobs are ran in the offtime, when nobody is available to restart the server. And really we are at the end of our ideas right now.