VMware vCenter

View Only

Rocky9 VM - Random crashes and Files system issues

SSW Group Virtualization posted Aug 22, 2024 04:46 AM

Hi everyone,

We are currently in the process of migrating our big physical database servers to new VMware Infrastructure with multiple new Rocky 9 VMs. Most of the VMs are stable, while we faced unknown crashes of the filesystem on currently 2 different hosts on different ESX Servers. These crashes only ever appeared on the new vm database servers. Not any other servers.

Current setup, though the same issues appeared with Rocky 9.3 before:

EsXi Host: VMware ESXi, 8.0.3, 24022510
OS Version: Rocky Linux 9.4 Blue Onyx
Kernel Version: 5.14.0-427.22.1.el9_4.x86_64
open-vm-tools: 12.3.5.46049 (build-22544099)
CPUs: AMD EPYC 9274F 24-Core Processor
Cores: 48 physical – 48 logical
RAM: 257 GB

We do not see any specific increase in resources in vmware or our monitoring system.

These crashes appear quite frequently on one server, always around in between 6-10 days uptime, no systemlogs are available, as the server seems to loose access to / partition and about 15m later from the /data partition. The server has already been cloned and recreated as a trial and also moved to a different esx.
ESX log shows these lines:

scsi0:0: aborting cmd 0x335
[...]
scsi0:1: aborting cmd 0x3fa

Unfortunately Broadcom Support is not able to give us any answer regarding these problems.
stdout on the screen of the crashed machine:

The servers are connected to a fibre channel storage solution in combination with the ESXI. The storage is used on a lot of Servers, with multiple OSes, (Rocky9, SL6.4 as well as some Alma8) without any issues on other servers.

Anyone ever experienced something like this? These are some of our most important servers, on which a lot of jobs are ran in the offtime, when nobody is available to restart the server. And really we are at the end of our ideas right now.

MohammadHadi Milani posted Feb 05, 2025 07:39 AM

Hi there,

It sounds like you’re experiencing quite a challenging issue with your Rocky Linux VMs. Here are a couple of suggestions that might help you identify the root cause:

Check Storage Latency:
- Have you checked the storage latency at both the datastore level and inside the VM? High latency could indicate underlying storage issues that might be causing the filesystem crashes.
Use Aria Operations for Monitoring:
- Consider using VMware Aria Operations (formerly vRealize Operations) to monitor your storage and compute resources. This tool can provide in-depth insights and analytics, helping you detect and resolve issues more effectively.

These steps could help you gain more visibility into the problem and potentially resolve the crashes.

Best of luck, and feel free to reach out if you need further assistance!