Blogs

A Tale of Ramdisk space and vMotion Challenges

By aksprasad posted 10 days ago

  
Day-2 tasks on ESXi hosts are usually routine, but sometimes, they turn into complex troubleshooting scenarios :)
Recently, I came across an issue that left me scrambling for solutions. The event on the ESXi Host, that stood out: 
2025-02-19T07:00:49.428Z: [VisorfsCorrelator] 2184901822us: [vob.visorfs.ramdisk.inodetable.full] Cannot create file /var/log/.vmsyslogd.err for process vmsyslogd because the inode table of its ramdisk (var) is full.
2025-02-19T07:00:49.429Z: [VisorfsCorrelator] 2184884819us: [esx.problem.visorfs.ramdisk.inodetable.full] The file table of the ramdisk 'var' is full.  As a result, the file /var/log/.vmsyslogd.err could not be created by the application 'vmsyslogd'.
 
No surprise! Indeed the inodes were maxed out!!
 
 
This was the first clue. Being a ramdisk-related issue, I thought the fix would be straightforward—simply reboot the ESXi host after placing it in Maintenance Mode. But things were not as simple as expected. We couldn’t vMotion any of the VMs from the affected host! We were greeted with another surprise. vMotion of VMs were failing with:
“A general system error occurred: Failed to create journal for vMotion-SRC: Failed to open '/var/lib/vmware/hostd/journal/xxxxxx' for write: There is no space left on the device.”
That’s not all! We couldn’t even access the system through TSM-SSH or TSM. Attempting to start either of these services caused them to immediately stop. At this point, it seemed like forceful reboot of the ESXi was the only option. But that wasn't an acceptable option.
 
After some research and collaboration with VMware Global Support (GS), we found a lead that pointed to the root cause. According to the article
  1. When audit logging to local storage is enabled, the audit record storage directory is created containing the audit files, by default at /scratch/auditLog. 
  2. When the customer reconfigure/change the Scratch location, after the audit logging is enabled, and reboots the ESXi Host
  3. After the reboot when the ESXi Host comes online, the syslog daemon comes up and looks for the audit directory. 
  4. Since scratch partition now points to a different location, vmsyslogd is unable to find the audit directory and initialize audit record storage, causing it to throw an exception and crash.
Engineering is aware of the issue and are working on a fix for this. The fix is expected to be included in future releases.
With this knowledge in hand, the solution became a bit clearer. The general suggestion was to create a temporary ramdisk to alleviate the issue. However, creating a ramdisk on /var/run during runtime wasn’t advised. This folder contains essential runtime files, and tampering with it could lead to unexpected behavior, or worse, data corruption.
What if we create a ramdisk on /var/lib/vmware/hostd/journal, the location where vMotion was specifically failing to create a journal file? Now, here's where it got tricky. Since TSM and TSM-SSH weren’t enabled (due to customer policy), I couldn’t directly create a ramdisk using the typical methods.
 
PowerCLI to the rescue! I was able to create a ramdisk on /var/lib/vmware/hostd/journal without needing to rely on TSM or TSM-SSH.
 
Here is the script that helped.
Note that you would need an administrator/equivalent user on vCenter to get this work. The script can be tailored according to the use case. 
$vCenterServer = Read-Host "Enter the FQDN of the vCenter Server that is managing the impacted ESXi"
$impactedesxihost = Read-Host "Enter the name of the ESXi Host FQDN"
Connect-VIServer $vCenterServer
$esxcli = get-esxcli -vmhost $impactedesxihost -v2
$ramdiskarguments = $esxcli.system.visorfs.ramdisk.add.createargs()
$ramdiskarguments.maxsize="20"
$ramdiskarguments.minsize="10"
$ramdiskarguments.permissions="755"
$ramdiskarguments.name="var-lib-test"
$ramdiskarguments.target="/var/lib/vmware/hostd/journal"
$esxcli.system.visorfs.ramdisk.add.invoke($ramdiskarguments)
The workaround worked just as expected, and vMotions went through smoothly.
 
After we were able to evacuate all the VMs, we placed the ESXi Host in Maintenance Mode and rebooted it. Once the ESXi Host came online, we disabled and enabled the local audit record storage using the commands:
esxcli system auditrecords local disable
esxcli system auditrecords local enable
With about 5 hosts and 120 critical VMs in production, this issue had the potential to escalate things! Time for a coffee!!
Important Disclaimer:
Before you consider using the steps I’ve outlined, please note that these are advanced troubleshooting steps and should not be attempted without an SR (Service Request) with VMware Technical Support. These steps could lead to potential downtime or data corruption if implemented improperly. Always perform tests in a UAT environment and consult GS before implementing solutions in production.
0 comments
10 views

Permalink