ESXi

 View Only
  • 1.  Fix corrupted GPT table

    Posted Aug 21, 2018 06:15 PM

    Hello

    Some time ago I noticed a VERY large number of messages in the log of vmkernel.log similar to:
    Partition: 648: Read from primary gpt table failed on "naa.600a...."


    Almost all datastore devices(absolutly different datastores and LUNs) are listed in the log.

    The output of the "partedUtil getptbl"  command is:

    Error: The primary GPT table is corrupt, but the backup appears OK, so that will be used. Fix primary table ? diskPath (/dev/disks/naa.600a...) diskSize (23622320128) AlternateLBA (1) LastUsableLBA (23622320094)


    I have tried the "partedUtil fixGpt" command and it fixes the GPT.

    However, I have questions:
    1. How safe is it to use in a production environment?
    2. What are the unpredictable consequences of this command?
    3. What can happen if you ignore these messages?

    4. How can I see what exactly is damaged in the Primary GPT?

    Output of the "partedUtil getptbl" command before fixing and after is the same:

    gpt

    1229833 255 63 19757268992

    ps ESXi 6.5 U2, datastores connect to hosts via FC



  • 2.  RE: Fix corrupted GPT table
    Best Answer

    Posted Aug 21, 2018 07:28 PM

    1. How safe is it to use in a production environment?

    If you receive the error message: The primary GPT table is corrupt, but the backup appears OK, so that will be used.
    as opposed to the message: The primary GPT table is corrupt/ missing
    then this fix is the best thing you can do.Even better if you create a backup first by dumping the first MB of the volume to another datastore.
    This will allow you to revert the fix in the improbable case something goes wrong.

    2. What are the unpredictable consequences of this command?

    In some really rare cases the size of the datastore is reported incorrectly. If you hit such a case you would not be able to mount the datastore again after a reboot.

    In this case you would use the partedUtil commands that show the max size and should be able to adjust the size accordingly.

    I would not recommend to run the command while the datastore is highly active with backups for example but other than that I am not aware of further unpredictable consequences.

    3. What can happen if you ignore these messages?
    In worst case the backup GPT table gets lost too - in this case you would have to create the partition from scratch - which is way less desirable but still manageable.
    If both tables are bad and you reboot you will not be able to mount the datastore without recreating the partitiontable first.

    4. How can I see what exactly is damaged in the Primary GPT?
    You can run

    hexdump -C /dev/disks/device | less
    this will not be really helpful unless you eat hexdumps for supper.
    A GPT-table uses a strict syntax and if only a few bits are wrong partedUtil will not display anything at all.
    If you ask this because you are surprised why a modern OS would corrupt the partitiontable at all - consider that ESXi tries to keep info like the partitiontable in RAM most of the times.
    So unpredictable events like powerfailures have more severe consequences as you are used to with OS like Windows for example.

    Summary:

    I regard replacing the bad primary table with the healthy backup table as one of the few well documented and safe options you have when dealing with VMFS-problems.

    Ulli



  • 3.  RE: Fix corrupted GPT table

    Posted Aug 21, 2018 08:11 PM

    Thank you for the detailed answer

    Even better if you create a backup first by dumping the first MB of the volume to another datastore.
    This will allow you to revert the fix in the improbable case something goes wrong.

    Can you give an example of how I can dump and then load back the first megabyte from the partition?

    (May be this? For dump: dd if=/vmfs/devices/disks/naa.ID of=/vmfs/volumes/otherDatastore/dump.bin bs=1M count=1)

    If both tables are bad and you reboot you will not be able to mount the datastore without recreating the partitiontable first.

    Do I understand correctly that when the host is rebooted, the problem with the datastore will only be on this host? Other hosts will continue to work with the datastore without any problems until they are rebooted?



  • 4.  RE: Fix corrupted GPT table

    Posted Aug 21, 2018 08:15 PM

    You got it already !
    (May be this? For dump: dd if=/vmfs/devices/disks/naa.ID of=/vmfs/volumes/otherDatastore/dump.bin bs=1M count=1)
    that creates the backup. To revert use
    dd of=/vmfs/devices/disks/naa.ID if=/vmfs/volumes/otherDatastore/dump.bin bs=1M count=1 conv=notrunc
    > Do I understand correctly that when the host is rebooted, the problem with the datastore will only be on this host?
    > Other hosts will continue to work with the datastore without any problems until they are rebooted?
    So you have a VMFS-volume on shared storage in a cluster ?
    This sometimes can have strange effects in a cluster. The situation may look infectious and appears to be deteriorating accross the cluster.
    Keep cool: try to isolate the datastore if possible to a single host. Then do the fix there and reboot that single host.  If that is not possible do the fix and reboot each host as soon as production allows.
    But I have not seen such issues in quite a while - I saw them more frequently with ESXi 5.x.
    Basically an ESXi host should be able to continue operation if the partitiontable gets lost after the host has finished booting.



  • 5.  RE: Fix corrupted GPT table

    Posted Aug 22, 2018 04:46 PM

    Thank you again for helping to understand how it works :smileyhappy:



  • 6.  RE: Fix corrupted GPT table

    Posted Feb 20, 2021 10:45 PM

    I'm having similar problems with two of my storage volumes on a Storewize v7000 storage system.  Other volumes are fine.  I attempted to repair the volumes.  I'm not seeing errors on the ESXi hosts.  But, if I attempt to add the storage volume vSphere says that it will create a NEW datastore and will wipe out the data on the volume.

    Is there any way for me to save the data on the Volume?  I have multiple VMs stored there.



  • 7.  RE: Fix corrupted GPT table

    Posted Feb 20, 2021 10:54 PM

    Dont create MeToo posts for problems like this.
    Create a new post instead and provide as much details as possible.

     

    Ulli



  • 8.  RE: Fix corrupted GPT table

    Posted May 16, 2021 08:33 PM

    Have you been able to correct this error ? I have the same issue on a v7000, many VMs on the datastore and the GPT table corrupt, I need to recover them asap.



  • 9.  RE: Fix corrupted GPT table

    Posted May 22, 2021 02:09 AM

    Sounds like you have a more serious problem - if many vmdks dont match their nominated size check if the VMFS itself is still healthy.

    If you fix corrupt GPT tables only using either original or backup GPT table then you can easily make matters worse.