SCS KB 1099: Backup of Linux machine fails with "Common I/O error" if partition table is damaged – Acronis SCS

Use Case:

Backup operation on machine running Linux OS

Symptoms:

1) Basic error message stack in Backup Console:

Status:Error

Common I/O error.

2) Details of error message in Backup Console (click on the Details button):

Error code: 21561347
Fields: {"$module":"disk_bundle_lxa64_27751"}
Message: Backup has failed.
------------------------
Error code: 66596
Fields: {"$module":"disk_bundle_lxa64_27751"}
Message: Failed to commit operations.
------------------------
Error code: 458785
Fields: {"$module":"disk_bundle_lxa64_27751"}
Message: Failed to create volume snapshot.
------------------------
Error code: 5832705
Fields: {"$module":"disk_bundle_lxa64_27751","device":"/dev/sda1"}
Message: Common I/O error.

The following entries can be seen in /var/log/messages or in the output of dmesg command (part of standard system information report):

[Tue Oct 5 20:13:07 2021] ata4.00: Enabling discard_zeroes_data [Tue Oct 5 20:13:07 2021] GPT:Primary header thinks Alt. header is not at the end of the disk.
[Tue Oct 5 20:13:07 2021] GPT:1855832063 != 1953525167 [Tue Oct 5 20:13:07 2021] GPT:Alternate GPT header not at the end of the disk.
[Tue Oct 5 20:13:07 2021] GPT:1855832063 != 1953525167 [Tue Oct 5 20:13:07 2021] GPT: Use GNU Parted to correct GPT errors.
[Tue Oct 5 20:13:07 2021] sda: sda1 sda2 sda3 [Tue Oct 5 20:13:07 2021] ata5.00: Enabling discard_zeroes_data [Tue Oct 5 20:13:07 2021] GPT:Primary header thinks Alt. header is not at the end of the disk.
[Tue Oct 5 20:13:07 2021] GPT:1855832063 != 1953525167 [Tue Oct 5 20:13:07 2021] GPT:Alternate GPT header not at the end of the disk.
[Tue Oct 5 20:13:07 2021] GPT:1855832063 != 1953525167 [Tue Oct 5 20:13:07 2021] GPT: Use GNU Parted to correct GPT errors.
[Tue Oct 5 20:13:07 2021] sdb: sdb1 sdb2 sdb3

What issues can be caused by such partition table errors

This example deals with a GPT partition style, but possible effects are the same for MBR style partition

A variety of behaviors can happen, and some can produce very obscure or very generic/uninformative error messages or warnings. For example, a bad/invalid GPT or MBR partition table on the machine can cause the backup engine to attempt to access non-existent block addresses which are physically not present on the disk(s) of the machine, and/or have an incorrect idea of where partitions (and filesystems) start and end.

Most often partition table or file system anomalies cause backups to fail fairly early, during the first stages of the backup process, when partitions and filesystems are enumerated and their parameters are detected and checked.

Troubleshooting

1. Look for errors concerning disk I/O (e.g. similar to errors described in Spotting and fixing hardware I/O errors (disk errors) on Linux), partition table anomalies, and mounting warnings/errors/anomalies in the places where the Linux kernel writes such messages: the output/buffer which is printed when you run the "dmesg" command and the /var/log/messages.*files (both included in the standard system information report).

2. In the same locations, the snapapi module might also print some warnings or errors.

3. Try changing the backup plan to force the sector-by-sector backup mode. Forcing sector-by-sector backup mode in the protection plan's options, which reads the disk end-to-end block-by-block and can gloss over a defective GPT (or MBR) and/or damaged filesystem, may be able to produce a successful entire-machine image backup.

(!)It is critical to ensure there is some form of adequate backup of the machine before attempting any repairs on the filesystem: preferably both a full system image backup and a file-level backup with valuable files/user-data/application-data.

Having a full system image created under forced sector-by-sector mode is one way to ensure such a backup exists; other approaches include but are not limited to:

Taking the machine offline and creating a block-for-block image (saved in a suitable location such as an external disk or network share) using some open-source imaging tool such as CloneZilla , or a combination of dd piped through gzip.
If the machine is a VM, taking snapshots of all of its disks/state using the facilities provided by the hypervisor (practically all hypervisors can take such snapshots). Some hypervisors call them snapshots, others call them Checkpoints, and many hypervisors also offer the option of exporting the snapshot + config metadata of the VM as an OVA bundle.
In some cases, if the machine is a VM and it is stored on some shared storage/networked storage/storage array, the storage may have the ability to do storage-side snapshots as well.

4. Try to repair/rebuild GPT (or MBR) partition table; reboot; run fsck from a rescue environment or live ISO on all filesystems to be on the safe side.

5. If attempting repairs of the partition table and running fsck on all filesystems cannot be done soon (due to the machine being a production one, due to procedures to schedule a maintenance window, etc) then doing the backups under forced sector-by-sector mode can serve as a temporary workaround.

It's also possible that there could be some latent I/O read errors / unreadable blocks / bad blocks on the physical disks, and/or a silently corrupted or internally inconsistent XFS, EXT4, FAT32, or whatever filesystem on some of the volumes/partitions. Please note that there isn't necessarily going to be an explicit message by the kernel about such issues in the dmesg output or in /var/log/messages logs. While the Linux kernel as a whole, and the file system drivers in it in particular, do quite a few filesystem consistency checks when they mount a filesystem and also dynamically when they access/use it, anomalies and corruption CAN slip through the cracks, especially if the corruption happened in parts of filesystem or physical areas of the disk which are rarely accessed or not accessed at all during normal operation of the machine. The thing is that the backup creation process (especially when a base full backup needs to be created for the first time) generally needs to access EVERYTHING on the disks and filesystems, which is why hard-to-spot partition table, filesystem, and disk physical errors can be uncovered only when the user/admin tries to do a backup (especially when it's an entire-machine image) of the machine for the first time ever (or for the first time in a while).

Attempts to fix the partition table, and to repair any potential latent filesystem corruption/inconsistency are just that - attempts - and are not in general guaranteed to succeed. Be prepared for the possibility of rebuilding the machine from scratch, with a correct partition table and filesystem(s) without any anomalies, using the existing backups of the machine and its user-data/application-data/configuration-data which you should have at this stage.

Related to

Use Case:

Symptoms:

What issues can be caused by such partition table errors

Troubleshooting

Related articles