SCS KB 1098: Spotting and fixing hardware I/O errors (disk errors) on Linux – Acronis SCS

Use Case:

The following material is intended to serve as an example and a reference guide to help spot when disk input/output errors coming from the hardware are creating problems for the backup agent, with somewhat varying error messages shown in the backup console and in PCS logs. The existence of hardware disk errors is not only a problem for the creation of backups - it can pose a hidden yet significant danger to the stability and operability of the customer's machine, and can easily lead to data loss - so spotting those on time can be crucial.

Symptoms:

Error messages in the Backup Console:

Backup fails with "Common I/O error."
Backup fails with "Cannot read the snapshot of the volume."

Error messages in the mms and/or pcs logs:

Example:

Error code: 21561347
Fields: {"$module":"disk_bundle_lxa64_26077"}
Message: Backup has failed.
------------------------
Error code: 66596
Fields: {"$module":"disk_bundle_lxa64_26077"}
Message: Failed to commit operations.
------------------------
Error code: 458755
Fields: {"$module":"disk_bundle_lxa64_26077"}
Message: Read error.
------------------------
Error code: 5832708
Fields: {"$module":"disk_bundle_lxa64_26077","device":"/dev/mapper/pve-root"}
Message: Cannot read the snapshot of the volume.

Error messages in the Linux kernel logs ( /var/log/messages files, outputs of dmesg command):

Some examples of I/O-related errors are listed below. This list is not exhaustive.

[11692891.711007] session_stat(service_process,27775): psize=116916224 pstrt=0 mshft=0 ioctls=24172
[11692891.711008] session_stat(service_process,27775): bhpgs=0 bhcnt=0 abhs=22742 fbhs=22742 dbhs=0
[11692891.711009] session_stat(service_process,27775): gpgs=8683 ppgs=8683 emmax=7943 emmin=7447 emcur=0 cached=0
[11692891.711010] session_stat(service_process,27775): rblk=1455488 cblk=161 rcblk=161 rc2blk=0 mcblk=143 rwcolls=10
[11692891.711011] session_stat(service_process,27775): sync=0 async=161 aretr=0 mipr=0 iprcnt=0
[11692891.711012] session_stat(service_process,27775): mbio=0 ioctlcnt=24172 ioctlpid=24171
[11692891.711013] session_stat(service_process,27775): rccalls=1455710 maxrcdepth=26 rcdepthcnts=(0, 0, 0, 0)
[11692891.775953] ata3.00: exception Emask 0x0 SAct 0x70000007 SErr 0x0 action 0x0
[11692891.776527] ata3.00: irq_stat 0x40000008
[11692891.777085] ata3.00: failed command: READ FPDMA QUEUED
[11692891.777612] ata3.00: cmd 60/40:e0:80:7a:a1/05:00:01:00:00/40 tag 28 ncq dma 688128 in
                           res 41/40:00:28:7b:a1/00:00:01:00:00/00 Emask 0x409 (media error) <F>
[11692891.778671] ata3.00: status: { DRDY ERR }
[11692891.779250] ata3.00: error: { UNC }
[11692891.785051] ata3.00: configured for UDMA/133
[11692891.785092] sd 2:0:0:0: [sdc] tag#28 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11692891.785095] sd 2:0:0:0: [sdc] tag#28 Sense Key : Medium Error [current] 
[11692891.785106] sd 2:0:0:0: [sdc] tag#28 Add. Sense: Unrecovered read error - auto reallocate failed
[11692891.785109] sd 2:0:0:0: [sdc] tag#28 CDB: Read(10) 28 00 01 a1 7a 80 00 05 40 00
[11692891.785110] print_req_error: I/O error, dev sdc, sector 27360040 flags 4000
[11692891.785665] ata3: EH complete
.....
11779259.062858] ata3.00: irq_stat 0x40000008
[11779259.063390] ata3.00: failed command: READ FPDMA QUEUED
[11779259.063922] ata3.00: cmd 60/08:18:00:21:f0/00:00:00:00:00/40 tag 3 ncq dma 4096 in
                           res 41/40:00:04:21:f0/00:00:00:00:00/00 Emask 0x409 (media error) <F>
[11779259.064989] ata3.00: status: { DRDY ERR }
[11779259.065563] ata3.00: error: { UNC }
[11779259.071309] ata3.00: configured for UDMA/133
[11779259.071326] sd 2:0:0:0: [sdc] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[11779259.071330] sd 2:0:0:0: [sdc] tag#3 Sense Key : Medium Error [current] 
[11779259.071332] sd 2:0:0:0: [sdc] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed
[11779259.071336] sd 2:0:0:0: [sdc] tag#3 CDB: Read(10) 28 00 00 f0 21 00 00 00 08 00
[11779259.071339] print_req_error: I/O error, dev sdc, sector 15737092 flags 0
[11779259.071923] ata3: EH complete
[11779286.228963] ata3.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x0
[11779286.229834] ata3.00: irq_stat 0x40000008
[11779286.230396] ata3.00: failed command: READ FPDMA QUEUED
....
11779292.509443] ata3.00: status: { DRDY }
[11779292.510016] ata3.00: failed command: WRITE FPDMA QUEUED
[11779292.510595] ata3.00: cmd 61/08:70:58:cf:7b/00:00:0a:00:00/40 tag 14 ncq dma 4096 out
                           res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
[11779292.511730] ata3.00: status: { DRDY }
[11779292.512298] ata3.00: failed command: WRITE FPDMA QUEUED
[11779292.512858] ata3.00: cmd 61/08:78:38:d7:7b/00:00:0a:00:00/40 tag 15 ncq dma 4096 out
                           res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x1 (device error)
....
[11692891.785110] print_req_error: I/O error, dev sdc, sector 27360040 flags 4000
[11759183.206085] print_req_error: I/O error, dev sdc, sector 15736888 flags 0
[11759210.632331] print_req_error: I/O error, dev sdc, sector 27360040 flags 0
[11759211.728279] print_req_error: I/O error, dev sdc, sector 27360040 flags 4000
[11779259.071339] print_req_error: I/O error, dev sdc, sector 15737092 flags 0

Keywords to look for:

While disk, input-output and storage subsystem errors vary a lot depending on multiple factors (such as the version of the Linux kernel, the exact type of storage controller and storage attachment -- some of those would look slightly different if e.g. virtual disks are used inside a hypervisor, or if a disk/volume is attached via iSCSI or Fibre Channel), there are several strings/messages and patterns to look for. This is not an exclusive list:

ata x.yz ... DRDY
ata x.yz failed command
WRITE FPDMA QUEUED
READ FPDMA QUEUED
print_req_error
I/O error ... <device is normally named, e.g. sda or sdb or sdc disk ID...> < sector(s) NNNN which cannot be read/written is usually mentioned>
hostbyte
driverbyte
DRIVER_SENSE
Sense Key : Medium Error
Add. Sense: Unrecovered read error - auto reallocate failed
ata <ID>: EH complete

Impact on backup and/or restore activities:

Impact on backup and recovery activities varies, depending on what operation fails, how it fails, whether it fails every time or only occasionally: e.g. a bad area or sector on disk may not always be permanently bad -- sometimes the hardware can recover/repair the lighter errors on its own, in the background; sometimes these errors only occur during unfavorable physical conditions such as excessive vibration in the server/computer/datacenter. It does matter what is stored in the problematic sectors or areas of disk -- some parts containing critical LVM or file system metadata, or the OS bootloader and kernel, or the system swap partition/file/area, are usually more important than others.

If the issues are smaller, and do not affect critical areas, the backup agent's engine may be able to automatically switch into sector-by-sector mode: this can be controlled via the Options sub-menu in the Backup Plan.

However, in practice, in most cases, the I/O errors are serious enough to make even sector-by-sector backups fail (always or intermittently).

Backup creation activities are affected during either the snapshot creation stage by snapapi26 kernel module, or during the actual reading of data from the snapshot in order to send it to the backup.

Restoring backups to problematic disks usually fails when data in the exact bad spots need to be overwritten, but if critical metadata of the LVM ort FS is corrupted/non-readable/non-writable, a wide variety of errors and messages may appear.

What to do (reactively AND proactively):

Fix or replace the faulty hardware.
Repair/resync hardware or software array (if using one).
Periodically run fsck in a mode that checks the entire disk surface (all blocks). Consult Linux manpages (man fsck) on how to do this. Use the "badblocks" Linux utility.
If using hardware or software RAID solutions, configure them to periodically scrub or do patrol reads to detect bad/unstable sectors and disks as early as possible.
Use advanced file systems like ZFS and BTRFS which have native features to detect and (if configured properly) self-heal some of such errors.
Take Entire-machine backups frequently enough.

Key takeaways:

It is often important to check the Linux kernel logs (/var/log/messages files, dmesg output) for hardware errors when creating backups fails with snapshot errors, unspecified I/O errors, "cannot read..." errors, and similar. The kernel logs are, in such cases, much more precise than the fairly high-level (and often generic) error messages which the backup agent can and does report -- it is a user-space application after all, and it cannot always "see" nor interpret low-level errors of the I/O subsystem.
If the hardware cannot read it (reliably), then the Linux kernel/the OS will not be able to read it, then the backup agent will not be able to process the data correctly and thus the backup will keep failing until the hardware problems get fixed (most often, until the bad disk(s), cable(s), or HBA/RAID card/adapter card get replaced).
The backup engine is NOT designed to be able to backup barely functioning/marginal storage hardware; it is NOT a specialized data recovery/disk repair software tool. Specialized data recovery tools can sometimes extract data from very unstable sectors, using specialized techniques like retrying a non-responding sector tens or hundreds of times.

Related to