How To Test If Your Hard Drive Is Good/Signs of dying hard drive

I like dd, although it only reads it, usually a read test of the entire disk will uncover if your hard drive is bad in some parts. This is a good thing to do at least once a month, a lot of times bizarre program behavior, laginess and crashing/unnmounting problems etc.. are due to a failing disc and SMART won't know it or indicate a problem:

We must also remember there's never a guarantee, I've found that ever since we moved to larger and more platters per drive with 1TB drives and up, that hard drives in general have become more prone to errors, problems and failures that I've never seen with smaller drives after years of use. It's scary how cheap and poor quality most hard drives are today, they're still the most unreliable component and are actually getting less reliable.

My way to test a drive:

It takes a long time but this should uncover a dying drive that may not be noticeable since dd reads every sector on the whole drive.

dd if=/dev/sdb of=/dev/null
3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 28390.9 seconds, 70.5 MB/s

Here is my Hitachi 2TB drive (new) tested and I checked dmesg and smartctl without any errors:

3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 21359.7 seconds, 93.7 MB/s

smartctl -a /dev/sdb
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    ML4220F318DZ2K
Firmware Version: ML6OA580
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is: ATA-8-ACS revision 4
Local Time is:    Wed Jun 8 19:14:06 2011 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (23306) seconds.
Offline data collection
capabilities:             (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   1) minutes.
Extended self-test routine
recommended polling time:     ( 255) minutes.
SCT capabilities:            (0x003d)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail Always       -       0
2 Throughput_Performance 0x0005   100   100   054    Pre-fail Offline      -       0
3 Spin_Up_Time            0x0007   100   100   024    Pre-fail Always       -       0
4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       3
5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail Always       -       0
7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail Always       -       0
8 Seek_Time_Performance   0x0005   100   100   020    Pre-fail Offline      -       0
9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       6
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       3
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       3
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Lifetime Min/Max 25/35)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector 0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
    1        0        0 Not_testing
    2        0        0 Not_testing
    3        0        0 Not_testing
    4        0        0 Not_testing
    5        0        0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

If there's a problem/Signs of a dying hard drive

You'll get errors like this (your drive is about to die and will be taken offline):

Jun 5 11:56:10 box11 kernel: end_request: I/O error, dev sdb, sector 0
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 0
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 1
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 2
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 3
Jun 5 11:56:10 box11 kernel: end_request: I/O error, dev sdb, sector 0

May 31 22:02:50 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 22:02:50 box11 kernel: end_request: I/O error, dev sdb, sector 0
May 31 22:02:50 box11 kernel: Buffer I/O error on device sdb, logical block 0
May 31 22:02:50 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 22:02:50 box11 kernel: end_request: I/O error, dev sdb, sector 3907029160

Here's an example of how mdadm reacts to the failed disc (I was experiencing bizarre database issues, crashes, freezes etc.. too)

May 31 12:45:54 box11 kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x6 frozen
May 31 12:45:54 box11 kernel: ata4: SError: { HostInt }
May 31 12:45:54 box11 kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May 31 12:45:54 box11 kernel:          res 40/00:04:b8:21:7e/cd:00:0a:00:00/40 Emask 0x44 (timeout)
May 31 12:45:54 box11 kernel: ata4.00: status: { DRDY }
May 31 12:45:54 box11 kernel: ata4: hard resetting link
May 31 12:46:00 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:04 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:04 box11 kernel: ata4: hard resetting link
May 31 12:46:10 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:14 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:14 box11 kernel: ata4: hard resetting link
May 31 12:46:20 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:54 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:54 box11 kernel: ata4: limiting SATA link speed to 1.5 Gbps
May 31 12:46:54 box11 kernel: ata4: hard resetting link
May 31 12:47:01 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:47:03 box11 kernel: ata4: reset failed, giving up
May 31 12:47:03 box11 kernel: ata4.00: disabled
May 31 12:47:03 box11 kernel: sd 3:0:0:0: timing out command, waited 30s
May 31 12:47:03 box11 kernel: ata4: EH complete
May 31 12:47:03 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:03 box11 kernel: end_request: I/O error, dev sdb, sector 117210047
May 31 12:47:03 box11 kernel: raid1: Disk failure on sdb1, disabling device.
May 31 12:47:03 box11 kernel:        Operation continuing on 1 devices
May 31 12:47:03 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 1152390464
May 31 12:47:04 box11 kernel: raid1: Disk failure on sdb3, disabling device.
May 31 12:47:04 box11 kernel:        Operation continuing on 1 devices
May 31 12:47:04 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 749321600
May 31 12:47:04 box11 kernel: raid1: sdb3: rescheduling sector 573506240
May 31 12:47:04 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 750128368
May 31 12:47:04 box11 kernel: raid1: sdb3: rescheduling sector 574313008

The first sign of my dying drive were there's errors infrequently while doing heavy reads:

May 15 04:46:59 box11 kernel: ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
May 15 04:46:59 box11 kernel: ata2.00: irq_stat 0x40000008
May 15 04:46:59 box11 kernel: ata2.00: cmd 60/80:a0:c0:ad:76/00:00:16:00:00/40 tag 20 ncq 65536 in
May 15 04:46:59 box11 kernel: res 41/40:00:28:ae:76/cd:00:16:00:00/40 Emask 0x409 (media error) <F>
May 15 04:46:59 box11 kernel: ata2.00: status: { DRDY ERR }
May 15 04:46:59 box11 kernel: ata2.00: error: { UNC }
May 15 04:46:59 box11 kernel: ata2.00: configured for UDMA/133
May 15 04:46:59 box11 kernel: ata2: EH complete
May 15 04:46:59 box11 kernel: SCSI device sda: 3907029168 512-byte hdwr sectors (2000399 MB)
May 15 04:46:59 box11 kernel: sda: Write Protect is off
May 15 04:46:59 box11 kernel: SCSI device sda: drive cache: write back
May 15 04:47:01 box11 kernel: ata2.00: exception Emask 0x0 SAct 0x7fcfffdf SErr 0x0 action 0x0
May 15 04:47:13 box11 kernel: ata2.00: irq_stat 0x40000008
May 15 04:47:13 box11 kernel: ata2.00: cmd 60/80:50:c0:ad:76/00:00:16:00:00/40 tag 10 ncq 65536 in
May 15 04:47:13 box11 kernel: res 41/40:00:28:ae:76/cd:00:16:00:00/40 Emask 0x409 (media error) <F>
May 15 04:47:13 box11 kernel: ata2.00: status: { DRDY ERR }
May 15 04:47:13 box11 kernel: ata2.00: error: { UNC }
May 15 04:47:13 box11 kernel: ata2.00: configured for UDMA/133

What if the problem occurs only when writing data?

With Linux or whatever OS you, you'll find that you begin to have random crashes, and if you look at your console in Linux it might offer clues but this can be hard to track down. I'm about 99% positive one of my WD EARS drives has been causing a crash on my computers for months and have realized it can be identified by a high "Multi-Zone Error Rate" and also if the Load Cycle is high (mine is over 1 million) thenyou can expect the drive is on the way out.

The problem is that you can't do a full badblocks or dd write test on an existing partition without destroying data. There's no good way of testing write issues except by taking the drive off-line and/or willfully destroying your data.

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZ20148885
Firmware Version: 50.0AB50
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Sep 17 08:57:20 2011 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:          (36600) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 255) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   174   021    Pre-fail  Always       -       6041
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9360
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1080578 194 Temperature_Celsius     0x0022   120   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       10

I'd also say that if you begin to see a Multi_Zone_Error_Rate high than 0 then consider these serious write errors that will get worse and ultimately start crashing your system and affecting stability (this is what I found as any of my drives that are WD EARS 2TB started to have issues as that number crept up). In the above example it started at just 1,2 and then got higher and the crashes got progressively worse.

Some of the errors/crashes showed the following which I believe are due to this dying EARS drive (I'll know in a few weeks since I'm replacing this drive today):

[<c0466769>] free_hot_cold_page+0xfc/0x150

[<c04667d1>] __pagevec_free+0x14/0x1a

[<c0468c6b>] release_pages+0x127/0x12f

[<c04692d1>] __pagevec_release+0x15/0x1d

[<c04697db>] __invalid_mapping_pages+0x120/0x156

[<c0469818>] invalidate_mapping_pages+0x7/0x9

[<c049c22e>] shrink_icache_memory+0xf5/0x295

[<c046aeab>] shrink_slab+0xfb/0x16e

[<c046b271>] kswapd+0x2d7/0x3fb

[<c0436e1f>] autoremove_wake_function+0x0/0x2d

[<c046af9a>] kswapd+0x0/0x3fb

[<c0436d5d>] kthread+0xc0/0xeb

[<c0436c9d>] kthread+0x0/0xeb

[<c062d243>] kernel_thread_helper+0x7/0x10

Code: 43 1c 31 c0 eb 0d 31 d2 89 f1 55 89 f8 e8 74 f0 ff ff 5a 5b 5e 5f 5d c3 55

89 d5 57 89 c7 56 53 8b 70 20 85 f6 0f 84 e9 00 00 00 <8b> 06 3d 75 62 75 62 0f

84 86 00 00 00 50 56 57 68 28 0e 63 c0

EIP: [<c043b5b4>] ub_page_uncharge+0x13/0x101 SS:ESP 0068:f7861df0

Kernel panic - not syncing: Fatal exception

---------

[<c0466769>] free_hot_cold_page+0xfc/0x150 [<c04667d1>] __pagvec_free+0x14/0x1a [<c0468c37>] release_pages+0xf3/0x12f [<c0469Zd1>] __pagevec_release+0x15/0x1d [<c0469b1Z>] truncate_inode_pages_range+0xcc/0x260 [<f916c8d3>] journal_stop+0x208/0x213 [jbd] [<c0469caf>] truncate_inode_pages+0x9/0xe [<f91a8a57>] ext3_delete_inode+0x13/0xba [ext3] [<f91a8a44>] ext3_delete_inode+0x0/0xba [ext3] [<c049b9ea>] generic_deIete_inode+0x91/0xfe [<c049b4c1>] input+x67/0x69 [<c0498df5>] d_kill+0x19/0x32 [<c0499f36>] dput+19f/0x1ac [<c049Z717>] sys_renameat+0x15f/0x1af [<c047Z410>] remove_vma+0x47/0x4c [<c047Ze3a>] do_munmap+0x19e/0x1ba [<c049Z778>] sys_rename+0x11/0x15 [<c06Zc4eb>] syscall_call+0x7/0xb Code: 43 1c 31 c0 eb 0d 31 dZ 89 f1 55 89 f8 e8 74 f0 ff ff 5a 5b 5e 5f 5d c3 5' 89 d5 57 89 c7 56 53 8b 70 Z0 85 f6 0f 84 e9 00 00 00 <8b> 06 3d 75 6Z 75 6Z 0 84 86 00 00 00 50 56 57 68 Z8 0e 63 c0 EIP: [<c043b5b4>] ub_page_uncharge+0x13/0x101 SS:ESP 0068:f7acbd9c Kernel panic - not syncing: Fatal exception