I like dd, although it only reads it, usually a read test of the entire disk will uncover if your hard drive is bad in some parts. This is a good thing to do at least once a month, a lot of times bizarre program behavior, laginess and crashing/unnmounting problems etc.. are due to a failing disc and SMART won't know it or indicate a problem:
We must also remember there's never a guarantee, I've found that ever since we moved to larger and more platters per drive with 1TB drives and up, that hard drives in general have become more prone to errors, problems and failures that I've never seen with smaller drives after years of use. It's scary how cheap and poor quality most hard drives are today, they're still the most unreliable component and are actually getting less reliable.
My way to test a drive:
It takes a long time but this should uncover a dying drive that may not be noticeable since dd reads every sector on the whole drive.
dd if=/dev/sdb of=/dev/null
3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 28390.9 seconds, 70.5 MB/s
Here is my Hitachi 2TB drive (new) tested and I checked dmesg and smartctl without any errors:
3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 21359.7 seconds, 93.7 MB/s
smartctl -a /dev/sdb
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: Hitachi HDS5C3020ALA632
Serial Number: ML4220F318DZ2K
Firmware Version: ML6OA580
User Capacity: 2,000,398,934,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jun 8 19:14:06 2011 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (23306) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 6
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 3
194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Lifetime Min/Max 25/35)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
You'll get errors like this (your drive is about to die and will be taken offline):
Jun 5 11:56:10 box11 kernel: end_request: I/O error, dev sdb, sector 0
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 0
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 1
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 2
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 3
Jun 5 11:56:10 box11 kernel: end_request: I/O error, dev sdb, sector 0
May 31 22:02:50 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 22:02:50 box11 kernel: end_request: I/O error, dev sdb, sector 0
May 31 22:02:50 box11 kernel: Buffer I/O error on device sdb, logical block 0
May 31 22:02:50 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 22:02:50 box11 kernel: end_request: I/O error, dev sdb, sector 3907029160
Here's an example of how mdadm reacts to the failed disc (I was experiencing bizarre database issues, crashes, freezes etc.. too)
May 31 12:45:54 box11 kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x6 frozen
May 31 12:45:54 box11 kernel: ata4: SError: { HostInt }
May 31 12:45:54 box11 kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May 31 12:45:54 box11 kernel: res 40/00:04:b8:21:7e/cd:00:0a:00:00/40 Emask 0x44 (timeout)
May 31 12:45:54 box11 kernel: ata4.00: status: { DRDY }
May 31 12:45:54 box11 kernel: ata4: hard resetting link
May 31 12:46:00 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:04 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:04 box11 kernel: ata4: hard resetting link
May 31 12:46:10 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:14 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:14 box11 kernel: ata4: hard resetting link
May 31 12:46:20 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:54 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:54 box11 kernel: ata4: limiting SATA link speed to 1.5 Gbps
May 31 12:46:54 box11 kernel: ata4: hard resetting link
May 31 12:47:01 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:47:03 box11 kernel: ata4: reset failed, giving up
May 31 12:47:03 box11 kernel: ata4.00: disabled
May 31 12:47:03 box11 kernel: sd 3:0:0:0: timing out command, waited 30s
May 31 12:47:03 box11 kernel: ata4: EH complete
May 31 12:47:03 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:03 box11 kernel: end_request: I/O error, dev sdb, sector 117210047
May 31 12:47:03 box11 kernel: raid1: Disk failure on sdb1, disabling device.
May 31 12:47:03 box11 kernel: Operation continuing on 1 devices
May 31 12:47:03 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 1152390464
May 31 12:47:04 box11 kernel: raid1: Disk failure on sdb3, disabling device.
May 31 12:47:04 box11 kernel: Operation continuing on 1 devices
May 31 12:47:04 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 749321600
May 31 12:47:04 box11 kernel: raid1: sdb3: rescheduling sector 573506240
May 31 12:47:04 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 750128368
May 31 12:47:04 box11 kernel: raid1: sdb3: rescheduling sector 574313008
The first sign of my dying drive were there's errors infrequently while doing heavy reads:
May 15 04:46:59 box11 kernel: ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
May 15 04:46:59 box11 kernel: ata2.00: irq_stat 0x40000008
May 15 04:46:59 box11 kernel: ata2.00: cmd 60/80:a0:c0:ad:76/00:00:16:00:00/40 tag 20 ncq 65536 in
May 15 04:46:59 box11 kernel: res 41/40:00:28:ae:76/cd:00:16:00:00/40 Emask 0x409 (media error) <F>
May 15 04:46:59 box11 kernel: ata2.00: status: { DRDY ERR }
May 15 04:46:59 box11 kernel: ata2.00: error: { UNC }
May 15 04:46:59 box11 kernel: ata2.00: configured for UDMA/133
May 15 04:46:59 box11 kernel: ata2: EH complete
May 15 04:46:59 box11 kernel: SCSI device sda: 3907029168 512-byte hdwr sectors (2000399 MB)
May 15 04:46:59 box11 kernel: sda: Write Protect is off
May 15 04:46:59 box11 kernel: SCSI device sda: drive cache: write back
May 15 04:47:01 box11 kernel: ata2.00: exception Emask 0x0 SAct 0x7fcfffdf SErr 0x0 action 0x0
May 15 04:47:13 box11 kernel: ata2.00: irq_stat 0x40000008
May 15 04:47:13 box11 kernel: ata2.00: cmd 60/80:50:c0:ad:76/00:00:16:00:00/40 tag 10 ncq 65536 in
May 15 04:47:13 box11 kernel: res 41/40:00:28:ae:76/cd:00:16:00:00/40 Emask 0x409 (media error) <F>
May 15 04:47:13 box11 kernel: ata2.00: status: { DRDY ERR }
May 15 04:47:13 box11 kernel: ata2.00: error: { UNC }
May 15 04:47:13 box11 kernel: ata2.00: configured for UDMA/133
With Linux or whatever OS you, you'll find that you begin to have random crashes, and if you look at your console in Linux it might offer clues but this can be hard to track down. I'm about 99% positive one of my WD EARS drives has been causing a crash on my computers for months and have realized it can be identified by a high "Multi-Zone Error Rate" and also if the Load Cycle is high (mine is over 1 million) thenyou can expect the drive is on the way out.
The problem is that you can't do a full badblocks or dd write test on an existing partition without destroying data. There's no good way of testing write issues except by taking the drive off-line and/or willfully destroying your data.
=== START OF INFORMATION SECTION === Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WMAZ20148885 Firmware Version: 50.0AB50 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Sep 17 08:57:20 2011 PDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (36600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 179 174 021 Pre-fail Always - 6041 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 19 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 9360 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 18 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 13 193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 1080578 194 Temperature_Celsius 0x0022 120 109 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 10
I'd also say that if you begin to see a Multi_Zone_Error_Rate high than 0 then consider these serious write errors that will get worse and ultimately start crashing your system and affecting stability (this is what I found as any of my drives that are WD EARS 2TB started to have issues as that number crept up). In the above example it started at just 1,2 and then got higher and the crashes got progressively worse.
Some of the errors/crashes showed the following which I believe are due to this dying EARS drive (I'll know in a few weeks since I'm replacing this drive today):
[<c0466769>] free_hot_cold_page+0xfc/0x150
[<c04667d1>] __pagevec_free+0x14/0x1a
[<c0468c6b>] release_pages+0x127/0x12f
[<c04692d1>] __pagevec_release+0x15/0x1d
[<c04697db>] __invalid_mapping_pages+0x120/0x156
[<c0469818>] invalidate_mapping_pages+0x7/0x9
[<c049c22e>] shrink_icache_memory+0xf5/0x295
[<c046aeab>] shrink_slab+0xfb/0x16e
[<c046b271>] kswapd+0x2d7/0x3fb
[<c0436e1f>] autoremove_wake_function+0x0/0x2d
[<c046af9a>] kswapd+0x0/0x3fb
[<c0436d5d>] kthread+0xc0/0xeb
[<c0436c9d>] kthread+0x0/0xeb
[<c062d243>] kernel_thread_helper+0x7/0x10
Code: 43 1c 31 c0 eb 0d 31 d2 89 f1 55 89 f8 e8 74 f0 ff ff 5a 5b 5e 5f 5d c3 55
89 d5 57 89 c7 56 53 8b 70 20 85 f6 0f 84 e9 00 00 00 <8b> 06 3d 75 62 75 62 0f
84 86 00 00 00 50 56 57 68 28 0e 63 c0
EIP: [<c043b5b4>] ub_page_uncharge+0x13/0x101 SS:ESP 0068:f7861df0
Kernel panic - not syncing: Fatal exception
---------
[<c0466769>] free_hot_cold_page+0xfc/0x150
[<c04667d1>
] __pagvec_free+0x14/0x1a
[<c0468c37>
] release_pages+0xf3/0x12f
[<c0469Zd1>
] __pagevec_release+0x15/0x1d
[<c0469b1Z>
] truncate_inode_pages_range+0xcc/0x260
[<f916c8d3>
] journal_stop+0x208/0x213 [jbd]
[<c0469caf>
] truncate_inode_pages+0x9/0xe
[<f91a8a57>
] ext3_delete_inode+0x13/0xba [ext3]
[<f91a8a44>
] ext3_delete_inode+0x0/0xba [ext3]
[<c049b9ea>
] generic_deIete_inode+0x91/0xfe
[<c049b4c1>
] input+x67/0x69
[<c0498df5>
] d_kill+0x19/0x32
[<c0499f36>
] dput+19f/0x1ac
[<c049Z717>
] sys_renameat+0x15f/0x1af
[<c047Z410>
] remove_vma+0x47/0x4c
[<c047Ze3a>
] do_munmap+0x19e/0x1ba
[<c049Z778>
] sys_rename+0x11/0x15
[<c06Zc4eb>
] syscall_call+0x7/0xb
Code: 43 1c 31 c0 eb 0d 31 dZ 89 f1 55 89 f8 e8 74 f0 ff ff 5a 5b 5e 5f 5d c3 5'
89 d5 57 89 c7 56 53 8b 70 Z0 85 f6 0f 84 e9 00 00 00 <8b> 06 3d 75 6Z 75 6Z 0
84 86 00 00 00 50 56 57 68 Z8 0e 63 c0
EIP: [<c043b5b4>] ub_page_uncharge+0x13/0x101 SS:ESP 0068:f7acbd9c
Kernel panic - not syncing: Fatal exception
drivei, dd, reads, disk, uncover, laginess, crashing, unnmounting, etc, failing, disc, indicate, ve, larger, platters, tb, prone, errors, failures, unreliable, component, reliable, noticeable, sector, dev, sdb, null, bytes, copied, mb, hitachi, dmesg, smartctl, redhat, linux, gnu, copyright, allen, http, smartmontools, sourceforge, hds, ala, ml, dz, firmware, oa, user, capacity, database, showall, ata, acs, revision, wed, jun, pdt, capability, enabled, overall, assessment, offline, auto, execution, previous, completed, capabilities, execute, suspend, scan, supported, conveyance, selective, saves, mode, supports, timer, logging, recommended, polling, extended, sct, feature, attributes, vendor, thresholds, attribute_name, thresh, updated, when_failed, raw_value, raw_read_error_rate, throughput_performance, spin_up_time, start_stop_count, old_age, reallocated_sector_ct, seek_error_rate, seek_time_performance, power_on_hours, spin_retry_count, power_cycle_count, off_retract_count, load_cycle_count, temperature_celsius, min, reallocated_event_count, current_pending_sector, offline_uncorrectable, udma_crc_error_count, logged, span, min_lba, max_lba, current_test_status, not_testing, flags, scanning, selected, spans, remainder, pending, resume, ll, kernel, end_request, buffer, scsi, mdadm, reacts, experiencing, crashes, freezes, exception, emask, sact, serr, serror, hostint, cmd, ea, res, timeout, drdy, resetting, softreset, limiting, sata, gbps, reset, disabled, raid, disabling, continuing, devices, rescheduling, infrequently, fffffff, irq_stat, ncq, ae, err, unc, configured, udma, sda, byte, hdwr, sectors, cache, fcfffdf, occurs, os, console, clues, wd, identified, quot, multi, cycle, thenyou, badblocks, existing, partition, willfully, wdc, mvwb, wmaz, ab, specification, draft, indicated, sep, suspended, calibration_retry_count, multi_zone_error_rate, ultimately, affecting, stability, crept, progressively, replacing, free_hot_cold_page, xfc, __pagevec_free, release_pages, __pagevec_release, db, __invalid_mapping_pages, invalidate_mapping_pages, shrink_icache_memory, xf, aeab, shrink_slab, xfb, kswapd, fb, autoremove_wake_function, af, kthread, xc, xeb, kernel_thread_helper, eb, ff, eip, ub_page_uncharge, ss, esp, df, syncing, fatal, __pagvec_free, zd, z, truncate_inode_pages_range, xcc, journal_stop, jbd, caf, truncate_inode_pages, xe, ext, _delete_inode, xba, generic_deiete_inode, xfe, input, d_kill, dput, ac, sys_renameat, remove_vma, ze, do_munmap, sys_rename, zc, syscall_call, xb, acbd,