How To Test If Your Hard Drive Is Good/Signs of dying hard drive

I like dd, although it only reads it, usually a read test of the entire disk will uncover if your hard drive is bad in some parts. This is a good thing to do at least once a month, a lot of times bizarre program behavior, laginess and crashing/unnmounting problems etc.. are due to a failing disc and SMART won't know it or indicate a problem:

We must also remember there's never a guarantee, I've found that ever since we moved to larger and more platters per drive with 1TB drives and up, that hard drives in general have become more prone to errors, problems and failures that I've never seen with smaller drives after years of use. It's scary how cheap and poor quality most hard drives are today, they're still the most unreliable component and are actually getting less reliable.

My way to test a drive:


It takes a long time but this should uncover a dying drive that may not be noticeable since dd reads every sector on the whole drive.

dd if=/dev/sdb of=/dev/null
3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 28390.9 seconds, 70.5 MB/s


Here is my Hitachi 2TB drive (new) tested and I checked dmesg and smartctl without any errors:

3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 21359.7 seconds, 93.7 MB/s

smartctl -a /dev/sdb
smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: Hitachi HDS5C3020ALA632
Serial Number: ML4220F318DZ2K
Firmware Version: ML6OA580
User Capacity: 2,000,398,934,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Jun 8 19:14:06 2011 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x80) Offline data collection activity
was never started.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (23306) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 100 100 024 Pre-fail Always - 0
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 3
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 6
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 3
194 Temperature_Celsius 0x0002 176 176 000 Old_age Always - 34 (Lifetime Min/Max 25/35)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

If there's a problem/Signs of a dying hard drive

You'll get errors like this (your drive is about to die and will be taken offline):

Jun 5 11:56:10 box11 kernel: end_request: I/O error, dev sdb, sector 0
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 0
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 1
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 2
Jun 5 11:56:10 box11 kernel: Buffer I/O error on device sdb, logical block 3
Jun 5 11:56:10 box11 kernel: end_request: I/O error, dev sdb, sector 0

May 31 22:02:50 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 22:02:50 box11 kernel: end_request: I/O error, dev sdb, sector 0
May 31 22:02:50 box11 kernel: Buffer I/O error on device sdb, logical block 0
May 31 22:02:50 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 22:02:50 box11 kernel: end_request: I/O error, dev sdb, sector 3907029160

Here's an example of how mdadm reacts to the failed disc (I was experiencing bizarre database issues, crashes, freezes etc.. too)

May 31 12:45:54 box11 kernel: ata4.00: exception Emask 0x40 SAct 0x0 SErr 0x800 action 0x6 frozen
May 31 12:45:54 box11 kernel: ata4: SError: { HostInt }
May 31 12:45:54 box11 kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May 31 12:45:54 box11 kernel: res 40/00:04:b8:21:7e/cd:00:0a:00:00/40 Emask 0x44 (timeout)
May 31 12:45:54 box11 kernel: ata4.00: status: { DRDY }
May 31 12:45:54 box11 kernel: ata4: hard resetting link
May 31 12:46:00 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:04 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:04 box11 kernel: ata4: hard resetting link
May 31 12:46:10 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:14 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:14 box11 kernel: ata4: hard resetting link
May 31 12:46:20 box11 kernel: ata4: link is slow to respond, please be patient (ready=0)
May 31 12:46:54 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:46:54 box11 kernel: ata4: limiting SATA link speed to 1.5 Gbps
May 31 12:46:54 box11 kernel: ata4: hard resetting link
May 31 12:47:01 box11 kernel: ata4: softreset failed (device not ready)
May 31 12:47:03 box11 kernel: ata4: reset failed, giving up
May 31 12:47:03 box11 kernel: ata4.00: disabled
May 31 12:47:03 box11 kernel: sd 3:0:0:0: timing out command, waited 30s
May 31 12:47:03 box11 kernel: ata4: EH complete
May 31 12:47:03 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:03 box11 kernel: end_request: I/O error, dev sdb, sector 117210047
May 31 12:47:03 box11 kernel: raid1: Disk failure on sdb1, disabling device.
May 31 12:47:03 box11 kernel: Operation continuing on 1 devices
May 31 12:47:03 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 1152390464
May 31 12:47:04 box11 kernel: raid1: Disk failure on sdb3, disabling device.
May 31 12:47:04 box11 kernel: Operation continuing on 1 devices
May 31 12:47:04 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 749321600
May 31 12:47:04 box11 kernel: raid1: sdb3: rescheduling sector 573506240
May 31 12:47:04 box11 kernel: sd 3:0:0:0: SCSI error: return code = 0x00040000
May 31 12:47:04 box11 kernel: end_request: I/O error, dev sdb, sector 750128368
May 31 12:47:04 box11 kernel: raid1: sdb3: rescheduling sector 574313008

The first sign of my dying drive were there's errors infrequently while doing heavy reads:

May 15 04:46:59 box11 kernel: ata2.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x0
May 15 04:46:59 box11 kernel: ata2.00: irq_stat 0x40000008
May 15 04:46:59 box11 kernel: ata2.00: cmd 60/80:a0:c0:ad:76/00:00:16:00:00/40 tag 20 ncq 65536 in
May 15 04:46:59 box11 kernel: res 41/40:00:28:ae:76/cd:00:16:00:00/40 Emask 0x409 (media error)
May 15 04:46:59 box11 kernel: ata2.00: status: { DRDY ERR }
May 15 04:46:59 box11 kernel: ata2.00: error: { UNC }
May 15 04:46:59 box11 kernel: ata2.00: configured for UDMA/133
May 15 04:46:59 box11 kernel: ata2: EH complete
May 15 04:46:59 box11 kernel: SCSI device sda: 3907029168 512-byte hdwr sectors (2000399 MB)
May 15 04:46:59 box11 kernel: sda: Write Protect is off
May 15 04:46:59 box11 kernel: SCSI device sda: drive cache: write back
May 15 04:47:01 box11 kernel: ata2.00: exception Emask 0x0 SAct 0x7fcfffdf SErr 0x0 action 0x0
May 15 04:47:13 box11 kernel: ata2.00: irq_stat 0x40000008
May 15 04:47:13 box11 kernel: ata2.00: cmd 60/80:50:c0:ad:76/00:00:16:00:00/40 tag 10 ncq 65536 in
May 15 04:47:13 box11 kernel: res 41/40:00:28:ae:76/cd:00:16:00:00/40 Emask 0x409 (media error)
May 15 04:47:13 box11 kernel: ata2.00: status: { DRDY ERR }
May 15 04:47:13 box11 kernel: ata2.00: error: { UNC }
May 15 04:47:13 box11 kernel: ata2.00: configured for UDMA/133

What if the problem occurs only when writing data?

With Linux or whatever OS you, you'll find that you begin to have random crashes, and if you look at your console in Linux it might offer clues but this can be hard to track down. I'm about 99% positive one of my WD EARS drives has been causing a crash on my computers for months and have realized it can be identified by a high "Multi-Zone Error Rate" and also if the Load Cycle is high (mine is over 1 million) thenyou can expect the drive is on the way out.

The problem is that you can't do a full badblocks or dd write test on an existing partition without destroying data. There's no good way of testing write issues except by taking the drive off-line and/or willfully destroying your data.

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZ20148885
Firmware Version: 50.0AB50
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Sep 17 08:57:20 2011 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)    Offline data collection activity
                    was suspended by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:          (36600) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      ( 255) minutes.
Conveyance self-test routine
recommended polling time:      (   5) minutes.
SCT capabilities:            (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   174   021    Pre-fail  Always       -       6041
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9360
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       18
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   001   001   000    Old_age   Always       -       1080578 194 Temperature_Celsius     0x0022   120   109   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       10 

I'd also say that if you begin to see a Multi_Zone_Error_Rate high than 0 then consider these serious write errors that will get worse and ultimately start crashing your system and affecting stability (this is what I found as any of my drives that are WD EARS 2TB started to have issues as that number crept up). In the above example it started at just 1,2 and then got higher and the crashes got progressively worse.

Some of the errors/crashes showed the following which I believe are due to this dying EARS drive (I'll know in a few weeks since I'm replacing this drive today):


[] free_hot_cold_page+0xfc/0x150

[] __pagevec_free+0x14/0x1a

[] release_pages+0x127/0x12f

[] __pagevec_release+0x15/0x1d

[] __invalid_mapping_pages+0x120/0x156

[] invalidate_mapping_pages+0x7/0x9

[] shrink_icache_memory+0xf5/0x295

[] shrink_slab+0xfb/0x16e

[] kswapd+0x2d7/0x3fb

[] autoremove_wake_function+0x0/0x2d

[] kswapd+0x0/0x3fb

[] kthread+0xc0/0xeb

[] kthread+0x0/0xeb

[] kernel_thread_helper+0x7/0x10

Code: 43 1c 31 c0 eb 0d 31 d2 89 f1 55 89 f8 e8 74 f0 ff ff 5a 5b 5e 5f 5d c3 55

89 d5 57 89 c7 56 53 8b 70 20 85 f6 0f 84 e9 00 00 00 <8b> 06 3d 75 62 75 62 0f

84 86 00 00 00 50 56 57 68 28 0e 63 c0

EIP: [] ub_page_uncharge+0x13/0x101 SS:ESP 0068:f7861df0

Kernel panic - not syncing: Fatal exception

---------

[] free_hot_cold_page+0xfc/0x150
[>] __pagvec_free+0x14/0x1a
[>] release_pages+0xf3/0x12f
[>] __pagevec_release+0x15/0x1d
[>] truncate_inode_pages_range+0xcc/0x260
[>] journal_stop+0x208/0x213 [jbd]
[>] truncate_inode_pages+0x9/0xe
[>] ext3_delete_inode+0x13/0xba [ext3]
[>] ext3_delete_inode+0x0/0xba [ext3]
[>] generic_deIete_inode+0x91/0xfe
[>] input+x67/0x69
[>] d_kill+0x19/0x32
[>] dput+19f/0x1ac
[>] sys_renameat+0x15f/0x1af
[>] remove_vma+0x47/0x4c
[>] do_munmap+0x19e/0x1ba
[>] sys_rename+0x11/0x15
[>] syscall_call+0x7/0xb

Code: 43 1c 31 c0 eb 0d 31 dZ 89 f1 55 89 f8 e8 74 f0 ff ff 5a 5b 5e 5f 5d c3 5'
89 d5 57 89 c7 56 53 8b 70 Z0 85 f6 0f 84 e9 00 00 00 <8b> 06 3d 75 6Z 75 6Z 0
84 86 00 00 00 50 56 57 68 Z8 0e 63 c0
EIP: [] ub_page_uncharge+0x13/0x101 SS:ESP 0068:f7acbd9c
Kernel panic - not syncing: Fatal exception


Tags:

drivei, dd, reads, disk, uncover, laginess, crashing, unnmounting, etc, failing, disc, indicate, ve, larger, platters, tb, prone, errors, failures, unreliable, component, reliable, noticeable, sector, dev, sdb, null, bytes, copied, mb, hitachi, dmesg, smartctl, redhat, linux, gnu, copyright, allen, http, smartmontools, sourceforge, hds, ala, ml, dz, firmware, oa, user, capacity, database, showall, ata, acs, revision, wed, jun, pdt, capability, enabled, overall, assessment, offline, auto, execution, previous, completed, capabilities, execute, suspend, scan, supported, conveyance, selective, saves, mode, supports, timer, logging, recommended, polling, extended, sct, feature, attributes, vendor, thresholds, attribute_name, thresh, updated, when_failed, raw_value, raw_read_error_rate, throughput_performance, spin_up_time, start_stop_count, old_age, reallocated_sector_ct, seek_error_rate, seek_time_performance, power_on_hours, spin_retry_count, power_cycle_count, off_retract_count, load_cycle_count, temperature_celsius, min, reallocated_event_count, current_pending_sector, offline_uncorrectable, udma_crc_error_count, logged, span, min_lba, max_lba, current_test_status, not_testing, flags, scanning, selected, spans, remainder, pending, resume, ll, kernel, end_request, buffer, scsi, mdadm, reacts, experiencing, crashes, freezes, exception, emask, sact, serr, serror, hostint, cmd, ea, res, timeout, drdy, resetting, softreset, limiting, sata, gbps, reset, disabled, raid, disabling, continuing, devices, rescheduling, infrequently, fffffff, irq_stat, ncq, ae, err, unc, configured, udma, sda, byte, hdwr, sectors, cache, fcfffdf, occurs, os, console, clues, wd, identified, quot, multi, cycle, thenyou, badblocks, existing, partition, willfully, wdc, mvwb, wmaz, ab, specification, draft, indicated, sep, suspended, calibration_retry_count, multi_zone_error_rate, ultimately, affecting, stability, crept, progressively, replacing, free_hot_cold_page, xfc, __pagevec_free, release_pages, __pagevec_release, db, __invalid_mapping_pages, invalidate_mapping_pages, shrink_icache_memory, xf, aeab, shrink_slab, xfb, kswapd, fb, autoremove_wake_function, af, kthread, xc, xeb, kernel_thread_helper, eb, ff, eip, ub_page_uncharge, ss, esp, df, syncing, fatal, __pagvec_free, zd, z, truncate_inode_pages_range, xcc, journal_stop, jbd, caf, truncate_inode_pages, xe, ext, _delete_inode, xba, generic_deiete_inode, xfe, input, d_kill, dput, ac, sys_renameat, remove_vma, ze, do_munmap, sys_rename, zc, syscall_call, xb, acbd,

  • Linux qemu-kvm How To Enable Soundcard in Guestl
  • QEMU-KVM Windows and Server Guest Installs Mouse Tracking Pointer Location Solution
  • SSH Keep Alive To stop Disconnections
  • Linux How To Disable SATA NCQ For Better Performance
  • the sign-in method you're trying to use isn't allowed. For more info, contact your network administrator - solution for active directory
  • gsmartcontrol for Windows to Check the SMART S.M.A.R.T status
  • WebRTC Vulnerability Shows Local IP Address Even When Using a Proxy or VPN Firefox Fix And Disable Solution
  • chroot in Linux Howto Simple and Easy Guide
  • qemu-kvm qemu-system Image format was not specified for '/mnt/space/cucm12.img' and probing guessed raw. Automatically detecting the format is dangerous for raw images, write operations on block 0 will be restricted. Specify the 'ra
  • Linux Over VNC VMWare How To Switch Virtual Terminals Console Without Using Ctrl+Alt+F1
  • Skype For Business 2015 and 2019 Guide, Reference, Howto and Troubleshooting Solutions
  • Centos 6 or 7 no DHCP IP during startup on first boot or reboot solution
  • Debian / Mint / Ubuntu net-tools packages provides netstat, ifconfig, route, arp and other classic network admin tools
  • Linux Mint XWindows Ubuntu MATE or Cinnamon How To Restart The GUI / Graphics / Session if it freezes without losing current windows or programs
  • Linux bash prompt why does it not show username@host and the current directory?
  • Microsoft SQL Server Check What Version is Running
  • How to install and setup LXC Containers (OpenVZ alternative) on Centos 6 / 7
  • Cisco CUCM Unified Communication Manager Howto Guide and Tutorials
  • SSH persistent and automatic login script for proxy
  • SSH proxy/command in the background or from cron script
  • About realtechtalk.com

    The Leading Source Of IT and Linux Administration, Server and Virtualization