手機站首頁散文詩歌雜文隨筆日記小小說

散文網(wǎng) » 生活 »日常 » 【NAS】硬盤健康度必備檢測工具——S.M.A.R.T

【NAS】硬盤健康度必備檢測工具——S.M.A.R.T

2022-03-16 20:51 作者:村雨Mura 0人讀過 | 我要投稿

SMART是什么？為什么重要

檢測硬盤健康度的工具，無論你買的是企業(yè)級硬盤，還是礦盤，還是亡命盤，SMART信息都能直觀的反應(yīng)硬盤的健康度，也就是壽命。

硬盤是消耗品，隨著時間推移，無論你呵護多么好，防塵、放震動、低溫、干燥劑、靜電袋套上等等，它一定會衰退，總有壞的一天。

無論你新買的硬盤，還是長期使用的硬盤，都必須要看SMART信息，定期通過SMART來監(jiān)控磁盤的健康度，隨時準備更換硬盤，避免數(shù)據(jù)丟失，可以說非常重要

例如我新買的希捷銀河，用diskgenius或者hdtune軟件檢測，直接"三黃報警"，可以說是翻新盤無疑了，直接要退貨的（SMART檢測不需要重新格式化）

我曾嘗試用它強行組RAID，剛開始正常，后來手動多檢測了smart幾次，scrub了幾次，第二天truenas不認了，直接顯示降級，并認為是壞盤

所以，對于修改SMART信息的翻新盤，可以嘗試多檢測幾次，做做性能測試，可能會讓它原形畢露，提前退換貨。如：壞道檢測慢掃，性能測試3小時，期間穿插SMART long類型檢測3次

剛收到一次，如果報黃直接換，如果沒事也不代表沒問題，萬一是刷過SMART信息的呢？

然后壞道慢掃描，翻新盤是檢測不到壞道的，這么做主要以防萬一一些低級翻新連壞道就不屏蔽的，要注意的是壞道慢掃的時候不要干其它的，否則檢測結(jié)果不準確

性能測試1-3小時，看看在較高IO讀寫下，能否碰到一些不好的塊，如果是翻新的盡快讓它顯現(xiàn)，檢測完成后繼續(xù)SMART檢測

最后是寫入數(shù)據(jù)，新盤一般要寫入數(shù)據(jù)的，寫入一半左右的數(shù)據(jù)，然后再一次SMART檢測，看看是否會出現(xiàn)不好的塊

工具安裝

一般Linux都有SMART檢測工具，叫smartmontools（smartctl），安裝：

windows版本也有，自行搜索

工具使用

一般NAS系統(tǒng)都自帶了，可以直接用，常見命令如下

long和short的區(qū)別：

一個簡單掃描，一個全掃描。建議：每天一次short掃描，每周一次long掃描

前臺和后臺的區(qū)別：

一般選擇后臺，前臺執(zhí)行就不能執(zhí)行別的命令了。本質(zhì)是調(diào)用磁盤自己的smart檢測程序，不占用CPU。8T硬盤long類型檢測1個多小時

實戰(zhàn)

先看自己有哪些硬盤

以最后一塊硬盤sdf為例，下面命令看一下硬盤是否開啟smart，如果沒有，顯示disable，則要執(zhí)行‘smartctl -s on /dev/sdf’ 來開啟

開始檢測

檢測完成后，查看結(jié)果

拓展補充

壞道檢測可以用MHDD，也能進行修復(fù)，具體教程自己看b站視頻或百度

SMART 詳解

原文出自：https://blog.csdn.net/pansaky/article/details/86650134

語法：

?????? smartctl ?[options] ?device

device：

“/dev/hd[a-t]”? ??IDE/ATA?磁盤

“/dev/sd[a-z]” ???SCSI devices磁盤。注意，對于SATA磁盤，由于是通過libata

庫來訪問，所以要增加參數(shù)“-d ?ata”。

3.1?????????[options]：

???????參數(shù)按照不同的類型來分類。

3.1.1??????????顯示信息參數(shù)：

-h??? ???????幫助信息

-V??????????版本信息

-i????????????打印基本信息（磁盤設(shè)備號、序列號、固件版本…）

-a??? ??打印磁盤所有的SMART信息

3.1.2??????????運行時行為參數(shù)：

-q? TYPE?????指定輸出的安靜模式。

TYPE可以有3種選擇：

???????????????????? ??eorsonly????????????只打印錯誤日志。

???????????????????? ??slent??????????????????有任何打印。

???????????????????? ??nserial????????不打印序列號

???????-d? TYPE?????指定磁盤的類型。如果沒有指定，smartctl會根據(jù)磁盤的名字來

猜測磁盤類型。

-T? TYPE?????指定當發(fā)生錯誤時，smartctl的容忍程度，是否繼續(xù)運行。

???????????????????? TYPE可以有4種選擇：

???????????????????? ??conservative??????一有錯就會退出

???????????????????? ??normal????????如果必須支持的SMART命令失敗，則退出

???????????????????? ??permissive?????忽略一次必須支持的SMART命令失敗

???????????????????? ??verypermissive??忽略所有必須支持的SMART命令失敗

-b? TYPE?????指定當發(fā)生校驗錯誤時，smartctl的動作。

???????????????????? TYPE有3種選擇：

???????????????????? ??warn??????????發(fā)出警告，繼續(xù)執(zhí)行

???????????????????? ??exit???????????退出smartctl

???????????????????? ??ignore????????不發(fā)出告警，繼續(xù)執(zhí)行??????

-r? TYPE????? smartmontools開發(fā)人員相關(guān)。

-n ?POWERMODE????指定當磁盤處于節(jié)能模式時，smartctl是否繼續(xù)檢查，

默認是不檢查。

POWERMODE有4種選擇：

??never? ?檢查

??sleep? ??除了sleep模式，檢查。

??standby??除了sleep或standby模式，檢查。

??idle? ????除了sleep或standby或idle模式，見車。

3.1.3??????????SMART功能開關(guān) 參數(shù)：

-s ?on/off??????打開或關(guān)閉磁盤的SMART功能

-o? on/off??????打開或關(guān)閉SMART自動離線檢測，該功能每4小時就會自動掃描磁盤是

否有缺陷。

-S? on/off???打開或關(guān)閉“自動保存廠商指定屬性”功能。

3.1.4??????????SMART?讀和顯示數(shù)據(jù) 參數(shù)

-H??????????報告磁盤的是否健康。如果報告不健康，則說明磁盤已經(jīng)損壞或會在24小時

內(nèi)損壞。

-c???????????顯示磁盤支持的普通SMART功能，以及這些功能當前的狀態(tài)。

-A??????????顯示磁盤支持的廠商指定SMART特性。這些特性的編號從1-253，并且有指

定的名字。

-l? TYPE??????指定顯示的log類型。

???????????????????? TYPE有4種選擇：

?????????????????????error?????????????只顯示error ?log。

?????????????????????selftest????只顯示selftest? log

?????????????????????selective?只顯示selective ?self-test ?log

?????????????????????directory?只顯示Log ?Directory

???????-v? N,OPTION????顯示廠商指定SMART特性N時，使用廠商相關(guān)的顯示方式。

-F ?TYPE?????設(shè)置smartctl的行為，當出現(xiàn)一些已知但還沒有解決的硬件或軟件bug時，

smartctl應(yīng)該怎么做。

-P ?TYPE?????設(shè)置smartctl是否對磁盤使用數(shù)據(jù)庫中已有的參數(shù)。

3.1.5??????????SMART?離線測試、自測試參數(shù)

-t? TEST??????立刻執(zhí)行測試，可以和-C參數(shù)一起使用。

???????????????????? TEST可以有以下幾個選擇：

???????????????????? offline??離線測試。可以在掛載文件系統(tǒng)的磁盤上使用

???????????????????? short???短時間測試。可以在掛載文件系統(tǒng)的磁盤上使用。

???????????????????? long???長時間測試?？梢栽趻燧d文件系統(tǒng)的磁盤上使用。

?????????????????????conveyance? [ATA only]傳輸zi測試。可以在掛載文件系統(tǒng)的磁盤上使用。

???????????????????? select, N-M????

select, N+SIZE? [ATA only]有選擇性測試，測試磁盤的部分LBA。N表示

LBA編號，M表示結(jié)束LBA編號，SIZE表示測試的LBA

范圍。

-C??在captive模式下運行測試。

注意：（1）-C必須配合-t一起使用，但如果是-t offline，則-C不生效。

?????? ?（2）-C會使得磁盤很忙，所以最好是在沒有掛載文件系統(tǒng)的磁盤上使用。

-X??中斷no-captive模式下運行的測試。

3.2?????????常用example

3.2.1??????????查看當前整體健康狀態(tài)

查看/dev/sda當前整體監(jiān)控狀態(tài)。PASSED表示健康，否則意味著磁盤已經(jīng)故障，或很快就會發(fā)生故障。

?smartctl ?-H? /dev/sda

?

3.2.2??????????查看所有信息

打印/dev/sda所有的SMART信息。

martctl ?-a ?/dev/sda

?

相當于依次執(zhí)行：

smartctl ?–i? /dev/sda??

smartctl? -c? /dev/sda??

smartctl? -A? /dev/sda??

smartctl? -l? error? /dev/sda

smartctl? -l? selftest? /dev/sda

smartctl? -l? selective? /dev/sda

3.2.3??????????開/關(guān)SMART功能

打開或關(guān)閉/dev/sda?的SMART功能。

smartctl ?-s ?on/off ?/dev/sda

?

查看當前SMART功能是否開啟，可以使用?–i?參數(shù)。

smartctl ?-i ?/dev/sda

3.2.4??????????離線測試

對/dev/sda進行離線測試，它的結(jié)果主要用來更新SMART?屬性。

smartctl? -t? offline? /dev/sda

3.2.5???????????短時間測試

對/dev/sda進行短時間測試。

smartctl? -t? short? /dev/sda

3.2.5.1?????????觀察測試進度

通過-c?參數(shù)，可以觀察到測試的進度：

# smartctl -c??? /dev/sda

…

Self-test execution status:????? ( 242) Self-test routine in progress…

?????????????????????????????????? ??????????? 20% of test remaining.

…

3.2.5.2?????????觀察測試結(jié)果

通過-l selftest?參數(shù)，可以看到/dev/sda測試的結(jié)果記錄：

“#1”代表的那一次測試，Completed without error表示完成，沒有錯誤。

“#2”代表的那一次測試，Aborted by host表示測試被用戶終止，還有90%沒有完成。

?

# smartctl -l selftest??? /dev/sda

…

Num? Test_Description? Status? ????????????????Remaining? Lifetime(hours)? LBA_of_first_error

# 1? Short offline???????Completed without error?? 00%????? ??9535?????????-

# 2? Extended offline??? Aborted by host????????? 90%???? ???9534???????? -

…

3.2.6??????????查看SMART屬性值

通過-A參數(shù)，可以看到/dev/sda SMART屬性值。

smartctl? -A? /dev/sda

?

3.4?????????SMART?屬性

使用smartctl? -A? /dev/sda能看到很多磁盤的SMART??屬性，可以知道磁盤是否健康。

下面是一個列表，可以知道每個屬性的具體含義：

ID

Hex

Attribut name

Description

01

0x01

Read Error Rate

(Vendor specific raw value.) Stores data related to the rate of hardware read errors that occurred when reading data from a disk surface. The raw value has different structure for different vendors and is often not meaningful as a decimal number.

02

0x02

Throughput Performance

Overall (general) throughput performance of a hard disk drive. If the value of this attribute is decreasing there is a high probability that there is a problem with the disk.

03

0x03

Spin-Up Time

Average time of spindle spin up (from zero RPM to fully operational [millisecs]).

04

0x04

Start/Stop Count

A tally of spindle start/stop cycles. The spindle turns on, and hence the count is increased, both when the hard disk is turned on after having before been turned entirely off (disconnected from power source) and when the hard disk returns from having previously been put to sleep mode.

05

0x05

Reallocated Sectors Count

Count of reallocated sectors. When the hard drive finds a read/write/verification error, it marks that sector as “reallocated” and transfers data to a special reserved area (spare area). This process is also known as remapping, and reallocated sectors are called “remaps”. The raw value normally represents a count of the bad sectors that have been found and remapped. Thus, the higher the attribute value, the more sectors the drive has had to reallocate. This allows a drive with bad sectors to continue operation; however, a drive which has had any reallocations at all is significantly more likely to fail in the near future.[2]While primarily used as a metric of the life expectancy of the drive, this number also affects performance. As the count of reallocated sectors increases, the read/write speed tends to become worse because the?drive head?is forced to seek to the reserved area whenever a remap is accessed. A workaround which will preserve drive speed at the expense of capacity is to create a?disk partition?over the region which contains remaps and instruct the?operating system?to not use that partition.

06

0x06

Read Channel Margin

Margin of a channel while reading data. The function of this attribute is not specified.

07

0x07

Seek Error Rate

(Vendor specific raw value.) Rate of seek errors of the magnetic heads. If there is a partial failure in the mechanical positioning system, then seek errors will arise. Such a failure may be due to numerous factors, such as damage to a servo, or thermal widening of the hard disk. The raw value has different structure for different vendors and is often not meaningful as a decimal number.

08

0x08

Seek Time Performance

Average performance of seek operations of the magnetic heads. If this attribute is decreasing, it is a sign of problems in the mechanical subsystem.

09

0x09

Power-On Hours?(POH)

Count of hours in power-on state. The raw value of this attribute shows total count of hours (or minutes, or seconds, depending on manufacturer) in power-on state.

10

0x0A

Spin Retry Count

Count of retry of spin start attempts. This attribute stores a total count of the spin start attempts to reach the fully operational speed (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.

11

0x0B

Recalibration Retries?orCalibration Retry Count

This attribute indicates the count that recalibration was requested (under the condition that the first attempt was unsuccessful). An increase of this attribute value is a sign of problems in the hard disk mechanical subsystem.

12

0x0C

Power Cycle Count

This attribute indicates the count of full hard disk power on/off cycles.

13

0x0D

Soft Read Error Rate

Uncorrected read errors reported to the operating system.

180

0xB4

Unused Reserved Block Count Total

“Pre-Fail” Attribute used at least in HP devices.

183

0xB7

SATA Downshift Error Count

Western Digital and Samsung attribute.

184

0xb8

End-to-End?error / IOEDC????

This attribute is a part of?Hewlett-Packard’s SMART IV technology, as well as part of other vendors’ IO Error Detection and Correction schemas, and it contains a count of parity errors which occur in the data path to the media via the drive’s cache RAM.

185

0xB9

Head Stability

Western Digital attribute.

186

0xBA

Induced Op-Vibration Detection

Western Digital attribute.

187

0xBB

Reported Uncorrectable Errors

The count of errors that could not be recovered using hardware ECC?.

188

0xBC

Command Timeout

The count of aborted operations due to HDD timeout. Normally this attribute value should be equal to zero and if the value is far above zero, then most likely there will be some serious problems with power supply or an oxidized data cable.

189

0xBD

High Fly Writes

HDD producers implement a Fly Height Monitor that attempts to provide additional protections for write operations by detecting when a recording head is flying outside its normal operating range. If an unsafe fly height condition is encountered, the write process is stopped, and the information is rewritten or reallocated to a safe region of the hard drive. This attribute indicates the count of these errors detected over the lifetime of the drive.

This feature is implemented in most modern Seagate drives?and some of Western Digital’s drives, beginning with the WD Enterprise WDE18300 and WDE9180 Ultra2 SCSI hard drives, and will be included on all future WD Enterprise products.

?

190

0xBE

Airflow Temperature (WDC)?resp.Airflow Temperature Celsius (HP)

Airflow temperature on Western Digital HDs (Same as temp. [C2], but current value is 50 less for some models. Marked as obsolete.)

191

0xBF

G-sense Error Rate

The count of errors resulting from externally-induced shock & vibration.

192

0xC0

Power-off Retract Countor?Emergency Retract Cycle Count(Fujitsu)

Count of times the heads are loaded off the media. Heads can be unloaded without actually powering off.

193

0xC1

Load Cycle Count?orLoad/Unload Cycle Count(Fujitsu)

Count of load/unload cycles into head landing zone position.

The typical lifetime rating for laptop (2.5-in) hard drives is 300,000 to 600,000 load cycles.?Some laptop drives are programmed to unload the heads whenever there has not been any activity for about five seconds.Many Linux installations write to the file system a few times a minute in the background.?As a result, there may be 100 or more load cycles per hour, and the load cycle rating may be exceeded in less than a year

?

194

0xC2

Temperatureresp.Temperature Celsius

Current internal temperature.

195

0xC3

Hardware ECC Recovered

(Vendor specific raw value.) The raw value has different structure for different vendors and is often not meaningful as a decimal number.

196

0xC4

Reallocation Event Count

Count of remap operations. The raw value of this attribute shows the total count of attempts to transfer data from reallocated sectors to a spare area. Both successful & unsuccessful attempts are counted.

197

0xC5

Current Pending Sector Count

Count of “unstable” sectors (waiting to be remapped, because of read errors). If an unstable sector is subsequently read successfully, this value is decreased and the sector is not remapped. Read errors on a sector will not remap the sector (since it might be readable later); instead, the drive firmware remembers that the sector needs to be remapped, and remaps it the next time it’s written.

198

0xC6

Uncorrectable Sector Countor

Offline Uncorrectableor

Off-Line Scan Uncorrectable Sector Count

?

The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

199

0xC7

UltraDMA CRC Error Count

The count of errors in data transfer via the interface cable as determined by ICRC (Interface Cyclic Redundancy Check).

200

0xC8

Multi-Zone Error Rate

The count of errors found when writing a sector. The higher the value, the worse the disk’s mechanical condition is.

200

0xC8

Write Error Rate?(Fujitsu)

The total count of errors when writing a sector.

201

0xC9

Soft Read Error Rate?or

TA Counter Detected

?

Count of off-track errors.

202

0xCA

Data Address Mark errorsor

TA Counter Increased

?

Count of Data Address Mark errors (or vendor-specific).

203

0xCB

Run Out Cancel

Count of ECC errors

204

0xCC

Soft ECC Correction

Count of errors corrected by software ECC

205

0xCD

Thermal Asperity Rate (TAR)

Count of errors due to high temperature.

206

0xCE

Flying Height

Height of heads above the disk surface. A flying height that’s too low increases the chances of a head crash while a flying height that’s too high increases the chances of a read/write error.

207

0xCF

Spin High Current

Amount of?surge current?used to spin up the drive.

208

0xD0

Spin Buzz

Count of buzz routines needed to spin up the drive due to insufficient power.

209

0xD1

Offline Seek Performance

Drive’s seek performance during its internal tests.

210

0xD2

Unkonw

(found in a Maxtor 6B200M0 200GB and Maxtor 2R015H1 15GB disks)

211

0xD3

Vibration During Write

Vibration During Write

212

0xD4

Shock During Write

Shock During Write

220

0xDC

Disk Shift

Distance the disk has shifted relative to the spindle (usually due to shock or temperature). Unit of measure is unknown.

222

0xDE

Loaded Hours

Time spent operating under data load (movement of magnetic head armature)

223

0xDF

Load/Unload Retry Count

Count of times head changes position.

224

0xE0

Load Friction

Resistance caused by friction in mechanical parts while operating.

225

0xE1

Load/Unload Cycle Count

Total count of load cycles

226

0xE2

Load ‘In’-time

Total time of loading on the magnetic heads actuator (time not spent in parking area).

227

0xE3

Torque Amplification Count

Count of attempts to compensate for platter speed variations

228

0xE4

Power-Off Retract Cycle

The count of times the magnetic armature was retracted automatically as a result of cutting power.

230

0xE6

GMR Head Amplitude

Amplitude of “thrashing” (distance of repetitive forward/reverse head motion)

231

0xE7

Temperature

Drive Temperature

232

0xE8

Endurance Remaining

Number of physical erase cycles completed on the drive as a percentage of the maximum physical erase cycles the drive is designed to endure

232

0xE8

Available Reserved Space

Intel SSD reports the number of available reserved space as a percentage of reserved space in a brand new SSD.

233

0xE9

Power-On Hours

Number of hours elapsed in the power-on state.

233

0xE9

Media Wearout Indicator

Intel SSD reports a normalized value of 100 (when the SSD is new) and declines to a minimum value of 1. It decreases while the NAND erase cycles increase from 0 to the maximum-rated cycles.

240

0xF0

Head Flying Hours

Time while head is positioning

240

0xF0

Transfer Error Rate(Fujitsu)

Count of times the link is reset during a data transfer.

241

0xF1

Total LBAs Written

Total count of LBAs written

242

0xF2

Total LBAs Read

Total count of LBAs read.
Some S.M.A.R.T. utilities will report a negative number for the raw value since in reality it has 48 bits rather than 32.

250

0xFA

Read Error Retry Rate

Count of errors while reading from a disk

254

0xFE

Free Fall Protection

ount of “Free Fall Events” detected

?

3.5?????????SMART self-test

使用smartctl? –t? offline/short/long???可以指定磁盤進行自測。

offline：

這個是默認的自測。

short：

???????短時自測的目的是快速確認磁盤是否故障。

???????測試過程有很多項目，都是磁盤廠商自定義的，比如下面的項目：

a)???????電氣測試項目，測試磁盤內(nèi)部的電路。具體測試細節(jié)有磁盤廠商自己指定，比如：

A)?????緩存測試。

B)?????讀、寫電路測試。

C)?????讀、寫磁頭測試。

b)??????尋道、伺服測試項目，測試磁盤在數(shù)據(jù)磁道上的尋找和伺服能。

c)???????讀、校驗測試項目，測試磁盤對部分或全盤的讀能力。

long：

???????稱為擴展的自測試。測試的項目和short類型，但是時間長得多

標簽：