Хабрахабр

ZFS: Тестируем надежность на плохих дисках

ZFS славится своей надежностью, а тут у меня под рукой собралось несколько побитых жизнью дисков. Попробуем на raidz1 закинуть некоторый объем данных, а затем проверить их целостность, а zfs на устойчивость к такой ситуации.

Диски подключены в Adaptec RAID 2805 в качестве «Simple Volume», поверх развернут raidz1 с настройками по умолчанию (для установщика FreeBSD 11-1).

Конфигурация ZFS

zfs get all zroot
NAME PROPERTY VALUE SOURCE
zroot type filesystem -
zroot creation пт янв. 26 15:40 2018 -
zroot used 75,5G -
zroot available 10,1T -
zroot referenced 128K -
zroot compressratio 1.01x -
zroot mounted yes -
zroot quota none default
zroot reservation none default
zroot recordsize 128K default
zroot mountpoint /zroot local
zroot sharenfs off default
zroot checksum on default
zroot compression lz4 local
zroot atime off local
zroot devices on default
zroot exec on default
zroot setuid on default
zroot readonly off default
zroot jailed off default
zroot snapdir hidden default
zroot aclmode discard default
zroot aclinherit restricted default
zroot canmount on default
zroot xattr off temporary
zroot copies 1 default
zroot version 5 -
zroot utf8only off -
zroot normalization none -
zroot casesensitivity sensitive -
zroot vscan off default
zroot nbmand off default
zroot sharesmb off default
zroot refquota none default
zroot refreservation none default
zroot primarycache all default
zroot secondarycache all default
zroot usedbysnapshots 0 -
zroot usedbydataset 128K -
zroot usedbychildren 75,5G -
zroot usedbyrefreservation 0 -
zroot logbias latency default
zroot dedup off default
zroot mlslabel -
zroot sync standard default
zroot refcompressratio 1.00x -
zroot written 128K -
zroot logicalused 75,8G -
zroot logicalreferenced 11,5K -
zroot volmode default default
zroot filesystem_limit none default
zroot snapshot_limit none default
zroot filesystem_count none default
zroot snapshot_count none default
zroot redundant_metadata all default

S.M.A.R.T. дисков

aacd0p4

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68WT0N0
Serial Number: WD-WCC4E6PN673U
LU WWN Device Id: 5 0014ee 20d6399e3
Firmware Version: 82.00A82
User Capacity: 4 000 787 030 016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Jan 26 15:25:30 2018 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled === START OF READ SMART DATA SECTION ===
SMART Status command failed: Input/output error
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check. SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 296 3 Spin_Up_Time 0x0027 210 208 021 Pre-fail Always - 6483 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7986 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 10
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 186
194 Temperature_Celsius 0x0022 112 111 000 Old_age Always - 40
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 17
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 55

aacd1p4

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68WT0N0
Serial Number: WD-WCC4E7LPL4AH
LU WWN Device Id: 5 0014ee 2b8213774
Firmware Version: 82.00A82
User Capacity: 4 000 787 030 016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Jan 26 15:26:19 2018 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled === START OF READ SMART DATA SECTION ===
SMART Status command failed: Input/output error
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check. SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 6043 3 Spin_Up_Time 0x0027 207 190 021 Pre-fail Always - 6616 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 173 173 140 Pre-fail Always - 803 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7964 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 10
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 7
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 212
194 Temperature_Celsius 0x0022 108 108 000 Old_age Always - 44
196 Reallocated_Event_Count 0x0032 049 049 000 Old_age Always - 151
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 6
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 101
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 232

aacd2p4

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD40EZRX-22SPEB0
Serial Number: WD-WCC4E4KAK52T
LU WWN Device Id: 5 0014ee 2b6900646
Firmware Version: 80.00A80
User Capacity: 4 000 787 030 016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Jan 26 15:26:43 2018 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled === START OF READ SMART DATA SECTION ===
SMART Status command failed: Input/output error
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check. SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 180 173 021 Pre-fail Always - 7983 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 11 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9777 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 11
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 138 138 000 Old_age Always - 186920
194 Temperature_Celsius 0x0022 110 109 000 Old_age Always - 42
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

aacd3p4

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Green
Device Model: WDC WD40EZRX-00SPEB0
Serial Number: WD-WCC4E5ALXUHC
LU WWN Device Id: 5 0014ee 20c16eacf
Firmware Version: 80.00A80
User Capacity: 4 000 787 030 016 bytes [4,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Fri Jan 26 15:27:03 2018 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled === START OF READ SMART DATA SECTION ===
SMART Status command failed: Input/output error
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check. SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 394 3 Spin_Up_Time 0x0027 194 179 021 Pre-fail Always - 7258 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 16 5 Reallocated_Sector_Ct 0x0033 195 195 140 Pre-fail Always - 160 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 090 090 000 Old_age Always - 7584 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 16
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 11
193 Load_Cycle_Count 0x0032 001 001 000 Old_age Always - 678600
194 Temperature_Celsius 0x0022 114 112 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 121 121 000 Old_age Always - 79
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 44
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 319
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 199 199 000 Old_age Offline - 530

Приступаем к мучениям

В качестве тестовых данных я закинул около 71 ГБ всякой мультимедиа,

Куча ошибок чтения


aacd3: hard error cmd=read 40789032-40789207
aacd3: hard error cmd=read 40789032-40789207
aacd1: hard error cmd=read 40790712-40790967
aacd3: hard error cmd=read 40789560-40789735
aacd1: hard error cmd=read 40794064-40794319
aacd1: hard error cmd=read 40795208-40795287
aacd1: hard error cmd=read 40795824-40796079
aacd1: hard error cmd=read 40796088-40796343
aacd1: hard error cmd=read 40797240-40797423
aacd3: hard error cmd=read 21743840-21744015
aacd1: hard error cmd=read 28502624-28502799
aacd1: hard error cmd=read 28597680-28597855
aacd1: hard error cmd=read 28635368-28635623
aacd1: hard error cmd=read 37340776-37340951
aacd1: hard error cmd=read 37342712-37342887
aacd1: hard error cmd=read 37347808-37348063
aacd1: hard error cmd=read 37348072-37348327
aacd1: hard error cmd=read 37352168-37352343
aacd1: hard error cmd=read 37359472-37359647
aacd1: hard error cmd=read 37365576-37365831
aacd1: hard error cmd=read 37372960-37373215
aacd1: hard error cmd=read 37373488-37373743
aacd1: hard error cmd=read 37380608-37380863
aacd1: hard error cmd=read 37381136-37381391
aacd1: hard error cmd=read 37382984-37383239
aacd1: hard error cmd=read 57577976-57577999
aacd1: hard error cmd=read 4606480-4606495
aacd1: hard error cmd=read 7811867664-7811867679
aacd1: hard error cmd=read 7811868176-7811868191
aac0: COMMAND 0xfffffe0000e97690 (TYPE 502) TIMEOUT AFTER 137 SECONDS
aac0: COMMAND 0xfffffe0000e91650 (TYPE 502) TIMEOUT AFTER 137 SECONDS
aac0: COMMAND 0xfffffe0000e92d10 (TYPE 502) TIMEOUT AFTER 137 SECONDS
aac0: WARNING! Controller is no longer running! code= 0xbcc90100
aacd3: hard error cmd=read 40785088-40785343
aacd3: hard error cmd=read 40785352-40785607
aacd3: hard error cmd=read 40785616-40785871
aacd3: hard error cmd=read 40788240-40788495
aacd3: hard error cmd=read 40783592-40783847
aacd3: hard error cmd=read 40784648-40784903
aacd3: hard error cmd=read 40785176-40785431
aacd3: hard error cmd=read 40785440-40785695
aacd3: hard error cmd=read 21743928-21744103
aacd1: hard error cmd=read 25407280-25407535
aacd1: hard error cmd=read 28507712-28507967
aacd1: hard error cmd=read 37322056-37322311
aacd1: hard error cmd=read 37344208-37344383
aacd1: hard error cmd=read 37348160-37348415
aacd1: hard error cmd=read 37373488-37373743
aacd1: hard error cmd=read 37380696-37380951
aacd1: hard error cmd=read 37383072-37383327
aacd1: hard error cmd=read 37383776-37384031
aacd1: hard error cmd=read 37395312-37395487
aacd1: hard error cmd=read 37426368-37426623
aacd1: hard error cmd=read 40682424-40682679
aacd1: hard error cmd=read 40702816-40703071
aacd1: hard error cmd=read 40725472-40725647
aacd1: hard error cmd=read 40760224-40760479
aacd1: hard error cmd=read 40761280-40761535
aacd1: hard error cmd=read 40764536-40764711
aacd1: hard error cmd=read 40772144-40772399
aacd1: hard error cmd=read 40774520-40774775
aacd1: hard error cmd=read 40778304-40778559
aacd3: hard error cmd=read 40783592-40783847
aacd3: hard error cmd=read 40784648-40784903
aacd3: hard error cmd=read 40785176-40785431
aacd3: hard error cmd=read 40785440-40785695
aacd1: hard error cmd=read 40785792-40785879
aacd3: hard error cmd=read 40785792-40785871
aacd1: hard error cmd=read 40785792-40785879
aacd1: hard error cmd=read 40790624-40790879
aacd3: hard error cmd=read 40790000-40790175
aacd1: hard error cmd=read 40799280-40799535
aacd3: hard error cmd=read 41121032-41121287
aacd1: hard error cmd=read 44290824-44290999
aacd1: hard error cmd=read 44301408-44301583
aacd1: hard error cmd=read 44315680-44315935
aacd1: hard error cmd=read 44330184-44330359
aacd1: hard error cmd=read 44337224-44337399
aacd1: hard error cmd=read 44344472-44344727
aacd1: hard error cmd=read 51561672-51561927
aacd1: hard error cmd=read 51571528-51571783

Во время теста было много проблем с контроллером Adaptec RAID 2805, который падал с паникой на проблемных дисках, из-за чего, проверку запускал раз 6. Однако, проверка завершилась, и даже удалось не потерять данные.

Состояние пула

 pool: zroot state: DEGRADED
status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: scrub in progress since Fri Jan 26 16:05:37 2018 67,0G scanned out of 101G at 71,3M/s, 0h8m to go 2,63M repaired, 66,05% done
config: NAME STATE READ WRITE CKSUM zroot DEGRADED 0 0 0 raidz1-0 DEGRADED 0 0 0 aacd0p4 ONLINE 0 0 0 aacd1p4 FAULTED 40 93 7 too many errors (repairing) aacd2p4 ONLINE 0 0 0 aacd3p4 ONLINE 0 0 0 pool: zroot state: ONLINE
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: scrub repaired 4,54M in 0h18m with 0 errors on Fri Jan 26 16:45:35 2018
config: NAME STATE READ WRITE CKSUM zroot ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 aacd0p4 ONLINE 0 0 0 aacd1p4 ONLINE 1 0 11 aacd2p4 ONLINE 0 0 0 aacd3p4 ONLINE 0 0 2 errors: No known data errors

Для вторичного подтверждения целостности использовал torrent, который не нашел ошибок в тестовом наборе.

Итоги

ZFS показал себя хорошо, даже в заведомо плохих условиях. Против ZFS работали ушатанные диски и контроллер, прошивка которого падала из-за проблем на дисках. Что примечательно, запись на диски шла очень быстро (загружал весь гигабит сетевухи), но вот чтение, и с ошибками заняло очень много времени. Что с этим делать, решайте сами 😉

Теги
Показать больше

Похожие статьи

Кнопка «Наверх»
Закрыть