Что бы не плодить темы, спрошу тут:
Есть сервак, и в определенный момент, после перезагрузки начал он нещадно тупить. Оказалось, система постоянно насилует винт. При этом, при попытке прочитать нужный фаил выскакивает ошибка: can not copy file input/output error (5). Похоже что-то с винтом.
сделал
[root@lgate ~]# fdisk -l
Disk /dev/sda: 1000.2 GB, 1000204886016 bytes, 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x5eb7dd01
Device Boot Start End Blocks Id System
/dev/sda1 * 2048 4194303 2096128 82 Linux swap / Solaris
/dev/sda2 4194304 1953519615 974662656 83 Linux
[root@lgate ~]# tune2fs -c 1 /dev/sda2
Перезагрузка, в логах:
Oct 22 10:27:14 localhost smartd[2777]: smartd 6.1 2013-03-16 r3800 [x86_64-linux-3.10.32-std-def-alt1] (ALT Linux 6.1-alt2)
Oct 22 10:27:14 localhost smartd[2777]: Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
Oct 22 10:27:14 localhost smartd[2777]: Opened configuration file /etc/smartd.conf
Oct 22 10:27:14 localhost smartd[2777]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], opened
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], ST31000528AS, S/N:9VP77WRM, WWN:5-000c50-02066110a, FW:CC38, 1.00 TB
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], found in smartd database: Seagate Barracuda 7200.12
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], WARNING: A firmware update for this drive may be available,
Oct 22 10:27:14 localhost smartd[2777]: see the following Seagate web pages:
Oct 22 10:27:14 localhost smartd[2777]: http://knowledge.seagate.com/articles/en_US/FAQ/207931en
Oct 22 10:27:14 localhost smartd[2777]: http://knowledge.seagate.com/articles/en_US/FAQ/213891en
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Oct 22 10:27:14 localhost smartd[2777]: Monitoring 1 ATA and 0 SCSI devices
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], 1 Currently unreadable (pending) sectors
Oct 22 10:27:14 localhost smartd[2777]: Sending warning via <mail> to root ...
Oct 22 10:27:14 localhost smartd[2777]: Warning via <mail> to root: successful
Oct 22 10:27:14 localhost smartd[2777]: Device: /dev/sda [SAT], 1 Offline uncorrectable sectors
Oct 22 10:27:14 localhost smartd[2777]: Sending warning via <mail> to root ...
Oct 22 10:27:14 localhost smartd[2777]: Warning via <mail> to root: successful
Oct 22 10:27:14 localhost smartd[2834]: smartd has fork()ed into background mode. New PID=2834.
Oct 22 10:27:14 localhost smartd[2834]: file /var/run/smartd.pid written containing PID 2834
Oct 22 10:27:15 localhost smartd: smartd startup succeeded
Oct 22 10:27:16 localhost fsck: /dev/sda2 has been mounted 1 times without being checked, check forced.
Oct 22 10:27:16 localhost fsck: ^B
Oct 22 10:27:16 localhost last message repeated 22 times
Oct 22 10:27:16 localhost fsck: /dev/sda2: 64131/60923904 files (0.2% non-contiguous), 208420875/243665664 blocks
Oct 22 10:27:16 localhost rc.sysinit: Checking root filesystem succeeded
И много повторяющихся ошибок типа этой:
Oct 22 17:15:34 localhost kernel: [ 317.138152] ata1.00: error: { UNC }
Oct 22 17:15:34 localhost kernel: [ 317.265769] ata1.00: configured for UDMA/133
Oct 22 17:15:34 localhost kernel: [ 317.265786] sd 0:0:0:0: [sda] Unhandled sense code
Oct 22 17:15:34 localhost kernel: [ 317.265788] sd 0:0:0:0: [sda]
Oct 22 17:15:34 localhost kernel: [ 317.265789] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 22 17:15:34 localhost kernel: [ 317.265791] sd 0:0:0:0: [sda]
Oct 22 17:15:34 localhost kernel: [ 317.265792] Sense Key : Medium Error [current] [descriptor]
Oct 22 17:15:34 localhost kernel: [ 317.265794] Descriptor sense data with sense descriptors (in hex):
Oct 22 17:15:34 localhost kernel: [ 317.265795] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Oct 22 17:15:34 localhost kernel: [ 317.265800] 02 61 a8 0e
Oct 22 17:15:34 localhost kernel: [ 317.265803] sd 0:0:0:0: [sda]
Oct 22 17:15:34 localhost kernel: [ 317.265804] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 22 17:15:34 localhost kernel: [ 317.265806] sd 0:0:0:0: [sda] CDB:
Oct 22 17:15:34 localhost kernel: [ 317.265807] Read(10): 28 00 02 61 a8 08 00 00 08 00
Oct 22 17:15:34 localhost kernel: [ 317.265812] end_request: I/O error, dev sda, sector 39954446
Oct 22 17:15:34 localhost kernel: [ 317.265823] ata1: EH complete
Oct 22 17:15:34 localhost kernel: [ 320.328848] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 22 17:15:34 localhost kernel: [ 320.328851] ata1.00: BMDMA stat 0x24
Oct 22 17:15:34 localhost kernel: [ 320.328854] ata1.00: failed command: READ DMA
Oct 22 17:15:35 localhost kernel: [ 320.328858] ata1.00: cmd c8/00:08:08:a8:61/00:00:00:00:00/e2 tag 0 dma 4096 in
Oct 22 17:15:35 localhost kernel: [ 320.328858] res 51/40:00:0e:a8:61/00:00:00:00:00/02 Emask 0x9 (media error)
Oct 22 17:15:35 localhost kernel: [ 320.328860] ata1.00: status: { DRDY ERR }
smartctl -d ata -t short /dev/sda
[root@lgate ~]# smartctl -d ata -a /dev/sda
smartctl 6.1 2013-03-16 r3800 [x86_64-linux-3.10.32-std-def-alt1] (ALT Linux 6.1-alt2)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.12
Device Model: ST31000528AS
Serial Number: 9VP77WRM
LU WWN Device Id: 5 000c50 02066110a
Firmware Version: CC38
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS T13/1699-D revision 4
SATA Version is: SATA 2.6, 3.0 Gb/s
Local Time is: Thu Oct 23 12:36:38 2014 NOVT
==> WARNING: A firmware update for this drive may be available,
see the following Seagate web pages:
http://knowledge.seagate.com/articles/en_US/FAQ/207931en
http://knowledge.seagate.com/articles/en_US/FAQ/213891en
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 121) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: ( 609) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 172) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x103f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 097 075 006 Pre-fail Always - 162395348
3 Spin_Up_Time 0x0003 095 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 097 097 020 Old_age Always - 3227
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 082 060 030 Pre-fail Always - 197439137
9 Power_On_Hours 0x0032 083 083 000 Old_age Always - 15317
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1609
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 20869
188 Command_Timeout 0x0032 100 097 000 Old_age Always - 75
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 049 045 Old_age Always - 29 (Min/Max 20/30)
194 Temperature_Celsius 0x0022 029 051 000 Old_age Always - 29 (0 17 0 0 0)
195 Hardware_ECC_Recovered 0x001a 035 023 000 Old_age Always - 162395348
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 1
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 60232621378695
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 928789645
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 762256775
SMART Error Log Version: 1
ATA Error Count: 20869 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 20869 occurred at disk power-on lifetime: 15316 hours (638 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 0e a8 61 02 Error: UNC at LBA = 0x0261a80e = 39954446
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 a8 61 e2 00 1d+20:13:25.401 READ DMA
ea 00 00 ff ff ff af 00 1d+20:13:25.382 FLUSH CACHE EXT
35 00 08 ff ff ff ef 00 1d+20:13:25.381 WRITE DMA EXT
ea 00 00 ff ff ff af 00 1d+20:13:25.381 FLUSH CACHE EXT
27 00 00 00 00 00 e0 00 1d+20:13:25.381 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
Error 20868 occurred at disk power-on lifetime: 15316 hours (638 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 0e a8 61 02 Error: UNC at LBA = 0x0261a80e = 39954446
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 a8 61 e2 00 1d+20:13:22.132 READ DMA
35 00 30 ff ff ff ef 00 1d+20:13:22.131 WRITE DMA EXT
c8 00 08 00 a8 61 e2 00 1d+20:13:22.131 READ DMA
35 00 08 ff ff ff ef 00 1d+20:13:22.130 WRITE DMA EXT
c8 00 08 f8 a7 61 e2 00 1d+20:13:22.130 READ DMA
Error 20867 occurred at disk power-on lifetime: 15316 hours (638 days + 4 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 0e a8 61 02 Error: UNC at LBA = 0x0261a80e = 39954446
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 00 e0 80 60 e2 00 1d+20:13:18.489 READ DMA
c8 00 e0 00 80 60 e2 00 1d+20:13:18.480 READ DMA
c8 00 00 e0 41 5f e2 00 1d+20:13:18.232 READ DMA
c8 00 00 e0 40 5f e2 00 1d+20:13:18.222 READ DMA
c8 00 e0 00 40 5f e2 00 1d+20:13:18.221 READ DMA
Error 20866 occurred at disk power-on lifetime: 15298 hours (637 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 0e a8 61 02 Error: UNC at LBA = 0x0261a80e = 39954446
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 a8 61 e2 00 1d+02:16:38.686 READ DMA
ea 00 00 ff ff ff af 00 1d+02:16:38.667 FLUSH CACHE EXT
35 00 08 ff ff ff ef 00 1d+02:16:38.667 WRITE DMA EXT
ea 00 00 ff ff ff af 00 1d+02:16:38.667 FLUSH CACHE EXT
25 00 08 ff ff ff ef 00 1d+02:16:38.667 READ DMA EXT
Error 20865 occurred at disk power-on lifetime: 15298 hours (637 days + 10 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 0e a8 61 02 Error: UNC at LBA = 0x0261a80e = 39954446
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
c8 00 08 08 a8 61 e2 00 1d+02:16:35.448 READ DMA
35 00 28 ff ff ff ef 00 1d+02:16:35.448 WRITE DMA EXT
c8 00 08 00 a8 61 e2 00 1d+02:16:35.447 READ DMA
35 00 08 ff ff ff ef 00 1d+02:16:35.447 WRITE DMA EXT
c8 00 08 f8 a7 61 e2 00 1d+02:16:35.447 READ DMA
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed: read failure 90% 15316 39954446
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Подскажите:
1. Винту похоже капец приходит, на замену?
2. Если я куплю такой же по размеру винт, как-то можно его клонировать, что бы не переустанавливать все заново?
2.2. Если можно клонировать, то как мне узнать. какие файлы были повреждены? Или все таки надежнее заново поставить систему с нуля?