当前位置: 首页 > 故障处理 > 正文

我们的文章会在微信公众号IT民工的龙马人生博客网站( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。

最近半个月遇到有两个客户的Oracle Exadata一体机出现物理磁盘的损坏,一个客户是机械磁盘、一个客户是FLASH磁盘。很巧的是这两个客户他们的日常运维过程中都是只看物理服务器的故障信号灯。但是在一体机环境中其实这远远不够的,就如今天我们分享的这个案例一样,一台存储节点故障灯并没有亮,但是磁盘已经出现了大量的坏块,导致最后重平衡失败。

针对Oracle一体机环境在硬件巡检时,有几个维度必须检查到:

1,所有节点的messages日志
2,存储节点cell的alert日志
3,cellcli中的alerthistory日志
4,asm的alert日志
5,ilom日志
6,最后才服务器信号灯

故障概述

2025年6月19日,值班人员在机房巡检时发现Exadata一体机存储节点亮黄灯,随即通知数据库工程师。经排查,确认存储6节点(CD_05_HTZHTZHADM06)出现故障,影响了部分业务磁盘的正常使用。

故障现象

  • 存储节点 HTZHTZHADM06 的物理磁盘出现异常,指示灯报警。
  • 通过 CellCLI 工具查询,发现相关物理磁盘、celldisk、griddisk 状态异常,部分磁盘状态为 proactive failure。
  • ASM 日志显示,业务 I/O 被转移到其他联机伙伴磁盘,rebalance 过程中,其他节点磁盘也出现坏块计数增加,rebalance 长时间无法完成。
  • 数据库层面,部分 ASM 磁盘未能自动 drop,rebalance 状态长时间处于 WAIT 或 RUN,无法顺利结束。

故障原因分析

  1. 硬盘物理故障
    存储节点6的物理磁盘(/dev/sdf,252:5)出现硬件故障,CellCLI 查询状态为 proactive failure,errorCount 累计至 46 次。
  2. 磁盘组冗余受损
    故障磁盘droping过程中触发磁盘组重平衡过程,遇到节点磁盘出现坏块,使得重平衡失败。
  3. 自动修复受阻
    由于部分磁盘健康状况不佳,导致相关 griddisk 的 asmdeactivationoutcome 状态为“Cannot deactivate because partner disk … has poor health”,影响了自动修复和冗余恢复流程。

处理过程

确定损坏磁盘的信息

1,通过list physicaldisk 查看状态标志包含failure的所有磁盘

CellCLI> list physicaldisk attributes all
	 252:0     	 	 	 22	 /dev/sda    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_0 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R6WSUN              	 8.91015625T                   	 	 	 0                      	 normal
	 252:1     	 	 	 23	 /dev/sdb    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_1 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R74S6N              	 8.91015625T                   	 	 	 1                      	 normal
	 252:2     	 	 	 28	 /dev/sdc    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_2 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2024-09-12T10:24:58+08:00	 sas	 	 	 R52JHK              	 8.91015625T                   	 	 	 2                      	 normal
	 252:3     	 	 	 20	 /dev/sdd    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_3 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R75DLN              	 8.91015625T                   	 	 	 3                      	 normal
	 252:4     	 	 	 26	 /dev/sde    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_4 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R63KJN              	 8.91015625T                   	 	 	 4                      	 normal
	 252:5     	 	 	 27	 /dev/sdf    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_5 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R714UN              	 8.91015625T                   	 	 	 5                      	 normal
	 252:6     	 	 	 25	 /dev/sdg    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_6 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R6SLBN              	 8.91015625T                   	 	 	 6                      	 normal
	 252:7     	 	 	 24	 /dev/sdh    	 HardDisk 	 252	 	 	 	 1106	 	 	 	 	 0_7 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R7D8GN              	 8.91015625T                   	 	 	 7                      	 normal
	 252:8     	 	 	 19	 /dev/sdi    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_8 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R6W8AN              	 8.91015625T                   	 	 	 8                      	 normal
	 252:9     	 	 	 18	 /dev/sdj    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_9 	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R6Y44N              	 8.91015625T                   	 	 	 9                      	 normal
	 252:10    	 	 	 17	 /dev/sdk    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_10	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R7D4RN              	 8.91015625T                   	 	 	 10                     	 normal
	 252:11    	 	 	 16	 /dev/sdl    	 HardDisk 	 252	 	 	 	 0   	 	 	 	 	 0_11	 "HGST    H1231A823SUN010T"               	 	 A680    	 2018-06-27T12:43:40+08:00	 sas	 	 	 R7G62N              	 8.91015625T                   	 	 	 11                     	 normal
	 FLASH_10_1	 	 	   	 /dev/nvme2n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 10_0	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2018-06-27T12:44:13+08:00	    	 	 	 PHLE805600PD6P4BGN-1	 2.910957656800746917724609375T	 	 	 "PCI Slot: 10; FDOM: 1"	 normal
	 FLASH_10_2	 	 	   	 /dev/nvme3n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 10_0	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2018-06-27T12:44:13+08:00	    	 	 	 PHLE805600PD6P4BGN-2	 2.910957656800746917724609375T	 	 	 "PCI Slot: 10; FDOM: 2"	 normal
	 FLASH_4_1 	 	 	   	 /dev/nvme4n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 4_0 	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2018-06-27T12:44:13+08:00	    	 	 	 PHLE8055008T6P4BGN-1	 2.910957656800746917724609375T	 	 	 "PCI Slot: 4; FDOM: 1" 	 normal
	 FLASH_4_2 	 	 	   	 /dev/nvme5n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 4_0 	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2018-06-27T12:44:13+08:00	    	 	 	 PHLE8055008T6P4BGN-2	 2.910957656800746917724609375T	 	 	 "PCI Slot: 4; FDOM: 2" 	 normal
	 FLASH_5_1 	 	 	   	 /dev/nvme6n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 5_0 	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2018-06-27T12:44:13+08:00	    	 	 	 PHLE805600P76P4BGN-1	 2.910957656800746917724609375T	 	 	 "PCI Slot: 5; FDOM: 1" 	 normal
	 FLASH_5_2 	 	 	   	 /dev/nvme7n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 5_0 	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2019-04-21T10:13:44+08:00	    	 	 	 PHLE805600P76P4BGN-2	 2.910957656800746917724609375T	 	 	 "PCI Slot: 5; FDOM: 2" 	 normal
	 FLASH_6_1 	 	 	   	 /dev/nvme0n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 6_0 	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2018-06-27T12:44:13+08:00	    	 	 	 PHLE8056009A6P4BGN-1	 2.910957656800746917724609375T	 	 	 "PCI Slot: 6; FDOM: 1" 	 normal
	 FLASH_6_2 	 	 	   	 /dev/nvme1n1	 FlashDisk	    	 	 	 	     	 	 	 	 	 6_0 	 "Oracle Flash Accelerator F640 PCIe Card"	 	 QDV1RF35	 2019-04-22T09:36:12+08:00	    	 	 	 PHLE8056009A6P4BGN-2	 2.910957656800746917724609375T	 	 	 "PCI Slot: 6; FDOM: 2" 	 normal
	 M2_SYS_0  	 	 	   	 /dev/sdm    	 M2Disk   	    	 	 	 	     	 	 	 	 	     	 "INTEL SSDSCKJB150G7"                    	 	 N2010121	 2018-06-27T12:44:19+08:00	    	 	 	 PHDW802004L8150A    	 139.73558807373046875G        	 	 	 "M.2 Slot: 0"          	 normal
	 M2_SYS_1  	 	 	   	 /dev/sdn    	 M2Disk   	    	 	 	 	     	 	 	 	 	     	 "INTEL SSDSCKJB150G7"                    	 	 N2010121	 2018-06-27T12:44:19+08:00	    	 	 	 PHDW802004ZT150A    	 139.73558807373046875G        	 	 	 "M.2 Slot: 1"          	 normal

2,通过list celldisk查看状态

    CD_00_HTZHTZHADM01	 	 2018-06-27T17:33:49+08:00	 /dev/sda  	 /dev/sda  	 HardDisk 	 0       	 0	 e432c76d-f0d4-46b5-8ef4-c6aa2f7efee0	 R6WSUN                                   	 8.9094085693359375T	 normal
	 CD_01_HTZHTZHADM01	 	 2018-06-27T17:33:49+08:00	 /dev/sdb  	 /dev/sdb  	 HardDisk 	 0       	 0	 25f32ef1-dd5a-4cf7-8440-1e74abe8e858	 R74S6N                                   	 8.9094085693359375T	 normal
	 CD_02_HTZHTZHADM01	 	 2024-09-12T10:25:05+08:00	 /dev/sdc  	 /dev/sdc  	 HardDisk 	 0       	 0	 76bcd3bd-9e46-4f7e-bece-046d4e83276a	 R52JHK                                   	 8.9094085693359375T	 normal
	 CD_03_HTZHTZHADM01	 	 2018-06-27T17:33:49+08:00	 /dev/sdd  	 /dev/sdd  	 HardDisk 	 0       	 0	 0b1c0522-e657-4d70-b6ea-85811ddf5913	 R75DLN                                   	 8.9094085693359375T	 normal
	 CD_04_HTZHTZHADM01	 	 2018-06-27T17:33:49+08:00	 /dev/sde  	 /dev/sde  	 HardDisk 	 0       	 0	 c169081a-1ba8-435f-ba7c-a605b3049c78	 R63KJN                                   	 8.9094085693359375T	 normal
	 CD_05_HTZHTZHADM01	 	 2018-06-27T17:33:49+08:00	 /dev/sdf  	 /dev/sdf  	 HardDisk 	 0       	 0	 4b56c0f2-a6a0-43a0-bbeb-314a70626b09	 R714UN                                   	 8.9094085693359375T	 normal
	 CD_06_HTZHTZHADM01	 	 2018-06-27T17:33:50+08:00	 /dev/sdg  	 /dev/sdg  	 HardDisk 	 0       	 0	 0161eb49-0dca-4ae5-a247-a0b810491270	 R6SLBN                                   	 8.9094085693359375T	 normal
	 CD_07_HTZHTZHADM01	 	 2018-06-27T17:33:50+08:00	 /dev/sdh  	 /dev/sdh  	 HardDisk 	 11413920	 0	 d7ef18a7-0aae-4136-8a98-2ec3978c498b	 R7D8GN                                   	 8.9094085693359375T	 normal
	 CD_08_HTZHTZHADM01	 	 2018-06-27T17:33:50+08:00	 /dev/sdi  	 /dev/sdi  	 HardDisk 	 0       	 0	 312cc384-6cb4-4724-a88d-e43a06003e80	 R6W8AN                                   	 8.9094085693359375T	 normal
	 CD_09_HTZHTZHADM01	 	 2018-06-27T17:33:50+08:00	 /dev/sdj  	 /dev/sdj  	 HardDisk 	 0       	 0	 700e6561-de2f-4577-8cfe-d8d09435d291	 R6Y44N                                   	 8.9094085693359375T	 normal
	 CD_10_HTZHTZHADM01	 	 2018-06-27T17:33:50+08:00	 /dev/sdk  	 /dev/sdk  	 HardDisk 	 0       	 0	 1ba9303f-16a7-4c75-bd06-9e772307b63a	 R7D4RN                                   	 8.9094085693359375T	 normal
	 CD_11_HTZHTZHADM01	 	 2018-06-27T17:33:50+08:00	 /dev/sdl  	 /dev/sdl  	 HardDisk 	 0       	 0	 0a429ee8-790b-4087-bfa5-74a94a79238a	 R7G62N                                   	 8.9094085693359375T	 normal
	 FD_00_HTZHTZHADM01	 	 2018-06-27T17:33:51+08:00	 /dev/md310	 /dev/md310	 FlashDisk	 0       	 0	 4d5272bf-4aba-48c4-8e98-b5c651ff19ed	 PHLE805600PD6P4BGN-2,PHLE805600PD6P4BGN-1	 5.8218994140625T   	 normal
	 FD_01_HTZHTZHADM01	 	 2018-06-27T17:33:51+08:00	 /dev/md304	 /dev/md304	 FlashDisk	 0       	 0	 6eea3fb4-deaf-436f-aa61-46ecf4bc5f2b	 PHLE8055008T6P4BGN-2,PHLE8055008T6P4BGN-1	 5.8218994140625T   	 normal
	 FD_02_HTZHTZHADM01	 	 2018-06-27T17:33:52+08:00	 /dev/md305	 /dev/md305	 FlashDisk	 0       	 0	 e10fe5b7-2d5c-48d4-a1a6-37612ddee585	 PHLE805600P76P4BGN-2,PHLE805600P76P4BGN-1	 5.8218994140625T   	 normal
	 FD_03_HTZHTZHADM01	 	 2018-06-27T17:33:53+08:00	 /dev/md306	 /dev/md306	 FlashDisk	 0       	 0	 38181a27-b808-4238-a98c-04325146db87	 PHLE8056009A6P4BGN-2,PHLE8056009A6P4BGN-1	 5.8218994140625T   	 normal

3,通过list griddisk查看状态

CellCLI> list griddisk attributes all
	 DATAC1_CD_00_HTZHTZHADM01	 DATAC1	 DATAC1_CD_00_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_03_HTZHTZHADM01	 default	 CD_00_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:56+08:00	 HardDisk	 0       	 8666a91f-f41b-4ad9-91af-72e5e1dd3840	 	 7.1279296875T    	 	 active
	 DATAC1_CD_01_HTZHTZHADM01	 DATAC1	 DATAC1_CD_01_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_01_HTZHTZHADM01	 default	 CD_01_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:56+08:00	 HardDisk	 0       	 98e70d44-755a-416e-b221-1339bfe6accf	 	 7.1279296875T    	 	 active
	 DATAC1_CD_02_HTZHTZHADM01	 DATAC1	 DATAC1_CD_02_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_03_HTZHTZHADM01	 default	 CD_02_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2024-09-12T10:25:05+08:00	 HardDisk	 0       	 20be2fbd-de47-455f-aceb-6f33451227e9	 	 7.1279296875T    	 	 active
	 DATAC1_CD_03_HTZHTZHADM01	 DATAC1	 DATAC1_CD_03_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_01_HTZHTZHADM01	 default	 CD_03_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:56+08:00	 HardDisk	 0       	 c5dd1a7b-3ab4-4afd-ba95-6b52d7e0c990	 	 7.1279296875T    	 	 active
	 DATAC1_CD_04_HTZHTZHADM01	 DATAC1	 DATAC1_CD_04_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_00_HTZHTZHADM01	 default	 CD_04_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:55+08:00	 HardDisk	 0       	 66daaae1-f9bd-409f-a587-056a49bb0737	 	 7.1279296875T    	 	 active
	 DATAC1_CD_05_HTZHTZHADM01	 DATAC1	 DATAC1_CD_05_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_02_HTZHTZHADM01	 default	 CD_05_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:56+08:00	 HardDisk	 0       	 475cab0a-eef4-4c5b-b3be-1ff23b155f51	 	 7.1279296875T    	 	 active
	 DATAC1_CD_06_HTZHTZHADM01	 DATAC1	 DATAC1_CD_06_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_03_HTZHTZHADM01	 default	 CD_06_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:57+08:00	 HardDisk	 0       	 f75ea3a4-7ee2-4f85-9fee-dada8e11c6cf	 	 7.1279296875T    	 	 active
	 DATAC1_CD_07_HTZHTZHADM01	 DATAC1	 DATAC1_CD_07_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_01_HTZHTZHADM01	 default	 CD_07_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:55+08:00	 HardDisk	 4       	 d30809fc-57df-442d-bea9-b22d12c72281	 	 7.1279296875T    	 	 active
	 DATAC1_CD_08_HTZHTZHADM01	 DATAC1	 DATAC1_CD_08_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_00_HTZHTZHADM01	 default	 CD_08_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:56+08:00	 HardDisk	 0       	 9cc88c99-31dc-4731-be5f-1f4e467efb01	 	 7.1279296875T    	 	 active
	 DATAC1_CD_09_HTZHTZHADM01	 DATAC1	 DATAC1_CD_09_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_02_HTZHTZHADM01	 default	 CD_09_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:55+08:00	 HardDisk	 0       	 1656d696-640e-459b-a15f-c6c3c2ec504a	 	 7.1279296875T    	 	 active
	 DATAC1_CD_10_HTZHTZHADM01	 DATAC1	 DATAC1_CD_10_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_00_HTZHTZHADM01	 default	 CD_10_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:55+08:00	 HardDisk	 0       	 881d0d07-1a4a-4492-8320-407bc8b2e5f0	 	 7.1279296875T    	 	 active
	 DATAC1_CD_11_HTZHTZHADM01	 DATAC1	 DATAC1_CD_11_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_02_HTZHTZHADM01	 default	 CD_11_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup DATAC1"	 2018-06-27T17:38:57+08:00	 HardDisk	 0       	 13fddb80-857b-40c5-9b41-68efced3dc40	 	 7.1279296875T    	 	 active
	 RECOC1_CD_00_HTZHTZHADM01	 RECOC1	 RECOC1_CD_00_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_03_HTZHTZHADM01	 default	 CD_00_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:19+08:00	 HardDisk	 0       	 a4ad0311-da18-4419-80dc-8b75227329d2	 	 1.78143310546875T	 	 active
	 RECOC1_CD_01_HTZHTZHADM01	 RECOC1	 RECOC1_CD_01_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_01_HTZHTZHADM01	 default	 CD_01_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:19+08:00	 HardDisk	 0       	 ce81a19a-3651-4dc1-9dc7-e84b41754a76	 	 1.78143310546875T	 	 active
	 RECOC1_CD_02_HTZHTZHADM01	 RECOC1	 RECOC1_CD_02_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_03_HTZHTZHADM01	 default	 CD_02_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2024-09-12T10:25:05+08:00	 HardDisk	 0       	 789705ed-e76f-43ab-ac1f-2544bebb1ea3	 	 1.78143310546875T	 	 active
	 RECOC1_CD_03_HTZHTZHADM01	 RECOC1	 RECOC1_CD_03_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_01_HTZHTZHADM01	 default	 CD_03_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:21+08:00	 HardDisk	 0       	 37098fd1-2708-4a4b-86a7-d4a8c14ddf88	 	 1.78143310546875T	 	 active
	 RECOC1_CD_04_HTZHTZHADM01	 RECOC1	 RECOC1_CD_04_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_00_HTZHTZHADM01	 default	 CD_04_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:20+08:00	 HardDisk	 0       	 e18f3551-0af6-44b3-b577-7f08bde773c3	 	 1.78143310546875T	 	 active
	 RECOC1_CD_05_HTZHTZHADM01	 RECOC1	 RECOC1_CD_05_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_02_HTZHTZHADM01	 default	 CD_05_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:20+08:00	 HardDisk	 0       	 cbb48299-dbac-4230-b784-508c954a461b	 	 1.78143310546875T	 	 active
	 RECOC1_CD_06_HTZHTZHADM01	 RECOC1	 RECOC1_CD_06_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_03_HTZHTZHADM01	 default	 CD_06_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:21+08:00	 HardDisk	 0       	 0010e607-713f-4055-8924-2fc23a31e9a1	 	 1.78143310546875T	 	 active
	 RECOC1_CD_07_HTZHTZHADM01	 RECOC1	 RECOC1_CD_07_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_01_HTZHTZHADM01	 default	 CD_07_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:20+08:00	 HardDisk	 23825674	 a2560429-fb29-4022-a85a-98e0baeacf28	 	 1.78143310546875T	 	 active
	 RECOC1_CD_08_HTZHTZHADM01	 RECOC1	 RECOC1_CD_08_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_00_HTZHTZHADM01	 default	 CD_08_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:20+08:00	 HardDisk	 0       	 70997c8e-4bf0-4820-97a9-32c48be234ff	 	 1.78143310546875T	 	 active
	 RECOC1_CD_09_HTZHTZHADM01	 RECOC1	 RECOC1_CD_09_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_02_HTZHTZHADM01	 default	 CD_09_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:19+08:00	 HardDisk	 0       	 309f073f-262f-4042-b9c7-b935c6f1a143	 	 1.78143310546875T	 	 active
	 RECOC1_CD_10_HTZHTZHADM01	 RECOC1	 RECOC1_CD_10_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_00_HTZHTZHADM01	 default	 CD_10_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:19+08:00	 HardDisk	 0       	 fbc56c09-c95f-4230-a971-31b2a18fc10c	 	 1.78143310546875T	 	 active
	 RECOC1_CD_11_HTZHTZHADM01	 RECOC1	 RECOC1_CD_11_HTZHTZHADM01	 HTZHTZHADM01	 	 FD_02_HTZHTZHADM01	 default	 CD_11_HTZHTZHADM01	 "Cluster cluster-clu1 diskgroup RECOC1"	 2018-06-27T17:39:19+08:00	 HardDisk	 0       	 96061cc1-ad24-4a8e-afe7-f974d0930a8f	 	 1.78143310546875T	 	 active

CellCLI> 

4,确认故障LUN的位置

         name:               CD_05_HTZHTZHADM06
         comment:
         creationTime:       2018-06-27T17:33:50+08:00
         deviceName:         /dev/sdf
         devicePartition:    /dev/sdf
         diskType:           HardDisk
         errorCount:         46
         freeSpace:          0
         id:                 86b90b17-fbe4-4622-821c-f3e96881ff09
         physicalDisk:       R74YWN
         size:               8.9094085693359375T
         status:             proactive failure

CellCLI>
查看griddisk的信息
list griddisk where celldisk=CD_05_HTZHTZHADM06
CellCLI> list griddisk where celldisk=CD_05_HTZHTZHADM06
         DATAC1_CD_05_HTZHTZHADM06   proactive failure
         RECOC1_CD_05_HTZHTZHADM06   proactive failure

5,确认失败的磁盘所关联的ASM disk是否已经自动drop。

使用grid用户登录到数据库节点,使用sqlplus / as sysasm连接到ASM实例。

SQL> set lines 180 pages 999
col path format a50
select group_number,path,header_status,mount_status,mode_status,name from v$ASM_DISK where path like '%CD_05_HTZHTZHADM06%';
SQL> SQL>
GROUP_NUMBER PATH                                               HEADER_STATU MOUNT_S MODE_ST NAME
------------ -------------------------------------------------- ------------ ------- ------- ------------------------------
           0 o/192.168.10.19;192.168.10.20/DATAC1_CD_05_sjxzcel FORMER       CLOSED  ONLINE
             adm06
           3 o/192.168.10.19;192.168.10.20/RECOC1_CD_05_sjxzcel MEMBER       CACHED  ONLINE  RECOC1_CD_05_HTZHTZHADM06
             adm06

发现未完全drop。

6,查看数据库reblance状态

SQL> 
INST_ID GROUP_NUMBER OPERA PASS STAT  POWER ACTUAL SOFAR    EST_WORK EST_RATE EST_MINUTES ERROR_CODE CON_ID
------- ------------ ----- ---- ----- ----- ------ -------- -------- -------- ----------- ---------- ------
      1            3 REBAL COMPACT WAIT   12     12        0        0        0           0          0
      3            3 REBAL REBALANCE RUN    12 121789 1240900        0        0           0          0
      3            3 REBAL REBUILD   DONE   12     12        0        0        0           0          0
      3            3 REBAL RESYNC    DONE   12     12        0        0        0           0          0
      1            3 REBAL COMPACT WAIT   12     12        0        0        0           0          0
      2            3 REBAL REBALANCE WAIT   12     12        0        0        0           0          0
      2            3 REBAL REBUILD   WAIT   12     12        0        0        0           0          0
      2            3 REBAL RESYNC    WAIT   12     12        0        0        0           0          0
      3            3 REBAL COMPACT WAIT   12     12        0        0        0           0          0
      3            3 REBAL REBALANCE WAIT   12     12        0        0        0           0          0
      3            3 REBAL REBUILD   WAIT   12     12        0        0        0           0          0
      3            3 REBAL RESYNC    WAIT   12     12        0        0        0           0          0
      1            3 REBAL COMPACT WAIT   12     12        0        0        0           0          0
      1            3 REBAL REBALANCE WAIT   12     12        0        0        0           0          0
      1            3 REBAL REBUILD   WAIT   12     12        0        0        0           0          0
      1            3 REBAL RESYNC    WAIT   12     12        0        0        0           0          0
      2            3 REBAL COMPACT WAIT   12     12        0        0        0           0          0
      2            3 REBAL REBALANCE WAIT   12     12        0        0        0           0          0
      2            3 REBAL REBUILD   WAIT   12     12        0        0        0           0          0
      2            3 REBAL RESYNC    WAIT   12     12        0        0        0           0          0

20 rows selected.

发现一直是这个状态。

确认ALERT日志信息

某节点的日志

----------------Alert----------------
2025-06-16T16:51:26.645923+08:00
NOTE: updating disk modes to 0x5 from 0x7 for disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1): lflags 0x0
NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for reads
NOTE: updating disk modes to 0x1 from 0x5 for disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1): lflags 0x0
NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for writes
NOTE: disk 19 (DATAC1_CD_05_HTZHTZHADM06) in group 1 (DATAC1) is offline for writes
SUCCESS: disk DATAC1_CD_05_HTZHTZHADM06 (19.1897451371) dropped from diskgroup DATAC1

----------------ASM磁盘故障-------------------------------
WARNING: I/O on unhealthy ASM disk (DATAC1_CD_05_HTZHTZHADM06) in group DATAC1 /0x3a82859c will be diverted to its online partner disks
2025-06-19T18:08:35.323341+08:00
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup RECOC1 drop
disk RECOC1_CD_05_HTZHTZHADM06

NOTE: GroupBlock outside rolling migration privileged region
NOTE: initiating offline for alter one membership refresh for group=3
2025-06-19T18:08:37.501287+08:00

2025-06-19T18:08:25.583485+08:00
SQL> /* Exadata Auto Mgmt: Proactive DROP ASM Disk */
alter diskgroup RECOC1 drop
disk RECOC1_CD_05_HTZHTZHADM06

2025-06-19T18:08:26.836233+08:00
NOTE: Attempting voting file refresh on diskgroup RECOC1
NOTE: Refresh completed on diskgroup RECOC1. No voting file found.

某节点的信息

2025-06-23T21:54:19.843872+08:00
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:1048576 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef8f3000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 7057 usec Time waited on I/O: 6048 usec
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:0 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef98288 bufp:0x7f19ea383000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 15155 usec Time waited on I/O: 14146 usec
NOTE: Suppressing further IO Read errors on group:3 disk:80
WARNING: Read Failed. group:3 disk:80 AU:38374 offset:0 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef969348 bufp:0x7f19ea073000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 7068 usec Time waited on I/O: 2021 usec
2025-06-23T21:54:50.567089+08:00
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:1048576 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef8f3000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 40362 usec Time waited on I/O: 6050 usec
WARNING: Read Failed. group:3 disk:80 AU:38398 offset:0 size:1048576
path:o/192.168.10.9;192.168.10.10/RECOC1_CD_07_HTZHTZHADM01
incarnation:0x7118d6cb asynchronous result:'I/O error'
subsys:OSS krq:0x7f19ef969348 bufp:0x7f19ea073000 osderr1:0xc9 osderr2:0x0
Exadata error:'Generic I/O error'
IO elapsed time: 10082 usec Time waited on I/O: 1009 usec
NOTE: Suppressing further IO Read errors on group:3 disk:80
WARNING: Read Failed. group:3 disk:80 AU:57491 offset:0 size:1048576

Exadata error:’Generic I/O error’和Read Failed这里给得很明显,提示在从平衡的时候读取AU时出现了IO错误。

确定IO故障节点的CELL的日志

Read Error on Cell Disk CD_07_sjxzceladm01 (/dev/sdh) at device offset 8078426636288 bytes with size 1048576 bytes membuf 0x1c89a00000, bioreq 0x616d522c (errno: No data available [61])
Read Error on Grid Disk RECOC1_CD_07_sjxzceladm01 at grid disk offset 241134731264 bytes with size 1048576 bytes from database +ASM
2025-06-23T17:23:05.781390+08:00
Read Error on Cell Disk CD_07_sjxzceladm01 (/dev/sdh) at device offset 7844585799680 bytes with size 1048576 bytes membuf 0x1e70f00000, bioreq 0x6120b2b4 (errno: No data available [61])
Read Error on Grid Disk RECOC1_CD_07_sjxzceladm01 at grid disk offset 7293894656 bytes with size 1048576 bytes from database +ASM
2025-06-23T17:23:05.790166+08:00

errno: No data available [61]通过这行日志,也可以很明确知道IO有异常

操作系统日志

Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.789307] sd 0:2:7:0: [sdh] tag#7 BRCM Debug mfi stat 0x2d, data len requested/completed 0x100000/0x0
Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790573] sd 0:2:7:0: [sdh] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790576] sd 0:2:7:0: [sdh] tag#7 Sense Key : Medium Error [current]
Jun 24 00:17:43 sjxzceladm01 kernel: [29925171.790578] sd 0:2:7:0: [sdh] tag#7 Add. Sense: Unrecovered read error

操作系统层面同样在报读取错误

故障处理

强制删除磁盘

使用踢盘命令alter physicaldisk 252:5 drop for replacement剔除存储节点6的故障硬盘

更换物理磁盘

更换磁盘,每个磁盘都有一个挂扣,拨动挂扣将旧的磁盘移除,然后在将新的磁盘推入槽位,锁起挂扣。更换完成后,磁盘上的LED指示灯消失,绿灯亮起。

确认更换磁盘的状态

在物理磁盘更换完成以后,系统会自动创建LUN,celldisk,griddisk,当其是系统盘时,如果磁盘包含系统分区,RAID同时也会自动进行重组。
在存储服务器这一端的cellcli命令提示符下执行如下命令可查看lun,physicaldisk,celldisk,griddisk的状态,创建时间及名称,确认更换后的信息正确无误。

CellCLI> list lun 0_5 detail
         name:               0_5
         cellDisk:           CD_05_HTZHTZHADM06
         deviceName:         /dev/sdf
         diskType:           HardDisk
         id:                 0_5
         isSystemLun:        FALSE
         lunSize:            8.90940952301025390625T
         lunUID:             0_5
         physicalDrives:     252:5
         raidLevel:          0
         lunWriteCacheMode:  "WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU"
         status:             normal
CellCLI> list physicaldisk 252:5 detail
         name:               252:5
         deviceId:           29
         deviceName:         /dev/sdf
         diskType:           HardDisk
         enclosureDeviceId:  252
         errOtherCount:      0
         luns:               0_5
         makeModel:          "HGST    H1231A823SUN010T"
         physicalFirmware:   A680
         physicalInsertTime: 2025-06-24T02:30:45+08:00
         physicalInterface:  sas
         physicalSerial:     R94HEN
         physicalSize:       8.91015625T
         slotNumber:         5
         status:             normal
CellCLI> list celldisk where lun=0_5 detail
         name:               CD_05_HTZHTZHADM06
         comment:
         creationTime:       2025-06-24T02:30:53+08:00
         deviceName:         /dev/sdf
         devicePartition:    /dev/sdf
         diskType:           HardDisk
         errorCount:         0
         freeSpace:          0
         id:                 1a41238e-cbb6-4d66-8427-5e7dc2e2729f
         physicalDisk:       R94HEN
         size:               8.9094085693359375T
         status:             normal
CellCLI> list griddisk where celldisk=CD_05_HTZHTZHADM06 detail
         name:               DATAC1_CD_05_HTZHTZHADM06
         asmDiskGroupName:   DATAC1
         asmDiskName:        DATAC1_CD_05_HTZHTZHADM06
         asmFailGroupName:   HTZHTZHADM06
         availableTo:
         cachedBy:           FD_01_HTZHTZHADM06
         cachingPolicy:      default
         cellDisk:           CD_05_HTZHTZHADM06
         comment:            "Cluster cluster-clu1 diskgroup DATAC1"
         creationTime:       2025-06-24T02:30:53+08:00
         diskType:           HardDisk
         errorCount:         0
         id:                 ac2ecc45-619e-464c-bf66-5c0fd7e8c608
         size:               7.1279296875T
         status:             active

         name:               RECOC1_CD_05_HTZHTZHADM06
         asmDiskGroupName:   RECOC1
         asmDiskName:        RECOC1_CD_05_HTZHTZHADM06
         asmFailGroupName:   HTZHTZHADM06
         availableTo:
         cachedBy:           FD_01_HTZHTZHADM06
         cachingPolicy:      default
         cellDisk:           CD_05_HTZHTZHADM06
         comment:            "Cluster cluster-clu1 diskgroup RECOC1"
         creationTime:       2025-06-24T02:30:53+08:00
         diskType:           HardDisk
         errorCount:         0
         id:                 5ad626b7-9167-4f64-b07d-cfb767ca8d3d
         size:               1.78143310546875T
         status:             active

然后在数据库服务器的ASM实例这一段查看griddisk是否已经正确添加到ASM磁盘组:

SQL> set linesize 180 pages 999
col path format a50
select group_number,path,header_status,mount_status,name from v$ASM_DISK where path like '%CD_05_HTZHTZHADM06%';
SQL> SQL>
GROUP_NUMBER PATH                                               HEADER_STATU MOUNT_S NAME
------------ -------------------------------------------------- ------------ ------- ------------------------------
           1 o/192.168.10.19;192.168.10.20/DATAC1_CD_05_sjxzcel MEMBER       CACHED  DATAC1_CD_05_HTZHTZHADM06
             adm06
           3 o/192.168.10.19;192.168.10.20/RECOC1_CD_05_sjxzcel MEMBER       CACHED  RECOC1_CD_05_HTZHTZHADM06
             adm06

——————作者介绍———————–
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)


故障处理:Oracle一体机磁盘故障时磁盘组重平衡失败的故障处理:等您坐沙发呢!

发表评论

gravatar

? razz sad evil ! smile oops grin eek shock ??? cool lol mad twisted roll wink idea arrow neutral cry mrgreen

快捷键:Ctrl+Enter