我们的文章会在微信公众号IT民工的龙马人生和博客网站 ( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。
在上上周分享了故障处理:19C RAC改私网IP后重建集群时报网络找不到,这套环境重新运行root.sh后,集群在初始化时仍然有报错,今天来回一趟重庆,晚上不想看书,所以临时想到把这个故障分析一下,大概发了点时间,这里和大家分析一下大概的思路:
环境信息
这个环境是在我自己的MacOS里面的虚拟机安装的Oracle Arm版本,版本为19.19,并未安装其它的补丁。
模拟故障现象
deconfig集群
为了模拟整个故障,所以我先将环境deconfig一次,这里注意关键词-lastnode -force,也就是以为着deconfig会删除最后的集群配置信息。
[root@arm01 install]# ./rootcrs.sh -deconfig -lastnode -force
.....
2025/09/02 22:06:11 CLSRSC-558: failed to deconfigure ASM
2025/09/02 22:06:11 CLSRSC-651: One or more deconfiguration steps failed, but the deconfiguration process continued because the -force option was specified.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 22:06:39 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:08:29 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:09:00 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node
2025/09/02 22:09:00 CLSRSC-559: Ensure that the GPnP profile data under the 'gpnp' directory in /oracle/app/19.3.0/grid is deleted on each node before using the software in the current Grid Infrastructure home for reconfiguration.
忽略中间的日志,通过最后的成功关键字,我们可以看到整个集群卸载成功了。
运行root.sh脚本
[root@arm01 install]# /oracle/app/19.3.0/grid/root.sh
2025/09/02 23:05:50 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:06:36 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.
ASM has been created and started successfully.
[DBT-30022] Disk group arm_ocr mounted successfully.
2025/09/02 23:06:59 CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
Died at /oracle/app/19.3.0/grid/crs/install/oraocr.pm line 1890.
这里将前面的正常的日志去掉了,这里注意关键的行是ASM创建磁盘组成功了,也就意味着原来头的信息是格式化过,否者无法创建磁盘组,但是磁盘组创建完成后,里面出发了CLSRSC-428的报错。
去查看一下详细信息
2025-09-02 23:06:59: Executing the step [ocr_configFirstNode_step_2] to configure OCR on the first node
2025-09-02 23:06:59: Reuse Disk Group is set to 0
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/ocrcheck -debug
2025-09-02 23:06:59: Command output:
> Status of Oracle Cluster Registry is as follows :
> Version : 4
> Total space (kbytes) : 901284
> Used space (kbytes) : 84400
> Available space (kbytes) : 816884
> ID : 1509093020
> Device/File Name : +ARM_OCR
> PROT-713: Device/File integrity check succeeded
>
> PROT-710: Device/File not configured
>
> PROT-710: Device/File not configured
>
> PROT-710: Device/File not configured
>
> PROT-710: Device/File not configured
>
> PROT-707: Cluster registry integrity check succeeded
>
> PROT-720: Logical corruption check succeeded
>
>End Command output
2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Command output:
> CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
>End Command output
2025-09-02 23:06:59: CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
2025-09-02 23:06:59: ###### Begin DIE Stack Trace ######
2025-09-02 23:06:59: Package File Line Calling
2025-09-02 23:06:59: --------------- -------------------- ---- ----------
2025-09-02 23:06:59: 1: main rootcrs.pl 358 crsutils::dietrap
2025-09-02 23:06:59: 2: oraClusterwareComp::oraocr oraocr.pm 1890 main::__ANON__
2025-09-02 23:06:59: 3: oraClusterwareComp::oraocr oraocr.pm 1836 oraClusterwareComp::oraocr::configureOCR
2025-09-02 23:06:59: 4: oraClusterwareComp::oraocr oraocr.pm 245 oraClusterwareComp::oraocr::configSteps
2025-09-02 23:06:59: 5: oraClusterwareComp oraClusterwareComp.pm 91 oraClusterwareComp::oraocr::configureFirstNode
2025-09-02 23:06:59: 6: crsinstall crsinstall.pm 2586 oraClusterwareComp::configureCurrentNode
2025-09-02 23:06:59: 7: crsinstall crsinstall.pm 2427 crsinstall::perform_initial_config
2025-09-02 23:06:59: 8: crsinstall crsinstall.pm 1085 crsinstall::perform_init_config
2025-09-02 23:06:59: 9: crsinstall crsinstall.pm 1243 crsinstall::init_config
2025-09-02 23:06:59: 10: crsinstall crsinstall.pm 487 crsinstall::CRSInstall
2025-09-02 23:06:59: 11: main rootcrs.pl 559 crsinstall::new
2025-09-02 23:06:59: ####### End DIE Stack Trace #######
注意下面这3行的信息:
2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall
checkOCR返回为0,也就是以为着检查通过,但是立马就报了OCR配置已经存在的错误,这里感觉有点奇怪。
分析过程
查看磁盘组的状态
Disk Group Name Fail Group Path File Name Status Status Status TYPE File Size (MB) Used Size (MB) Pct. Used
---------------- ------------------ --------------------------------- ---------------------------- ------------------------ -------------- -------------- --------- -------------- -------------- ---------
ARM_OCR ARM_OCR_0000 /dev/nvme0n2 ARM_OCR_0000 MEMBER CACHED ONLINE REGULAR 5,120 320 6.25
****************** -------------- --------------
TOTAL 5,120 320
这里看到磁盘组状态完全是正常的。
查看磁盘组内容
在看看磁盘组里面的内容
[grid@arm01 ~]$ ocrconfig -showbackup
^CPROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy
arm01 2025/08/22 02:08:51 +arm_ocr:/raccluster/OCRBACKUP/backup00.ocr.261.1209780531 0
arm01 2025/08/21 14:16:32 +arm_ocr:/raccluster/OCRBACKUP/backup01.ocr.258.1209737791 0
arm01 2025/08/21 10:16:31 +arm_ocr:/raccluster/OCRBACKUP/backup02.ocr.263.1209723391 0
arm01 2025/08/21 10:16:31 +arm_ocr:/raccluster/OCRBACKUP/day.ocr.259.1209723391 0
arm01 2025/08/21 10:16:31 +arm_ocr:/raccluster/OCRBACKUP/week.ocr.260.1209723391 0
这里连之前的备份信息都还有?搞得有得不懂了,这部分信息来至于那儿呢?
原因分析
通过上面的分析,简单可以判断是由于磁盘中仍存在历史信息,所以导致集群在初始化时报错。但是deconfig为什么没有格式化磁盘组时没有完全格式化,并且创建磁盘组时还能正常的创建成功。大概猜想时这个版本中lastnode格式化时只格式化了磁盘组头部的信息,并未格式化集群配置文件的位置,所以导致在集群检查时,读取到历史的集群信息后直接退出。
解决方案
手动情况磁盘的信息,感觉一下回到10G环境中手动清理磁盘的信息。
[root@arm01 install]# dd if=/dev/zero of=/dev/nvme0n2 bs=8192 count=1000000
dd: error writing '/dev/nvme0n2': No space left on device
^C
^C^C^C
在次运行root.sh脚本后集群初始化成功,日志的信息如下:
[root@arm01 ~]# tail -1000f /oracle/app/19.3.0/grid/install/root_arm01_2025-09-02_23-17-34-060803682.log
Performing root user operation.
The following environment variables are set as:
ORACLE_OWNER= grid
ORACLE_HOME= /oracle/app/19.3.0/grid
Copying dbhome to /usr/local/bin ...
Copying oraenv to /usr/local/bin ...
Copying coraenv to /usr/local/bin ...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Relinking oracle with rac_on option
Using configuration parameter file: /oracle/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
/oracle/app/grid/crsdata/arm01/crsconfig/rootcrs_arm01_2025-09-02_11-17-34PM.log
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 1 of 19: 'ValidateEnv'.
2025/09/02 23:17:36 CLSRSC-363: User ignored prerequisites during installation
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 2 of 19: 'CheckFirstNode'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 3 of 19: 'GenSiteGUIDs'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 4 of 19: 'SetupOSD'.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 5 of 19: 'CheckCRSConfig'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 6 of 19: 'SetupLocalGPNP'.
2025/09/02 23:17:41 CLSRSC-594: Executing installation step 7 of 19: 'CreateRootCert'.
2025/09/02 23:17:42 CLSRSC-594: Executing installation step 8 of 19: 'ConfigOLR'.
2025/09/02 23:17:50 CLSRSC-594: Executing installation step 9 of 19: 'ConfigCHMOS'.
2025/09/02 23:17:50 CLSRSC-594: Executing installation step 10 of 19: 'CreateOHASD'.
2025/09/02 23:17:51 CLSRSC-594: Executing installation step 11 of 19: 'ConfigOHASD'.
2025/09/02 23:17:51 CLSRSC-330: Adding Clusterware entries to file 'oracle-ohasd.service'
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 12 of 19: 'SetupTFA'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 13 of 19: 'InstallAFD'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 14 of 19: 'InstallACFS'.
2025/09/02 23:18:25 CLSRSC-594: Executing installation step 15 of 19: 'InstallKA'.
2025/09/02 23:18:26 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:19:11 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.
ASM has been created and started successfully.
[DBT-30001] Disk groups created successfully. Check /oracle/app/grid/cfgtoollogs/asmca/asmca-250902PM111854.log for details.
2025/09/02 23:19:37 CLSRSC-482: Running command: '/oracle/app/19.3.0/grid/bin/ocrconfig -upgrade grid oinstall'
CRS-4256: Updating the profile
Successful addition of voting disk 38a1f1f25b454f55bfbeeb3f52abb8e3.
Successfully replaced voting disk group with +arm_ocr.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
## STATE File Universal Id File Name Disk group
-- ----- ----------------- --------- ---------
1. ONLINE 38a1f1f25b454f55bfbeeb3f52abb8e3 (/dev/nvme0n2) [ARM_OCR]
Located 1 voting disk(s).
2025/09/02 23:20:09 CLSRSC-594: Executing installation step 17 of 19: 'StartCluster'.
2025/09/02 23:21:17 CLSRSC-343: Successfully started Oracle Clusterware stack
2025/09/02 23:21:17 CLSRSC-594: Executing installation step 18 of 19: 'ConfigNode'.
2025/09/02 23:21:55 CLSRSC-594: Executing installation step 19 of 19: 'PostConfig'.
2025/09/02 23:22:04 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded
——————作者介绍———————–
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)
故障处理:RAC环境deconfig的未知BUG,导致集群配置信息未被清空的案例处理:等您坐沙发呢!