当前位置: 首页 > ORACLE, 故障处理 > 正文

我们的文章会在微信公众号IT民工的龙马人生博客网站 ( www.htz.pw )同步更新 ,欢迎关注收藏,也欢迎大家转载,但是请在文章开始地方标注文章出处,谢谢!
由于博客中有大量代码,通过页面浏览效果更佳。

在上上周分享了故障处理:19C RAC改私网IP后重建集群时报网络找不到,这套环境重新运行root.sh后,集群在初始化时仍然有报错,今天来回一趟重庆,晚上不想看书,所以临时想到把这个故障分析一下,大概发了点时间,这里和大家分析一下大概的思路:

环境信息

这个环境是在我自己的MacOS里面的虚拟机安装的Oracle Arm版本,版本为19.19,并未安装其它的补丁。

模拟故障现象

deconfig集群

为了模拟整个故障,所以我先将环境deconfig一次,这里注意关键词-lastnode -force,也就是以为着deconfig会删除最后的集群配置信息。

[root@arm01 install]# ./rootcrs.sh  -deconfig -lastnode -force
.....
2025/09/02 22:06:11 CLSRSC-558: failed to deconfigure ASM
2025/09/02 22:06:11 CLSRSC-651: One or more deconfiguration steps failed, but the deconfiguration process continued because the -force option was specified.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 22:06:39 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:08:29 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.
2025/09/02 22:09:00 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node
2025/09/02 22:09:00 CLSRSC-559: Ensure that the GPnP profile data under the 'gpnp' directory in /oracle/app/19.3.0/grid is deleted on each node before using the software in the current Grid Infrastructure home for reconfiguration.

忽略中间的日志,通过最后的成功关键字,我们可以看到整个集群卸载成功了。

运行root.sh脚本

[root@arm01 install]# /oracle/app/19.3.0/grid/root.sh

2025/09/02 23:05:50 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:06:36 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.

ASM has been created and started successfully.

[DBT-30022] Disk group arm_ocr mounted successfully.

2025/09/02 23:06:59 CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
Died at /oracle/app/19.3.0/grid/crs/install/oraocr.pm line 1890.

这里将前面的正常的日志去掉了,这里注意关键的行是ASM创建磁盘组成功了,也就意味着原来头的信息是格式化过,否者无法创建磁盘组,但是磁盘组创建完成后,里面出发了CLSRSC-428的报错。

去查看一下详细信息

2025-09-02 23:06:59: Executing the step [ocr_configFirstNode_step_2] to configure OCR on the first node
2025-09-02 23:06:59: Reuse Disk Group is set to 0
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/ocrcheck -debug
2025-09-02 23:06:59: Command output:
>  Status of Oracle Cluster Registry is as follows :
>        Version                  :          4
>        Total space (kbytes)     :     901284
>        Used space (kbytes)      :      84400
>        Available space (kbytes) :     816884
>        ID                       : 1509093020
>        Device/File Name         :   +ARM_OCR
>                                      PROT-713: Device/File integrity check succeeded
>
>                                      PROT-710: Device/File not configured
>
>                                      PROT-710: Device/File not configured
>
>                                      PROT-710: Device/File not configured
>
>                                      PROT-710: Device/File not configured
>
>        PROT-707: Cluster registry integrity check succeeded
>
>        PROT-720: Logical corruption check succeeded
>
>End Command output
2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Executing cmd: /oracle/app/19.3.0/grid/bin/clsecho -p has -f clsrsc -m 428
2025-09-02 23:06:59: Command output:
>  CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
>End Command output
2025-09-02 23:06:59: CLSRSC-428: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall.
2025-09-02 23:06:59: ###### Begin DIE Stack Trace ######
2025-09-02 23:06:59:     Package         File                 Line Calling
2025-09-02 23:06:59:     --------------- -------------------- ---- ----------
2025-09-02 23:06:59:  1: main            rootcrs.pl            358 crsutils::dietrap
2025-09-02 23:06:59:  2: oraClusterwareComp::oraocr oraocr.pm            1890 main::__ANON__
2025-09-02 23:06:59:  3: oraClusterwareComp::oraocr oraocr.pm            1836 oraClusterwareComp::oraocr::configureOCR
2025-09-02 23:06:59:  4: oraClusterwareComp::oraocr oraocr.pm             245 oraClusterwareComp::oraocr::configSteps
2025-09-02 23:06:59:  5: oraClusterwareComp oraClusterwareComp.pm   91 oraClusterwareComp::oraocr::configureFirstNode
2025-09-02 23:06:59:  6: crsinstall      crsinstall.pm        2586 oraClusterwareComp::configureCurrentNode
2025-09-02 23:06:59:  7: crsinstall      crsinstall.pm        2427 crsinstall::perform_initial_config
2025-09-02 23:06:59:  8: crsinstall      crsinstall.pm        1085 crsinstall::perform_init_config
2025-09-02 23:06:59:  9: crsinstall      crsinstall.pm        1243 crsinstall::init_config
2025-09-02 23:06:59: 10: crsinstall      crsinstall.pm         487 crsinstall::CRSInstall
2025-09-02 23:06:59: 11: main            rootcrs.pl            559 crsinstall::new
2025-09-02 23:06:59: ####### End DIE Stack Trace #######

注意下面这3行的信息:

2025-09-02 23:06:59: checkOCR rc=0
2025-09-02 23:06:59: OCR check: passed
2025-09-02 23:06:59: Existing OCR configuration found, aborting the configuration. Rerun configuration setup after deinstall

checkOCR返回为0,也就是以为着检查通过,但是立马就报了OCR配置已经存在的错误,这里感觉有点奇怪。

分析过程

查看磁盘组的状态

Disk Group Name  Fail Group         Path                              File Name                    Status                   Status         Status         TYPE      File Size (MB) Used Size (MB) Pct. Used
---------------- ------------------ --------------------------------- ---------------------------- ------------------------ -------------- -------------- --------- -------------- -------------- ---------
ARM_OCR          ARM_OCR_0000       /dev/nvme0n2                      ARM_OCR_0000                 MEMBER                   CACHED         ONLINE         REGULAR            5,120            320      6.25
                 ******************                                                                                                                                 -------------- --------------
                 TOTAL                                                                                                                                                       5,120            320

这里看到磁盘组状态完全是正常的。

查看磁盘组内容

在看看磁盘组里面的内容

[grid@arm01 ~]$ ocrconfig -showbackup


^CPROT-26: Oracle Cluster Registry backup locations were retrieved from a local copy

arm01     2025/08/22 02:08:51     +arm_ocr:/raccluster/OCRBACKUP/backup00.ocr.261.1209780531     0
arm01     2025/08/21 14:16:32     +arm_ocr:/raccluster/OCRBACKUP/backup01.ocr.258.1209737791     0
arm01     2025/08/21 10:16:31     +arm_ocr:/raccluster/OCRBACKUP/backup02.ocr.263.1209723391     0
arm01     2025/08/21 10:16:31     +arm_ocr:/raccluster/OCRBACKUP/day.ocr.259.1209723391     0
arm01     2025/08/21 10:16:31     +arm_ocr:/raccluster/OCRBACKUP/week.ocr.260.1209723391     0

这里连之前的备份信息都还有?搞得有得不懂了,这部分信息来至于那儿呢?

原因分析

通过上面的分析,简单可以判断是由于磁盘中仍存在历史信息,所以导致集群在初始化时报错。但是deconfig为什么没有格式化磁盘组时没有完全格式化,并且创建磁盘组时还能正常的创建成功。大概猜想时这个版本中lastnode格式化时只格式化了磁盘组头部的信息,并未格式化集群配置文件的位置,所以导致在集群检查时,读取到历史的集群信息后直接退出。

解决方案

手动情况磁盘的信息,感觉一下回到10G环境中手动清理磁盘的信息。

[root@arm01 install]# dd if=/dev/zero of=/dev/nvme0n2 bs=8192 count=1000000
dd: error writing '/dev/nvme0n2': No space left on device
^C

^C^C^C

在次运行root.sh脚本后集群初始化成功,日志的信息如下:

[root@arm01 ~]# tail -1000f /oracle/app/19.3.0/grid/install/root_arm01_2025-09-02_23-17-34-060803682.log
Performing root user operation.

The following environment variables are set as:
    ORACLE_OWNER= grid
    ORACLE_HOME=  /oracle/app/19.3.0/grid
   Copying dbhome to /usr/local/bin ...
   Copying oraenv to /usr/local/bin ...
   Copying coraenv to /usr/local/bin ...

Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.
Relinking oracle with rac_on option
Using configuration parameter file: /oracle/app/19.3.0/grid/crs/install/crsconfig_params
The log of current session can be found at:
  /oracle/app/grid/crsdata/arm01/crsconfig/rootcrs_arm01_2025-09-02_11-17-34PM.log
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 1 of 19: 'ValidateEnv'.
2025/09/02 23:17:36 CLSRSC-363: User ignored prerequisites during installation
2025/09/02 23:17:36 CLSRSC-594: Executing installation step 2 of 19: 'CheckFirstNode'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 3 of 19: 'GenSiteGUIDs'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 4 of 19: 'SetupOSD'.
Redirecting to /bin/systemctl restart rsyslog.service
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 5 of 19: 'CheckCRSConfig'.
2025/09/02 23:17:37 CLSRSC-594: Executing installation step 6 of 19: 'SetupLocalGPNP'.
2025/09/02 23:17:41 CLSRSC-594: Executing installation step 7 of 19: 'CreateRootCert'.
2025/09/02 23:17:42 CLSRSC-594: Executing installation step 8 of 19: 'ConfigOLR'.


2025/09/02 23:17:50 CLSRSC-594: Executing installation step 9 of 19: 'ConfigCHMOS'.
2025/09/02 23:17:50 CLSRSC-594: Executing installation step 10 of 19: 'CreateOHASD'.
2025/09/02 23:17:51 CLSRSC-594: Executing installation step 11 of 19: 'ConfigOHASD'.
2025/09/02 23:17:51 CLSRSC-330: Adding Clusterware entries to file 'oracle-ohasd.service'
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 12 of 19: 'SetupTFA'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 13 of 19: 'InstallAFD'.
2025/09/02 23:18:05 CLSRSC-594: Executing installation step 14 of 19: 'InstallACFS'.
2025/09/02 23:18:25 CLSRSC-594: Executing installation step 15 of 19: 'InstallKA'.
2025/09/02 23:18:26 CLSRSC-594: Executing installation step 16 of 19: 'InitConfig'.
2025/09/02 23:19:11 CLSRSC-4002: Successfully installed Oracle Trace File Analyzer (TFA) Collector.

ASM has been created and started successfully.

[DBT-30001] Disk groups created successfully. Check /oracle/app/grid/cfgtoollogs/asmca/asmca-250902PM111854.log for details.

2025/09/02 23:19:37 CLSRSC-482: Running command: '/oracle/app/19.3.0/grid/bin/ocrconfig -upgrade grid oinstall'
CRS-4256: Updating the profile
Successful addition of voting disk 38a1f1f25b454f55bfbeeb3f52abb8e3.
Successfully replaced voting disk group with +arm_ocr.
CRS-4256: Updating the profile
CRS-4266: Voting file(s) successfully replaced
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   38a1f1f25b454f55bfbeeb3f52abb8e3 (/dev/nvme0n2) [ARM_OCR]
Located 1 voting disk(s).
2025/09/02 23:20:09 CLSRSC-594: Executing installation step 17 of 19: 'StartCluster'.
2025/09/02 23:21:17 CLSRSC-343: Successfully started Oracle Clusterware stack
2025/09/02 23:21:17 CLSRSC-594: Executing installation step 18 of 19: 'ConfigNode'.
2025/09/02 23:21:55 CLSRSC-594: Executing installation step 19 of 19: 'PostConfig'.
2025/09/02 23:22:04 CLSRSC-325: Configure Oracle Grid Infrastructure for a Cluster ... succeeded

——————作者介绍———————–
姓名:黄廷忠
现就职:Oracle中国高级服务团队
曾就职:OceanBase、云和恩墨、东方龙马等
电话、微信、QQ:18081072613
个人博客: (http://www.htz.pw)
CSDN地址: (https://blog.csdn.net/wwwhtzpw)
博客园地址: (https://www.cnblogs.com/www-htz-pw)


故障处理:RAC环境deconfig的未知BUG,导致集群配置信息未被清空的案例处理:等您坐沙发呢!

发表评论

gravatar

? razz sad evil ! smile oops grin eek shock ??? cool lol mad twisted roll wink idea arrow neutral cry mrgreen

快捷键:Ctrl+Enter