当前位置: 首页 > ORA > 正文

RAC中IPC Send timeout导致实例重启(reason 0x1)

1  环境介绍

      AIX平台2个节点数据库,同时只能启动一个节点,当启动第二个节点的时候会由于IPC timeout将第一个节点踢掉,数据库版本是

10.2.0.5

 

2hosts与网卡配置

2.1 hosts信息

127.0.0.1               loopback localhost      # loopback (lo0) name/address

22.111.111.215    htzpw1

22.111.111.216    htzpw2

22.111.111.208    htzpw1-vip

22.111.111.209    htzpw2-vip

#192.178.4.10      htzpw1-priv

#192.178.4.11      htzpw2-priv

 

111.111.1.10    htzpw1_boot1 htzpw1-priv

192.178.3.10    htzpw1_boot2

111.111.1.11    htzpw2_boot1 htzpw2-priv

192.178.3.11    htzpw2_boot2

22.111.111.175  htzpw1_per

22.111.111.176  htzpw2_per

22.111.111.130  htzpwVIP

2.2 网卡信息

node1

ifconfig -a

en4: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 111.111.1.10 netmask 0xffffff80 broadcast 111.111.1.127

        inet 22.111.111.175 netmask 0xffffff80 broadcast 10.185.240.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en5: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 192.178.4.10 netmask 0xffffff80 broadcast 192.178.4.127

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en6: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 192.178.3.10 netmask 0xffffff80 broadcast 192.178.3.127

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en7: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 22.111.111.215 netmask 0xffffff80 broadcast 10.185.240.255

        inet 22.111.111.208 netmask 0xffffff80 broadcast 10.185.240.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

        inet6 ::1%1/0

         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

        

        

node2

 

ifconfig -a

en4: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 111.111.1.11 netmask 0xffffff80 broadcast 192.178.2.127

        inet 22.111.111.176 netmask 0xffffff80 broadcast 10.185.240.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en5: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 192.178.4.11 netmask 0xffffff80 broadcast 192.178.4.127

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en6: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 192.178.3.11 netmask 0xffffff80 broadcast 192.178.3.127

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

en7: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>

        inet 22.111.111.216 netmask 0xffffff80 broadcast 10.185.240.255

        inet 22.111.111.209 netmask 0xffffff80 broadcast 10.185.240.255

         tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0

lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN>

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255

        inet6 ::1%1/0

         tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1

2.3 oifcfg信息

oifcfg getif

en7 22.111.111.128 global public

en4 111.112.1.0 global cluster_interconnect

 

node 1

en4 111.111.1.10  netmask 0xffffff80

en7 22.111.111.215  netmask 0xffffff80

 

node2

en4 111.111.1.11   netmask 0xffffff80

en7 22.111.111.216  netmask 0xffffff80

从上面的信息可以得到,HACMP有两个boot地址分别配置在ent4ent6上面,在hacmp上面配置了一个永久IP地址与服务地址,

ORACLE 共有网络在同一个网段,ORACLE共有网络配置在en7上面。其实整个HACMP配置网络是有问题的,出现了共有网段

同时出现在ORACLE的时候与共有网络上面。在10G RAC中,HACMP在资源处的主要作用就是提供VG并发的功能,并不提供IP

服务功能。这里还有一个错误的地方,私有网络的网段是111.112.1.0这个网段,但是网卡配置的确实111.111.1这个网段,

这里肯定是错误的。

 

3,启动报错

启动节点2alert日志信息如下

SuSat Mar 28 10:34:40 GMT+08:00 2015Starting ORACLE instance (normal)

sskgpgetexecname failed to get name

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Interface type 1 en4 111.111.1.0 configured from OCR for use as a cluster interconnect

Interface type 1 en7 22.111.111.128 configured from OCR for use as  a public interface

Picked latch-free SCN scheme 3

Autotune of undo retention is turned off.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

ksdpec: called for event 13740 prior to event group initialization

Starting up ORACLE RDBMS Version: 10.2.0.5.0.

System parameters with non-default values:

  processes                = 4000

  sessions                 = 4405

  sga_max_size             = 37580963840

  __shared_pool_size       = 4731174912

  __large_pool_size        = 16777216

  __java_pool_size         = 67108864

  __streams_pool_size      = 0

  spfile                   = /dev/rspfile

  sga_target               = 37580963840

  control_files            = /dev/rcontrol1, /dev/rcontrol2, /dev/rcontrol3

  db_block_size            = 8192

  __db_cache_size          = 32749125632

  compatible               = 10.2.0.5.0

  log_archive_dest_1       = location=/arch2

  db_files                 = 2000

  db_file_multiblock_read_count= 16

  cluster_database         = TRUE

  cluster_database_instances= 2

  _gc_undo_affinity        = FALSE

  _gc_affinity_time        = 0

  _gc_affinity_limit       = 250

  thread                   = 2

  instance_number          = 2

  undo_management          = AUTO

  undo_tablespace          = UNDOTBS2

  _undo_autotune           = FALSE

  undo_retention           = 3600

  remote_os_authent        = FALSE

  remote_login_passwordfile= EXCLUSIVE

  db_domain                =

  local_listener           = (ADDRESS=(PROTOCOL=TCP)(HOST=22.111.111.209)(PORT=1521))

  remote_listener          = LISTENERS_IPAPCQ

  job_queue_processes      = 10

  background_dump_dest     = /oracle/admin/ipapcq/bdump

  user_dump_dest           = /oracle/admin/ipapcq/udump

  core_dump_dest           = /oracle/admin/ipapcq/cdump

  audit_file_dest          = /oracle/admin/ipapcq/adump

  db_name                  = ipapcq

  open_cursors             = 1000

  sql92_security           = TRUE

  pga_aggregate_target     = 5368709120

  aq_tm_processes          = 10

Cluster communication is configured to use the following interface(s) for this instance

  111.111.1.11

Sat Mar 28 10:34:42 GMT+08:00 2015cluster interconnect IPC version:Oracle UDP/IP (generic)

IPC Vendor 1 proto 2

PMON started with pid=2, OS id=15795144

DIAG started with pid=3, OS id=14811274

PSP0 started with pid=4, OS id=15663318

LMON started with pid=5, OS id=9044900

LMD0 started with pid=6, OS id=15663464

LMS0 started with pid=7, OS id=8717302

LMS1 started with pid=8, OS id=8520698

LMS2 started with pid=9, OS id=10748526

LMS3 started with pid=10, OS id=15335548

MMAN started with pid=11, OS id=11076346

DBW0 started with pid=12, OS id=12386814

DBW1 started with pid=13, OS id=11796972

LGWR started with pid=14, OS id=17498712

CKPT started with pid=15, OS id=12255636

SMON started with pid=16, OS id=14811574

RECO started with pid=17, OS id=15270292

CJQ0 started with pid=18, OS id=9437972

MMON started with pid=19, OS id=11338422

MMNL started with pid=20, OS id=10879030

Sat Mar 28 10:34:52 GMT+08:00 2015lmon registered with NM – instance id 2 (internal mem no 1)

Sat Mar 28 10:34:52 GMT+08:00 2015Reconfiguration started (old inc 0, new inc 82)

Dynamic remastering is disabled

List of nodes:

 0 1

 Global Resource Directory frozen

* allocate domain 0, invalid = TRUE

 Communication channels reestablished

 * domain 0 valid = 1 according to instance 0

Sat Mar 28 10:34:53 GMT+08:00 2015 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

Sat Mar 28 10:40:06 GMT+08:00 2015IPC Send timeout detected.Sender: ospid 15663464

Receiver: inst 1 binc 91474918 ospid 23855480

Sat Mar 28 10:40:08 GMT+08:00 2015IPC Send timeout to 0.0 inc 82 for msg type 26 from opid 6

Sat Mar 28 10:40:08 GMT+08:00 2015Communications reconfiguration: instance_number 1

Sat Mar 28 10:40:09 GMT+08:00 2015Trace dumping is performing id=[cdmp_20150328104008]

Sat Mar 28 10:40:09 GMT+08:00 2015Reconfiguration started (old inc 82, new inc 86)

List of nodes:

 1

 Nested reconfiguration detected.

 Global Resource Directory frozen

 * dead instance detected – domain 0 invalid = TRUE

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

 All grantable enqueues granted

 Post SMON to start 1st pass IR

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 0: 0 GCS shadows traversed, 0 replayed

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 2: 0 GCS shadows traversed, 0 replayed

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 1: 0 GCS shadows traversed, 0 replayed

Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 3: 0 GCS shadows traversed, 0 replayed

Sat Mar 28 10:40:09 GMT+08:00 2015 Submitted all GCS remote-cache requests

 Fix write in gcs resources

Reconfiguration complete

LCK0 started with pid=21, OS id=10879338

Sat Mar 28 10:40:10 GMT+08:00 2015ALTER DATABASE   MOUNT

这里可以看到数据库在nomount阶段的时候由于ipc send timeout将节点1给关了,这里数据库识别出来的是111.111.1这个网段是私用网络,但是oifcfg出来是111.112这个网络。

下面是节点1的日志

Sat Mar 28 10:34:35 GMT+08:00 2015Reconfiguration started (old inc 80, new inc 82)

List of nodes:

 0 1

 Global Resource Directory frozen

 Communication channels reestablished

 Master broadcasted resource hash value bitmaps

 Non-local Process blocks cleaned out

Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed

Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed

 Set master node info

 Submitted all remote-enqueue requests

 Dwn-cvts replayed, VALBLKs dubious

Sat Mar 28 10:39:49 GMT+08:00 2015IPC Send timeout detected. Receiver ospid 23855480

Sat Mar 28 10:39:49 GMT+08:00 2015Errors in file /oracle/admin/ipapcq/bdump/ipapcq1_lmd0_23855480.trc:

Sat Mar 28 10:39:51 GMT+08:00 2015Received an instance abort message from instance 2 (reason 0x1)

Please check instance 2 alert and LMON trace files for detail.

LMS0: terminating instance due to error 481

Sat Mar 28 10:39:52 GMT+08:00 2015Trace dumping is performing id=[cdmp_20150328104008]

Sat Mar 28 10:39:54 GMT+08:00 2015Shutting down instance (abort)

License high water mark = 1186

Sat Mar 28 10:39:56 GMT+08:00 2015Instance terminated by LMS0, pid = 20251510

Sat Mar 28 10:39:59 GMT+08:00 2015Instance terminated by USER, pid = 25232348

这里看到节点1由于IPC SEND TIMEOUTlms0进程将实例关闭,ipc timeout的原因是(reason 0x1)0x1说明:The Node Monitor generated the reconfiguration

 

lmd0进程日志信息如下:

Shared Memory:

ID                KEY

3145740         0xd36e6b18

Maximum processes:               = 4000

ksxpdmp:  facility 2 (SKGFAIPC) (0x1, 0x0) counts 0, 0

ksxpdmp:  Dumping the osd context (brief)

SKGXPCTX: 0x11040f4d0 ctx

WAIT HISTORY

Wait Time  Time since     Fast reaps  Wait Type  Return Code

  (ms)     prev wait(ms)  before                          

———  ————– ———– ———  ———–

117        0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

151        0              0           NORMAL     POSTED 这里是成功了的

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

80         0              0           NORMAL     TIMEDOUT

wait delta 0 sec (60 msec) ctx ts 0x27b9f8e8 last ts 0x27b9f8e8

user cpu time since last wait 0 sec 0 ticks

system cpu time since last wait 0 sec 0 ticks

locked 1

blocked 10

timed wait receives 2

fast reaps since last wait 0

admno 0x4440d03e admport:

SSKGXPT 0x104118a0 flags SSKGXPT_BUFSZ_SET         socket no 9         IP 111.111.1.10         UDP 34970

last bytes received: 0

context timestamp 0x27b9f8e8

done Queue

no completed requests

port Queue

SKGXPT 11046c718 portno = 165815271 port:

SSKGXPT 0x1046c728 flags SSKGXPT_BUFSZ_SET         socket no 14         IP 111.111.1.10         UDP 34972

connection Queue

no pending connect disconnect operations

sends waiting to be transmitted

no sends waiting to be transmitted

pending ack Queue

no send requests pending ack

Mapped regions

ksxpdmp:  Dumping the ksxp contexts

ksxpdmp:  Dumping ksxp context 0x11046c8f8 client 2

这里给处了IP地址与IPC TIMEDOUT的信息。

 

lms0日志信息,只有下面的信息。

*** 2015-03-28 10:34:35.858

 0 GCS shadows cancelled, 0 closed

 885825 GCS resources traversed, 0 cancelled

 LMS 0: 2443790 GCS resources on freelist, 4215772 on array, 4215775 allocated

*** 2015-03-28 10:39:51.259

Received a special admin message (type 2) from node 1

Message body: flag 0x1 data 0x0 0x1 0x0

Abort the instance

ksuitm: waiting up to [5] seconds before killing DIAG(25297038)

此时节点2上面的lmd,lms进程里面没有任何与节点1相关的一些信息。

 

4 分析总结

     通过上面分析,怀疑是配置错误的私有网络导致的,另外HACMP配置网络错误,出现公网网络与私网网络在同一个网卡上面。

本文固定链接: http://www.htz.pw/2015/03/28/rac%e4%b8%adipc-send-timeout%e5%af%bc%e8%87%b4%e5%ae%9e%e4%be%8b%e9%87%8d%e5%90%afreason-0x1.html | 认真就输

该日志由 huangtingzhong 于2015年03月28日发表在 ORA 分类下, 通告目前不可用,你可以至底部留下评论。
原创文章转载请注明: RAC中IPC Send timeout导致实例重启(reason 0x1) | 认真就输
关键字: