1 环境介绍
AIX平台2个节点数据库,同时只能启动一个节点,当启动第二个节点的时候会由于IPC timeout将第一个节点踢掉,数据库版本是
10.2.0.5
2,hosts与网卡配置
2.1 hosts信息
127.0.0.1 loopback localhost # loopback (lo0) name/address 22.111.111.215 htzpw1 22.111.111.216 htzpw2 22.111.111.208 htzpw1-vip 22.111.111.209 htzpw2-vip #192.178.4.10 htzpw1-priv #192.178.4.11 htzpw2-priv
111.111.1.10 htzpw1_boot1 htzpw1-priv 192.178.3.10 htzpw1_boot2 111.111.1.11 htzpw2_boot1 htzpw2-priv 192.178.3.11 htzpw2_boot2 22.111.111.175 htzpw1_per 22.111.111.176 htzpw2_per 22.111.111.130 htzpwVIP |
2.2 网卡信息
node1 ifconfig -a en4: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 111.111.1.10 netmask 0xffffff80 broadcast 111.111.1.127 inet 22.111.111.175 netmask 0xffffff80 broadcast 10.185.240.255 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 en5: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 192.178.4.10 netmask 0xffffff80 broadcast 192.178.4.127 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 en6: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 192.178.3.10 netmask 0xffffff80 broadcast 192.178.3.127 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 en7: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 22.111.111.215 netmask 0xffffff80 broadcast 10.185.240.255 inet 22.111.111.208 netmask 0xffffff80 broadcast 10.185.240.255 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1%1/0 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1
node2
ifconfig -a en4: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 111.111.1.11 netmask 0xffffff80 broadcast 192.178.2.127 inet 22.111.111.176 netmask 0xffffff80 broadcast 10.185.240.255 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 en5: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 192.178.4.11 netmask 0xffffff80 broadcast 192.178.4.127 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 en6: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 192.178.3.11 netmask 0xffffff80 broadcast 192.178.3.127 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 en7: flags=1e080863,c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN> inet 22.111.111.216 netmask 0xffffff80 broadcast 10.185.240.255 inet 22.111.111.209 netmask 0xffffff80 broadcast 10.185.240.255 tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0 lo0: flags=e08084b,c0<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,LARGESEND,CHAIN> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1%1/0 tcp_sendspace 131072 tcp_recvspace 131072 rfc1323 1 |
2.3 oifcfg信息
oifcfg getif en7 22.111.111.128 global public en4 111.112.1.0 global cluster_interconnect
node 1 en4 111.111.1.10 netmask 0xffffff80 en7 22.111.111.215 netmask 0xffffff80
node2 en4 111.111.1.11 netmask 0xffffff80 en7 22.111.111.216 netmask 0xffffff80 |
从上面的信息可以得到,HACMP有两个boot地址分别配置在ent4与ent6上面,在hacmp上面配置了一个永久IP地址与服务地址,
与ORACLE的 共有网络在同一个网段,ORACLE共有网络配置在en7上面。其实整个HACMP配置网络是有问题的,出现了共有网段
同时出现在ORACLE的时候与共有网络上面。在10G RAC中,HACMP在资源处的主要作用就是提供VG并发的功能,并不提供IP
服务功能。这里还有一个错误的地方,私有网络的网段是111.112.1.0这个网段,但是网卡配置的确实111.111.1这个网段,
这里肯定是错误的。
3,启动报错
启动节点2,alert日志信息如下
SuSat Mar 28 10:34:40 GMT+08:00 2015Starting ORACLE instance (normal) sskgpgetexecname failed to get name LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 en4 111.111.1.0 configured from OCR for use as a cluster interconnect Interface type 1 en7 22.111.111.128 configured from OCR for use as a public interface Picked latch-free SCN scheme 3 Autotune of undo retention is turned off. LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.5.0. System parameters with non-default values: processes = 4000 sessions = 4405 sga_max_size = 37580963840 __shared_pool_size = 4731174912 __large_pool_size = 16777216 __java_pool_size = 67108864 __streams_pool_size = 0 spfile = /dev/rspfile sga_target = 37580963840 control_files = /dev/rcontrol1, /dev/rcontrol2, /dev/rcontrol3 db_block_size = 8192 __db_cache_size = 32749125632 compatible = 10.2.0.5.0 log_archive_dest_1 = location=/arch2 db_files = 2000 db_file_multiblock_read_count= 16 cluster_database = TRUE cluster_database_instances= 2 _gc_undo_affinity = FALSE _gc_affinity_time = 0 _gc_affinity_limit = 250 thread = 2 instance_number = 2 undo_management = AUTO undo_tablespace = UNDOTBS2 _undo_autotune = FALSE undo_retention = 3600 remote_os_authent = FALSE remote_login_passwordfile= EXCLUSIVE db_domain = local_listener = (ADDRESS=(PROTOCOL=TCP)(HOST=22.111.111.209)(PORT=1521)) remote_listener = LISTENERS_IPAPCQ job_queue_processes = 10 background_dump_dest = /oracle/admin/ipapcq/bdump user_dump_dest = /oracle/admin/ipapcq/udump core_dump_dest = /oracle/admin/ipapcq/cdump audit_file_dest = /oracle/admin/ipapcq/adump db_name = ipapcq open_cursors = 1000 sql92_security = TRUE pga_aggregate_target = 5368709120 aq_tm_processes = 10 Cluster communication is configured to use the following interface(s) for this instance 111.111.1.11 Sat Mar 28 10:34:42 GMT+08:00 2015cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 PMON started with pid=2, OS id=15795144 DIAG started with pid=3, OS id=14811274 PSP0 started with pid=4, OS id=15663318 LMON started with pid=5, OS id=9044900 LMD0 started with pid=6, OS id=15663464 LMS0 started with pid=7, OS id=8717302 LMS1 started with pid=8, OS id=8520698 LMS2 started with pid=9, OS id=10748526 LMS3 started with pid=10, OS id=15335548 MMAN started with pid=11, OS id=11076346 DBW0 started with pid=12, OS id=12386814 DBW1 started with pid=13, OS id=11796972 LGWR started with pid=14, OS id=17498712 CKPT started with pid=15, OS id=12255636 SMON started with pid=16, OS id=14811574 RECO started with pid=17, OS id=15270292 CJQ0 started with pid=18, OS id=9437972 MMON started with pid=19, OS id=11338422 MMNL started with pid=20, OS id=10879030 Sat Mar 28 10:34:52 GMT+08:00 2015lmon registered with NM – instance id 2 (internal mem no 1) Sat Mar 28 10:34:52 GMT+08:00 2015Reconfiguration started (old inc 0, new inc 82) Dynamic remastering is disabled List of nodes: 0 1 Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid = 1 according to instance 0 Sat Mar 28 10:34:53 GMT+08:00 2015 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:34:53 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious Sat Mar 28 10:40:06 GMT+08:00 2015IPC Send timeout detected.Sender: ospid 15663464 Receiver: inst 1 binc 91474918 ospid 23855480 Sat Mar 28 10:40:08 GMT+08:00 2015IPC Send timeout to 0.0 inc 82 for msg type 26 from opid 6 Sat Mar 28 10:40:08 GMT+08:00 2015Communications reconfiguration: instance_number 1 Sat Mar 28 10:40:09 GMT+08:00 2015Trace dumping is performing id=[cdmp_20150328104008] Sat Mar 28 10:40:09 GMT+08:00 2015Reconfiguration started (old inc 82, new inc 86) List of nodes: 1 Nested reconfiguration detected. Global Resource Directory frozen * dead instance detected – domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 0: 0 GCS shadows traversed, 0 replayed Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 2: 0 GCS shadows traversed, 0 replayed Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 1: 0 GCS shadows traversed, 0 replayed Sat Mar 28 10:40:09 GMT+08:00 2015 LMS 3: 0 GCS shadows traversed, 0 replayed Sat Mar 28 10:40:09 GMT+08:00 2015 Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete LCK0 started with pid=21, OS id=10879338 Sat Mar 28 10:40:10 GMT+08:00 2015ALTER DATABASE MOUNT |
这里可以看到数据库在nomount阶段的时候由于ipc send timeout将节点1给关了,这里数据库识别出来的是111.111.1这个网段是私用网络,但是oifcfg出来是111.112这个网络。
下面是节点1的日志
Sat Mar 28 10:34:35 GMT+08:00 2015Reconfiguration started (old inc 80, new inc 82) List of nodes: 0 1 Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 3: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 1: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 0: 0 GCS shadows cancelled, 0 closed Sat Mar 28 10:34:35 GMT+08:00 2015 LMS 2: 0 GCS shadows cancelled, 0 closed Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious Sat Mar 28 10:39:49 GMT+08:00 2015IPC Send timeout detected. Receiver ospid 23855480 Sat Mar 28 10:39:49 GMT+08:00 2015Errors in file /oracle/admin/ipapcq/bdump/ipapcq1_lmd0_23855480.trc: Sat Mar 28 10:39:51 GMT+08:00 2015Received an instance abort message from instance 2 (reason 0x1) Please check instance 2 alert and LMON trace files for detail. LMS0: terminating instance due to error 481 Sat Mar 28 10:39:52 GMT+08:00 2015Trace dumping is performing id=[cdmp_20150328104008] Sat Mar 28 10:39:54 GMT+08:00 2015Shutting down instance (abort) License high water mark = 1186 Sat Mar 28 10:39:56 GMT+08:00 2015Instance terminated by LMS0, pid = 20251510 Sat Mar 28 10:39:59 GMT+08:00 2015Instance terminated by USER, pid = 25232348 |
这里看到节点1由于IPC SEND TIMEOUT,lms0进程将实例关闭,ipc timeout的原因是(reason 0x1),0x1说明:The Node Monitor generated the reconfiguration。
lmd0进程日志信息如下:
Shared Memory: ID KEY 3145740 0xd36e6b18 Maximum processes: = 4000 ksxpdmp: facility 2 (SKGFAIPC) (0x1, 0x0) counts 0, 0 ksxpdmp: Dumping the osd context (brief) SKGXPCTX: 0x11040f4d0 ctx WAIT HISTORY Wait Time Time since Fast reaps Wait Type Return Code (ms) prev wait(ms) before ——— ————– ———– ——— ———– 117 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 151 0 0 NORMAL POSTED 这里是成功了的 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT 80 0 0 NORMAL TIMEDOUT wait delta 0 sec (60 msec) ctx ts 0x27b9f8e8 last ts 0x27b9f8e8 user cpu time since last wait 0 sec 0 ticks system cpu time since last wait 0 sec 0 ticks locked 1 blocked 10 timed wait receives 2 fast reaps since last wait 0 admno 0x4440d03e admport: SSKGXPT 0x104118a0 flags SSKGXPT_BUFSZ_SET socket no 9 IP 111.111.1.10 UDP 34970 last bytes received: 0 context timestamp 0x27b9f8e8 done Queue no completed requests port Queue SKGXPT 11046c718 portno = 165815271 port: SSKGXPT 0x1046c728 flags SSKGXPT_BUFSZ_SET socket no 14 IP 111.111.1.10 UDP 34972 connection Queue no pending connect disconnect operations sends waiting to be transmitted no sends waiting to be transmitted pending ack Queue no send requests pending ack Mapped regions ksxpdmp: Dumping the ksxp contexts ksxpdmp: Dumping ksxp context 0x11046c8f8 client 2 |
这里给处了IP地址与IPC TIMEDOUT的信息。
lms0日志信息,只有下面的信息。
*** 2015-03-28 10:34:35.858 0 GCS shadows cancelled, 0 closed 885825 GCS resources traversed, 0 cancelled LMS 0: 2443790 GCS resources on freelist, 4215772 on array, 4215775 allocated *** 2015-03-28 10:39:51.259 Received a special admin message (type 2) from node 1 Message body: flag 0x1 data 0x0 0x1 0x0 Abort the instance ksuitm: waiting up to [5] seconds before killing DIAG(25297038) |
此时节点2上面的lmd,lms进程里面没有任何与节点1相关的一些信息。
4 分析总结
通过上面分析,怀疑是配置错误的私有网络导致的,另外HACMP配置网络错误,出现公网网络与私网网络在同一个网卡上面。
RAC中IPC Send timeout导致实例重启(reason 0x1):等您坐沙发呢!