数据库今天早上突然出现大量的log file sync等待事件,如下:
I EVENT SQL_ID COUNT(*) — —————————————- —————— ———- 1 Backup: MML write backup piece 8 PX Deq: Execution Msg 81prbu1ddkd5c 1 db file scattered read f404sdb940xrp 1 log file sync 7 read by other session f404sdb940xrp 4 2 PX Deq: Execute Reply 81prbu1ddkd5c 1 PX Deq: Execution Msg 81prbu1ddkd5c 1 db file sequential read 06xny3qtp17j4 1 db file sequential read 0pthc62tk3b6d 1 db file sequential read 1rmtm7zh5qvyq 1 db file sequential read 2m6tz655jw57t 1 db file sequential read 3yfa7shq4vvgw 1 db file sequential read 48kt9xh5g1tdj 1 db file sequential read 4xjm94x3v9pyu 1 db file sequential read 5j6c12bw59bvk 1 db file sequential read 6dkrf3bzhgxrp 1 db file sequential read 6nwm098n4vs5z 1 db file sequential read 77zm7d3rhnu5z 1 db file sequential read 7r3jkwzkcuz5d 1 db file sequential read 81p0m707yjwbs 1 db file sequential read 94jsv1hdp6ks9 1 db file sequential read 96ncsaz7tfh3v 1 db file sequential read 9mcnbjwx7qynf 1 db file sequential read 9nx093r7hr0g3 1 db file sequential read a4j25uk2q0jdp 1 db file sequential read asphg45zk6d7m 1 db file sequential read bcsjzna3fvrb1 1 db file sequential read dp2jf1dpvd7x3 1 db file sequential read dwbxgb7838xht 1 db file sequential read fpabwmm0c5812 1 db file sequential read gb995a2pwuqu2 1 db file sequential read gf25ng83zukxx 1 db file sequential read gm42ngbs5ac9a 1 db file sequential read gmk8cs80tb9u9 1 enq: TX – contention 3dhx24xm7u1xp 1 library cache lock 85spadpadkvhj 1 log file sync 41
37 rows selected.
I EVENT COUNT(*) — —————————————- ———- 1 Backup: MML write backup piece 8 log file sync 7 read by other session 4 db file scattered read 1 PX Deq: Execution Msg 1 2 log file sync 38 db file sequential read 29 enq: TX – contention 1 PX Deq: Execution Msg 1 PX Deq: Execute Reply 1 SQL*Net message from client 1 |
基本都出现在节点2上,节点1只有少量的log file sync等待事件,不过在节点1上面出现了一个rman的等待事件,并且数量达到了8个,有点异常了。在原来做NBU的时候,一般配置都不会操作过4个。
查看lgwr的trace文件,怀疑是由于自适应日志写导致的,这是11G中的新功能。
Warning: log write elapsed time 589ms, size 660KB Warning: log write elapsed time 635ms, size 1290KB Warning: log write elapsed time 866ms, size 1312KB …………….. Warning: log write elapsed time 1414ms, size 930KB Warning: log write elapsed time 510ms, size 1844KB |
通过trace文件,未发现相当的trace信息输出
查看io速度
由于aix平台,sar命令需要root用户权限才可以执行,所以这里直接查看光纤的速度就可以了
节点2的光纤速度
lqtopas_nmonqqL=LargePageStatsqqqHost=abmdb02qqqqqqqqRefresh=2 secsqqq09:32 x Fibre-Channel-Adapter qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq x Adapter Receive Transmit -Requests- –Size KB– xNumber Name KB/s KB/s In Out In Out x 1 fcs0 23594.9 11831.3 546.5 481.7 43.2 24.6 x 2 fcs1 0.0 0.0 0.0 0.0 0.0 0.0 x 3 fcs2 22541.6 10784.7 541.1 471.9 41.7 22.9 x 4 fcs3 0.0 0.0 0.0 0.0 0.0 0.0 lqtopas_nmonqqk=Kernel-statsqqqqqHost=abmdb02qqqqqqqqRefresh=2 secsqqq09:32 x Fibre-Channel-Adapter qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq x Adapter Receive Transmit -Requests- –Size KB– xNumber Name KB/s KB/s In Out In Out x 1 fcs0 28903.6 21866.5 705.2 811.4 41.0 26.9 x 2 fcs1 0.0 0.0 0.0 0.0 0.0 0.0 x 3 fcs2 28384.6 20360.7 639.4 796.6 44.4 25.6 x 4 fcs3 0.0 0.0 0.0 0.0 0.0 0.0 |
这里发现节点2的传递速度只有20M/S左右,太不正常了
节点1的光纤速度
lqtopas_nmonqq#=PURR StatsqqqqqqqHost=abmdb01qqqqqqqqRefresh=2 secsqqq09:50 x Fibre-Channel-Adapter qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq x Adapter Receive Transmit -Requests- –Size KB– xNumber Name KB/s KB/s In Out In Out x 1 fcs0 145064.1 1317.7 1245.3 67.3 116.5 19.6 x 2 fcs1 0.0 224332.0 0.0 1752.6 0.0 128.0 x 3 fcs2 144409.3 1985.3 1259.9 63.9 114.6 31.1 x 4 fcs3 0.0 0.0 0.0 0.0 0.0 0.0 lqtopas_nmonqqc=CPUqqqqqqqqqqqqqqHost=abmdb01qqqqqqqqRefresh=2 secsqqq09:50 x Fibre-Channel-Adapter qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq x Adapter Receive Transmit -Requests- –Size KB– xNumber Name KB/s KB/s In Out In Out x 1 fcs0 146195.3 133.1 1289.7 11.3 113.4 11.8 x 2 fcs1 0.0 228583.8 0.0 1785.8 0.0 128.0 x 3 fcs2 143075.5 151.9 1267.7 13.2 112.9 11.5 x 4 fcs3 0.0 0.0 0.0 0.0 0.0 0.0 |
这里发现节点1的传递速度达到了220M/S的速度。
通过两个节点的光纤速度相加差不250M/S的速度了,基本上已经达到了存储的限制值了
收集两个节点AWR数据
节点2异常时的IO统计
节点2异常时的IO统计
基本上跟之前的光纤卡的速度对应,并且节点1的RMAN占用的IO达到了274M/S,太吓人了
下面来看看前一天的相同时段的AWR数据
节点2
节点1
可以看到在正常时段的IO还不到150M/S
分析总结
由于节点1RMAN消耗大量的IO,导致存储IO性能下降,导致节点2出现大量的log file sync,通知TSM备份工程师,暂停RMAN备份后,系统恢复正常。
一次log file sync等待事件分析:等您坐沙发呢!