绑定完请刷新页面
取消
刷新

分享好友

×
取消 复制
IBM小机RAC集群一个节点异常关闭案例分析
2021-07-27 11:02:03



这是一个两节点集群,周六一大早,值班人员再微信群里说监控连接异常,再确认业务不受影响后,通过看监控软件连接配置,认为很可能是一个实例挂了,因为机器在内网,没有VPN只能到公司处理。通过crsctl stat res -t 发现确实实例1挂了。

告警日志提示如下。

his instance is not in good health and terminating itself.
2021-07-24T05:21:52.460053+08:00
Errors in file /u01/app/oracle/diag/rdbms/mydb/mydb1/trace/mydb1_lmon_18036.trc:
ORA-29743: exiting from instance membership recovery protocol because this instance is not in good health <===== LMON先报告的错误
2021-07-24T05:21:52.573893+08:00
Error: Shutdown in progress. Error: 29743.
Errors in file /u01/app/oracle/diag/rdbms/mydb/mydb1/trace/mydb1_ora_17901.trc (incident=756041) (PDBNAME=CDB
$ROOT):
ORA-00600: internal error code, arguments: [ksqsgn:join], [error in lmon process], [32], [], [], [], [], [], [], <==== ORA-600则发生在 lmon 异常之后
[], [], []
Incident details in: /u01/app/oracle/diag/rdbms/mydb/mydb1/incident/incdir_756041/mydb1_ora_17901_i756041.trc
2021-07-24T05:21:52.673750+08:00
LMON (ospid: 18036): terminating the instance due to error 481
2021-07-24T05:21:52.863668+08:00
System state dump requested by (instance=1, osid=4294985332 (LMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/mydb/mydb1/trace/mydb1_diag_18020_20210724052152.
trc
2021-07-24T05:21:54.356135+08:00
Dumping diagnostic data in directory=[cdmp_20210724052154], requested by (instance=1, osid=4294985332 (LMON)), su
mmary=[abnormal instance termination].
2021-07-24T05:21:57.666877+08:00
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
2021-07-24T05:21:58.679728+08:00
Instance terminated by LMON, pid = 18036 <===== LMON停掉了 DB,所以 ORA-600: [ksqsgn:join], [error in lmon process], [32], [], [], [], [], [], [], 只是副作用

开始怀疑是oracle的bug导致,通过原厂工程师的确认,先后关系看确实是LMON先报错,ora-600是LMON异常的副产品。

下面分析lmon的跟踪日志

LOG FILE
--------------
Filename=mydb1_lmon_18036.trc
See the following:

*** CONTAINER ID:(1) 2021-07-24T05:21:48.326391+08:00

LMD0 group 0 GES resources 82296 pool 21
LMD1 group 0 GES resources 82296 pool 21
LMD2 group 0 GES resources 82296 pool 21
GES enqueues 127123
GES IPC: Receivers 7 Senders 7
GES IPC: Buffers Receive 1000 Send (i:0 b:0) Reserve 0
GES IPC: Msg Size Regular 512 Batch 8376
Batching factor: enqueue replay 206, ack 229
Batching factor: cache replay 93 size per lock 88
Read-write Instance? 1, Designated Master? 1, BOC? 1, Broadcast SCN mode: 1
CSS cluster type is UNKNOWN (1)

*** 2021-07-24T05:21:51.874352+08:00 (CDB$ROOT(1))
kjxggin: CGS tickets = 1000
kjxgmin: set instance reconnect max time to 40 secs
kjxgrdmpcpu: CPU Total 128 Core 16 Socket 2 OCPU 64
kjxgrdmpcpu: High load threshold 81920
CGS/IMR TIMEOUTS:
CSS recovery timeout = 31 sec (Total CSS waittime = 65)
IMR Reconfig timeout = 75 sec
CGS rcfg timeout = 85 sec
kjxgmjoin: rimlost event instmap:

*** 2021-07-24T05:21:52.346067+08:00 (CDB$ROOT(1))
kjxgmrcfg: network health verification fails. <=============== lmon 显示网络异常

=========================
== My IP address Usage ==
=========================
Local instance 1 uses 4 interfaces.
[0]: 169.254.41.56
[1]: 169.254.111.220
[2]: 169.254.182.82
[3]: 169.254.199.102
================================
== System Network Information ==
================================
==[ Network Interfaces : 13 (13 max) ]============
lo0 | 127.0.0.1 | 255.0.0.0 | UP|RUNNING
aggr0 | 192.168.140.51 | 255.255.255.0 | UP|RUNNING
aggr0:1 | 169.254.182.82 | 255.255.192.0 | DOWN|RUNNING <<<<状体异常地址
aggr1 | x.x.x.179 | 255.255.255.128 | UP|RUNNING
aggr1:1 | x.x.x.181 | 255.255.255.128 | UP|RUNNING
aggr1:2 | x.x.x.182 | 255.255.255.128 | UP|RUNNING
net4 | 172.16.10.11 | 255.255.255.0 | UP|RUNNING
net4:1 | 169.254.41.56 | 255.255.192.0 | UP|RUNNING
net4:2 | 169.254.199.102 | 255.255.192.0 | DOWN|RUNNING <<<<状体异常地址
net4:3 | 169.254.182.82 | 255.255.192.0 | UP|RUNNING
net5 | 172.16.11.11 | 255.255.255.0 | UP|RUNNING
net5:1 | 169.254.111.220 | 255.255.192.0 | UP|RUNNING
net5:2 | 169.254.199.102 | 255.255.192.0 | UP|RUNNING

这里net4 和net5是私网网卡,aggr0是IBM小机存储网络地址,明显 aggr0:1和net4:3使用了相同的地址

对于aggr0地址到底谁在用,后咨询了相关工程师说是IBM的小机存储网络在用,显然它也用了169.254/16网段的地址
经过分析文档Doc ID 1629814.1可以知道,在采用了IBM Integrated Management Module (IMM) ,此时
IMM 默认使用 IANA (Internet Assigned Numbers Authority) 网络地址范围, , 即 "Link Local" 地址范围为169.254/16

后续查询了集群配置了三个私网网卡,那么Oracle会有四个haip地址,其中aggr0一个,net 4两个,net5一个,通过ifconfig可以

看到net4:3 和net5:2 各自多了haip网段地址。检查hosts文件发现配置也有问题,公网IP对应两个名称解析,后发现是爱数备份相关

但是其工程师说这个地址解析他们不通,至于谁加了这个解析不得而知,但是肯定有问题,我屏蔽掉这个解析,重启集群,发现net4:3上的与aggr0:1

的地址冲突没有了

跟项目负责人沟通后,先屏蔽非Oracle使用的映射关系。
/etc/hosts
x.x.x.179 node1

#x.x.x.179 osi_client <<<<<屏蔽这个映射关系

重启集群,haip的冲突解决


备注:这里的核心还是IBM Integrated Management Module (IMM) 再相关设备上加了haip网段的地址造成冲突。再屏蔽hosts文件中要给重复解析后,重启集群解决,至于二者之间关系,

IMM的原理还是需要小机工程师分析。


分享好友

分享这个小栈给你的朋友们,一起进步吧。

Oracle运维新鲜事-技术与管理各占半边天
创建时间:2020-08-04 11:34:57
本技术栈旨在分享技术心得,运维趣事,故障处理经验,调优案例,故障处理涉及集群,DG,OGG,大家生产中遇到的问题基本都会囊括了,我会发布生产库遇到的故障,希望在交流中互助互益,共同提高,也希望大家讨论,如果您有生产中遇到的集群问题,也可以在这里提出来,一起讨论,现实中也帮助不少同学解决了生产库的故障。
展开
订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅:虚拟交易,一经交易不退款;若特殊情况,可3日内客服咨询

• 专区发布评论属默认订阅所评论专区(除付费小栈外)

栈主、嘉宾

查看更多
  • Abraham林老师
    栈主
  • 小雨滴
    嘉宾
  • hawkliu
    嘉宾
  • u_97a59a25246404
    嘉宾

小栈成员

查看更多
  • 栈栈
  • dapan
  • 小菜鸟___
  • hwayw
戳我,来吐槽~