集群一个节点无法启动
1 查询故障节点的集群的启动状态
[grid@rac2 rac2]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE OFFLINE
ora.cluster_interconnect.haip
1 ONLINE OFFLINE
ora.crf
1 ONLINE ONLINE rac2
ora.crsd
1 ONLINE OFFLINE
ora.cssd
1 ONLINE OFFLINE
ora.cssdmonitor
1 ONLINE ONLINE rac2
ora.ctssd
1 ONLINE OFFLINE
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE OFFLINE
ora.gipcd
1 ONLINE ONLINE rac2
ora.gpnpd
1 ONLINE ONLINE rac2
ora.mdnsd
1 ONLINE ONLINE rac2
此时只是启动了构建集群所需要的底层资源 CSSD 没有启动。
2 个节点正常,第二个节点CSSD无法正常启动,通过查看[grid@rac2 cssd]$ tail -100f ocssd.log | more日志发现如下报错
2020-09-13 22:10:16.117: [ CSSD][3833571072]clssgmDiscEndpcl: gipcDestroy 0x96e8
2020-09-13 22:10:16.451: [ CSSD][3833571072]clssgmExecuteClientRequest: MAINT recvd from proc 2 (0x7f06dc0593b0)
2020-09-13 22:10:16.451: [ CSSD][3833571072]clssgmShutDown: Received abortive shutdown request from client.
2020-09-13 22:10:16.451: [ CSSD][3833571072]###################################
2020-09-13 22:10:16.451: [ CSSD][3833571072]clssscExit: CSSD aborting from thread GMClientListener
2020-09-13 22:10:16.451: [ CSSD][3833571072]###################################
2020-09-13 22:10:16.451: [ CSSD][3833571072](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2020-09-13 22:10:16.552: [ CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2020-09-13 22:10:16.553: [ CSSD][3480196864]clssnmPollingThread: state(1) clusterState(0) exit
2020-09-13 22:10:16.553: [ CSSD][3480196864]clssscExit: abort already set 0
2020-09-13 22:10:16.561: [ CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt,
695082, LATS 3972724, lastSeqNo 695081, uniqueness 1599745548, timestamp 1600006216/11824574
2020-09-13 22:10:17.553: [ CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2020-09-13 22:10:17.563: [ CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt,
695083, LATS 3973724, lastSeqNo 695082, uniqueness 1599745548, timestamp 1600006217/11825574
2020-09-13 22:10:18.554: [ CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2020-09-13 22:10:18.566: [ CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt,
695084, LATS 3974734, lastSeqNo 695083, uniqueness 1599745548, timestamp 1600006218/11826574
2020-09-13 22:10:19.555: [ CSSD][3478619904]clssnmSendingThread: sending join msg to all nodes
2020-09-13 22:10:19.555: [ CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2020-09-13 22:10:19.555: [ CSSD][3478619904]clssnmSendingThread: sent 4 join msgs to all nodes
2020-09-13 22:10:19.570: [ CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt,
695085, LATS 3975734, lastSeqNo 695084, uniqueness 1599745548, timestamp 1600006219/11827574
2020-09-13 22:10:20.555: [ CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
2020-09-13 22:10:20.575: [ CSSD][3486504704]clssnmvDHBValidateNcopy: node 1, rac1, has a disk HB, but no network HB, DHB has rcfg 495240370, wrtcnt,
695086, LATS 3976744, lastSeqNo 695085, uniqueness 1599745548, timestamp 1600006220/11828574
2020-09-13 22:10:21.556: [ CSSD][3481773824]clssgmWaitOnEventValue: after CmInfo State val 3, eval 1 waited 0
这个报错之13号的报错,我是在15号上午重启启动这个节点的集群,此时ocssd.log根本没有记录任何东西,也就是此时集群根本没有启动到这里,应该是卡在底层某个地方了。
通过occd.log历史信息我们发现核心是没有网络心跳
3 测试私网连接,发现在两个节点私网地址互通没问题
[grid@rac2 rac2]$ ping rac1-priv
PING rac1-priv (192.168.57.101) 56(84) bytes of data.
64 bytes from rac1-priv (192.168.57.101): icmp_seq=1 ttl=64 time=0.268 ms
64 bytes from rac1-priv (192.168.57.101): icmp_seq=2 ttl=64 time=0.451 ms
64 bytes from rac1-priv (192.168.57.101): icmp_seq=3 ttl=64 time=0.407 ms
64 bytes from rac1-priv (192.168.57.101): icmp_seq=4 ttl=64 time=0.629 ms
64 bytes from rac1-priv (192.168.57.101): icmp_seq=5 ttl=64 time=0.452 ms
--- rac1-priv ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4363ms
rtt min/avg/max/mdev = 0.268/0.441/0.629/0.116 ms
[grid@rac2 rac2]$ ping rac2-priv
PING rac2-priv (192.168.57.102) 56(84) bytes of data.
64 bytes from rac2-priv (192.168.57.102): icmp_seq=1 ttl=64 time=0.035 ms
64 bytes from rac2-priv (192.168.57.102): icmp_seq=2 ttl=64 time=0.050 ms
64 bytes from rac2-priv (192.168.57.102): icmp_seq=3 ttl=64 time=0.047 ms
64 bytes from rac2-priv (192.168.57.102): icmp_seq=4 ttl=64 time=0.050 ms
64 bytes from rac2-priv (192.168.57.102): icmp_seq=5 ttl=64 time=0.048 ms
64 bytes from rac2-priv (192.168.57.102): icmp_seq=6 ttl=64 time=0.096 ms
4 再看下网络配置信息ifocnfig -a
发现节点1的网卡eth1上有HAIP地址,而第个节点的eth1上没有HAIP地址,这个就比较奇怪了,很可能第二个节点启动了HAIP网段的地址。
我们在第二个节点查看,发现eth3配置了.
节点1:
eth1:1 Link encap:Ethernet HWaddr 08:00:27:AF:CA:C2
inet addr:169.254.13.18 Bcast:169.254.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
节点2
eth1 Link encap:Ethernet HWaddr 08:00:27:5B:E4:92
inet addr:192.168.57.102 Bcast:192.168.57.255 Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe5b:e492/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:28778 errors:0 dropped:0 overruns:0 frame:0
TX packets:52599 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:16441517 (15.6 MiB) TX bytes:72160947 (68.8 MiB)
eth3 Link encap:Ethernet HWaddr 08:00:27:ED:7D:A2
inet addr:169.254.1.100 Bcast:169.254.255.255 Mask:255.255.0.0
inet6 addr: fe80::a00:27ff:feed:7da2/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:11725 errors:0 dropped:0 overruns:0 frame:0
TX packets:78 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:10457605 (9.9 MiB) TX bytes:13433 (13.1 KiB)
此时,应该是第二个节点中HAIP网段的地址被使用到了导致问题发生,我们尝试删除该网卡配置(不明确是何时被修改而增加了这个网卡和地址),重启集群
5 删除第二个节点的网卡eth3配置,重启集群,问题解决。