加veritas共享文件系统后实例无法启动

2021-06-24 16:03:24

添加veritas共享文件系统后ASM磁盘I/O异常造成实例无法启动,这个 veritas共享文件是为复制软件用的，跟Oracle没有关系，系统组在处理完后，我们发现了问题，下面是处理过程。其实我们定为了IO问题，但是具体哪里问题后还是需要系统和存储工程师去协调处理。下面是处理过程。

一套三节点集群，使用ASM共享存储，发现２，３结点数据库实例无法启动，1节点实例又频繁重启。下面是分析过程。

CKPT进程写文件异常　LMHB 发现后台核心进程异常terminate实例。
021-06-23T20:04:39.016044+08:00
CKPT (ospid: 20145) waits for event 'control file parallel write' for 94 secs.　　<<<<<<ＣＫＰＴ进程写文件异常
2021-06-23T20:04:41.177782+08:00
Errors in file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_lmhb_20127.trc (incident=486956):
ORA-29770: global enqueue process CKPT (OSID 20145) is hung for more than 70 seconds
Incident details in: /oracle/diag/rdbms/xxxxx01/xxxxx012/incident/incdir_486956/xxxxx012_lmhb_20127_i486956.trc
2021-06-23T20:04:43.647557+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_xxxxx01 by memno 1
ERROR: Some process(s) is not making progress.
LMHB (ospid: 20127) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 20127): terminating the instance due to error 29770 <<<<<<LMHB 发现后台核心进程异常terminate实例。
2021-06-23T20:04:44.149112+08:00
System state dump requested by (instance=2, osid=20127 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_diag_20054_20210623200444.trc
2021-06-23T20:04:45.052499+08:00
Dumping diagnostic data in directory=[cdmp_20210623200444], requested by (instance=2, osid=20127 (LMHB)), summary=[abnormal instance termination].
2021-06-23T20:04:45.702580+08:00
License high water mark = 6
2021-06-23T20:04:49.393904+08:00
Termination issued to instance processes. Waiting for the processes to exit, wait time 5 sec
2021-06-23T20:04:54.395504+08:00
Termination issued to instance processes. Waiting for the processes to exit, wait time 0 sec
2021-06-23T20:04:55.395871+08:00
Instance termination failed to kill one or more processes
2021-06-23T20:04:59.398103+08:00
Instance terminated by LMHB, pid = 20127
2021-06-23T20:04:59.399099+08:00
Warning: 2 processes are still attach to shmid 19:
(size: 32768 bytes, creator pid: 7297, last attach/detach pid: 45950)
2021-06-23T20:04:59.704748+08:00
USER (ospid: 3914): terminating the instance
2021-06-23T20:04:59.705632+08:00
Instance terminated by USER, pid = 3914

结点２　集群尝试启动，如下步骤很慢，数据库日志如下
NOTE: ASMB mounting group 2 (DATA1)
NOTE: ASM background process initiating disk discovery for grp 2 (reqid:0)
2021-06-24T10:35:51.989120+08:00
KSXPPING: KSXP selected for Ping
2021-06-24T10:36:50.257524+08:00
Decreasing number of high priority LMS from 3 to 0
2021-06-24T10:39:01.164090+08:00
NOTE: disk group discovery running slower than expected, ASMB diagnostic requested after 189 seconds
NOTE: ASMB process state dumped to trace file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_ckpt_47894.trc

后续逐渐发现了ASM磁盘，但是后续启动步骤非常慢，ASMB负责与ASM实例通信，在启动时负责初始化磁盘组，找到对应的ASM磁盘，
但是这个过程持续１８９秒ASMB进程开始dump。　

数据库日志

这里同样是控制文件读写异常，也是i/o问题，超时。由于这个异常实例被关闭。
2021-06-24T10:54:43.712362+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 71 secs.
2021-06-24T10:54:43.712463+08:00
IMR0 (ospid: 1872) is hung in an acceptable location (cfio 0x11.00).
2021-06-24T10:54:46.719241+08:00
IMR has experienced some problems and can't re-enable, retry count 1
2021-06-24T10:54:54.018174+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 91 secs.
2021-06-24T10:56:21.809079+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 169 secs.
2021-06-24T10:56:21.809159+08:00
IMR0 (ospid: 1872) is hung in an acceptable location (cfio 0x11.00).
2021-06-24T10:56:32.416434+08:00
IMR0 (ospid: 1872) waits for event 'control file sequential read' for 189 secs.
......

2021-06-24T11:04:34.400380+08:00
Reconfiguration started (old inc 6, new inc 8)
List of instances (total 2) :
2 3
Dead instances (total 1) :
1
My inst 2
.....

2021-06-24T11:06:04.341089+08:00
IMR0 (ospid: 1872) waits for event 'control file parallel write' for 82 secs.
2021-06-24T11:07:20.598935+08:00
requested by (instance=2, osid=47050 (DIA0)), summary=[process state dump request (kjznHA)]
requested by (instance=2, osid=47050 (DIA0)), summary=[process state dump request (kjznHA)]
2021-06-24T11:09:29.246934+08:00
Errors in file /oracle/diag/rdbms/xxxxx01/xxxxx012/trace/xxxxx012_ora_2453.trc (incident=535284):
ORA-00240: control file enqueue held for more than 120 seconds
Incident details in: /oracle/diag/rdbms/xxxxx01/xxxxx012/incident/incdir_535284/xxxxx012_ora_2453_i535284.trc
2021-06-24T11:11:23.302105+08:00

* instance 3 validates domain 0
2021-06-24T11:11:25.438586+08:00
License high water mark = 4
2021-06-24T11:11:25.439008+08:00
USER (ospid: 33543): terminating the instance
2021-06-24T11:11:29.443778+08:00
......
2021-06-24T11:11:45.451636+08:00
Instance terminated by USER, pid = 33543
2021-06-24T11:11:45.452326+08:00
Warning: 4 processes are still attach to shmid 34:
(size: 32768 bytes, creator pid: 13106, last attach/detach pid: 33543)

处理第二个结点问题时，个结点重起
ＬＧＷＲ　写盘异常，ａｓｍｂ（监控核心后台进程）　　结束实例

2021-06-24T11:26:07.035994+08:00
CKPT (ospid: 12600) waits for event 'control file parallel write' for 71 secs.
2021-06-24T11:26:07.036087+08:00
CKPT (ospid: 12600) is hung in an acceptable location (cfio 0x11.00).
2021-06-24T11:26:09.675861+08:00
LGWR (ospid: 12598) waits for event 'LGWR all worker groups' for 23 secs.
2021-06-24T11:26:14.050214+08:00
CKPT (ospid: 12600) waits for event 'control file parallel write' for 80 secs.
2021-06-24T11:26:16.295247+08:00
Errors in file /oracle/diag/rdbms/xxxxx01/xxxxx011/trace/xxxxx011_lmhb_12582.trc (incident=555113):
ORA-29770: global enqueue process CKPT (OSID 12600) is hung for more than 70 seconds
Incident details in: /oracle/diag/rdbms/xxxxx01/xxxxx011/incident/incdir_555113/xxxxx011_lmhb_12582_i555113.trc
2021-06-24T11:26:18.830136+08:00
LOCK_DBGRP: GCR_SYSTEST debug event locked group GR+DB_xxxxx01 by memno 0
ERROR: Some process(s) is not making progress.
LMHB (ospid: 12582) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 12582): terminating the instance due to error 29770
2021-06-24T11:26:19.197840+08:00
System state dump requested by (instance=1, osid=12582 (LMHB)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/diag/rdbms/xxxxx01/xxxxx011/trace/xxxxx011_diag_12473_20210624112619.trc
。。。。。。
2021-06-24T11:26:44.194255+08:00
Instance terminated by LMHB, pid = 12582

这里可以看出CKPT进程异常，无法写入，LMHB进程监控到这个问题，在ｄｕｍｐ后随之Instance terminated by LMHB
虽然后续结点１实例启动成功，但是问题依旧。在确认三个结点都可以读取asm磁盘，且权限属组都没有问题后，我们判断是
ｉ／ｏ　，１，３结点可以访问ASM磁盘但是偶尔会慢造成无法访问，而结点２根本无法访问，i/o更慢。　

总结：以上现象都说明存储I/O异常，不稳定需要系统组排查，后经项目负责人协调后确认系统组已经意识到I/O异常，进一步分析处理，
至此该问题分析完毕。