绑定完请刷新页面
取消
刷新

分享好友

×
取消 复制
log file switch (archiving needed) 事件导致系统缓慢-归档空间正常
2022-01-06 12:11:01


用户反映整个业务操作缓慢,影响业务运行,需要紧急排查,下面是分析步骤

1 看下数据库告警

ORACLE Instance xxxx303 - Cannot allocate log, archival required
Thread 3 cannot allocate new log, sequence 146566
All online logs need archiving
Examine archive trace files for archiving errors
Current log# 9 seq# 146565 mem# 0: //u011/xxxx30/redo17.log
Current log# 9 seq# 146565 mem# 1: //u012/xxxx30/redo18.log
Thread 3 advanced to log sequence 146566 (LGWR switch)
Current log# 11 seq# 146566 mem# 0: //u011/xxxx30/redo21.log
Current log# 11 seq# 146566 mem# 1: //u012/xxxx30/redo22.log
Archived Log entry 268556 added for T-3.S-146565 ID 0xe4617d4f LAD:1
Thread 3 advanced to log sequence 146567 (LGWR switch)
Current log# 12 seq# 146567 mem# 0: //u011/xxxx30/redo23.log
Current log# 12 seq# 146567 mem# 1: //u012/xxxx30/redo24.log
Archived Log entry 268557 added for T-3.S-146566 ID 0xe4617d4f LAD:1
ORACLE Instance xxxx303 - Cannot allocate log, archival required
Thread 3 cannot allocate new log, sequence 146568
All online logs need archiving
Examine archive trace files for archiving errors
Current log# 12 seq# 146567 mem# 0: //u011/xxxx30/redo23.log
Current log# 12 seq# 146567 mem# 1: //u012/xxxx30/redo24.log
(其实这里可以查看arc跟踪文件,因为感觉问题出在归档上,根据自己习惯做了如下操作)
在线日志需要归档 但是有问题无法完成归档,我们看看数据库等待事件

select event ,count(*) from v$active_session_history
where sample_time >to_date('202201061100','yyyymmddhh24mi') and event is not null group by event having count(*)>10 order
by count(*);
EVENT COUNT(*)
-------------------------------------------------- ----------
log file switch (archiving needed) 529848
我们看到严重的写日志有问题,也就是Oracle认为无法归档成功。
查询用户会话阻塞情况
SQL> select user_id,sql_id,event ,count(*) from v$active_session_history
where sample_time >to_date('202201061100','yyyymmddhh24mi') and event like 'log file%' and event is not null group by user_id,sql_id,event having count(*)>10 order
by count(*);

省略输出
231 2hsuv64sdjt4w log file switch (archiving needed) 5788
147 dkqkn4rbv2q44 log file switch (archiving needed) 6073
147 3qmth1t2zjw18 log file switch (archiving needed) 8816
147 48nya9ct5z7v7 log file switch (archiving needed) 13505
228 3dx8vk477jzup log file switch (archiving needed) 21302
169 1fcxttxxfmxsu log file switch (archiving needed) 141689
939 rows selected.

发现大量用户被写归档事件阻塞,这就是大量用户反映慢的根因。但是为什么写入有问题。我们从两个角度考虑
1 归档满了
2 归档进程写磁盘异常

查询空间排除归档满的问题。
检查归档进程是否异常等待

SQL> select sid,serial#,process ,program,EVENT,SECONDS_IN_WAIT from v$session where program like '%ARC%';

SID SERIAL# PROCESS PROGRAM EVENT SECONDS_IN_WAIT
---------- ---------- ---------- ---------------------------------------- ------------------------------ ---------------
13189 39093 255660 oracle@n-pc-sr850-324 (ARC0) rdbms ipc message 47
13503 53177 255664 oracle@n-pc-sr850-324 (ARC1) Disk file operations I/O 5130
13660 36034 255666 oracle@n-pc-sr850-324 (ARC2) rdbms ipc message 229
13817 30293 255668 oracle@n-pc-sr850-324 (ARC3) rdbms ipc message 107

我们看到此时process 255664等待了5130秒,等待事件为文件I/O等待,所以我们怀疑问题出在操作系统和存储层面,此时归档依然可以写
,但是写入慢,好像被什么阻塞了。

后续从操作系统messages查看有无异常,发现如下报错

Jan 6 09:07:37 n-pc-sr850-324 kernel: INFO: task ora_arc1_xxxx30:255664 blocked for more than 120 seconds.
Jan 6 09:07:37 n-pc-sr850-324 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 6 09:07:37 n-pc-sr850-324 kernel: ora_arc1_xxxx30 D 00000000005b1013 0 255664 1 0x00000084
Jan 6 09:07:37 n-pc-sr850-324 kernel: ffff88726b68b730 0000000000000086 ffff883fb879cf10 ffff88726b68bfd8
Jan 6 09:07:37 n-pc-sr850-324 kernel: ffff88726b68bfd8 ffff88726b68bfd8 ffff883fb879cf10 ffff883fb879cf10
Jan 6 09:07:37 n-pc-sr850-324 kernel: ffff8854afb0c2b0 ffff8854afb0c2a8 ffff887172b0d870 00000000005b1013
Jan 6 09:07:37 n-pc-sr850-324 kernel: Call Trace:
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ea84e0>] vx_svar_sleep_unlock+0x70/0xd0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ea8420>] ? vx_basic_set_mnamfs+0xd0/0xd0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ea8ccf>] vx_event_wait+0x2f/0x80 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0d93af0>] vx_async_waitmsg+0x20/0x50 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0d93c98>] vx_msg_send+0x178/0x190 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc09e2bb1>] ? gab_allocmsg+0x11/0x30 [gab]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e9331b>] vx_extset_msg+0x2eb/0x660 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0d9df7e>] vx_cfs_extset+0x3e/0x50 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e37727>] vx_extset+0x447/0x580 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e37b45>] vx_setext+0x2e5/0x5c0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0a1c80b>] ? vxg_api_lock+0x8b/0xc0 [vxglm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0f662b3>] vx_uioctl+0x6a3/0x9e0 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0a33b9a>] ? vxg_lock_ilock_omnibus+0x3aa/0x3c0 [vxglm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e778ae>] ? vx_iglock2+0x8e/0x110 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0a1acd0>] ? vxg_upgrade_fastsub+0x50/0x50 [vxglm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0e43ef4>] ? vx_glm_unlock+0x34/0x60 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0f06d33>] vx_vop_ioctl+0x23/0x30 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ee23c3>] vx_odm_resize+0xc3/0x280 [vxfs]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae9ef4>] odm_vx_resize+0x94/0x130 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0aea0c1>] odm_vx_create+0x131/0x190 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0af24b2>] odm_file_create.isra.12+0x173/0x2c1 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae98bf>] ? odm_vx_dirlook+0x3f/0x190 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0adf44a>] ? odm_node_rele+0x4a/0x1d0 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae0358>] ? odm_clust_reserve+0xd8/0x190 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0ae681d>] odm_create+0x65d/0x850 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0add27f>] odm_ioctl_ctl+0x16f/0x3a0 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffffc0aee101>] odm_ioctl_ctl_unlocked+0x21/0x30 [vxodm]
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff812151cd>] do_vfs_ioctl+0x33d/0x540
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff816affd1>] ? __do_page_fault+0x171/0x450
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff81215471>] SyS_ioctl+0xa1/0xc0
Jan 6 09:07:37 n-pc-sr850-324 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

上面提示很明显task ora_arc1_xxxx30:255664 blocked for more than 120 seconds.也就是数据库的ARC1归档进程卡在IO上了,而这个IO
明显是操作系统和存储层面的问题,后续交由系统组和存储工程师排查,我们暂时将业务迁移到其他实例。


后续:参考MOS (Doc ID 1476444.1)的说明,还是很有借鉴意义
事件log file switch (archiving needed)发生原因,lgwr进程切换日志,但是要切换到的日志目前其归档无法完成写操作。
问题症状:
1 对于集群而言本地实例有问题,前提是只是本地进程写有问题,如果是共享归档空间,比如满了,也可能影响其他实例。
2只是部分进程收到影响,比如增删改,其实对于OLTP系统,这个影响面已经很大了。
3 该等待事件在DBtime占比很高
4 归档空间不足

如何解决
1 检查告警日志是否归档被失败的归档写入操作阻塞(这个在本例中确认了,写入被存储层面问题阻塞了)
2 加速归档,增加归档进程,或者归档存储条带化处理(条带化这个生产库不现实)
3 检查归档空间问题,满了释放空间或者扩容解决




分享好友

分享这个小栈给你的朋友们,一起进步吧。

Oracle运维新鲜事-技术与管理各占半边天
创建时间:2020-08-04 11:34:57
本技术栈旨在分享技术心得,运维趣事,故障处理经验,调优案例,故障处理涉及集群,DG,OGG,大家生产中遇到的问题基本都会囊括了,我会发布生产库遇到的故障,希望在交流中互助互益,共同提高,也希望大家讨论,如果您有生产中遇到的集群问题,也可以在这里提出来,一起讨论,现实中也帮助不少同学解决了生产库的故障。
展开
订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅:虚拟交易,一经交易不退款;若特殊情况,可3日内客服咨询

• 专区发布评论属默认订阅所评论专区(除付费小栈外)

栈主、嘉宾

查看更多
  • Abraham林老师
    栈主
  • 小雨滴
    嘉宾
  • hawkliu
    嘉宾
  • u_97a59a25246404
    嘉宾

小栈成员

查看更多
  • 栈栈
  • dapan
  • 小菜鸟___
  • hwayw
戳我,来吐槽~