我们知道Oracle的集群是Share-EveryThing的架构,需要集群中的多个节点能同时访问文件,如果其控制文件中记录了对应信息如表空间信息,但是数据文件不存在此时就会报错,这个案例基于此场景,我们通过模拟来重新这个过程。并迁移到ASM上。
1 故障模拟
(1)节点2创建表空间、创建表如下:
SYS@orcl2>create tablespace corruption_tbs datafile '/oracle/db/corp_tbs.dbf' size 10m;
Tablespace created.
Elapsed: 00:00:00.97
SYS@orcl2>create table scott.test_corp tablespace corruption_tbs as select * from scott.emp;
Table created.
此时查询表信息是没问题的,生产中由于指定了用户的Service所以会直接连接到这个实例,读数据没问题,注意这个表空间的数据文件是本地文件,也就是节点1无法读取到这个文件。我们在节点1尝试做备份看看如何报错。
RMAN> backup as compressed backupset database;
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1910 instance=orcl1 device type=DISK
RMAN-06169: could not read file header for datafile 10 error reason 4
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at
RMAN-06056: could not access datafile 10
这个文件根本无法访问,在节点1 尝试访问该表
SYS@orcl1>select * from scott.test_corp;
select * from scott.test_corp
*
ERROR at line 1:
ORA-01157: cannot identify/lock data file 10 - see DBWR trace file
ORA-01110: data file 10: '/oracle/db/corp_tbs.dbf'
Elapsed: 00:00:00.01
其实,从这里可以分析出问题了,就是控制文件记录了表空间的定义,这个定义在数据字典中,这个信息两个节点都可以访问的,但是物理文件的信息也记录在控制文件中,但是这个文件确只有节点2可以访问。
如果重启节点1,问题本质是一样的也就是由于控制文件记录了表空间的存在,Oracle尝试打开该数据文件,但是由于物理文件根本不存在,打开失败,此时Oracle实例无法启动,我们看日志报错信息如下:
ALTER DATABASE OPEN /* db agent *//* {2:28603:2} */
This instance was first to open
Errors in file /oracle/db/diag/rdbms/orcl/orcl1/trace/orcl1_dbw0_11999.trc:
ORA-01157: cannot identify/lock data file 10 - see DBWR trace file
ORA-01110: data file 10: '/oracle/db/corp_tbs.dbf'
ORA-27037: unable to obtain file status
Linux-x86_64 Error: 2: No such file or directory
这里该如何打开实例1 让我对外服务呢,其实这里跟数据文件异常损坏的恢复道理一样,我们要先保证数据库对外提供服务,所以需要告诉ORacle可以将该数据文件offline,也就是实例启动时可以不管该文件,这个需要再数据库处于mount状态实现,并且两个节点都需要处于mount状态,这个数据文件在两个节点暂时都无法访问(当然重启期间数据库也无法访问)。
在任一节点将数据文件offline。比如这里在节点1
SYS@orcl1> alter database datafile 10 offline;
Database altered.
Elapsed: 00:00:00.05
SYS@orcl1>alter database open;
Database altered.
此时节点1数据库打开,但是由于数据文件不存在无法online,表数据无法访问
SYS@orcl1> alter database datafile 10 online;
alter database datafile 10 online
*
ERROR at line 1:
ORA-01157: cannot identify/lock data file 10 - see DBWR trace file
ORA-01110: data file 10: '/oracle/db/corp_tbs.dbf'
然后我们在节点2 将数据文件online,让该表空间可以对外服务
SYS@orcl2>alter database datafile 10 online;
Database altered.
Elapsed: 00:00:00.10
SYS@orcl2>select count(*) from scott.test_corp;
COUNT(*)
----------
14
Elapsed: 00:00:00.02
至此,我们拉起了节点1的实例,使得节点2可以对外提供服务,但是由于这个表空间的数据文件。
2 迁移数据文件到ASM。
在节点2操作
SYS@orcl2>alter tablespace corruption_tbs offline;
Tablespace altered.
Elapsed: 00:00:00.28
SYS@orcl2>exit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
[oracle@rac2 ~]$ rman target /
Recovery Manager: Release 11.2.0.4.0 - Production on Tue Sep 8 11:28:30 2020
Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
connected to target database: ORCL (DBID=1533381003)
RMAN> copy datafile 10 to '+ASMVG1';
Starting backup at
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1911 instance=orcl2 device type=DISK
channel ORA_DISK_1: starting datafile copy
input datafile file number=00010 name=/oracle/db/corp_tbs.dbf
output file name=+ASMVG1/orcl/datafile/corruption_tbs.272.1050578927 tag=TAG20200908T112845 RECID=1 STAMP=1050578926
channel ORA_DISK_1: datafile copy complete, elapsed time: 00:00:01
Finished backup at
Starting Control File and SPFILE Autobackup at
piece handle=/oracle/db/product/11.2/dbs/c-1533381003-20200908-03 comment=NONE
Finished Control File and SPFILE Autobackup at
修改控制文件记录信息
RMAN> switch datafile 10 to copy;
datafile 10 switched to datafile copy "+ASMVG1/orcl/datafile/corruption_tbs.272.1050578927"
SYS@orcl2>alter tablespace corruption_tbs online;
Tablespace altered.
Elapsed: 00:00:00.25
此时我们在节点1查询,应该可以查询到表空间中的表数据了(告警日志中会recovery该表空间)
SYS@orcl1>select count(*) from scott.test_corp;
COUNT(*)
----------
14
Elapsed: 00:00:00.03
至此,迁移完毕,这个迁移由于不影响其他表空间数据操作,还是属于高可用方式。