绑定完请刷新页面
取消
刷新

分享好友

×
取消 复制
GaussDB T分布式集群故障恢复案例:CN隔离恢复
2020-03-18 11:32:06

摘要:说明: 1、恢复bin目录之后,集群不正常的同时,也发现group_1组下的DN主备角色互换了,导致集群中DN主备不均衡,存在一个主机有两个主无备,另一主机却有两个备无主。 2、进行CN隔离恢复之后,发现omm用户密码变成默认的gaussdb_123,此问题是1.0.1的bug,已在1.0.2版本修复。 3、数据库名称在CN隔离恢复之后变成默认的GAUSS了,此问题是Bug,华为暂时还未解决。

背景说明:

虚拟机环境,一套4节点的GaussDB T 1.0.1分布式集群,由于想升级至1.0.2,配置python3时,误删除某个主机的/usr/bin/目录,导致整个节点主机异常。
恢复/usr/bin目录之后,该主机集群状态异常。CM、ETCD、DN状态为离线OFFLINE,,CN状态为DELETED。
误删/usr/bin/目录如何恢复,这里不作介绍,大致流程是新建一个虚拟机,注意vg名称不要和旧的一样,卸载故障主机的磁盘,挂载到新虚拟机,拷贝一个好的/usr/bin/到然后卸载,再挂载回原来的故障主机。

CN 隔离恢复注意事项:

● 不支持恢复由于物理损坏而被隔离并替换的故障CN。
● 在故障CN恢复期间,不允许进行DDL操作。
● 如果集群中有负载均衡组件,需要从负载均衡组件中剔除对故障CN的业务分发。在故障恢复后,再在负载均衡组件中恢复CN。
● 如果集群中某个CN节点一直处于route_conflict,同时在故障CN以外的某个CN或主DN上,SYS_DATA_NODES系统表中没有配置故障的CN节点, 也适用手动恢复操作。

遇到的问题

1、恢复/usr/bin/目录之后,集群不正常的同时,也发现group_1组下的DN主备角色互换了,导致集群中DN主备不均衡,存在一个主机gsdb12有两个主无备,gsdb11却有两个备无主。(详见后面)

2、进行CN隔离恢复之后,发现该CN的omm用户密码变成默认的gaussdb_123了,此问题是1.0.1的bug,已在1.0.2版本修复。(详见后面)

3、进行CN隔离恢复之后,数据库名称变成默认的GAUSS了,此问题是bug,占时还未解决。(详见后面)

以下是恢复过程如下:

/usr/bin目录恢复后,查看集群状态。。CM、ETCD、DN状态为离线OFFLINE,,CN状态为DELETED。

[omm@gsdb11 ~]$ gs_om -t status Set output to terminal. 2020-03-09 11:22:23.772 [error] instance (AZ1/gsdb11/ETCD1): get error(etcdserver: key is not provided) when get status, offline 2020-03-09 11:22:23.780 [error] instance (AZ1/gsdb11/CM1): get error(etcdserver: key is not provided) when get status, offline --------------------------------------------------------------------Cluster Status-------------------------------------------------------------------- az_state : single_az cluster_state : Degraded balanced : false ----------------------------------------------------------------------AZ Status----------------------------------------------------------------------- AZ:AZ1 ROLE:primary STATUS:ONLINE ---------------------------------------------------------------------Host Status---------------------------------------------------------------------- HOST:gsdb11 AZ:AZ1 STATUS:ONLINE IP:192.168.179.126 HOST:gsdb12 AZ:AZ1 STATUS:ONLINE IP:192.168.179.127 HOST:gsdb13 AZ:AZ1 STATUS:ONLINE IP:192.168.179.128 HOST:gsdb14 AZ:AZ1 STATUS:ONLINE IP:192.168.179.129 ----------------------------------------------------------------Cluster Manager Status---------------------------------------------------------------- INSTANCE:CM1 ROLE:slave STATUS:OFFLINE HOST:gsdb11 ID:601 INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:gsdb12 ID:602 INSTANCE:CM3 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:603 INSTANCE:CM4 ROLE:slave STATUS:ONLINE HOST:gsdb14 ID:604 ---------------------------------------------------------------------ETCD Status---------------------------------------------------------------------- INSTANCE:ETCD1 ROLE:backup STATUS:OFFLINE HOST:gsdb11 ID:701 PORT:2379 DataDir:/u01/gaussdb/data/etcd INSTANCE:ETCD2 ROLE:follower STATUS:ONLINE HOST:gsdb12 ID:702 PORT:2379 DataDir:/u01/gaussdb/data/etcd INSTANCE:ETCD3 ROLE:leader STATUS:ONLINE HOST:gsdb13 ID:703 PORT:2379 DataDir:/u01/gaussdb/data/etcd ----------------------------------------------------------------------CN Status----------------------------------------------------------------------- INSTANCE:cn_401 ROLE:no role STATUS:DELETED HOST:gsdb11 ID:401 PORT:8000 DataDir:/u01/gaussdb/data/cn INSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:gsdb12 ID:402 PORT:8000 DataDir:/u01/gaussdb/data/cn INSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:gsdb13 ID:403 PORT:8000 DataDir:/u01/gaussdb/data/cn INSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:gsdb14 ID:404 PORT:8000 DataDir:/u01/gaussdb/data/cn ---------------------------------------------------------Instances Status in Group (group_1)---------------------------------------------------------- INSTANCE:DB1_1 ROLE:standby STATUS:OFFLINE HOST:gsdb11 ID:1 PORT:40000 DataDir:/u01/gaussdb/data/dn1 INSTANCE:DB1_2 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:2 PORT:40021 DataDir:/u01/gaussdb/data/dn1 ---------------------------------------------------------Instances Status in Group (group_2)---------------------------------------------------------- INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:3 PORT:40000 DataDir:/u01/gaussdb/data/dn2 INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:gsdb13 ID:4 PORT:40021 DataDir:/u01/gaussdb/data/dn2 ---------------------------------------------------------Instances Status in Group (group_3)---------------------------------------------------------- INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:5 PORT:40000 DataDir:/u01/gaussdb/data/dn3 INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:gsdb14 ID:6 PORT:40021 DataDir:/u01/gaussdb/data/dn3 ---------------------------------------------------------Instances Status in Group (group_4)---------------------------------------------------------- INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:8 PORT:40021 DataDir:/u01/gaussdb/data/dn4 INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:gsdb14 ID:7 PORT:40000 DataDir:/u01/gaussdb/data/dn4 -----------------------------------------------------------------------Manage IP---------------------------------------------------------------------- HOST:gsdb11 IP:192.168.179.126 HOST:gsdb12 IP:192.168.179.127 HOST:gsdb13 IP:192.168.179.128 HOST:gsdb14 IP:192.168.179.129 -------------------------------------------------------------------Query Action Info------------------------------------------------------------------ HOSTNAME: gsdb11 TIME: 2020-03-09 11:22:23.862783 ------------------------------------------------------------------------Float Ip------------------------------------------------------------------ HOST:gsdb14 DB4_7:192.168.179.129 IP: HOST:gsdb13 DB3_5:192.168.179.128 IP: HOST:gsdb12 DB2_3:192.168.179.127 IP: HOST:gsdb12 DB1_2:192.168.179.127 IP: [omm@gsdb11 ~]$ [omm@gsdb11 ~]$

CN 隔离恢复
注意:执行命令恢复时所在目录不能是待恢复CN的数据目录。

[omm@gsdb11 ~]$ [omm@gsdb11 ~]$ gs_om -t recoverycn Start to recovery cn. Get deleted cn. Check deleted cn datadir and backup dir. Successfully check deleted cn datadir and backup dir. Close cm update node route. Close the CM heartbeat about CN. Add node info of deleted CNs on other instances. Successfully add node info of deleted CNs on other instances. Restore deleted CNs. Successfully restore deleted CNs. Handle the pending dist trans of deleted CNs. ..................18s Completed to handle the pending dist trans of deleted CNs. Backup the deleted CNs. .. ........158s Successfully backup the deleted CNs. Reconstruct deleted CNs. ... ......96s Successfully reconstruct deleted CNs. Export metadata from original instance. Export metadata from original instance cn_404. .Su ccessfully export metadata from original instance.Import metadata into deleted CNs. Successfully import metadata into deleted CNs. Start deleted CNs. Successfully start deleted CNs. Open the CM heartbeat about CN. Close cm update node route. Successfully recovery cn. [omm@gsdb11 ~]$ [omm@gsdb11 ~]$

恢复后,集群正常。

[omm@gsdb11 ~]$ [omm@gsdb11 ~]$ gs_om -t status Set output to terminal. --------------------------------------------------------------------Cluster Status-------------------------------------------------------------------- az_state : single_az cluster_state : Normal balanced : false ----------------------------------------------------------------------AZ Status----------------------------------------------------------------------- AZ:AZ1 ROLE:primary STATUS:ONLINE ---------------------------------------------------------------------Host Status---------------------------------------------------------------------- HOST:gsdb11 AZ:AZ1 STATUS:ONLINE IP:192.168.179.126 HOST:gsdb12 AZ:AZ1 STATUS:ONLINE IP:192.168.179.127 HOST:gsdb13 AZ:AZ1 STATUS:ONLINE IP:192.168.179.128 HOST:gsdb14 AZ:AZ1 STATUS:ONLINE IP:192.168.179.129 ----------------------------------------------------------------Cluster Manager Status---------------------------------------------------------------- INSTANCE:CM1 ROLE:slave STATUS:ONLINE HOST:gsdb11 ID:601 INSTANCE:CM2 ROLE:slave STATUS:ONLINE HOST:gsdb12 ID:602 INSTANCE:CM3 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:603 INSTANCE:CM4 ROLE:slave STATUS:ONLINE HOST:gsdb14 ID:604 ---------------------------------------------------------------------ETCD Status---------------------------------------------------------------------- INSTANCE:ETCD1 ROLE:follower STATUS:ONLINE HOST:gsdb11 ID:701 PORT:2379 DataDir:/u01/gaussdb/data/etcd INSTANCE:ETCD2 ROLE:follower STATUS:ONLINE HOST:gsdb12 ID:702 PORT:2379 DataDir:/u01/gaussdb/data/etcd INSTANCE:ETCD3 ROLE:leader STATUS:ONLINE HOST:gsdb13 ID:703 PORT:2379 DataDir:/u01/gaussdb/data/etcd ----------------------------------------------------------------------CN Status----------------------------------------------------------------------- INSTANCE:cn_401 ROLE:no role STATUS:ONLINE HOST:gsdb11 ID:401 PORT:8000 DataDir:/u01/gaussdb/data/cn INSTANCE:cn_402 ROLE:no role STATUS:ONLINE HOST:gsdb12 ID:402 PORT:8000 DataDir:/u01/gaussdb/data/cn INSTANCE:cn_403 ROLE:no role STATUS:ONLINE HOST:gsdb13 ID:403 PORT:8000 DataDir:/u01/gaussdb/data/cn INSTANCE:cn_404 ROLE:no role STATUS:ONLINE HOST:gsdb14 ID:404 PORT:8000 DataDir:/u01/gaussdb/data/cn ---------------------------------------------------------Instances Status in Group (group_1)---------------------------------------------------------- INSTANCE:DB1_1 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:1 PORT:40000 DataDir:/u01/gaussdb/data/dn1 INSTANCE:DB1_2 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:2 PORT:40021 DataDir:/u01/gaussdb/data/dn1 ---------------------------------------------------------Instances Status in Group (group_2)---------------------------------------------------------- INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:3 PORT:40000 DataDir:/u01/gaussdb/data/dn2 INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:gsdb13 ID:4 PORT:40021 DataDir:/u01/gaussdb/data/dn2 ---------------------------------------------------------Instances Status in Group (group_3)---------------------------------------------------------- INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:5 PORT:40000 DataDir:/u01/gaussdb/data/dn3 INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:gsdb14 ID:6 PORT:40021 DataDir:/u01/gaussdb/data/dn3 ---------------------------------------------------------Instances Status in Group (group_4)---------------------------------------------------------- INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:8 PORT:40021 DataDir:/u01/gaussdb/data/dn4 INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:gsdb14 ID:7 PORT:40000 DataDir:/u01/gaussdb/data/dn4 -----------------------------------------------------------------------Manage IP---------------------------------------------------------------------- HOST:gsdb11 IP:192.168.179.126 HOST:gsdb12 IP:192.168.179.127 HOST:gsdb13 IP:192.168.179.128 HOST:gsdb14 IP:192.168.179.129 -------------------------------------------------------------------Query Action Info------------------------------------------------------------------ HOSTNAME: gsdb11 TIME: 2020-03-09 11:36:15.686140 ------------------------------------------------------------------------Float Ip------------------------------------------------------------------ HOST:gsdb14 DB4_7:192.168.179.129 IP: HOST:gsdb13 DB3_5:192.168.179.128 IP: HOST:gsdb12 DB2_3:192.168.179.127 IP: HOST:gsdb12 DB1_2:192.168.179.127 IP: [omm@gsdb11 ~]$

恢复后存在的疑问

问题1

查看集群状态,发现group_1组下的DN主备角色互换了,导致集群中DN主备不均衡,如下gs_om -t status结果所示,group_1组内,gsdb11正常应该是primary,gsdb12正常应该是standby,/usr/bin/目录恢复之后,group_1组内的dn主备角色发生切换,这样就存在gsdb12有两个主无备(主DN1和主DN2),gsdb11有两个备无主(备DN1和备DN4),从而集群中DN主备不均衡。
此问题原因是:由于/usr/bin目录被删除,一些命令无法执行,导致主备自动切换,恢复集群之后,考虑影响业务,因此主备不会自动切回,只能手动switchover切换。

[omm@gsdb11 ~]$ gs_om -t status ---------------------------------------------------------Instances Status in Group (group_1)---------------------------------------------------------- INSTANCE:DB1_1 ROLE:standby STATUS:OFFLINE HOST:gsdb11 ID:1 PORT:40000 DataDir:/u01/gaussdb/data/dn1 INSTANCE:DB1_2 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:2 PORT:40021 DataDir:/u01/gaussdb/data/dn1 ---------------------------------------------------------Instances Status in Group (group_2)---------------------------------------------------------- INSTANCE:DB2_3 ROLE:primary STATUS:ONLINE HOST:gsdb12 ID:3 PORT:40000 DataDir:/u01/gaussdb/data/dn2 INSTANCE:DB2_4 ROLE:standby STATUS:ONLINE HOST:gsdb13 ID:4 PORT:40021 DataDir:/u01/gaussdb/data/dn2 ---------------------------------------------------------Instances Status in Group (group_3)---------------------------------------------------------- INSTANCE:DB3_5 ROLE:primary STATUS:ONLINE HOST:gsdb13 ID:5 PORT:40000 DataDir:/u01/gaussdb/data/dn3 INSTANCE:DB3_6 ROLE:standby STATUS:ONLINE HOST:gsdb14 ID:6 PORT:40021 DataDir:/u01/gaussdb/data/dn3 ---------------------------------------------------------Instances Status in Group (group_4)---------------------------------------------------------- INSTANCE:DB4_8 ROLE:standby STATUS:ONLINE HOST:gsdb11 ID:8 PORT:40021 DataDir:/u01/gaussdb/data/dn4 INSTANCE:DB4_7 ROLE:primary STATUS:ONLINE HOST:gsdb14 ID:7 PORT:40000 DataDir:/u01/gaussdb/data/dn4 -----------------------------------------------------------------------Manage IP----------------------------------------------------------------------

问题2

进行CN隔离恢复之后,发现该CN的omm用户密码变成默认的gaussdb_123了。。。难道是跟CN隔离恢复有关。。。这个有待咨询华为。。

如下,CN1恢复是从CN4上导出的数据恢复的,那么按理应该omm密码会和CN4上的一致,但是确变成默认的密码了。

[omm@gsdb14 ~]$ zsql omm/yhadmin_123@192.168.179.129:8000 -q connected. SQL> select instance_name ,status from v$instance; INSTANCE_NAME STATUS -------------------- -------------------- cn_404 OPEN 1 rows fetched. SQL> exit [omm@gsdb14 ~]$

CN1恢复之后,omm密码变成默认值了。。

[omm@gsdb11 ~]$ zsql omm/yhadmin_123@192.168.179.126:8000 -q GS-00329, Incorrect user or password [omm@gsdb11 ~]$ [omm@gsdb11 ~]$ [omm@gsdb11 ~]$ zsql omm/gaussdb_123@192.168.179.126:8000 -q connected. SQL> select instance_name,status from v$instance; INSTANCE_NAME STATUS -------------------- -------------------- cn_401 OPEN 1 rows fetched. SQL> exit [omm@gsdb11 ~]$

华为回复,此问题是一个bug,已在1.0.2版本修复。

问题3

进行CN隔离恢复之后,发现数据库名称变成默认的GAUSS了。
原因是:cn隔离恢复的时候是先导出元数据,再重新建库,后导入元数据。在重建库的时候,使用的是默认建库模版,因此数据库名称在CN隔离恢复之后变成默认的GAUSS了。
此问题属于Bug,华为暂时还未解决。

CN1恢复之后,数据库名称变成默认值GAUSS了,我之前数据库名称是安装时自定义的YHDB。

[omm@gsdb11 ~]$ zsql omm/yhadmin_123@192.168.179.126:8000 -q connected. SQL> select * from v$version; VERSION ---------------------------------------------------------------- GaussDB_100_1.0.1.B023 Release d92e025 ZENGINE d92e025 3 rows fetched. SQL> select name,status,open_status,database_role from v$database; NAME STATUS OPEN_STATUS DATABASE_ROLE --------------- ---------- --------------- ------------------------- GAUSS OPEN READ WRITE PRIMARY 1 rows fetched. SQL> exit [omm@gsdb11 ~]$

在其它CN节点gsdb12、gsdb13、gsdb14登录查看,数据库名称未变,如下:

[omm@gsdb12 ~]$ zsql omm/yhadmin_123@192.168.179.127:8000 -q connected. SQL> select * from v$version; VERSION ---------------------------------------------------------------- GaussDB_T_1.0.2.B307 Release d4484ac ZENGINE 2 rows fetched. SQL> select instance_name,status from v$instance; INSTANCE_NAME STATUS -------------------- -------------------- cn_402 OPEN 1 rows fetched. SQL> SQL> select name,status,open_status,database_role from v$database; NAME STATUS OPEN_STATUS DATABASE_ROLE -------------------------------- -------------------- -------------------- ------------------------------ YHDB OPEN READ WRITE PRIMARY 1 rows fetched. SQL> exit [omm@gsdb12 ~]$ [omm@gsdb13 ~]$ zsql omm/yhadmin_123@192.168.179.128:8000 -q connected. SQL> select * from v$version; VERSION ---------------------------------------------------------------- GaussDB_T_1.0.2.B307 Release d4484ac ZENGINE 2 rows fetched. SQL> select instance_name,status from v$instance; INSTANCE_NAME STATUS -------------------- -------------------- cn_403 OPEN 1 rows fetched. SQL> select name,status,open_status,database_role from v$database; NAME STATUS OPEN_STATUS DATABASE_ROLE -------------------------------- -------------------- -------------------- ------------------------------ YHDB OPEN READ WRITE PRIMARY 1 rows fetched. SQL> exit [omm@gsdb13 ~]$ [omm@gsdb14 ~]$ zsql omm/yhadmin_123@192.168.179.128:8000 -q connected. SQL> select * from v$version; VERSION ---------------------------------------------------------------- GaussDB_T_1.0.2.B307 Release d4484ac ZENGINE 2 rows fetched. SQL> select instance_name,status from v$instance; INSTANCE_NAME STATUS -------------------- -------------------- cn_404 OPEN 1 rows fetched. SQL> select name,status,open_status,database_role from v$database; NAME STATUS OPEN_STATUS DATABASE_ROLE -------------------------------- -------------------- -------------------- ------------------------------ YHDB OPEN READ WRITE PRIMARY 1 rows fetched. SQL> exit [omm@gsdb14 ~]$
分享好友

分享这个小栈给你的朋友们,一起进步吧。

GaussDB_数据库
创建时间:2020-01-06 16:21:44
华为GaussDB数据库小栈
展开
订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅:虚拟交易,一经交易不退款;若特殊情况,可3日内客服咨询

• 专区发布评论属默认订阅所评论专区(除付费小栈外)

技术专家

查看更多
  • GaussDB_数据库
    专家
戳我,来吐槽~