本文主要分析MySQL Group Replication(MGR)相比传统MySQL复制模式,mysqld新增的几个内存缓存模块。举例说明由此可能引发的问题,并介绍潜在的优化方案。
与传统的MySQL主从复制不同,MySQL Group Replication模式下,mysqld会占据更多的内存空间。如果大家在云主机或docker这种内存十分有限的环境下使用MGR,那么就要特别留意,需结合业务场景合理规划内存空间。
MGR新增了多个内存缓存,其作用各不相同。本文仅介绍正常运行期间重要的2个内存模块。分别是事务认证模块的冲突检测数据库(certification_info对象)和底层xcom模块的paxos cache。
certification_info对象
首先介绍冲突检测数据库,即certification_info对象。其保存的是事务的writeset。具体实现可以参考上一篇文章
因为一个事务只有在每个MGR节点都执行/回放后,其writeset才能从certification_info中被清理,所以,如果节点间复制延迟越大,writeset积累就越多。每个writeset有2部分组成,分别是键的哈希值和事务执行时的gtid_executed。前者的大小是固定的,但后者的大小无法准确衡量。比如:
gtid_executed a为
6c6aa49f-22dd-11e9-82e2-c81f66e48c6e:1-696578346
gtid_executed b为
597672ea-a498-11e6-9dd2-246e9627d610:1-13276447793,
9041b15e-a686-11e8-8bf2-246e9672a2f0:1-1234925114,
cbaf3b47-bcb3-11e8-aaa6-246e96c4fc68:1-8077131748,
e8ea688c-2b73-11e7-a9d6-246e9627f950:1-6362,
ec82f85a-145d-11e7-979e-246e96280570:1-45540504450,
f73f342f-a496-11e6-a74a-246e9627cdc8:1-5118868458
b相比a占据数倍的内存,如果每个writeset包含的gtid_executed是b,那么会消耗更多的内存空间。所以,在使用MGR时gtid_executed中server_uuid个数越少越好。
目前MGR的流控模块进行流控调节是基于等待认证和等待回放的事务个数,而writeset的数目跟事务个数没有对应关系。一个事务可能只操作了一条记录,也可能操作了成百上千条记录。在这种情况下,基于MGR的流控机制,就无法限制certification_info对象中的writeset数目。导致出现执行的事务数很少,但内存中的writeset数目已经很大。
使用sysbench 1.0版本进行多线程并发prepare很容易构造上述场景。因为进行数据prepare的时候,每个事务为1MB左右,包括500条记录。每个表有2个索引,每个索引记录会产生带校对和不带校对2个版本writeset。那么一个事务就会产生2000条writeset。假设创建128个表,使用128个线程进行prepare,即使每个线程tps是1,一秒时间内就会产生25多万的writeset。按照MySQL社区版60s清理一次,可能会累积1500万的writeset。这是一笔不小的内存开销。
paxos cache
下面是与paxos cache相关的几个变量。
/*
We require that the number of elements in the cache is big enough enough that
it is always possible to find instances that are not busy.
Under normal circumstances the number of busy instances will be
less than event_horizon, since the proposers only considers
instances which belong to the local node.
A node may start proposing no_ops for instances belonging
to other nodes, meaning that event_horizon * NSERVERS instances may be
involved. However, for the time being, proposing a no_op for an instance
will not mark it as busy. This may change in the future, so a safe upper
limit on the number of nodes marked as busy is event_horizon * NSERVERS.
*/
#define CACHED 50000
static lru_machine cache[CACHED]; /* The Paxos instances, plus a link for the LRU chain */
MySQL社区版的paxos cache大小为5万个lru_machine。
/* {{{ Paxos machine cache */
struct lru_machine {
linkage lru_link;
pax_machine pax;
};
lru_machine是pax_machine的一次封装,pax_machine定义如下:
/* Definition of a Paxos instance */
struct pax_machine {
linkage hash_link;
lru_machine * lru;
synode_no synode;
double last_modified; /* Start time */
linkage rv; /* Tasks may sleep on this until something interesting happens */
struct {
ballot bal; /* The current ballot we are working on */
bit_set * prep_nodeset; /* Nodes which have answered my prepare */
ballot sent_prop;
bit_set * prop_nodeset; /* Nodes which have answered my propose */
pax_msg * msg; /* The value we are trying to push */
ballot sent_learn;
} proposer;
struct {
ballot promise; /* Promise to not accept any proposals less than this */
pax_msg * msg; /* The value we have accepted */
} acceptor;
struct {
pax_msg *msg; /* The value we have learned */
} learner;
int lock; /* Busy ? */
pax_op op;
int force_delivery;
};
每个pax_machine都是一个独立的paxos instance。可以看出其包括proposer、acceptor和learner三个子对象,分别对应paxos一致性协议的三个阶段。pax_msg *msg即为一个或多个事务的writeset集合(msg包括多个事务的场景是paxos做了batch)。
所以,paxos cache的总大小也是跟每个事务中writeset数量有关的,无法准确计算。但相比冲突检测数据库,MySQL在这块做得好一些,引入了
/*
cache size limit and interval
*/
size_t cache_limit;
默认为1G大小:
/* Reasonable initial cache limit */
#define CACHE_LIMIT 1000000000ULL
其逻辑是在将已经达成一致的paxos消息上推给MySQL执行后,会检查当前Cache的大小,如果超过1G,那么会触发cache清理。
/*
Loop through the LRU (protected_lru) and deallocate objects until the size of
the cache is below the limit.
The freshly initialized objects are put into the probation_lru, so we can always start
scanning at the end of protected_lru.
lru_get will always look in probation_lru first.
*/
void shrink_cache()
{
FWD_ITER(&protected_lru, lru_machine,
if ( above_cache_limit() && can_deallocate(link_iter)) {
last_removed_cache = link_iter->pax.synode;
hash_out(&link_iter->pax); /* Remove from hash table */
link_into(link_out(&link_iter->lru_link), &probation_lru); /* Put in probation lru */
init_pax_machine(&link_iter->pax, link_iter, null_synode);
} else {
return;
}
);
}
清理的逻辑是遍历整个cache,使用can_deallocate函数找出可以被清理的lru_machine,清理的标准之一是各个节点都已经收到这个已经达成一致的消息了。之后调用init_pax_machine释放其上的pax_msg对象内存。如果清理过程中发现cache已经小于1G,那么也会停止清理。可以说,正常情况下,paxos cache的大小维持在1G左右波动。但如果节点间的网络延时比较高,有个节点落后比较多,会导致cache的大小超过硬编码的1G阈值。
并且,在cache总大小已经超过阈值后,其大小还可能进一步变大,原因请看下面的代码
/*
Get a machine for (re)use.
The machines are statically allocated, and organized in two lists.
probation_lru is the free list.
protected_lru tracks the machines that are currently in the cache in
lest recently used order.
*/
static lru_machine *lru_get()
{
lru_machine * retval = NULL;
// !above_cache_limit() add By InnoSQL, make sure the cache size will no large than cache limit
if (!link_empty(&probation_lru)) {
retval = (lru_machine * ) link_first(&probation_lru);
} else {
/* Find the first non-busy instance in the LRU */
FWD_ITER(&protected_lru, lru_machine,
if (!is_busy_machine(&link_iter->pax)) {
retval = link_iter;
/* Since this machine is in in the cache, we need to update
last_removed_cache */
last_removed_cache = retval->pax.synode;
break;
}
)
}
assert(retval && !is_busy_machine(&retval->pax));
return retval;
}
从注释可以知道,lru_get函数是从paxos cache中申请一个lru_machine。如果当前cache中表示空闲列表的probation_lru还有空闲machine,那么优先从空闲列表分配。如果此时cache总大小已经超过阈值,那么选择走从probation_lru队列申请会进一步增加cache大小。
总结
本文先介绍了MGR相比普通MySQL复制增加的内存模块,并分析了其潜在的问题。下一篇将讨论如何对其进行优化