绑定完请刷新页面
取消
刷新

分享好友

×
取消 复制
PostgreSQL中的BlockIdData解释
2020-05-19 17:31:22

本文是基于当前的PostgreSQL master分支进行介绍的,当前commitid为087d3d0583cf292146a7385746d1f5b53eeeaee6。

只要学习过PostgreSQL存储结构的同学,对于ctid一定不会陌生。ctid是PostgreSQL是对于物理tuple的标识符,不同对象存在相同的ctid。

ctid由两部分构成,page顺序值以及tuple顺序值,一般表示为,'(12, 33)'。page顺序值指的是page在所有segment中的顺序,是一个且连续的数值;tuple顺序值指的是在当前page页面中的顺序值。这两项就能够标识tuple了。

/*
 * ItemPointer:
 *
 * This is a pointer to an item within a disk page of a known file
 * (for example, a cross-link from an index to its parent table).
 * ip_blkid tells us which block, ip_posid tells us which entry in
 * the linp (ItemIdData) array we want.
 *
 * Note: because there is an item pointer in each tuple header and index
 * tuple header on disk, it's very important not to waste space with
 * structure padding bytes.  The struct is designed to be six bytes long
 * (it contains three int16 fields) but a few compilers will pad it to
 * eight bytes unless coerced.  We apply appropriate persuasion where
 * possible.  If your compiler can't be made to play along, you'll waste
 * lots of space.
 */
typedef struct ItemPointerData
{
    BlockIdData ip_blkid;
    OffsetNumber ip_posid;
}

以上是ctid或者ItemPointerData的源码表示。OffsetNumber ip_posid就是表示tuple顺序值。

这里要注意的是BlockIdData ip_blkid,先看一下BlockIdData的源码:

/*
 * BlockId:
 *
 * this is a storage type for BlockNumber.  in other words, this type
 * is used for on-disk structures (e.g., in HeapTupleData) whereas
 * BlockNumber is the type on which calculations are performed (e.g.,
 * in access method code).
 *
 * there doesn't appear to be any reason to have separate * except
 * for the fact that BlockIds can be SHORTALIGN'd (and therefore any
 * structures that contains them, such as ItemPointerData, can also be
 * SHORTALIGN'd).  this is an important consideration for reducing the
 * space requirements of the line pointer (ItemIdData) array on each
 * page and the header of each heap or index tuple, so it doesn't seem
 * wise to change this without good reason.
 */
typedef struct BlockIdData
{
    uint16      bi_hi;
    uint16      bi_lo;
} BlockIdData;

从上可以看出,page顺序值是由两部构成,bi_hi和bi_lo。而我们能够看到page顺序值其实是一个无符号整型。早期,我一直认为hi表示segment文件顺序值,而lo表示当前segment的顺序值,近在整理存储这部分代码时,发现page顺序值是一个连续值,所以重新研究发现,其实page顺序值就是typedef uint32 BlockNumber。

我们看一下BlockNumber相关源码:

/*
 * BlockNumber:
 *
 * each data file (heap or index) is divided into postgres disk blocks
 * (which may be thought of as the unit of i/o -- a postgres buffer
 * contains exactly one disk block).  the blocks are numbered
 * sequentially, 0 to 0xFFFFFFFE.
 *
 * InvalidBlockNumber is the same thing as P_NEW in bufmgr.h.
 *
 * the access methods, the buffer manager and the storage manager are
 * more or less the only pieces of code that should be accessing disk
 * blocks directly.
 */
typedef uint32 BlockNumber;

#define InvalidBlockNumber		((BlockNumber) 0xFFFFFFFF)

#define MaxBlockNumber			((BlockNumber) 0xFFFFFFFE)

那么为什么PostgreSQL使用BlockIdData来替代BlockNumber呢?其实这只是PostgreSQL为了节省字节的一种实现方式而已,没有其他实际意义。

ItemPointerData作为tuple header的一部分,所有的tuple都会包含。而我们可以想象的到,tuple的顺序值,即ip_posid值的长度不会超过无符号短整形(即2字节)[1]。因此为了减少数据膨胀,设置ip_posid为2字节即可[2]。那么,如果BlockIdData如果是4字节的类型,编译器将会进行数据对齐,将ItemPointerData变为8字节,因此PostgreSQL将BlockNumber以BlockIdData来替代,使得ItemPointerData变为6字节,除了少部分编译器还会将其变为8字节外,大部分,比如GCC,CLANG都会将其编译为6字节。这样从这种技巧上来减少空间的浪费。当然,不得不进行一些额外的计算,但这部分运算相对于数据膨胀带来的I/O性能损耗会好很多。

参考

  1. ^大的页为32KB,即32768字节。而无符号短整形大值为65535.
  2. ^单字节过少
分享好友

分享这个小栈给你的朋友们,一起进步吧。

华山论剑
创建时间:2019-02-22 18:53:00
没了烟火气,人生就是一段孤独的旅程·····于是,在ITPUB,我们以武论英雄!
展开
订阅须知

• 所有用户可根据关注领域订阅专区或所有专区

• 付费订阅:虚拟交易,一经交易不退款;若特殊情况,可3日内客服咨询

• 专区发布评论属默认订阅所评论专区(除付费小栈外)

栈主、嘉宾

查看更多
  • 栈栈
    栈主
  • ?
    嘉宾

小栈成员

查看更多
  • u_9a3ed7a37f8e4a
  • daisyplay
  • boss_ch
  • Jack2k
戳我,来吐槽~