PostgreSQL中的BlockIdData解释

2020-05-19 17:31:22

本文是基于当前的PostgreSQL master分支进行介绍的，当前commitid为087d3d0583cf292146a7385746d1f5b53eeeaee6。

只要学习过PostgreSQL存储结构的同学，对于ctid一定不会陌生。ctid是PostgreSQL是对于物理tuple的标识符，不同对象存在相同的ctid。

ctid由两部分构成，page顺序值以及tuple顺序值，一般表示为，'(12, 33)'。page顺序值指的是page在所有segment中的顺序，是一个且连续的数值；tuple顺序值指的是在当前page页面中的顺序值。这两项就能够标识tuple了。

/*
 * ItemPointer:
 *
 * This is a pointer to an item within a disk page of a known file
 * (for example, a cross-link from an index to its parent table).
 * ip_blkid tells us which block, ip_posid tells us which entry in
 * the linp (ItemIdData) array we want.
 *
 * Note: because there is an item pointer in each tuple header and index
 * tuple header on disk, it's very important not to waste space with
 * structure padding bytes.  The struct is designed to be six bytes long
 * (it contains three int16 fields) but a few compilers will pad it to
 * eight bytes unless coerced.  We apply appropriate persuasion where
 * possible.  If your compiler can't be made to play along, you'll waste
 * lots of space.
 */
typedef struct ItemPointerData
{
    BlockIdData ip_blkid;
    OffsetNumber ip_posid;
}

以上是ctid或者ItemPointerData的源码表示。OffsetNumber ip_posid就是表示tuple顺序值。

这里要注意的是BlockIdData ip_blkid，先看一下BlockIdData的源码：

/*
 * BlockId:
 *
 * this is a storage type for BlockNumber.  in other words, this type
 * is used for on-disk structures (e.g., in HeapTupleData) whereas
 * BlockNumber is the type on which calculations are performed (e.g.,
 * in access method code).
 *
 * there doesn't appear to be any reason to have separate * except
 * for the fact that BlockIds can be SHORTALIGN'd (and therefore any
 * structures that contains them, such as ItemPointerData, can also be
 * SHORTALIGN'd).  this is an important consideration for reducing the
 * space requirements of the line pointer (ItemIdData) array on each
 * page and the header of each heap or index tuple, so it doesn't seem
 * wise to change this without good reason.
 */
typedef struct BlockIdData
{
    uint16      bi_hi;
    uint16      bi_lo;
} BlockIdData;

从上可以看出，page顺序值是由两部构成，bi_hi和bi_lo。而我们能够看到page顺序值其实是一个无符号整型。早期，我一直认为hi表示segment文件顺序值，而lo表示当前segment的顺序值，近在整理存储这部分代码时，发现page顺序值是一个连续值，所以重新研究发现，其实page顺序值就是typedef uint32 BlockNumber。

我们看一下BlockNumber相关源码：

/*
 * BlockNumber:
 *
 * each data file (heap or index) is divided into postgres disk blocks
 * (which may be thought of as the unit of i/o -- a postgres buffer
 * contains exactly one disk block).  the blocks are numbered
 * sequentially, 0 to 0xFFFFFFFE.
 *
 * InvalidBlockNumber is the same thing as P_NEW in bufmgr.h.
 *
 * the access methods, the buffer manager and the storage manager are
 * more or less the only pieces of code that should be accessing disk
 * blocks directly.
 */
typedef uint32 BlockNumber;

#define InvalidBlockNumber		((BlockNumber) 0xFFFFFFFF)

#define MaxBlockNumber			((BlockNumber) 0xFFFFFFFE)

那么为什么PostgreSQL使用BlockIdData来替代BlockNumber呢？其实这只是PostgreSQL为了节省字节的一种实现方式而已，没有其他实际意义。

ItemPointerData作为tuple header的一部分，所有的tuple都会包含。而我们可以想象的到，tuple的顺序值，即ip_posid值的长度不会超过无符号短整形（即2字节）^[1]。因此为了减少数据膨胀，设置ip_posid为2字节即可^[2]。那么，如果BlockIdData如果是4字节的类型，编译器将会进行数据对齐，将ItemPointerData变为8字节，因此PostgreSQL将BlockNumber以BlockIdData来替代，使得ItemPointerData变为6字节，除了少部分编译器还会将其变为8字节外，大部分，比如GCC，CLANG都会将其编译为6字节。这样从这种技巧上来减少空间的浪费。当然，不得不进行一些额外的计算，但这部分运算相对于数据膨胀带来的I/O性能损耗会好很多。