本文是基于当前的PostgreSQL master分支进行介绍的,当前commitid为087d3d0583cf292146a7385746d1f5b53eeeaee6。
只要学习过PostgreSQL存储结构的同学,对于ctid一定不会陌生。ctid是PostgreSQL是对于物理tuple的标识符,不同对象存在相同的ctid。
ctid由两部分构成,page顺序值以及tuple顺序值,一般表示为,'(12, 33)'。page顺序值指的是page在所有segment中的顺序,是一个且连续的数值;tuple顺序值指的是在当前page页面中的顺序值。这两项就能够标识tuple了。
/*
* ItemPointer:
*
* This is a pointer to an item within a disk page of a known file
* (for example, a cross-link from an index to its parent table).
* ip_blkid tells us which block, ip_posid tells us which entry in
* the linp (ItemIdData) array we want.
*
* Note: because there is an item pointer in each tuple header and index
* tuple header on disk, it's very important not to waste space with
* structure padding bytes. The struct is designed to be six bytes long
* (it contains three int16 fields) but a few compilers will pad it to
* eight bytes unless coerced. We apply appropriate persuasion where
* possible. If your compiler can't be made to play along, you'll waste
* lots of space.
*/
typedef struct ItemPointerData
{
BlockIdData ip_blkid;
OffsetNumber ip_posid;
}
以上是ctid或者ItemPointerData的源码表示。OffsetNumber ip_posid就是表示tuple顺序值。
这里要注意的是BlockIdData ip_blkid,先看一下BlockIdData的源码:
/*
* BlockId:
*
* this is a storage type for BlockNumber. in other words, this type
* is used for on-disk structures (e.g., in HeapTupleData) whereas
* BlockNumber is the type on which calculations are performed (e.g.,
* in access method code).
*
* there doesn't appear to be any reason to have separate * except
* for the fact that BlockIds can be SHORTALIGN'd (and therefore any
* structures that contains them, such as ItemPointerData, can also be
* SHORTALIGN'd). this is an important consideration for reducing the
* space requirements of the line pointer (ItemIdData) array on each
* page and the header of each heap or index tuple, so it doesn't seem
* wise to change this without good reason.
*/
typedef struct BlockIdData
{
uint16 bi_hi;
uint16 bi_lo;
} BlockIdData;
从上可以看出,page顺序值是由两部构成,bi_hi和bi_lo。而我们能够看到page顺序值其实是一个无符号整型。早期,我一直认为hi表示segment文件顺序值,而lo表示当前segment的顺序值,近在整理存储这部分代码时,发现page顺序值是一个连续值,所以重新研究发现,其实page顺序值就是typedef uint32 BlockNumber。
我们看一下BlockNumber相关源码:
/*
* BlockNumber:
*
* each data file (heap or index) is divided into postgres disk blocks
* (which may be thought of as the unit of i/o -- a postgres buffer
* contains exactly one disk block). the blocks are numbered
* sequentially, 0 to 0xFFFFFFFE.
*
* InvalidBlockNumber is the same thing as P_NEW in bufmgr.h.
*
* the access methods, the buffer manager and the storage manager are
* more or less the only pieces of code that should be accessing disk
* blocks directly.
*/
typedef uint32 BlockNumber;
#define InvalidBlockNumber ((BlockNumber) 0xFFFFFFFF)
#define MaxBlockNumber ((BlockNumber) 0xFFFFFFFE)
那么为什么PostgreSQL使用BlockIdData来替代BlockNumber呢?其实这只是PostgreSQL为了节省字节的一种实现方式而已,没有其他实际意义。
ItemPointerData作为tuple header的一部分,所有的tuple都会包含。而我们可以想象的到,tuple的顺序值,即ip_posid值的长度不会超过无符号短整形(即2字节)[1]。因此为了减少数据膨胀,设置ip_posid为2字节即可[2]。那么,如果BlockIdData如果是4字节的类型,编译器将会进行数据对齐,将ItemPointerData变为8字节,因此PostgreSQL将BlockNumber以BlockIdData来替代,使得ItemPointerData变为6字节,除了少部分编译器还会将其变为8字节外,大部分,比如GCC,CLANG都会将其编译为6字节。这样从这种技巧上来减少空间的浪费。当然,不得不进行一些额外的计算,但这部分运算相对于数据膨胀带来的I/O性能损耗会好很多。