您好,登錄后才能下訂單哦!
本節簡單介紹了PostgreSQL在執行插入過程中與存儲相關的函數RecordAndGetPageWithFreeSpace->fsm_set_and_search,該函數設置給定的FSM page的相應值,并返回合適的slot。
FSMAddress
內部的FSM處理過程以邏輯地址scheme的方式工作,樹的每一個層次都可以認為是一個獨立的地址文件.
/*
* The internal FSM routines work on a logical addressing scheme. Each
* level of the tree can be thought of as a separately addressable file.
* 內部的FSM處理過程工作在一個邏輯地址scheme上.
* 樹的每一個層次都可以認為是一個獨立的地址文件.
*/
typedef struct
{
//層次
int level; /* level */
//該層次內的頁編號
int logpageno; /* page number within the level */
} FSMAddress;
/* Address of the root page. */
//根頁地址
static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};
FSMPage
FSM page數據結構.詳細可參看src/backend/storage/freespace/README.
/*
* Structure of a FSM page. See src/backend/storage/freespace/README for
* details.
* FSM page數據結構.詳細可參看src/backend/storage/freespace/README.
*/
typedef struct
{
/*
* fsm_search_avail() tries to spread the load of multiple backends by
* returning different pages to different backends in a round-robin
* fashion. fp_next_slot points to the next slot to be returned (assuming
* there's enough space on it for the request). It's defined as an int,
* because it's updated without an exclusive lock. uint16 would be more
* appropriate, but int is more likely to be atomically
* fetchable/storable.
* fsm_search_avail()函數嘗試通過在一輪循環中返回不同的頁面到不同的后臺進程,
* 從而分散在后臺進程上分散負載.
* 該字段因為無需獨占鎖,因此定義為整型.
* unit16可能會更合適,但整型看起來更適合于原子提取和存儲.
*/
int fp_next_slot;
/*
* fp_nodes contains the binary tree, stored in array. The first
* NonLeafNodesPerPage elements are upper nodes, and the following
* LeafNodesPerPage elements are leaf nodes. Unused nodes are zero.
* fp_nodes以數組的形式存儲二叉樹.
* 第一個NonLeafNodesPerPage元素是上一層的節點,接下來的LeafNodesPerPage元素是葉子節點.
* 未使用的節點為0.
*/
uint8 fp_nodes[FLEXIBLE_ARRAY_MEMBER];
} FSMPageData;
typedef FSMPageData *FSMPage;
FSMLocalMap
對于小表,不需要創建FSM來存儲空間信息,使用本地的內存映射信息.
/* Either already tried, or beyond the end of the relation */
//已嘗試或者已在表的末尾之后
#define FSM_LOCAL_NOT_AVAIL 0x00
/* Available to try */
//可用于嘗試
#define FSM_LOCAL_AVAIL 0x01
/*
* For small relations, we don't create FSM to save space, instead we use
* local in-memory map of pages to try. To locate free space, we simply try
* pages directly without knowing ahead of time how much free space they have.
* 對于小表,不需要創建FSM來存儲空間信息,使用本地的內存映射信息.
* 為了定位空閑空間,我們不需要知道他們有多少空閑空間而是直接簡單的對page進行嘗試.
*
* Note that this map is used to the find the block with required free space
* for any given relation. We clear this map when we have found a block with
* enough free space, when we extend the relation, or on transaction abort.
* See src/backend/storage/freespace/README for further details.
* 注意這個map用于搜索給定表的請求空閑空間.
* 在找到有足夠空閑空間的block/擴展了relation/在事務回滾時,則清除這個map的信息.
* 詳細可查看src/backend/storage/freespace/README.
*/
typedef struct
{
BlockNumber nblocks;//塊數
uint8 map[HEAP_FSM_CREATION_THRESHOLD];//數組
} FSMLocalMap;
static FSMLocalMap fsm_local_map =
{
0,
{
FSM_LOCAL_NOT_AVAIL
}
};
#define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)
fsm_set_and_search設置給定的FSM page的相應值,并返回合適的slot.
主要處理邏輯如下:
1.初始化相關變量
2.設置相應頁面的FSM可用標記
3.如search_cat/minvalue(請求空間)不為0,則調用fsm_search_avail搜索slot
4.解鎖,返回
搜索樹的算法是逐漸擴展”search triangle”(暫且稱為搜索三角),也就是說,被當前節點所覆蓋的所有節點,確保我們從開始點向右搜索.在第一步,只有目標slot會被檢查.當我們從左邊子節點往上移動到父節點時,我們同時添加了父節點的右子樹到搜索三角中.當我們從右邊子節點往上移動到父節點時,我們同時刪除了當前搜索三角(這時候已經知道了沒有合適的page), 同時向右檢索下一個更大尺寸的三角.因此,我們不會從起點向左搜索,在每一步搜索三角的尺寸都會翻倍,確保只需要log2(N)步就可以搜索N個頁面.
例如,考慮下面這棵樹:
7
7 6
5 7 6 5
4 5 5 7 2 6 5 2
T
假定目標節點是字母T指示的節點,同時我們正在使用6或更大的數字搜索節點.檢索從T開始.在第一輪迭代,移到右邊,然后到父節點,到達最右邊的節點 5.在第二輪迭代,移到右邊,回卷,然后上溯,在第三個層次上到達節點 7這時候7滿足我們的搜索要求,因此我們沿著7這條路徑下降到底部.實際上,這是(考慮到回卷)起點右側第一個滿足條件的頁面.
/*
* Set value in given FSM page and slot.
* 設置給定的FSM page的相應值,并返回合適的slot.
*
* If minValue > 0, the updated page is also searched for a page with at
* least minValue of free space. If one is found, its slot number is
* returned, -1 otherwise.
* 如minValue大于0,更新后的page還將搜索空閑空間只是為最小值的page.
* 如果找到了,返回slot編號,否則返回-1.
*/
//search_slot = fsm_set_and_search(rel, addr, slot, old_cat, search_cat);
static int
fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
uint8 newValue, uint8 minValue)
{
Buffer buf;
Page page;
int newslot = -1;
//獲取FSM的buffer
buf = fsm_readbuf(rel, addr, true);
//鎖定
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
//獲取FSM的page
page = BufferGetPage(buf);
if (fsm_set_avail(page, slot, newValue))
MarkBufferDirtyHint(buf, false);
if (minValue != 0)
{
/* Search while we still hold the lock */
//搜索slot
newslot = fsm_search_avail(buf, minValue,
addr.level == FSM_BOTTOM_LEVEL,
true);
}
UnlockReleaseBuffer(buf);
return newslot;
}
/*
* Searches for a slot with category at least minvalue.
* Returns slot number, or -1 if none found.
* 搜索目錄至少為最小值的slot.
*
* The caller must hold at least a shared lock on the page, and this
* function can unlock and lock the page again in exclusive mode if it
* needs to be updated. exclusive_lock_held should be set to true if the
* caller is already holding an exclusive lock, to avoid extra work.
* 調用者必須持有page共享鎖,如page需要更新則該函數可能重新以獨占模式解鎖或鎖定page.
* 如調用者已持有獨占鎖exclusive_lock_held需設置為T,以避免額外的工作.
*
* If advancenext is false, fp_next_slot is set to point to the returned
* slot, and if it's true, to the slot after the returned slot.
* 如果advancenext是F,fp_next_slot被設置為指向返回的slot;
* 如為T,則指向返回的slot后的slot.
*/
//newslot = fsm_search_avail(buf, minValue, \
addr.level == FSM_BOTTOM_LEVEL, \
true);
int
fsm_search_avail(Buffer buf, uint8 minvalue, bool advancenext,
bool exclusive_lock_held)
{
Page page = BufferGetPage(buf);//獲取page
FSMPage fsmpage = (FSMPage) PageGetContents(page);//獲取FSMPage
int nodeno;
int target;
uint16 slot;
restart:
/*
* Check the root first, and exit quickly if there's no leaf with enough
* free space
* 首先檢查根節點,如葉子節點無足夠的空閑空間則快速退出.
*/
if (fsmpage->fp_nodes[0] < minvalue)
return -1;
/*
* Start search using fp_next_slot. It's just a hint, so check that it's
* sane. (This also handles wrapping around when the prior call returned
* the last slot on the page.)
* 使用fp_next_slot開始搜索.
* 這個標記只是一個提示而已,檢查它是否正常.
* (這同時處理了在上一次調用返回的頁面最后一個slot的回卷問題)
*/
//#define SlotsPerFSMPage LeafNodesPerPage
//#define LeafNodesPerPage (NodesPerPage - NonLeafNodesPerPage)
//#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
offsetof(FSMPageData, fp_nodes))
//#define NonLeafNodesPerPage (BLCKSZ / 2 - 1) = 4095
target = fsmpage->fp_next_slot;
if (target < 0 || target >= LeafNodesPerPage)
target = 0;
target += NonLeafNodesPerPage;
/*----------
* Start the search from the target slot. At every step, move one
* node to the right, then climb up to the parent. Stop when we reach
* a node with enough free space (as we must, since the root has enough
* space).
* 從目標slot開始檢索.每一步,把節點移到右邊,然后上溯父節點.
* 在到達一個有足夠空閑空間的節點時停止(因為根節點有足夠的空間).
*
* The idea is to gradually expand our "search triangle", that is, all
* nodes covered by the current node, and to be sure we search to the
* right from the start point. At the first step, only the target slot
* is examined. When we move up from a left child to its parent, we are
* adding the right-hand subtree of that parent to the search triangle.
* When we move right then up from a right child, we are dropping the
* current search triangle (which we know doesn't contain any suitable
* page) and instead looking at the next-larger-size triangle to its
* right. So we never look left from our original start point, and at
* each step the size of the search triangle doubles, ensuring it takes
* only log2(N) work to search N pages.
* 算法思想是逐漸擴展"search triangle 搜索三角",也就是說,
* 被當前節點所覆蓋的所有節點,確保我們從開始點向右搜索.
* 在第一步,只有目標slot會被檢查.
* 當我們從左邊子節點往上移動到父節點時,我們同時添加了父節點的右子樹到搜索三角中.
* 當我們從右邊子節點往上移動到父節點時,我們同時刪除了當前搜索三角(這時候已經知道了沒有合適的page),
* 同時向右檢索下一個更大尺寸的三角.
* 因此,我們不會從起點向左搜索,在每一步搜索三角的尺寸都會翻倍,確保只需要log2(N)步就可以搜索N個頁面.
*
* The "move right" operation will wrap around if it hits the right edge
* of the tree, so the behavior is still good if we start near the right.
* Note also that the move-and-climb behavior ensures that we can't end
* up on one of the missing nodes at the right of the leaf level.
* "move right"操作會在觸碰到樹的右邊緣時回卷,因此在右邊鄰近開始搜索也是沒有問題的.
* 注意move-and-climb操作確保了我們不能在葉子層右邊的節點上結束.
*
* For example, consider this tree:
* 例如,考慮下面這棵樹:
*
* 7
* 7 6
* 5 7 6 5
* 4 5 5 7 2 6 5 2
* T
*
* Assume that the target node is the node indicated by the letter T,
* and we're searching for a node with value of 6 or higher. The search
* begins at T. At the first iteration, we move to the right, then to the
* parent, arriving at the rightmost 5. At the second iteration, we move
* to the right, wrapping around, then climb up, arriving at the 7 on the
* third level. 7 satisfies our search, so we descend down to the bottom,
* following the path of sevens. This is in fact the first suitable page
* to the right of (allowing for wraparound) our start point.
* 假定目標節點是字母T指示的節點,同時我們正在使用6或更大的數字搜索節點.檢索從T開始.
* 在第一輪迭代,移到右邊,然后到父節點,到達最右邊的節點 5.在第二輪迭代,移到右邊,回卷,
* 然后上溯,在第三個層次上到達節點 7這時候7滿足我們的搜索要求,
* 因此我們沿著7這條路徑下降到底部.
* 實際上,這是(考慮到回卷)起點右側第一個滿足條件的頁面.
*----------
*/
nodeno = target;
while (nodeno > 0)
{
if (fsmpage->fp_nodes[nodeno] >= minvalue)
break;
/*
* Move to the right, wrapping around on same level if necessary, then
* climb up.
* 移動到右邊,如需要在同一層次上回卷,然后上溯.
*/
nodeno = parentof(rightneighbor(nodeno));
}
/*
* We're now at a node with enough free space, somewhere in the middle of
* the tree. Descend to the bottom, following a path with enough free
* space, preferring to move left if there's a choice.
* 現在已到達了有足夠空閑空間的節點,樹中間的某個位置上面.
* 沿著有足夠空閑空間的路徑下降到底部,如有選擇,往左移動.
*/
while (nodeno < NonLeafNodesPerPage)
{
//左樹節點
int childnodeno = leftchild(nodeno);
if (childnodeno < NodesPerPage &&
fsmpage->fp_nodes[childnodeno] >= minvalue)
{
//如有機會,往左移動
nodeno = childnodeno;
continue;
}
//指向右樹節點
childnodeno++; /* point to right child */
if (childnodeno < NodesPerPage &&
fsmpage->fp_nodes[childnodeno] >= minvalue)
{
//相當于下降了一層
nodeno = childnodeno;
}
else
{
/*
* Oops. The parent node promised that either left or right child
* has enough space, but neither actually did. This can happen in
* case of a "torn page", IOW if we crashed earlier while writing
* the page to disk, and only part of the page made it to disk.
* 父節點承諾了不是左樹就是右樹有足夠的空間,但實際上并不是這樣.
* 這會發生在"torn page"的情況下,在寫入page到磁盤上時系統崩潰,
* page只有一部分寫入到磁盤中.
*
* Fix the corruption and restart.
* 修復損壞并重啟.
*/
RelFileNode rnode;
ForkNumber forknum;
BlockNumber blknum;
//獲取tag
BufferGetTag(buf, &rnode, &forknum, &blknum);
elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u",
blknum, rnode.spcNode, rnode.dbNode, rnode.relNode);
/* make sure we hold an exclusive lock */
//確保持有獨占鎖
if (!exclusive_lock_held)
{
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
exclusive_lock_held = true;
}
//重建FSM
fsm_rebuild_page(page);
MarkBufferDirtyHint(buf, false);
//重新搜索
goto restart;
}
}
/* We're now at the bottom level, at a node with enough space. */
//現在已到底層,在有足夠空間的節點上.
slot = nodeno - NonLeafNodesPerPage;
/*
* Update the next-target pointer. Note that we do this even if we're only
* holding a shared lock, on the grounds that it's better to use a shared
* lock and get a garbled next pointer every now and then, than take the
* concurrency hit of an exclusive lock.
* 更新下一個目標塊指針.
* 注意:就算我們只持有共享鎖,也可以做這個事情,
* 基于這樣的理由,最好是使用共享鎖,時不時的獲取一個打亂的下一個指針,
* 這樣比持有獨占鎖要好.
*
* Wrap-around is handled at the beginning of this function.
* 在該函數的開始,回卷已處理.
*/
fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
return slot;
}
測試腳本
15:54:13 (xdb@[local]:5432)testdb=# insert into t1 values (1,'1','1');
啟動gdb,設置斷點
(gdb) b fsm_set_and_search
Breakpoint 1 at 0x888850: file freespace.c, line 676.
(gdb) c
Continuing.
Breakpoint 1, fsm_set_and_search (rel=0x7fd0d2f10788, addr=..., slot=5, newValue=0 '\000', minValue=1 '\001')
at freespace.c:676
676 int newslot = -1;
(gdb)
輸入參數
(gdb) p *rel
$1 = {rd_node = {spcNode = 1663, dbNode = 16402, relNode = 50820}, rd_smgr = 0x1233b00, rd_refcnt = 1, rd_backend = -1,
rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 1 '\001', rd_statvalid = false,
rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7fd0d2f109a0, rd_att = 0x7fd0d2f10ab8, rd_id = 50820,
rd_lockInfo = {lockRelId = {relId = 50820, dbId = 16402}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0,
rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0,
rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x7fd0d2f0f820, rd_oidindex = 0, rd_pkindex = 0,
rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0,
rd_idattr = 0x0, rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0, rd_indextuple = 0x0,
rd_amhandler = 0, rd_indexcxt = 0x0, rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0,
rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0,
rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0,
pgstat_info = 0x12275f0}
(gdb)
(gdb) p addr
$2 = {level = 0, logpageno = 0}
獲取buffere,并鎖定
(gdb) n
678 buf = fsm_readbuf(rel, addr, true);
(gdb)
679 LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
(gdb) p buf
$3 = 105
(gdb)
獲取page
(gdb) p page
$4 = (Page) 0x7fd0bf41dc00 ""
(gdb) p *page
$5 = 0 '\000'
調用fsm_search_avail函數,進入fsm_search_avail
(gdb) n
684 MarkBufferDirtyHint(buf, false);
(gdb)
686 if (minValue != 0)
(gdb)
690 addr.level == FSM_BOTTOM_LEVEL,
(gdb) step
689 newslot = fsm_search_avail(buf, minValue,
(gdb)
fsm_search_avail (buf=105, minvalue=1 '\001', advancenext=true, exclusive_lock_held=true) at fsmpage.c:161
161 Page page = BufferGetPage(buf);
(gdb)
獲取FSMPage,fp_nodes數組,0表示未使用
(gdb) n
162 FSMPage fsmpage = (FSMPage) PageGetContents(page);
(gdb) p page
$6 = (Page) 0x7fd0bf41dc00 ""
(gdb) n
173 if (fsmpage->fp_nodes[0] < minvalue)
(gdb) p *fsmpage
$7 = {fp_next_slot = 6, fp_nodes = 0x7fd0bf41dc1c "{{"}
(gdb) p *fsmpage->fp_nodes
$8 = 123 '{'
(gdb) p fsmpage->fp_nodes[0]
$9 = 123 '{'
(gdb) p fsmpage->fp_nodes[1]
$10 = 123 '{'
(gdb) p fsmpage->fp_nodes[2]
$11 = 0 '\000'
(gdb) p fsmpage->fp_nodes[3]
$12 = 123 '{'
(gdb) p fsmpage->fp_nodes[4]
$13 = 0 '\000'
(gdb) p fsmpage->fp_nodes[5]
$14 = 0 '\000'
(gdb) p fsmpage->fp_nodes[6]
$15 = 0 '\000'
使用fp_next_slot開始搜索.
從目標slot開始檢索.每一步,把節點移到右邊,然后上溯父節點.
在到達一個有足夠空閑空間的節點時停止(因為根節點有足夠的空間).
(gdb) n
181 target = fsmpage->fp_next_slot;
(gdb)
182 if (target < 0 || target >= LeafNodesPerPage)
(gdb) p target
$16 = 6
(gdb) p LeafNodesPerPage
No symbol "__builtin_offsetof" in current context.
(gdb) n
184 target += NonLeafNodesPerPage;
(gdb)
227 nodeno = target;
(gdb) p target
$17 = 4101
(gdb)
循環,直至找到滿足條件的節點.
方法是移動到右邊,如需要在同一層次上回卷,然后上溯.
(gdb) p fsmpage->fp_nodes[nodeno]
$19 = 0 '\000'
(gdb) p minvalue
$20 = 1 '\001'
(gdb) n
237 nodeno = parentof(rightneighbor(nodeno));
(gdb) n
228 while (nodeno > 0)
(gdb) p nodeno
$21 = 2050
(gdb) n
230 if (fsmpage->fp_nodes[nodeno] >= minvalue)
(gdb) fsmpage->fp_nodes[nodeno]
Undefined command: "fsmpage->fp_nodes". Try "help".
(gdb) p fsmpage->fp_nodes[nodeno]
$22 = 0 '\000'
(gdb) n
237 nodeno = parentof(rightneighbor(nodeno));
(gdb)
228 while (nodeno > 0)
(gdb)
230 if (fsmpage->fp_nodes[nodeno] >= minvalue)
(gdb) p nodeno
$23 = 1025
(gdb) n
231 break;
(gdb) p fsmpage->fp_nodes[nodeno]
$24 = 1 '\001'
(gdb)
現在已到達了有足夠空閑空間的節點,樹中間的某個位置上面.
沿著有足夠空閑空間的路徑下降到底部.
如有選擇,往左移動.本例,往左移動了.
(gdb) n
245 while (nodeno < NonLeafNodesPerPage)
(gdb) p nodeno
$25 = 1025
(gdb) n
247 int childnodeno = leftchild(nodeno);
(gdb)
249 if (childnodeno < NodesPerPage &&
(gdb) p childnodeno
$26 = 2051
(gdb) n
250 fsmpage->fp_nodes[childnodeno] >= minvalue)
(gdb) p fsmpage->fp_nodes[childnodeno]
$27 = 1 '\001'
(gdb) n
249 if (childnodeno < NodesPerPage &&
(gdb)
252 nodeno = childnodeno;
(gdb)
253 continue;
(gdb)
找到了相應的葉子節點
(gdb)
245 while (nodeno < NonLeafNodesPerPage)
(gdb) n
247 int childnodeno = leftchild(nodeno);
(gdb)
249 if (childnodeno < NodesPerPage &&
(gdb)
250 fsmpage->fp_nodes[childnodeno] >= minvalue)
(gdb)
249 if (childnodeno < NodesPerPage &&
(gdb)
255 childnodeno++; /* point to right child */
(gdb)
256 if (childnodeno < NodesPerPage &&
(gdb) p childnodeno
$28 = 4104
(gdb) n
257 fsmpage->fp_nodes[childnodeno] >= minvalue)
(gdb)
256 if (childnodeno < NodesPerPage &&
(gdb)
259 nodeno = childnodeno;
(gdb) p childnodeno
$29 = 4104
(gdb) n
245 while (nodeno < NonLeafNodesPerPage)
(gdb)
現在已到底層,在有足夠空間的節點上.
同時,更新下一個目標塊指針.
(gdb)
293 slot = nodeno - NonLeafNodesPerPage;
(gdb) n
303 fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);
(gdb) p slot
$30 = 9
(gdb) p nodeno
$31 = 4104
(gdb) n
305 return slot;
(gdb)
306 }
(gdb)
回到fsm_set_and_search,返回slot
(gdb)
fsm_set_and_search (rel=0x7fd0d2f10788, addr=..., slot=5, newValue=0 '\000', minValue=1 '\001') at freespace.c:694
694 UnlockReleaseBuffer(buf);
(gdb)
696 return newslot;
(gdb)
(gdb) p newslot
$32 = 9
(gdb)
DONE!
PG Source Code
Database Cluster, Databases, and Tables
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。