您好,登錄后才能下訂單哦!
這篇文章主要講解了“數據庫中如何搜索時空行為數據”,文中的講解內容簡單清晰,易于學習與理解,下面請大家跟著小編的思路慢慢深入,一起來研究和學習“數據庫中如何搜索時空行為數據”吧!
數據結構
時空行為數據包含三個屬性:時間、空間和對象。
非結構化索引:
create table test( id int8, crt_time timestamp, -- Time pos geometry, -- Location obj jsonb -- Object description );
除了應用于JSON,結構化數據還可以用于對象描述。例如:
create table test( id int8, crt_time timestamp, -- Time pos geometry, -- Location c1 int, -- Some property examples c2 int, c3 text, c4 float8, c5 int, c6 date, c7 text, c8 int, c9 int, c10 int );
時空行為數據的SQL查詢實例
select * from test where pos <-> ? < ? and crt_time between ? and ? and ( (c1 = ? and c2 between ? and ?) or c10=?) ... ;
優化方法
考慮運用以下知識:
時間序列BRIN索引
crt_time字段是一個時間序列字段,表示生成數據的時間。在PostgreSQL堆存儲中,存儲和該字段的值具有很強的線性相關性。
因此,BRIN索引很合適。
使用BRIN索引來代替分區表進行TPC-H測試。大范圍搜索的性能甚至優于使用分區表時的功能。
create index idx_test_1 on test using brin(crt_time);
空間索引
顯然,空間檢索需要空間索引。PostgreSQL中可以使用三種方法實現空間檢索。
1. 幾何類型的GIST索引
create index idx_test_2 on test using gist(pos);
該索引支持空間KNN搜索和空間位置確定等功能。
2. 幾何類型的主索引
create index idx_test_2 on test using spgist(pos);
該索引支持空間KNN搜索和空間位置確定等功能。
3. Geohash和B-tree索引(將經度和緯度轉換為Geohash并為hash值創建B-tree索引)。只需使用表達式索引。
create index idx_test_3 on test using btree( ST_GeoHash(pos,15) );
此索引支持前綴搜索(其能落實編碼地理信息網格中包含的關系)。它屬于有損索引,需要二次過濾。
GiST和SPGiST空間索引能夠找到準確的地理位置信息,優于GEOHASH索引。但是,查詢信息時需要特別注意。
GIN 索引
此索引類型的目標是對象屬性字段JSONB或多個結構化對象屬性字段。只需使用GIN索引。
例如:
create extension btree_gin;
非結構化索引:
create index idx_test_4 on test using gin( obj );
結構化索引:
create index idx_test_4 on test using gin( c1,c2,c3,c4,c5,c6,c7,c8,c9 );
BitmapAnd和BitmapOr
但是,可以同時使用這些索引嗎? PostgreSQL為多個索引提供bitmapAnd及bitmapOr接口。它們可以組合多個索引,減少需要掃描的數據庫數量。
Heap, one square = one page: +---------------------------------------------+ |c____u_____X___u___X_________u___cXcc______u_| +---------------------------------------------+ Rows marked c match customers pkey condition. Rows marked u match username condition. Rows marked X match both conditions. Bitmap scan from customers_pkey: +---------------------------------------------+ |100000000001000000010000000000000111100000000| bitmap 1 +---------------------------------------------+ One bit per heap page, in the same order as the heap Bits 1 when condition matches, 0 if not Bitmap scan from ix_cust_username: +---------------------------------------------+ |000001000001000100010000000001000010000000010| bitmap 2 +---------------------------------------------+ Once the bitmaps are created a bitwise AND is performed on them: +---------------------------------------------+ |100000000001000000010000000000000111100000000| bitmap 1 |000001000001000100010000000001000010000000010| bitmap 2 &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& |000000000001000000010000000000000010000000000| Combined bitmap +-----------+-------+--------------+----------+ | | | v v v Used to scan the heap only for matching pages: +---------------------------------------------+ |___________X_______X______________X__________| +---------------------------------------------+ The bitmap heap scan then seeks to the start of each page and reads the page: +---------------------------------------------+ |___________X_______X______________X__________| +---------------------------------------------+ seek------->^seek-->^seek--------->^ | | | ------------------------ only these pages read
例如:
select * from test where c1 ... and crt_time between ? and ? and test->> c1 in (?, ? ...);
根據統計數據自動使用適當的索引。如果需要,bitmapAnd和bitmapOr將在多個索引上自動執行合并掃描。跳過不需要掃描的頁面,重新檢查命中的頁面。
堆表存儲分級和分區
存儲可以分為一級分區或多級分區:
1. 單一分區
例如,按時間劃分。
create table test( id int8, crt_time timestamp, -- Time pos geometry, -- Location obj jsonb -- Object description ) PARTITION BY range (crt_time) ; create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ); ......
2. 多層分區
例如,先按時間分區,然后按Geohash劃分。
create table test_201701 PARTITION OF test for values FROM ( 2017-01-01 ) TO ( 2017-02-01 ) partition by range(st_geohash(pos,15)); ... create table test_201701_prefix1 PARTITION OF test for values FROM ( xxxx1 ) TO ( xxxx2 ); -- Generate BOX (GRID) on a map, find corresponding boundaries and use -- boundaries as partitioning conditions
使用分區時,如果查詢條件包括分區鍵(如時間和空間范圍),相應的分區將自動定位,這即為需要掃描的數據量。
創建面向對象屬性的GIN索引,以實現高效查詢。
索引分級與分區
與數據一樣,索引在不使用分區表的情況下也支持分區邏輯。
空間索引+時間分區
create index idx_20170101 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 ; ... create index idx_20170102 on tbl using gist (pos) where crt_time between 2017-01-02 and 2017-01-03 ; ...
通過使用前述分區索引,可以在輸入時間范圍后快速定位目標數據,執行空間搜索。
select * from tbl where crt_time between 2017-01-01 and 2017-01-02 -- Time and (pos <-> ?) < ? -- Distance to a point to be searched for and ? -- Other conditions order by pos <-> ? -- Sort by distance limit ?; -- Number of results to be returned
可以使用更多的索引分區,比如用作搜索條件和商店類型的維度(對象屬性)(假設它是可枚舉的或在范圍相對較小的情況下)。
create index idx_20170101_mod0 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=0; ... create index idx_20170101_mod1 on tbl using gist (pos) where crt_time between 2017-01-01 and 2017-01-02 and dtype=1; ...
通過使用前面的分區索引,在輸入時間范圍或特定條件以執行空間搜索后,可以快速定位目標數據。
select * from tbl where crt_time between 2017-01-01 and 2017-01-02 -- Time and (pos <-> ?) < ? -- Distance to a point to be searched for and dtype=0 -- Object condition and ? -- Other conditions order by pos <-> ? -- Sort by distance limit ?; -- Number of results to be returned
請注意,前面的SQL查詢可以實現最佳性能優化。
索引組織形式(或索引結構)可以由邏輯分區重新構造,可以用上述類似的索引創建方法覆蓋所有條件。
CTID相交陣列連接掃描
如前所述,BitmapAnd和BitmapOr合并掃描是在多個索引或GIN索引中自動執行的。事實上,這種掃描也可以在SQL中顯式執行。
每個條件滲透對應的CTID。
使用Intersect或Union生成滿足總體需求的CTID。(Intersect對應于“and”條件;union對應于“or”條件。)
生成一個ctid數組。
示例
1. 創建對象提要數據表
postgres=# create table tbl (id int, info text, crt_time timestamp, pos point, c1 int , c2 int, c3 int ); CREATE TABLE
2. 將5000萬條測試數據寫入表中
postgres=# insert into tbl select generate_series(1,50000000), md5(random()::text), clock_timestamp(), point(180-random()*180, 90-random()*90), random()*10000, random()*5000, random()*1000; INSERT 0 50000000
3. 創建對象索引
postgres=# create index idx_tbl_1 on tbl using gin (info, c1, c2, c3); CREATE INDEX
4. 創建時間索引
postgres=# create index idx_tbl_2 on tbl using btree (crt_time); CREATE INDEX
5. 創建空間索引
postgres=# create index idx_tbl_3 on tbl using gist (pos); CREATE INDEX
6. 生成數據布局以方便后續查詢
postgres=# select min(crt_time),max(crt_time),count(*) from tbl; min | max | count ----------------------------+----------------------------+---------- 2017-07-22 17:59:34.136497 | 2017-07-22 18:01:27.233688 | 50000000 (1 row)
7. 創建一個極限KNN查詢函數
create or replace function ff(point, float8, int) returns setof tid as $ declare v_rec record; v_limit int := $3; begin set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number for v_rec in select *, (pos <-> $1) as dist, ctid from tbl order by pos <-> $1 loop if v_limit <=0 then -- raise notice "Sufficient data obtained" return; end if; if v_rec.dist > $2 then -- raise notice "All matching points returned" return; else return next v_rec.ctid; end if; v_limit := v_limit -1; end loop; end; $ language plpgsql strict volatile; postgres=# select * from ff(point (100,100) ,100,100) ; ff ------------- (407383,11) (640740,9) (26073,51) (642750,34) ... (100 rows) Time: 1.061 ms
8. CTID合并檢索
顯示符合以下條件的記錄
( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) and pos <-> point (0,0) < 5 and crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 ;
首先,分別查看每個條件,找匹配一個條件的記錄數量,以及在索引掃描上所花時長。
1. 54,907條記錄
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.tbl (cost=820.07..65393.94 rows=54151 width=73) (actual time=23.842..91.911 rows=54907 loops=1) Output: id, info, crt_time, pos, c1, c2, c3 Recheck Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) Heap Blocks: exact=52778 Buffers: shared hit=52866 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.264..14.264 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) Buffers: shared hit=88 Planning time: 0.105 ms Execution time: 94.606 ms (10 rows)
2. 95,147條記錄
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c2<10; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.tbl (cost=835.73..112379.10 rows=99785 width=73) (actual time=69.243..179.388 rows=95147 loops=1) Output: id, info, crt_time, pos, c1, c2, c3 Recheck Cond: (tbl.c2 < 10) Heap Blocks: exact=88681 Buffers: shared hit=88734 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=53.612..53.612 rows=95147 loops=1) Index Cond: (tbl.c2 < 10) Buffers: shared hit=53 Planning time: 0.094 ms Execution time: 186.201 ms (10 rows)
3. 149930條記錄(為快速獲得結果,PostgreSQL使用位圖進行合并掃描)
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 <10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on postgres.tbl (cost=1694.23..166303.58 rows=153828 width=73) (actual time=98.988..266.852 rows=149930 loops=1) Output: id, info, crt_time, pos, c1, c2, c3 Recheck Cond: ((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) Heap Blocks: exact=134424 Buffers: shared hit=134565 -> BitmapOr (cost=1694.23..1694.23 rows=153936 width=0) (actual time=73.763..73.763 rows=0 loops=1) Buffers: shared hit=141 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=16.733..16.733 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) Buffers: shared hit=88 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=57.029..57.029 rows=95147 loops=1) Index Cond: (tbl.c2 < 10) Buffers: shared hit=53 Planning time: 0.149 ms Execution time: 274.548 ms (15 rows)
4. 60,687條記錄(即使運用出色的KNN性能優化,仍然需要耗費195毫秒)。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) Output: ff Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) Buffers: shared hit=61296 Planning time: 0.029 ms Execution time: 195.097 ms (6 rows)
讓我們看看不使用KNN優化需要多長時間。
結果非常令人驚訝——極限優化性能提高了一個數量級。
5. 2,640,751條記錄
使用所有索引逐個掃描數據條件,得到ctid并執行ctid掃描。
現在,讓我們來分解這個過程:
首先,讓我們看看時間和對象屬性的合并查詢,成果非常驚人。使用位圖BitmapOr時,查詢可以跳過大多數數據塊,并且掃描時間比單索引掃描要短。
注意,在此步驟中記錄的數量減少到7,847條。
postgres=# explain (analyze,verbose,timing,costs,buffers) select ctid from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on postgres.tbl (cost=35025.85..44822.94 rows=7576 width=6) (actual time=205.577..214.821 rows=7847 loops=1) Output: ctid Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) Heap Blocks: exact=6983 Buffers: shared hit=14343 -> BitmapAnd (cost=35025.85..35025.85 rows=7581 width=0) (actual time=204.048..204.048 rows=0 loops=1) Buffers: shared hit=7360 -> BitmapOr (cost=1621.11..1621.11 rows=153936 width=0) (actual time=70.279..70.279 rows=0 loops=1) Buffers: shared hit=141 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=15.860..15.860 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) Buffers: shared hit=88 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=54.418..54.418 rows=95147 loops=1) Index Cond: (tbl.c2 < 10) Buffers: shared hit=53 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=127.101..127.101 rows=2640751 loops=1) Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) Buffers: shared hit=7219 Planning time: 0.203 ms Execution time: 216.697 ms (20 rows)
然后,看KNN的掃描時間:
注意,60,687條記錄滿足KNN距離條件,所以接下來將解釋CTID合并掃描與原始掃描之間的性能比較。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from ff(point (0,0) ,5,1000000); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Function Scan on postgres.ff (cost=0.25..10.25 rows=1000 width=6) (actual time=188.563..192.114 rows=60687 loops=1) Output: ff Function Call: ff( (0,0) ::point, 5 ::double precision, 1000000) Buffers: shared hit=61296 Planning time: 0.029 ms Execution time: 195.097 ms (6 rows)
最后,將這些片段合并到ctid中。
select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ); ff ------------ (1394,8) (3892,50) (6124,45) (7235,8) (7607,45) (11540,8) (13397,31) (14266,36) (18149,7) (19256,44) (24671,62) (26525,64) (30235,48) (13 rows) Time: 463.012 ms
取得最終紀錄。
select * from tbl where ctid = any ( array( -- array start select * from ff(point (0,0) ,5,1000000) intersect select ctid from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) ) -- array end ); id | info | crt_time | pos | c1 | c2 | c3 ---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 (13 rows) Time: 462.715 ms
過程花費462毫秒。
9. 測試原始SQL查詢的性能: PostgreSQL Multi-Index BitmapAnd and BitmapOr跳過掃描
直接編寫SQL查詢,而不是使用多CTID掃描。
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) and pos <-> point (0,0) < 5; Bitmap Heap Scan on postgres.tbl (cost=35022.06..44857.06 rows=2525 width=73) (actual time=205.542..214.547 rows=13 loops=1) Output: id, info, crt_time, pos, c1, c2, c3 Recheck Cond: (((tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) OR (tbl.c2 < 10)) AND (tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) Filter: ((tbl.pos <-> (0,0) ::point) < 5 ::double precision) Rows Removed by Filter: 7834 Heap Blocks: exact=6983 Buffers: shared hit=14343 -> BitmapAnd (cost=35022.06..35022.06 rows=7581 width=0) (actual time=203.620..203.620 rows=0 loops=1) Buffers: shared hit=7360 -> BitmapOr (cost=1618.58..1618.58 rows=153936 width=0) (actual time=71.660..71.660 rows=0 loops=1) Buffers: shared hit=141 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..806.54 rows=54151 width=0) (actual time=14.861..14.861 rows=54907 loops=1) Index Cond: (tbl.c1 = ANY ( {1,2,3,4,100,200,99,88,77,66,55} ::integer[])) Buffers: shared hit=88 -> Bitmap Index Scan on idx_tbl_1 (cost=0.00..810.79 rows=99785 width=0) (actual time=56.797..56.797 rows=95147 loops=1) Index Cond: (tbl.c2 < 10) Buffers: shared hit=53 -> Bitmap Index Scan on idx_tbl_2 (cost=0.00..33402.60 rows=2462443 width=0) (actual time=125.255..125.255 rows=2640751 loops=1) Index Cond: ((tbl.crt_time >= 2017-07-22 17:59:34 ::timestamp without time zone) AND (tbl.crt_time <= 2017-07-22 17:59:40 ::timestamp without time zone)) Buffers: shared hit=7219 Planning time: 0.160 ms Execution time: 216.797 ms (22 rows)
性能如預期的那樣好,之前解釋過原因。KNN條件以外的條件已經將結果收斂到7,000條記錄,因此沒有必要使用包含KNN條件的索引。(即使使用KNN索引也需要195毫秒,因為有60,687條記錄滿足KNN條件。)
校驗結果:
select * from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) and pos <-> point (0,0) < 5; id | info | crt_time | pos | c1 | c2 | c3 ---------+----------------------------------+----------------------------+----------------------------------------+------+------+----- 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 (13 rows)
分區索引示例
假設前面的查詢條件保持不變,使用分區索引來測試性能。
這是為了演示分區索引的極端效果。在實際場景中,集合級別可能沒有那么高(例如按天集合或按ID散列集合)。只要集合是可能的,就可以展現出色的性能。
postgres=# create index idx_tbl_4 on tbl using gist (pos) where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) ; CREATE INDEX Time: 8359.330 ms (00:08.359)
重構極值KNN優化函數
create or replace function ff(point, float8, int) returns setof record as $ declare v_rec record; v_limit int := $3; begin set local enable_seqscan=off; -- Force index that exits when scanned rows reach a specific number for v_rec in select *, (pos <-> $1) as dist from tbl where crt_time between 2017-07-22 17:59:34 and 2017-07-22 17:59:40 and ( c1 in (1,2,3,4,100,200,99,88,77,66,55) or c2 < 10 ) order by pos <-> $1 loop if v_limit <=0 then -- raise notice "Sufficient data obtained" return; end if; if v_rec.dist > $2 then -- raise notice "All matching points returned" return; else return next v_rec; end if; v_limit := v_limit -1; end loop; end; $ language plpgsql strict volatile;
查詢性能:
postgres=# select * from ff(point (0,0) , 5, 10000000) as t(id int, info text, crt_time timestamp, pos point, c1 int, c2 int, c3 int, dist float8); id | info | crt_time | pos | c1 | c2 | c3 | dist ---------+----------------------------------+----------------------------+----------------------------------------+------+------+-----+------------------- 1850387 | 6e0011c6db76075edd2aa7f81ec94129 | 2017-07-22 17:59:38.243091 | (0.0168232340365648,0.420973123982549) | 100 | 4395 | 321 | 0.421309141034319 1989439 | 6211907ac254a4a3ca54f90822a2095e | 2017-07-22 17:59:38.551637 | (0.0274275150150061,0.490507003851235) | 1850 | 5 | 74 | 0.49127323294376 1444244 | 41bf6f8e4b89458c13fb408a7db05284 | 2017-07-22 17:59:37.339594 | (0.52773853763938,2.16670122463256) | 1 | 2470 | 820 | 2.23004532710301 542633 | c422d6137f9111d5c2dc723b40c7023f | 2017-07-22 17:59:35.334278 | (0.0631888210773468,2.2334903664887) | 4968 | 3 | 245 | 2.23438404136508 291950 | 1c2901689ab1eb7653d8ad972f7aa376 | 2017-07-22 17:59:34.776808 | (2.5384977646172,1.09820357523859) | 3 | 2131 | 360 | 2.76586731309247 1361182 | 7c4c1c208c2b2b21f00772c43955d238 | 2017-07-22 17:59:37.155127 | (1.7334086727351,2.18367457855493) | 9742 | 0 | 232 | 2.78803520274409 2267673 | 898fdd54dcc5b14c27cf1c8b9afe2471 | 2017-07-22 17:59:39.170035 | (0.394239127635956,2.86229319870472) | 2892 | 6 | 917 | 2.88931598221975 459345 | 9e46548f29d914019ce53a589be8ebac | 2017-07-22 17:59:35.148699 | (0.715781506150961,3.1486327573657) | 1 | 1276 | 8 | 3.22896754478952 570570 | fc57bfc6b7781d89b17c90417bd306f7 | 2017-07-22 17:59:35.39653 | (3.14926156774163,1.04107855819166) | 88 | 2560 | 561 | 3.31688000783581 1004806 | afe9f88cbebf615a7ae5f41180c4b33f | 2017-07-22 17:59:36.362027 | (1.13972157239914,3.28763140831143) | 3 | 1639 | 208 | 3.47958123047986 865508 | 34509c7f7640afaf288a5e1d38199701 | 2017-07-22 17:59:36.052573 | (3.12869547866285,2.34822122845799) | 2 | 65 | 875 | 3.91188935630676 104558 | c4699c933d4e2d2a10d828c4ff0b3362 | 2017-07-22 17:59:34.362508 | (4.20534582808614,2.43749532848597) | 99 | 4858 | 543 | 4.86069100130757 1069986 | 6b9f27bfde993fb0bae3336ac010af7a | 2017-07-22 17:59:36.507775 | (4.51995821669698,2.08761331625283) | 2 | 200 | 355 | 4.97877009299311 (13 rows) Time: 0.592 ms
太棒了!查詢時間從200毫秒減少到1毫秒以內。
優化方法綜述
優化方法回顧:
1. 為不同的數據類型構建不同的索引。
例如,對空間使用GiST或SP-GiST索引,對時間使用B樹或BRIN索引,對多個對象屬性使用GIN索引。索引的目的是縮小數據掃描的范圍。
2. 方法五提到數據分區。
數據分區的目的是有意地組織數據,這意味著有意地組織數據以滿足搜索需求。例如,如果時間是必需的查詢條件或公共查詢條件,那么可以按時間(分區)分割數據,以減少需要掃描的數據量。
3. 方法六描述了索引分區。
目的類似于方法五。方法五和方法六的區別在于分區在索引級別使用,因此當執行索引掃描時,數據命中率會直接提高。
4.方法七中的ctid合并掃描類似于PostgreSQL中的多索引bitmapAnd或bitmapOr掃描。
bitmapAnd/bitmapOr跳過不需要掃描的塊,方法七中的ctid合并掃描跳過不需要掃描的行。
合并從多個索引掃描獲得的ctid。跳過不需要掃描的行數。
如果當其他條件為“AND”時,過濾條件可以顯著減少ctid(記錄),則沒有必要使用ctid合并掃描。相反,使用FILTER作為另一個條件。(這將略微增加CPU開銷。)
5. 最好的功夫總是以最大的靈活性、自由和對每一個動作的無限想象為特征。
PostgreSQL實現多索引BitmapAnd或BitmapOr掃描,顯著提高了多種條件(索引)下的數據命中率。
此外,PostgreSQL具有出色的CBO估計機制,它允許PostgreSQL不總是使用位圖合并掃描的所有索引。這也是為什么在“測試原始SQL查詢的性能——PostgreSQL多索引BitmapAnd位圖或跳過掃描”一節中描述的性能更好。
6. 如何實現極端優化
采用方法五或六,并使用可修復的條件作為分區鍵來分區數據或索引。
對于其他條件,可以使用PostgreSQL中的多索引BitmapAnd或BitmapOr掃描來提高多條件(索引)的數據命中率。
我們可以看到,按照時間、空間和對象屬性從5,000萬數據塊中進行多維檢索所需的時間減少到了0.592毫秒。
7. 對于空間數據,除了使用GiST索引,我們還可以使用BRIN索引,這降低了成本。有條理地組織數據后,會使濾波性能良好。
感謝各位的閱讀,以上就是“數據庫中如何搜索時空行為數據”的內容了,經過本文的學習后,相信大家對數據庫中如何搜索時空行為數據這一問題有了更深刻的體會,具體使用情況還需要大家實踐驗證。這里是億速云,小編將為大家推送更多相關知識點的文章,歡迎關注!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。