Phoenix(sql on hbase

發布時間：2021-11-23 15:13:42 來源：億速云閱讀：144 作者：柒染欄目：數據庫

Phoenix(sql on hbase)簡介

介紹：

Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular
JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance
on the order of milliseconds for small queries, or seconds for tens of millions of rows.

部署：

1：wget http://phoenix-bin.github.com/client/phoenix-2.2.1-install.tar，將jar包拷貝至HBASE_HOME/lib即可 2：執行psql.sh localhost ../examples/web_stat.sql ../examples/web_stat.csv ../examples/web_stat_queries.sql，加載示例數據 3：sqlline.sh localhost（zookeeper地址）進入命令行客戶端

二級索引相關（索引的使用需要調用Phoenix API）：

二級索引（多列時）使用需要在hbase-site.xml中加入如下配置

<property> <name>hbase.regionserver.wal.codec</name> <value>org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec</value> </property>

創建索引例子：

create table usertable (id varchar primary key, firstname varchar, lastname varchar); create index idx_name on usertable (firstname) include (lastname); 可通過如下方法查看當前SQL語句索引是否生效： explain select id, firstname, lastname from usertable where firstname = 'foo';

explain的相關說明：

RANGE SCAN means that only a subset of the rows in your table will be scanned over. This occurs if you use one or more leading columns from your primary key constraint. Query that is not filtering on leading PK columns ex. select * from test where pk2='x'
and pk3='y'; will result in full scan whereas the following query will result in range scan select * from test where pk1='x' and pk2='y';. Note that you can add a secondary index on your "pk2" and "pk3" columns and that would cause a range scan to be done
for the first query (over the index table). DEGENERATE SCAN means that a query can't possibly return any rows. If we can determine that at compile time, then we don't bother to even run the scan. FULL SCAN means that all rows of the table will be scanned over (potentially with a filter applied if you have a WHERE clause) SKIP SCAN means that either a subset or all rows in your table will be scanned over, however it will skip large groups of rows depending on the conditions in your filter. See this blog for more detail. We don't do a SKIP SCAN if you have no filter on the
leading primary key columns, but you can force a SKIP SCAN by using the /*+ SKIP_SCAN */ hint. Under some conditions, namely when the cardinality of your leading primary key columns is low, it will be more efficient than a FULL SCAN.

索引使用介紹：

主鍵索引：主鍵索引要按創建時的順序引用。如primary key（id,name,add），那么會隱式的創建(id),(id,name),(id,name,add)三個索引，如果在where中用這三個條件會用到索引，其他組合則無法使用索引（FULL SCAN）。
二級索引：除了要按創建時的順序引用外，如果查詢的列不全在索引或者覆蓋索引中則無法使用索引。舉例： DDL：create table usertable (id varchar primary key, firstname varchar, lastname varchar); create index idx_name on usertable (firstname); DML：select id, firstname, lastname from usertable where firstname = 'foo'; 此查詢不會使用到索引，因為lastname不再索引中。執行DDL：create idx_name on usertable (firstname) include (lastname)后該查詢語句才能使用索引。遺留問題：include和on在Phoenix中具體有什么區別？
查詢條件中主鍵索引+二級索引同時存在的話，Phoenix會自己選擇最優索引。

Phoenix的SQL表結構與Hbase結構的映射實驗

結論：
1：Phoenix會把“CONSTRAINT PK PRIMARY KEY (id,firstname)”這樣定義的列拼起來加入到Hbase主鍵中（用\x00進行分割），同時將聯合主鍵涉及到的列合并默認名為"_0"的一列。其值為空。其他列放入Hbase的同名列中存儲
2：Phoenix在Hbase中維護了一張系統表（SYSTEM TABLE）來存儲相關Phoenix表的scheme元數據。
3：創建二級索引（create index）操作不會影響表結構
4：如果建表時不指定列族，則列族以_0、_1的方式命名
5：如果有多列時value值通過HBase接口獲取的并不是直接可用的值（只能通過Phoenix接口獲取正常值）

動態scheme相關

1：支持修改列

Example: ALTER TABLE my_schema.my_table ADD d.dept_id char(10) VERSIONS=10 ALTER TABLE my_table ADD dept_name char(50) ALTER TABLE my_table ADD parent_id char(15) null primary key ALTER TABLE my_table DROP COLUMN d.dept_id ALTER TABLE my_table DROP COLUMN dept_name ALTER TABLE my_table DROP COLUMN parent_id ALTER TABLE my_table SET IMMUTABLE_ROWS=true

2：支持修改二級索引

Example: CREATE INDEX my_idx ON sales.opportunity(last_updated_date DESC) CREATE INDEX my_idx ON log.event(created_date DESC) INCLUDE (name, payload) SALT_BUCKETS=10 CREATE INDEX IF NOT EXISTS my_comp_idx ON server_metrics ( gc_time DESC, created_date DESC ) DATA_BLOCK_ENCODING='NONE',VERSIONS=?,MAX_FILESIZE=2000000 split on (?, ?, ?) ALTER INDEX my_idx ON sales.opportunity DISABLE ALTER INDEX IF EXISTS my_idx ON server_metrics REBUILD DROP INDEX my_idx ON sales.opportunity DROP INDEX IF EXISTS my_idx ON server_metrics

3：應該是不支持修改主鍵索引（沒找到相關信息。理論上也不好支持，因為主鍵索引就是rowkey的值。）

Java客戶端示例代碼(直接面向JDBC接口編程)：

import java.sql.Connection; import java.sql.DriverManager; import java.sql.ResultSet; import java.sql.SQLException; import java.sql.PreparedStatement; import java.sql.Statement; public class test { public static void main(String[] args) throws SQLException { Statement stmt = null; ResultSet rset = null; Connection con = DriverManager.getConnection("jdbc:phoenix:zookeeper"); stmt = con.createStatement(); stmt.executeUpdate("create table test (mykey integer not null primary key, mycolumn varchar)"); stmt.executeUpdate("upsert into test values (1,'Hello')"); stmt.executeUpdate("upsert into test values (2,'World!')"); con.commit(); PreparedStatement statement = con.prepareStatement("select * from test"); rset = statement.executeQuery(); while (rset.next()) { System.out.println(rset.getString("mycolumn")); } statement.close(); con.close(); } }

單節點測試：

建表： CREATE TABLE IF NOT EXISTS $table (HOST CHAR(2) NOT NULL,DOMAIN VARCHAR NOT NULL, FEATURE VARCHAR NOT NULL,DATE DATE NOT NULL,USAGE.CORE BIGINT,USAGE.DB BIGINT,STATS.ACTIVE_VISITOR INTEGER CONSTRAINT PK PRIMARY KEY (HOST, DOMAIN, FEATURE, DATE)) SPLIT ON ('CSGoogle','CSSalesforce','EUApple','EUGoogle','EUSalesforce','NAApple','NAGoogle','NASalesforce'); performance_10000000數據表中是1000W條如下數據： +------+------------+------------+---------------------+----------+----------+----------------+ | HOST | DOMAIN | FEATURE | DATE | CORE | DB | ACTIVE_VISITOR | +------+------------+------------+---------------------+----------+----------+----------------+ | CS | Apple.com | Dashboard | 2013-12-23 | 363 | 795 | 8390 | +------+------------+------------+---------------------+----------+----------+----------------+ Query # 1 - Count - SELECT COUNT(1) FROM performance_10000000; COUNT(1) -------- 10000000 Time: 66.044 sec(s) Query # 2 - Group By First PK - SELECT HOST FROM performance_10000000 GROUP BY HOST; HOST ---- CS EU NA Time: 51.43 sec(s) Query # 3 - Group By Second PK - SELECT DOMAIN FROM performance_10000000 GROUP BY DOMAIN; DOMAIN ---------- Apple.com Google.com Salesforce.com Time: 46.908 sec(s) Query # 4 - Truncate + Group By - SELECT TRUNC(DATE,'DAY') DAY FROM performance_10000000 GROUP BY TRUNC(DATE,'DAY'); DAY ------------------- 2013-12-23 00:00:00 2013-12-24 00:00:00 2013-12-25 00:00:00 ...... Time: 48.132 sec(s) Query # 5 - Filter + Count - SELECT COUNT(1) FROM performance_10000000 WHERE CORE<10; COUNT(1) -------- 198669 Time: 31.301 sec(s)

集群(5*RegionServer)測試：

performance_10000000數據表中是1000W條如下數據： +------+------------+------------+---------------------+----------+----------+----------------+ | HOST | DOMAIN | FEATURE | DATE | CORE | DB | ACTIVE_VISITOR | +------+------------+------------+---------------------+----------+----------+----------------+ | CS | Apple.com | Dashboard | 2013-12-23 | 363 | 795 | 8390 | +------+------------+------------+---------------------+----------+----------+----------------+ Query # 1 - Count - SELECT COUNT(1) FROM performance_10000000; COUNT(1) -------- 19630614 Time: 13.879 sec(s) Query # 2 - Group By First PK - SELECT HOST FROM performance_10000000 GROUP BY HOST; HOST ---- CS EU NA Time: 13.545 sec(s) Query # 3 - Group By Second PK - SELECT DOMAIN FROM performance_10000000 GROUP BY DOMAIN; DOMAIN ---------- Apple.com Google.com Salesforce.com Time: 12.907 sec(s) Query # 4 - Truncate + Group By - SELECT TRUNC(DATE,'DAY') DAY FROM performance_10000000 GROUP BY TRUNC(DATE,'DAY'); DAY ------------------- 2013-12-23 00:00:00 2013-12-24 00:00:00 2013-12-25 00:00:00 ...... Time: 13.845 sec(s) Query # 5 - Filter + Count - SELECT COUNT(1) FROM performance_10000000 WHERE CORE<10; COUNT(1) -------- 393154 Time: 8.522 sec(s)

優點：

1：命令行和java客戶端使用都很簡單。尤其是java客戶端直接面向JDBC接口編程，封裝且優化了Hbase很多細節。
2：在單表操作上性能比Hive Handler好很多（但是handler也有可能會升級加入斜處理器相關聚合等特性）
3：支持多列的二級索引，列數不限。其中可變索引時列數越多寫入速度越慢，不可變索引不影響寫入速度（參考：https://github.com/forcedotcom/phoenix/wiki/Secondary-Indexing#mutable-indexing）。
4：對Top-N查詢速度遠超Hive（參考：https://github.com/forcedotcom/phoenix/wiki/Performance#top-n）
5：提供對rowkey分桶的特性，可以實現數據在各個region的均勻分布（參考：https://github.com/forcedotcom/phoenix/wiki/Performance#salting）
6：低侵入性，基本對原Hbase的使用沒什么影響
7：提供的函數基本都能cover住絕大多數需求了
8：與Hive不同的是，Phoenix的sql語句更接近標準sql規范。

缺點：

1：Phoenix創建的表Hbase可以識別并使用，但是使用Hbase創建的表，Phoenix不能識別，因為Phoenix對每張表都有其相應的元數據信息。
2：硬傷：多表join操作當前不支持（官方文檔對當前2.2.3版本的說法不一致，但3.0應該會支持，有可能會引入Apache Drill把大表join分割成小任務的特性）。
3：目前只支持hbase0.94系列（這個應該問題不大）

其他

1：Phoenix對所有數據都是基于內存進行統計。因此從sql語句復雜程度以及java各數據結構的性能，基本能對其消耗的時間有一個大概的估計。

功能擴展（impala不太熟，主要針對Hive說了）

如果將Hive和Phoenix集成的話，還是很不錯的，兩者剛好互補。Hive并沒使用coprocesser，只是通過把數據取出來做MR，而Phoenix剛好是在單表取數據方面做了很多優化。集成后可以享受到Phoenix的單表操作優勢，同時可以解決多表join的問題（在Phoenix估計短時間難做出來大表join的方案，說是要模仿Drill，但是現在Drill本身救處于Alpha階段）。

如果集成的話主要工作是需要在Hive-hbase-handler中適配Hive相關單表操作到Phoenix的java客戶端接口。

不太成熟的問題：

1：是把Phoenix的單表操作移植到Hive中還是把Hive的join移植到Phoenix中？
2：是只對外提供Hive的接口還是同時對外提供Hive和Phoenix兩種接口呢？
3：適配的過程還有很多細節

嘗試調研了下Phoenix二級索引是否可以達到像華為一樣創建完可以無需修改HBase任何代碼就享受到二級索引的效果

擴展閱讀：

sql for hbase(Phoenix、Impala、Drill): http://www.orzota.com/sql-for-hbase/
SQL on Hadoop的最新進展及7項相關技術分享：http://www.csdn.net/article/2013-10-18/2817214-big-data-hadoop

對比華為HBase二級索引：

缺點：華為二級索引需要在建表時指定列（及不支持動態修改），同時華為代碼對Hbase本身侵入性太大（比如balancer要用華為的），難以升級維護。

優點：但是索引建好后，在對Hbase的scan、Puts、Deletes操作時使用Hbase原生代碼（無需任何改動）即可享受到索引的效果。也不需要指定使用哪個索引，它會自己使用最優索引。

也就是說如果加上華為索引，Hive-hbase-handler無需改動即可使用二級索引。但是phoenix目前只支持通過phoenix sql方式使用二級索引。

性能對比：暫未測試，估計差不太多

綜合看移植phoenix比移植華為更靠譜，phoenix侵入性小，功能更強大，且升級維護方面也比華為要靠譜。但是移植phoenix難度也相對比較大。

但是如果只是想短期起效果，可以嘗試下華為索引。

淘寶開源項目Lealone：

是一個可用于HBase的分布式SQL引擎，主要功能就是能用SQL方式（JDBC）查詢Hbase，避免了HBase使用的繁瑣操作。相對與Phoenix的功能弱多了。

支持高性能的分布式事務，
使用一個非常新穎的基于局部時間戳的多版本沖突與有效性檢測的分布式事務模型
是對H2關系數據庫SQL引擎的改進和擴展
HBase建的表Lealone只能讀；Lealone建的表Lealone可以讀寫。

基于Solr的HBase多條件查詢：

介紹：ApacheSolr 是一個開源的搜索服務器，Solr 使用 Java 語言開發，主要基于 HTTP 和Apache Lucene 實現。

原理：基于Solr的HBase多條件查詢原理很簡單，將HBase表中涉及條件過濾的字段和rowkey在Solr中建立索引，通過Solr的多條件查詢快速獲得符合過濾條件的rowkey值，拿到這些rowkey之后在HBASE中通過指定rowkey進行查詢。

缺點：
1：ApacheSolr本身并不是專為HBase設計的。需要專門針對ApacheSolr寫Hbase的相關應用，比如HBase寫數據時同步更新索引的過程需要我們自己寫協處理器。
2：ApacheSolr本身是一個WebService服務，需要額外維護一個或多個ApacheSolr服務器。

向AI問一下細節

中文字幕av专区_日韩电影在线播放_精品国产精品久久一区免费式_av在线免费观看网站

Phoenix(sql on hbase