您好,登錄后才能下訂單哦!
小編給大家分享一下spark 3.0.1集成delta 0.7.0之delta如何進行DDL操作,相信大部分人都還不怎么了解,因此分享這篇文章給大家參考一下,希望大家閱讀完這篇文章后大有收獲,下面讓我們一起去了解一下吧!
delta在0.7.0以前是不能夠進行save表操作的,只能存儲到文件中,也就是說他的元數據是和spark的其他元數據是分開的,delta是獨立存在的,也是不能和其他表進行關聯操作的,只有到了delta 0.7.0版本以后,才真正意義上和spark進行了集成,這也得益于spark 3.x的Catalog plugin API 特性。
還是先從delta的configurate sparksession入手,如下:
import org.apache.spark.sql.SparkSession val spark = SparkSession .builder() .appName("...") .master("...") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") .getOrCreate()
對于第二個配置 config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
從spark configuration,我們可以看到對該spark.sql.catalog.spark_catalog的解釋是
A catalog implementation that will be used as the v2 interface to Spark's built-in v1 catalog: spark_catalog. This catalog shares its identifier namespace with the spark_catalog and must be consistent with it; for example, if a table can be loaded by the spark_catalog, this catalog must also return the table metadata. To delegate operations to the spark_catalog, implementations can extend 'CatalogExtension'.
也就是說,通過該配置可以實現元數據的統一性,其實這也是spark社區和delta社區進行交互的一種結果
為了能搞懂delta為什么能夠進行DDL和DML操作,就得先知道spark 3.x的Catalog plugin機制SPARK-31121.
首先是interface CatalogPlugin,該接口是catalog plugin的頂級接口,正如注釋所說:
* A marker interface to provide a catalog implementation for Spark. * <p> * Implementations can provide catalog functions by implementing additional interfaces for tables, * views, and functions. * <p> * Catalog implementations must implement this marker interface to be loaded by * {@link Catalogs#load(String, SQLConf)}. The loader will instantiate catalog classes using the * required public no-arg constructor. After creating an instance, it will be configured by calling * {@link #initialize(String, CaseInsensitiveStringMap)}. * <p> * Catalog implementations are registered to a name by adding a configuration option to Spark: * {@code spark.sql.catalog.catalog-name=com.example.YourCatalogClass}. All configuration properties * in the Spark configuration that share the catalog name prefix, * {@code spark.sql.catalog.catalog-name.(key)=(value)} will be passed in the case insensitive * string map of options in initialization with the prefix removed. * {@code name}, is also passed and is the catalog's name; in this case, "catalog-name".
可以通過spark.sql.catalog.catalog-name=com.example.YourCatalogClass集成到spark中
該類的實現還可以集成其他額外的tables views functions的接口,這里就得提到接口TableCatalog,該類提供了與tables相關的方法:
/** * List the tables in a namespace from the catalog. * <p> * If the catalog supports views, this must return identifiers for only tables and not views. * * @param namespace a multi-part namespace * @return an array of Identifiers for tables * @throws NoSuchNamespaceException If the namespace does not exist (optional). */ Identifier[] listTables(String[] namespace) throws NoSuchNamespaceException; /** * Load table metadata by {@link Identifier identifier} from the catalog. * <p> * If the catalog supports views and contains a view for the identifier and not a table, this * must throw {@link NoSuchTableException}. * * @param ident a table identifier * @return the table's metadata * @throws NoSuchTableException If the table doesn't exist or is a view */ Table loadTable(Identifier ident) throws NoSuchTableException;
這樣就可以基于TableCatalog開發自己的catalog,從而實現multi-catalog support
還得有個接口DelegatingCatalogExtension,這是個實現了CatalogExtension接口的抽象類,而CatalogExtension繼承了TableCatalog, SupportsNamespaces。DeltaCatalog實現了DelegatingCatalogExtension ,這部分后續進行分析。
最后還有一個class CatalogManager,這個類是用來管理CatalogPlugins的,且是線程安全的:
/** * A thread-safe manager for [[CatalogPlugin]]s. It tracks all the registered catalogs, and allow * the caller to look up a catalog by name. * * There are still many commands (e.g. ANALYZE TABLE) that do not support v2 catalog API. They * ignore the current catalog and blindly go to the v1 `SessionCatalog`. To avoid tracking current * namespace in both `SessionCatalog` and `CatalogManger`, we let `CatalogManager` to set/get * current database of `SessionCatalog` when the current catalog is the session catalog. */ // TODO: all commands should look up table from the current catalog. The `SessionCatalog` doesn't // need to track current database at all. private[sql] class CatalogManager( conf: SQLConf, defaultSessionCatalog: CatalogPlugin, val v1SessionCatalog: SessionCatalog) extends Logging {
我們看到CatalogManager管理了v2版本的 CatalogPlugin和v1版本的sessionCatalog,這個是因為歷史的原因導致必須得兼容v1版本
那CatalogManager在哪里被調用呢。 我們看一下BaseSessionStateBuilder ,可以看到該類中才是正宗使用CatalogManager的地方:
/** * Catalog for managing table and database states. If there is a pre-existing catalog, the state * of that catalog (temp tables & current database) will be copied into the new catalog. * * Note: this depends on the `conf`, `functionRegistry` and `sqlParser` fields. */ protected lazy val catalog: SessionCatalog = { val catalog = new SessionCatalog( () => session.sharedState.externalCatalog, () => session.sharedState.globalTempViewManager, functionRegistry, conf, SessionState.newHadoopConf(session.sparkContext.hadoopConfiguration, conf), sqlParser, resourceLoader) parentState.foreach(_.catalog.copyStateTo(catalog)) catalog } protected lazy val v2SessionCatalog = new V2SessionCatalog(catalog, conf) protected lazy val catalogManager = new CatalogManager(conf, v2SessionCatalog, catalog)
SessionCatalog 是v1版本的,主要是跟底層的元數據存儲通信,以及管理臨時視圖,udf的,這一部分暫時不分析,重點放到v2版本的sessionCatalog, 我們看一下V2SessionCatalog:
/** * A [[TableCatalog]] that translates calls to the v1 SessionCatalog. */ class V2SessionCatalog(catalog: SessionCatalog, conf: SQLConf) extends TableCatalog with SupportsNamespaces { import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.NamespaceHelper import V2SessionCatalog._ override val defaultNamespace: Array[String] = Array("default") override def name: String = CatalogManager.SESSION_CATALOG_NAME // This class is instantiated by Spark, so `initialize` method will not be called. override def initialize(name: String, options: CaseInsensitiveStringMap): Unit = {} override def listTables(namespace: Array[String]): Array[Identifier] = { namespace match { case Array(db) => catalog .listTables(db) .map(ident => Identifier.of(Array(ident.database.getOrElse("")), ident.table)) .toArray case _ => throw new NoSuchNamespaceException(namespace) } }
我們分析一下listTables方法可知,v2的sessionCatalog操作 都是委托給了v1版本的sessionCatalog去操作的,其他的方法也是一樣, 而且name默認為CatalogManager.SESSION_CATALOG_NAME,也就是spark_catalog,這里后面也會提到,注意一下。 而且,catalogmanager在邏輯計劃中的分析器和優化器中也會用到,因為會用到其中的元數據:
protected def analyzer: Analyzer = new Analyzer(catalogManager, conf) { ... protected def optimizer: Optimizer = { new SparkOptimizer(catalogManager, catalog, experimentalMethods) { override def earlyScanPushDownRules: Seq[Rule[LogicalPlan]] = super.earlyScanPushDownRules ++ customEarlyScanPushDownRules override def extendedOperatorOptimizationRules: Seq[Rule[LogicalPlan]] = super.extendedOperatorOptimizationRules ++ customOperatorOptimizationRules } }
而analyzer和optimizer正是spark sql進行解析的核心中的核心,當然還有物理計劃的生成。 那這些analyzer和optimizer是在哪里被調用呢?
我們舉一個例子,DataSet中的filter方法就調用了:
*/ def filter(conditionExpr: String): Dataset[T] = { filter(Column(sparkSession.sessionState.sqlParser.parseExpression(conditionExpr))) }
sessionState.sqlParser就是剛才所說的sqlParser:
protected lazy val sqlParser: ParserInterface = { extensions.buildParser(session, new SparkSqlParser(conf)) }
只有整個邏輯 從sql解析到使用元數據的數據鏈路,我們就能大致知道怎么一回事了。
我們回過頭來看看,delta的DeltaCatalog是怎么和spark 3.x進行結合的 ,上源碼DeltaCatalog:
class DeltaCatalog(val spark: SparkSession) extends DelegatingCatalogExtension with StagingTableCatalog with SupportsPathIdentifier { def this() = { this(SparkSession.active) } ...
就如之前所說的DeltaCatalog繼承了DelegatingCatalogExtension,從名字可以看出這是一個委托類,那到底是怎么委托的呢以及委托給誰呢?
public abstract class DelegatingCatalogExtension implements CatalogExtension { private CatalogPlugin delegate; public final void setDelegateCatalog(CatalogPlugin delegate) { this.delegate = delegate; }
該類中有個setDelegateCatalog方法,該方法在CatalogManager中的loadV2SessionCatalog方法中被調用:
private def loadV2SessionCatalog(): CatalogPlugin = { Catalogs.load(SESSION_CATALOG_NAME, conf) match { case extension: CatalogExtension => extension.setDelegateCatalog(defaultSessionCatalog) extension case other => other } }
而該方法被v2SessionCatalog調用:
private[sql] def v2SessionCatalog: CatalogPlugin = { conf.getConf(SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION).map { customV2SessionCatalog => try { catalogs.getOrElseUpdate(SESSION_CATALOG_NAME, loadV2SessionCatalog()) } catch { case NonFatal(_) => logError( "Fail to instantiate the custom v2 session catalog: " + customV2SessionCatalog) defaultSessionCatalog } }.getOrElse(defaultSessionCatalog) }
這個就是返回默認的v2版本的SessionCatalog實例,分析一下這個方法:
首先得到配置項SQLConf.V2_SESSION_CATALOG_IMPLEMENTATION,也就是spark.sql.catalog.spark_catalog配置, 如果spark配置了的話,就調用loadV2SessionCatalog加載該類,,否則就加載默認的v2SessionCatalog,也就是V2SessionCatalog實例
這里我們就發現了:
delta配置的spark.sql.catalog.spark_catalog為"org.apache.spark.sql.delta.catalog.DeltaCatalog",也就是說,spark中的V2SessionCatalog是DeltaCatalog的實例,而DeltaCatalog的委托給了BaseSessionStateBuilder中的V2SessionCatalog實例。
具體看看DeltaCatalog 的createTable方法,其他的方法類似:
override def createTable( ident: Identifier, schema: StructType, partitions: Array[Transform], properties: util.Map[String, String]): Table = { if (DeltaSourceUtils.isDeltaDataSourceName(getProvider(properties))) { createDeltaTable( ident, schema, partitions, properties, sourceQuery = None, TableCreationModes.Create) } else { super.createTable(ident, schema, partitions, properties) } } ... private def createDeltaTable( ident: Identifier, schema: StructType, partitions: Array[Transform], properties: util.Map[String, String], sourceQuery: Option[LogicalPlan], operation: TableCreationModes.CreationMode): Table = { ... val tableDesc = new CatalogTable( identifier = TableIdentifier(ident.name(), ident.namespace().lastOption), tableType = tableType, storage = storage, schema = schema, provider = Some("delta"), partitionColumnNames = partitionColumns, bucketSpec = maybeBucketSpec, properties = tableProperties.toMap, comment = Option(properties.get("comment"))) // END: copy-paste from the super method finished. val withDb = verifyTableAndSolidify(tableDesc, None) ParquetSchemaConverter.checkFieldNames(tableDesc.schema.fieldNames) CreateDeltaTableCommand( withDb, getExistingTableIfExists(tableDesc), operation.mode, sourceQuery, operation, tableByPath = isByPath).run(spark) loadTable(ident) } override def loadTable(ident: Identifier): Table = { try { super.loadTable(ident) match { case v1: V1Table if DeltaTableUtils.isDeltaTable(v1.catalogTable) => DeltaTableV2( spark, new Path(v1.catalogTable.location), catalogTable = Some(v1.catalogTable), tableIdentifier = Some(ident.toString)) case o => o } }
判斷是否是delta數據源,如果是的話,跳到createDeltaTable方法,否則直接調用super.createTable方法,
createDeltaTable先會進行delta特有的CreateDeltaTableCommand.run()命令寫入delta數據,之后載loadTable
loadTable則會調用super的loadTable,而方法會調用V2SessionCatalog的loadTable,而V2SessionCatalog最終會調用v1版本sessionCatalog的getTableMetadata方法,從而組成V1Table(catalogTable)返回,這樣就把delta的元數據信息持久化到了v1 SessionCatalog管理的元數據庫中
如果不是delta數據源,則調用super.createTable方法,該方法調用V2SessionCatalog的createTable,而最終還是調用v1版本sessionCatalog的createTable方法
我們這里重點分析了delta數據源到元數據的存儲,非delta數據源的代碼就沒有粘貼過來,有興趣的自己可以編譯源碼跟蹤一下
我們還得提一下spark.sql.defaultCatalog的默認配置為spark_catalog,也就是sql的默認catalog為spark_catalog,對應到delta的話就是DeltaCatalog。
以上是“spark 3.0.1集成delta 0.7.0之delta如何進行DDL操作”這篇文章的所有內容,感謝各位的閱讀!相信大家都有了一定的了解,希望分享的內容對大家有所幫助,如果還想學習更多知識,歡迎關注億速云行業資訊頻道!
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。