-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-69] Support Spark Datasource for MOR table #1722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,7 +18,9 @@ | |
| package org.apache.hudi | ||
|
|
||
| import org.apache.hudi.DataSourceReadOptions._ | ||
| import org.apache.hudi.exception.HoodieException | ||
| import org.apache.hudi.common.model.HoodieTableType | ||
| import org.apache.hudi.common.table.HoodieTableMetaClient | ||
| import org.apache.hudi.exception.{HoodieException, TableNotFoundException} | ||
| import org.apache.hudi.hadoop.HoodieROTablePathFilter | ||
| import org.apache.log4j.LogManager | ||
| import org.apache.spark.sql.execution.datasources.DataSource | ||
|
|
@@ -58,26 +60,28 @@ class DefaultSource extends RelationProvider | |
| throw new HoodieException("'path' must be specified.") | ||
| } | ||
|
|
||
| // Try to create hoodie table meta client from the give path | ||
| // TODO: Smarter path handling | ||
| val metaClient = try { | ||
| val conf = sqlContext.sparkContext.hadoopConfiguration | ||
| Option(new HoodieTableMetaClient(conf, path.get, true)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would n't be problematic if
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point we have:
What I am trying to do here is:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Let me think about this more. We need to support some form of globbing for MOR/Snapshot query.
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Udit's PR has this path handling. Should we merge part of his PR first? https://github.com/apache/hudi/pull/1702/files#diff-9a21766ebf794414f94b302bcb968f41R31 |
||
| } catch { | ||
| case e: HoodieException => Option.empty | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can just error out there?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I used this as a flag that the |
||
| } | ||
|
|
||
| if (parameters(QUERY_TYPE_OPT_KEY).equals(QUERY_TYPE_SNAPSHOT_OPT_VAL)) { | ||
| // this is just effectively RO view only, where `path` can contain a mix of | ||
| // non-hoodie/hoodie path files. set the path filter up | ||
| sqlContext.sparkContext.hadoopConfiguration.setClass( | ||
| "mapreduce.input.pathFilter.class", | ||
| classOf[HoodieROTablePathFilter], | ||
| classOf[org.apache.hadoop.fs.PathFilter]) | ||
|
|
||
| log.info("Constructing hoodie (as parquet) data source with options :" + parameters) | ||
| log.warn("Snapshot view not supported yet via data source, for MERGE_ON_READ tables. " + | ||
| "Please query the Hive table registered using Spark SQL.") | ||
| // simply return as a regular parquet relation | ||
| DataSource.apply( | ||
| sparkSession = sqlContext.sparkSession, | ||
| userSpecifiedSchema = Option(schema), | ||
| className = "parquet", | ||
| options = parameters) | ||
| .resolveRelation() | ||
| if (metaClient.isDefined && metaClient.get.getTableType.equals(HoodieTableType.MERGE_ON_READ)) { | ||
| new SnapshotRelation(sqlContext, path.get, optParams, schema, metaClient.get) | ||
| } else { | ||
| getReadOptimizedView(sqlContext, parameters, schema) | ||
| } | ||
| } else if(parameters(QUERY_TYPE_OPT_KEY).equals(QUERY_TYPE_READ_OPTIMIZED_OPT_VAL)) { | ||
| getReadOptimizedView(sqlContext, parameters, schema) | ||
| } else if (parameters(QUERY_TYPE_OPT_KEY).equals(QUERY_TYPE_INCREMENTAL_OPT_VAL)) { | ||
| new IncrementalRelation(sqlContext, path.get, optParams, schema) | ||
| if (metaClient.isEmpty) { | ||
| throw new TableNotFoundException(path.get) | ||
| } | ||
| new IncrementalRelation(sqlContext, path.get, optParams, schema, metaClient.get) | ||
| } else { | ||
| throw new HoodieException("Invalid query type :" + parameters(QUERY_TYPE_OPT_KEY)) | ||
| } | ||
|
|
@@ -123,4 +127,25 @@ class DefaultSource extends RelationProvider | |
| } | ||
|
|
||
| override def shortName(): String = "hudi" | ||
|
|
||
| private def getReadOptimizedView(sqlContext: SQLContext, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can rename to something like
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure, will do |
||
| optParams: Map[String, String], | ||
| schema: StructType): BaseRelation = { | ||
| log.warn("Loading Read Optimized view.") | ||
| // this is just effectively RO view only, where `path` can contain a mix of | ||
| // non-hoodie/hoodie path files. set the path filter up | ||
| sqlContext.sparkContext.hadoopConfiguration.setClass( | ||
| "mapreduce.input.pathFilter.class", | ||
| classOf[HoodieROTablePathFilter], | ||
| classOf[org.apache.hadoop.fs.PathFilter]) | ||
|
|
||
| log.info("Constructing hoodie (as parquet) data source with options :" + optParams) | ||
| // simply return as a regular parquet relation | ||
| DataSource.apply( | ||
| sparkSession = sqlContext.sparkSession, | ||
| userSpecifiedSchema = Option(schema), | ||
| className = "parquet", | ||
| options = optParams) | ||
| .resolveRelation() | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,139 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.hudi | ||
|
|
||
| import org.apache.hudi.avro.HoodieAvroUtils | ||
| import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} | ||
| import org.apache.hudi.common.table.timeline.HoodieTimeline | ||
| import org.apache.hudi.config.HoodieWriteConfig | ||
| import org.apache.hudi.hadoop.{HoodieParquetInputFormat, HoodieROTablePathFilter} | ||
| import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils | ||
| import org.apache.hudi.exception.HoodieException | ||
| import org.apache.hudi.table.HoodieTable | ||
|
|
||
| import org.apache.hadoop.mapred.JobConf | ||
| import org.apache.log4j.LogManager | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.{DataFrame, Row, SQLContext} | ||
| import org.apache.spark.sql.sources.{BaseRelation, Filter, PrunedFilteredScan} | ||
| import org.apache.spark.sql.types.StructType | ||
|
|
||
| import java.util | ||
| import scala.collection.JavaConverters._ | ||
|
|
||
| /** | ||
| * This is the Spark DataSourceV1 relation to read Hudi MOR table. | ||
|
||
| * @param sqlContext | ||
| * @param basePath | ||
| * @param optParams | ||
| * @param userSchema | ||
| */ | ||
| class SnapshotRelation(val sqlContext: SQLContext, | ||
| val basePath: String, | ||
| val optParams: Map[String, String], | ||
| val userSchema: StructType, | ||
| val metaClient: HoodieTableMetaClient) extends BaseRelation with PrunedFilteredScan { | ||
|
|
||
| private val log = LogManager.getLogger(classOf[SnapshotRelation]) | ||
| private val conf = sqlContext.sparkContext.hadoopConfiguration | ||
|
|
||
| // Load Hudi table | ||
| private val hoodieTable = HoodieTable.create(metaClient, HoodieWriteConfig.newBuilder().withPath(basePath).build(), conf) | ||
| private val commitTimeline = hoodieTable.getMetaClient.getCommitsAndCompactionTimeline | ||
| if (commitTimeline.empty()) { | ||
| throw new HoodieException("No Valid Hudi timeline exists") | ||
| } | ||
| private val completedCommitTimeline = hoodieTable.getMetaClient.getCommitsTimeline.filterCompletedInstants() | ||
| private val lastInstant = completedCommitTimeline.lastInstant().get() | ||
|
|
||
| // Set config for listStatus() in HoodieParquetInputFormat | ||
| conf.setClass( | ||
| "mapreduce.input.pathFilter.class", | ||
| classOf[HoodieROTablePathFilter], | ||
| classOf[org.apache.hadoop.fs.PathFilter]) | ||
| conf.setStrings("mapreduce.input.fileinputformat.inputdir", basePath) | ||
| conf.setStrings("mapreduce.input.fileinputformat.input.dir.recursive", "true") | ||
| conf.setStrings("hoodie.realtime.last.commit", lastInstant.getTimestamp) | ||
|
|
||
| private val hoodieInputFormat = new HoodieParquetInputFormat | ||
| hoodieInputFormat.setConf(conf) | ||
|
|
||
| // List all parquet files | ||
| private val fileStatus = hoodieInputFormat.listStatus(new JobConf(conf)) | ||
|
|
||
| val (parquetPaths, parquetWithLogPaths) = if (lastInstant.getAction.equals(HoodieTimeline.COMMIT_ACTION) | ||
| || lastInstant.getAction.equals(HoodieTimeline.COMPACTION_ACTION)) { | ||
| (fileStatus.map(f => f.getPath.toString).toList, Map.empty[String, String]) | ||
| } else { | ||
| val fileGroups = HoodieRealtimeInputFormatUtils.groupLogsByBaseFile(conf, util.Arrays.stream(fileStatus)).asScala | ||
| // Split the file group to: parquet file without a matching log file, parquet file need to merge with log files | ||
| val parquetPaths: List[String] = fileGroups.filter(p => p._2.size() == 0).keys.toList | ||
| val parquetWithLogPaths: Map[String, String] = fileGroups | ||
| .filter(p => p._2.size() > 0) | ||
| .map{ case(k, v) => (k, v.asScala.toList.mkString(","))} | ||
| .toMap | ||
| (parquetPaths, parquetWithLogPaths) | ||
| } | ||
|
|
||
| if (log.isDebugEnabled) { | ||
| log.debug("Stand alone parquet files: \n" + parquetPaths.mkString("\n")) | ||
| log.debug("Parquet files that have matching log files: \n" + parquetWithLogPaths.map(m => s"${m._1}:${m._2}").mkString("\n")) | ||
| } | ||
|
|
||
| // Add log file map to options | ||
| private val finalOps = optParams ++ parquetWithLogPaths | ||
|
|
||
| // use schema from latest metadata, if not present, read schema from the data file | ||
| private val latestSchema = { | ||
| val schemaUtil = new TableSchemaResolver(metaClient) | ||
| val tableSchema = HoodieAvroUtils.createHoodieWriteSchema(schemaUtil.getTableAvroSchemaWithoutMetadataFields); | ||
| AvroConversionUtils.convertAvroSchemaToStructType(tableSchema) | ||
| } | ||
|
|
||
| override def schema: StructType = latestSchema | ||
|
|
||
| override def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] = { | ||
vinothchandar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| if (parquetWithLogPaths.isEmpty) { | ||
| sqlContext | ||
| .read | ||
| .options(finalOps) | ||
| .schema(schema) | ||
| .format("parquet") | ||
| .load(parquetPaths:_*) | ||
| .selectExpr(requiredColumns:_*) | ||
| .rdd | ||
garyli1019 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| } else { | ||
| val regularParquet = sqlContext | ||
| .read | ||
| .options(finalOps) | ||
| .schema(schema) | ||
| .format("parquet") | ||
| .load(parquetPaths:_*) | ||
| // Hudi parquet files needed to merge with log file | ||
| sqlContext | ||
| .read | ||
| .options(finalOps) | ||
| .schema(schema) | ||
| .format("org.apache.spark.sql.execution.datasources.parquet.HoodieParquetRealtimeFileFormat") | ||
| .load(parquetWithLogPaths.keys.toList: _*) | ||
| .union(regularParquet) | ||
| .rdd | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t think this change is necessary, right? RO view does map to snapshot query for cow. We may need to have two maps for cow and mor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got confused by the naming sometimes...
So for COW table, snapshot view = read optimized view
for MOR, snapshot view and read optimized view are different things.
With bootstrap, we will have one more view.
Can we call
read optimized view -> parquet only(including bootstrap)snapshot view -> parquet(with bootstrap) merge with logregardless of table type?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this mapping because we should be able to do RO view on MOR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.. there are no more views.. we did a renaming exercise to clear things up as "query types" .. with that there should be no confusion.. our docs are consistent with this as well.. On COW there is in fact no RO view.. so this change has to be done differently, if you need for MOR..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my previous comments are confusing, let me rephrase.
What I trying to do here is to not change the query behavior. Since before we don't support snapshot query for MOR, so RO and snapshot query type will behave the same regardless of its COW or MOR.
If we don't change this mapping, the user will have different behavior after upgrade to the next release. If they are using
VIEW_TYPE_READ_OPTIMIZED_OPT_VAL(deprecated)on MOR in their code, after upgrade to the next release, the code will run snapshot query instead of RO query. This could give users surprise even this key was deprecated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have been logging warning for sometime on the use of the deprecated configs. and so I think its fair to do the right thing here moving forward and call this out in the release notes. Let me push some changes..