-
Notifications
You must be signed in to change notification settings - Fork 157
Description
I am trying to understand the code we currently have. It is similar to the below one.
job.yml
metrics:
- metric.yaml
inputs:
df_input:
file:
path: s3a://bucket1/database1/table1/*.csv
format: csv
options:
header: true
delimiter: ","
output:
file:
dir: s3a://bucket1/
metric.yaml
steps:
- dataFrameName: df1
sql:
SELECT * FROM df_input
output:
- dataFrameName: df1
outputType: File
format: parquet
outputOptions:
saveMode: Overwrite
path: final/hive/database1/table1
protectFromEmptyOutput: false
tableName: database1.table1
partitionBy:
- as_of_date
What is the significance of tableName under output in metric.yaml file? I saw the comment for this property as "# save output to hive metastore (or any other catalog provider)" from https://github.com/YotpoLtd/metorikku/blob/master/config/metric_config_sample.yaml. What does that mean? Does it mean that it will issue "MSCK REPAIR" or "ALTER TABLE ADD PARTITION" or something similar to update Hive metastore? What are prerequisites for this property to work. It worked for us in our old cluster but not on the new one.
Another question indirectly linked to the above one. If I have 2 metric files in my job.yaml file. If I want to access the data written to a file (on which Hive external table is defined) from first metric file in the second one is it possible with the assumption that tableName property of the output is not working in the first metric file? Is there any example that does this?