Skip to content

Significance of tableName for output in metric.yaml file #482

@kiranbobba

Description

@kiranbobba

I am trying to understand the code we currently have. It is similar to the below one.

job.yml
metrics:
  - metric.yaml
inputs:
  df_input:
    file: 
      path: s3a://bucket1/database1/table1/*.csv
      format: csv
      options:
        header: true
        delimiter: ","
output:
  file:
    dir: s3a://bucket1/

metric.yaml
steps:
  - dataFrameName: df1
    sql:
      SELECT * FROM df_input

output:
  - dataFrameName: df1
    outputType: File
    format: parquet
    outputOptions:
      saveMode: Overwrite
      path: final/hive/database1/table1
      protectFromEmptyOutput: false
      tableName: database1.table1
      partitionBy:
        - as_of_date

What is the significance of tableName under output in metric.yaml file? I saw the comment for this property as "# save output to hive metastore (or any other catalog provider)" from https://github.com/YotpoLtd/metorikku/blob/master/config/metric_config_sample.yaml. What does that mean? Does it mean that it will issue "MSCK REPAIR" or "ALTER TABLE ADD PARTITION" or something similar to update Hive metastore? What are prerequisites for this property to work. It worked for us in our old cluster but not on the new one.

Another question indirectly linked to the above one. If I have 2 metric files in my job.yaml file. If I want to access the data written to a file (on which Hive external table is defined) from first metric file in the second one is it possible with the assumption that tableName property of the output is not working in the first metric file? Is there any example that does this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions