-
Notifications
You must be signed in to change notification settings - Fork 835
Description
Testing the HMS migration script with spark-submit command fails with:
AttributeError: 'str' object has no attribute '_jdf'
which is triggered by the call:
id_type = df.get_schema_type(id_col)
If I change the call to:
id_type = get_schema_type(df, id_col)
I get past the error but expose other df related errors in other functions.
This is tested on:
"emr-5.31.0"
"Hadoop":"2.10.0"
"Hive":"2.3.7"
"Spark":"2.4.6"
Full stack trace:
Traceback (most recent call last):
File "/home/hadoop/hive_metastore_migration.py", line 1525, in
main()
File "/home/hadoop/hive_metastore_migration.py", line 1519, in main
etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options)
etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options)
etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options)
File "/home/hadoop/hive_metastore_migration.py", line 1414, in etl_from_metastore
etl_from_metastore(sc, sql_context, db_prefix, table_prefix, hive_metastore, options)
File "/home/hadoop/hive_metastore_migration.py", line 1414, in etl_from_metastore
File "/home/hadoop/hive_metastore_migration.py", line 1414, in etl_from_metastore
.transform(hive_metastore)
.transform(hive_metastore)
.transform(hive_metastore)
File "/home/hadoop/hive_metastore_migration.py", line 753, in transform
ms_database_params=hive_metastore.ms_database_params)
File "/home/hadoop/hive_metastore_migration.py", line 734, in transform_databases
dbs_with_params = self.join_with_params(df=ms_dbs, df_params=ms_database_params, id_col='DB_ID')
File "/home/hadoop/hive_metastore_migration.py", line 336, in join_with_params
df_params_map = self.transform_params(params_df=df_params, id_col=id_col)
File "/home/hadoop/hive_metastore_migration.py", line 314, in transform_params
return self.kv_pair_to_map(params_df, id_col, key, value, 'parameters')
File "/home/hadoop/hive_metastore_migration.py", line 326, in kv_pair_to_map
id_type = df.get_schema_type(id_col)
File "/home/hadoop/hive_metastore_migration.py", line 199, in get_schema_type
return df.select(column_name).schema.fields[0].dataType
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 1327, in select
AttributeError: 'str' object has no attribute '_jdf'
I have also tried with EMR v6.5 with Spark v3.1.2. Same error. I thought it might be Spark version issue.
What Spark version has this script been successful with? EMR version?
I launch the spark-submit per the readme with the --jdbc* options changed as needed.