Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 16, 2021

What changes were proposed in this pull request?

Remove .internal() from the SQL legacy config spark.sql.legacy.replaceDatabricksSparkAvro.enabled.

Why are the changes needed?

In fact, the SQL config spark.sql.legacy.replaceDatabricksSparkAvro.enabled has been already documented publicly, see http://spark.apache.org/docs/latest/sql-data-sources-avro.html. So, it cannot be considered as an internal config.

Does this PR introduce any user-facing change?

This updates the list of auto generated SQL configs.

How was this patch tested?

By generating docs:

$ ./sql/create-docs.sh
$ cd docs
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch

@github-actions github-actions bot added the SQL label Feb 16, 2021
@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 16, 2021

There are other legacy configs that are non-internal:

  • spark.sql.legacy.allowHashOnMapType
  • spark.sql.legacy.sessionInitWithConfigDefaults

Though I am not sure why the configs ^^ are not internal, maybe there is a mistake because the configs are not mentioned in public docs. @HyukjinKwon WDYT?

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 16, 2021

@HyukjinKwon @gengliangwang @cloud-fan Could you take a look at this, please.

@SparkQA
Copy link

SparkQA commented Feb 16, 2021

Test build #135169 has started for PR 31571 at commit c43224e.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this was there since 2.4.0, I'm not sure why Apache Spark 3.2.0 does require this suddenly. I'm a little reluctant to advertise this option more explicitly in Apache Spark side.

This additional exposure proposal puts us into a more difficult situation when we want to remove this completely in the future from the Apache Spark codebase. Eventually, Apache Spark avro data source should be the best and de-facto standard.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 16, 2021

Given that this was there since 2.4.0, I'm not sure why Apache Spark 3.2.0 does require this suddenly.

FYI, I have found this config while documenting other "internal" configs, see #31564 (review)

I'm a little reluctant to advertise this option more explicitly in Apache Spark side.

It has been already explicitly advertised in public docs:
http://spark.apache.org/docs/latest/sql-data-sources-avro.html#compatibility-with-databricks-spark-avro
http://spark.apache.org/docs/latest/sql-data-sources-avro.html#configuration

This additional exposure proposal puts us into a more difficult situation when we want to remove this completely in the future from the Apache Spark codebase.

Do you consider a situation when you can just remove this config w/o deprecation only because it is marked as .internal()?

From my point of view, even the config is "internal" de jure, it is external de facto. In this situation, it cannot be just removed hiddenly from users. I do believe we should make it as external de jure, so, maybe remove it in the future only via deprecation otherwise we can break users apps potentially. @HyukjinKwon @cloud-fan WDYT?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 16, 2021

No, I don't. Removing a config w/o deprecation sounds wrong.

Do you consider a situation when you can just remove this config w/o deprecation only because it is marked as .internal()?

No. I disagree with the expression, de facto. The documentation change is far much easier than removing non-internal config.

it is external de facto.

Like https://github.com/databricks/spark-avro says, only Databricks customer can use that library directly . All other distributions, especially 3.2.0+ versions, I don't think this PR makes sense.

Databricks customers can also use this library directly on the Databricks Unified Analytics Platform without any additional dependency configurations.

.createWithDefault(false)

val LEGACY_REPLACE_DATABRICKS_SPARK_AVRO_ENABLED =
buildConf("spark.sql.legacy.replaceDatabricksSparkAvro.enabled")
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang do we need this conf? We do the same thing in com.databricks.spark.csv already by default internally. We could just make it exposed and deprecate this config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this way, I think it will address most of concerns raised.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then we could remove both com.databricks.spark.csv and com.databricks.spark.avro fallbacks together in the future, of course (Spark 4.0?). I don't think it makes sense to keep both mapping forever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's deprecate it and then remove it in later versions. It's OK if a deprecated internal config is mentioned in public docs for migration purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #31578 to deprecate the config.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all!

@cloud-fan
Copy link
Contributor

There are other legacy configs that are non-internal:

All legacy configs should be internal. These 2 are very likely mistakes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 17, 2021

All legacy configs should be internal. These 2 are very likely mistakes.

We have already marked all legacy configs as internal configs before the release 3.0, see #27448 . Those 2 configs were added after the commit, probably.

I opened the PR #31577 with a test which checks that all legacy SQL configs are internal.

@dongjoon-hyun
Copy link
Member

@MaxGekk . Shall we close this PR because we have #31577 and #31578 ?

@dongjoon-hyun
Copy link
Member

Since #31578 is merged, I'll close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants