Skip to content

Comments

[SPARK-28907][CORE] Review invalid usage of new Configuration()#25616

Closed
advancedxy wants to merge 2 commits intoapache:masterfrom
advancedxy:remove_invalid_configuration
Closed

[SPARK-28907][CORE] Review invalid usage of new Configuration()#25616
advancedxy wants to merge 2 commits intoapache:masterfrom
advancedxy:remove_invalid_configuration

Conversation

@advancedxy
Copy link
Contributor

What changes were proposed in this pull request?

Replaces some incorrect usage of new Configuration() as it will load default configs defined in Hadoop

Why are the changes needed?

Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existed tests.

if (r) {
this.curReader.asInstanceOf[HConfigurable].setConf(getConf)
if (getConf != null) {
this.curReader.asInstanceOf[HConfigurable].setConf(getConf)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed because initNextRecordReader could be called in the constructor, which getConf would be null.

We have to override setConf too to set conf for the first reader.

@advancedxy
Copy link
Contributor Author

cc @gatorsmile.

Can one of the admins verify this patch?

Ah, would you please add me to the white list.

@joshrosen-stripe
Copy link
Contributor

joshrosen-stripe commented Aug 30, 2019

(Drive-by comment; not a serious review):

Is it worth adding a Scalastyle regex check to ban the zero-argument Configuration constructor? See the existing Scalastyle XML configurations for examples of how this was done for other similar API (mis)uses. This would prevent re-occurrence of this problem in new code.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28907] Review invalid usage of new Configuration() [SPARK-28907][CORE] Review invalid usage of new Configuration() Aug 30, 2019
@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109928 has finished for PR 25616 at commit 50d6d75.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding a Scalastyle regex check to ban the zero-argument Configuration constructor? See the existing Scalastyle XML configurations for examples of how this was done for other similar API (mis)uses. This would prevent re-occurrence of this problem in new code.

@joshrosen-stripe This is a good point. Two problems remain uncertain.

  1. new Configuration() are mostly used in Suite/Test files, can we skip these files in Scalastyle XMLs?
  2. new Configuration() is valid and should be called in some places. How can we whitelists these calls? such as allowed in certain classes?

val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
val reader = new WholeTextFileRecordReader(fileSplit, hadoopAttemptContext, 0)
reader.setConf(hadoopAttemptContext.getConfiguration)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WholeTextFileRecordReader is Configurable, setConf should be called after creation.
This is why tests are failing before this patch.

However, I am wondering for org.apache.spark.input.WholeTextFileRecordReader and org.apache.spark.input.ConfigurableCombineFileRecordReader, we can already retrieve config from org.apache.hadoop.mapreduce.TaskAttemptContext. There is no need to make these class Configurable

I am wondering if we should remove Configurable trait for the related classes all at once. what do you think @gatorsmile

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an existing test that fails without this change, as you mention? should it be reenabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
Some tests in WholeTextFileSuite and SaveLoadSuite are failing without this change.

However, the failure is introduced by my change to WholeTextFileRecordReader

override def nextKeyValue(): Boolean = {
if (!processed) {
val conf = getConf
val factory = new CompressionCodecFactory(conf)

We use getConf instead of new Configuration, then should call setConf first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, they're not failing in master but can fail if run in an env where Hadoop config files are present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, they're not failing in master but can fail if run in an env where Hadoop config files are present?I see, they're not failing in master but can fail if run in an env where Hadoop config files are present?

Eh, yes, they are not failing in master. The code(master) even normally won't fail in an env where Hadoop configs are present. They could fail or get unexpected result unless the Hadoop configs are incorrectly configured in executor env(such as yarn-cluster), even user supplies correct configs (passed to TaskAttemptContext

@SparkQA
Copy link

SparkQA commented Aug 30, 2019

Test build #109937 has finished for PR 25616 at commit 149de72.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonble to me.

@SparkQA
Copy link

SparkQA commented Aug 31, 2019

Test build #4851 has finished for PR 25616 at commit 149de72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in ca71177 Sep 5, 2019
@srowen
Copy link
Member

srowen commented Sep 5, 2019

Merged to master

@advancedxy advancedxy deleted the remove_invalid_configuration branch September 5, 2019 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants