[SPARK-28907][CORE] Review invalid usage of new Configuration() by advancedxy · Pull Request #25616 · apache/spark

advancedxy · 2019-08-29T07:35:40Z

What changes were proposed in this pull request?

Replaces some incorrect usage of new Configuration() as it will load default configs defined in Hadoop

Why are the changes needed?

Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existed tests.

advancedxy · 2019-08-29T07:37:41Z

core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala

    if (r) {
-      this.curReader.asInstanceOf[HConfigurable].setConf(getConf)
+      if (getConf != null) {
+        this.curReader.asInstanceOf[HConfigurable].setConf(getConf)


This is needed because initNextRecordReader could be called in the constructor, which getConf would be null.

We have to override setConf too to set conf for the first reader.

advancedxy · 2019-08-29T07:39:12Z

cc @gatorsmile.

Can one of the admins verify this patch?

Ah, would you please add me to the white list.

joshrosen-stripe · 2019-08-30T00:51:51Z

(Drive-by comment; not a serious review):

Is it worth adding a Scalastyle regex check to ban the zero-argument Configuration constructor? See the existing Scalastyle XML configurations for examples of how this was done for other similar API (mis)uses. This would prevent re-occurrence of this problem in new code.

dongjoon-hyun · 2019-08-30T01:07:05Z

ok to test

SparkQA · 2019-08-30T02:53:10Z

Test build #109928 has finished for PR 25616 at commit 50d6d75.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

advancedxy

Is it worth adding a Scalastyle regex check to ban the zero-argument Configuration constructor? See the existing Scalastyle XML configurations for examples of how this was done for other similar API (mis)uses. This would prevent re-occurrence of this problem in new code.

@joshrosen-stripe This is a good point. Two problems remain uncertain.

new Configuration() are mostly used in Suite/Test files, can we skip these files in Scalastyle XMLs?
new Configuration() is valid and should be called in some places. How can we whitelists these calls? such as allowed in certain classes?

advancedxy · 2019-08-30T06:37:02Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileWholeTextReader.scala

    val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0)
    val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
    val reader = new WholeTextFileRecordReader(fileSplit, hadoopAttemptContext, 0)
+    reader.setConf(hadoopAttemptContext.getConfiguration)


WholeTextFileRecordReader is Configurable, setConf should be called after creation.
This is why tests are failing before this patch.

However, I am wondering for org.apache.spark.input.WholeTextFileRecordReader and org.apache.spark.input.ConfigurableCombineFileRecordReader, we can already retrieve config from org.apache.hadoop.mapreduce.TaskAttemptContext. There is no need to make these class Configurable

I am wondering if we should remove Configurable trait for the related classes all at once. what do you think @gatorsmile

Is there an existing test that fails without this change, as you mention? should it be reenabled?

Some tests in WholeTextFileSuite and SaveLoadSuite are failing without this change.

However, the failure is introduced by my change to WholeTextFileRecordReader

spark/core/src/main/scala/org/apache/spark/input/WholeTextFileRecordReader.scala

Lines 70 to 73 in 149de72

override def nextKeyValue(): Boolean = {

if (!processed) {

val conf = getConf

val factory = new CompressionCodecFactory(conf)

We use getConf instead of new Configuration, then should call setConf first.

I see, they're not failing in master but can fail if run in an env where Hadoop config files are present?

I see, they're not failing in master but can fail if run in an env where Hadoop config files are present?I see, they're not failing in master but can fail if run in an env where Hadoop config files are present?

Eh, yes, they are not failing in master. The code(master) even normally won't fail in an env where Hadoop configs are present. They could fail or get unexpected result unless the Hadoop configs are incorrectly configured in executor env(such as yarn-cluster), even user supplies correct configs (passed to TaskAttemptContext

SparkQA · 2019-08-30T07:05:02Z

Test build #109937 has finished for PR 25616 at commit 149de72.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

srowen

This seems reasonble to me.

SparkQA · 2019-08-31T17:52:01Z

Test build #4851 has finished for PR 25616 at commit 149de72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-05T00:52:27Z

Merged to master

[SPARK-28907] Review invalid usage of new Configuration()

50d6d75

advancedxy commented Aug 29, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-28907] Review invalid usage of new Configuration()~~ [SPARK-28907][CORE] Review invalid usage of new Configuration() Aug 30, 2019

dongjoon-hyun added the SPARK CORE label Aug 30, 2019

call setConf for WholeTextFileReader to avoid NPE.

149de72

advancedxy commented Aug 30, 2019

View reviewed changes

srowen reviewed Aug 31, 2019

View reviewed changes

srowen closed this in ca71177 Sep 5, 2019

advancedxy deleted the remove_invalid_configuration branch September 5, 2019 02:26

	override def nextKeyValue(): Boolean = {
	if (!processed) {
	val conf = getConf
	val factory = new CompressionCodecFactory(conf)

Comments

Conversation

advancedxy commented Aug 29, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

advancedxy Aug 29, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy commented Aug 29, 2019

Uh oh!

joshrosen-stripe commented Aug 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 30, 2019

Uh oh!

SparkQA commented Aug 30, 2019

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

advancedxy Aug 30, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Sep 2, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Sep 2, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Sep 2, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Sep 3, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 30, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 31, 2019

Uh oh!

srowen commented Sep 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

joshrosen-stripe commented Aug 30, 2019 •

edited

Loading