Skip to content

Conversation

@echohlne
Copy link
Contributor

@echohlne echohlne commented Nov 20, 2020

What changes were proposed in this pull request?

To make sure the sensitive attributes to be redacted in the history server log.

Why are the changes needed?

We found the secure attributes like password in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.
The screenshot can be viewed in the attachment of JIRA spark-33504

Does this PR introduce any user-facing change?

no

How was this patch tested?

muntual test works well, I have also added unit testcase.

@github-actions github-actions bot added the CORE label Nov 20, 2020
@mridulm
Copy link
Contributor

mridulm commented Nov 20, 2020

Redaction support looks kind of inconsistent.
For ApplicationEnvironmentInfo, the event file contains unredacted event with the api redacting on demand.
On other hand, SparkListenerEnvironmentUpdate is redacted before getting logged into event file.

The current pr takes the approach of SparkListenerEnvironmentUpdate - and logs redacted events.
Personally I am inclined to have source of truth in the event file, and expose redacted view.

But then, all of these was last changed 4 years or so back (#15971).
Any thoughts @vanzin, @tgravescs ?

@mridulm
Copy link
Contributor

mridulm commented Nov 20, 2020

ok to test

@SparkQA
Copy link

SparkQA commented Nov 20, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36047/

@SparkQA
Copy link

SparkQA commented Nov 20, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36047/

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131441 has finished for PR 30446 at commit 33b6e24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131466 has finished for PR 30446 at commit f03fdef.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131467 has finished for PR 30446 at commit fdb305c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131468 has finished for PR 30446 at commit 26440c3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131469 has finished for PR 30446 at commit 84a1a28.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131471 has finished for PR 30446 at commit a73f527.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2020

Test build #131473 has finished for PR 30446 at commit 1a4e5c2.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 22, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36103/

@SparkQA
Copy link

SparkQA commented Nov 22, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36103/

@SparkQA
Copy link

SparkQA commented Nov 22, 2020

Test build #131499 has finished for PR 30446 at commit effbead.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@echohlne
Copy link
Contributor Author

@mridulm thanks for your great advice, Please review it again when you have time.

@tgravescs
Copy link
Contributor

So both of the configs in the jira look like they are user generated configs. It sounds like you updated the spark.redaction.regex config but that didn't work for these events. It seems like we just missed these, so thanks for catching this.

I think the current approach makes sense and matches our previous implementations.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131591 has finished for PR 30446 at commit effbead.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mridulm
Copy link
Contributor

mridulm commented Nov 25, 2020

@tgravescs The question I had was regarding when redaction is applied - while logging the event or when surfacing in UI/cli/etc.
Only for the case of EventLoggingListener.onEnvironmentUpdate we are redacting event file itself - in all other cases of application of redaction regex, we do not modify the source (event file, spark map, etc), but rather change when presenting to user.

I am not sure if this was done intentionally (allowing superuser to look at the unredact properties - and so onEnvironmentUpdate is a bug) or was an oversight (in which case all properties should be redacted before event logging).

Thoughts ? Thanks.

@tgravescs
Copy link
Contributor

sorry for my delay, behind on reviews.

I think we should redact from both ui/rest or from event logging as people may not realize event logging has the password and history server could allow viewing from people who shouldn't see those. If people want it differently I would say we add a config to allow it but have it off by default - but I would wait until someone requests this.

I think it was an oversight that these events weren't redacted. so if you see other cases we aren't we should fix them.

@mridulm
Copy link
Contributor

mridulm commented Dec 1, 2020

Thanks for clarifying Tom ! Sounds good to me.
I am fine with this patch given @tgravescs's comments above.

We can have follow up work to redact from other events as well - at event logging time.

@echohlne
Copy link
Contributor Author

echohlne commented Dec 2, 2020

@mridulm @tgravescs thanks for your review!

@echohlne
Copy link
Contributor Author

echohlne commented Dec 2, 2020

@mridulm Does it need someone else to continue reviewing, or can someone help me merge the code? thanks!

@tgravescs
Copy link
Contributor

I will merge this shortly

@tgravescs tgravescs changed the title [spark-33504] The application log in the Spark history server contains sensitive attributes should be redated [SPARK-33504][webui] The application log in the Spark history server contains sensitive attributes should be redacted Dec 2, 2020
@tgravescs tgravescs changed the title [SPARK-33504][webui] The application log in the Spark history server contains sensitive attributes should be redacted [SPARK-33504][CORE] The application log in the Spark history server contains sensitive attributes should be redacted Dec 2, 2020
@asfgit asfgit closed this in 28dad1b Dec 2, 2020
asfgit pushed a commit that referenced this pull request Dec 2, 2020
…ontains sensitive attributes should be redacted

### What changes were proposed in this pull request?
To make sure the sensitive attributes to be redacted in the history server log.

### Why are the changes needed?
We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.
The screenshot can be viewed in the attachment of JIRA spark-33504
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
muntual test works well, I have also added unit testcase.

Closes #30446 from akiyamaneko/eventlog_unredact.

Authored-by: neko <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
(cherry picked from commit 28dad1b)
Signed-off-by: Thomas Graves <[email protected]>
@tgravescs
Copy link
Contributor

thanks @akiyamaneko @mridulm merged to master and branch-3.0

@dongjoon-hyun
Copy link
Member

Hi, @tgravescs .
There is a compilation error in branch-3.0.

[error] /home/runner/work/spark/spark/core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:118: not found: value resourceProfileId
[error] Error occurred in an application involving default arguments.
[error]       resourceProfileId = ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID)
[error]       ^

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 2, 2020

Could you take a look, please? I guess we need to recover the branch-3.0 first by reverting it and make a PR to branch-3.0 to pass all CIs. But, it's up to you. Thanks in advance.

@tgravescs
Copy link
Contributor

yeah lets revert first

@tgravescs
Copy link
Contributor

#30576 is revert

@akiyamaneko could you port this to branch-3.0 and put up a PR?

asfgit pushed a commit that referenced this pull request Dec 2, 2020
…server contains sensitive attributes should be redacted"

### What changes were proposed in this pull request?

Revert SPARK-33504 on branch-3.0 compilation error. Original PR #30446

This reverts commit e59179b.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #30576 from tgravescs/revert33504.

Authored-by: Thomas Graves <[email protected]>
Signed-off-by: Thomas Graves <[email protected]>
@tgravescs
Copy link
Contributor

@akiyamaneko do you have time to back port?

@echohlne
Copy link
Contributor Author

@tgravescs sorry for delay, Is there anything I can do?

@tgravescs
Copy link
Contributor

I don't think anyone back ported it yet so if you have time it would be great to put up a version again branch-3.0

viirya added a commit that referenced this pull request Feb 24, 2021
…ver contains sensitive attributes should be redacted

### What changes were proposed in this pull request?

To make sure the sensitive attributes to be redacted in the history server log. This is the backport of original PR #30446.

### Why are the changes needed?

We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.

The screenshot can be viewed in the attachment of JIRA Spark-33504

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #31631 from viirya/SPARK-33504-3.0.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
rshkv pushed a commit to palantir/spark that referenced this pull request Mar 9, 2021
…ver contains sensitive attributes should be redacted

### What changes were proposed in this pull request?

To make sure the sensitive attributes to be redacted in the history server log. This is the backport of original PR apache#30446.

### Why are the changes needed?

We found the secure attributes like password  in SparkListenerJobStart and SparkListenerStageSubmitted events would not been redated, resulting in sensitive attributes can be viewd directly.

The screenshot can be viewed in the attachment of JIRA Spark-33504

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes apache#31631 from viirya/SPARK-33504-3.0.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants