Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Aug 11, 2025

Which issue does this PR close?

Closes #1149

Similar issue with gluten we met before, apache/incubator-gluten#9156

Rationale for this change

The purpose of this change is to ensure that the shuffle data files allowing read access for the others to fix shuffle fetch fail.

The error message on NodeManager:

Caused by: java.io.FileNotFoundException: /hadoop/2/yarn/local/usercache/b_stf/appcache/application_1754277161261_3068012/blockmgr-62f53ceb-e9f1-4ac8-bbc7-7442b47d6990/2c/shuffle_0_114_0.data (Permission denied)
	at java.io.RandomAccessFile.open0(Native Method)
	at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
	at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
	at org.sparkproject.io.netty.channel.DefaultFileRegion.open(DefaultFileRegion.java:88)
	at org.sparkproject.io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:128)

Set the shuffle file permissions to 0644 to keep it consistent with the permissions of the built-in shuffler manager in Spark.

  1. https://github.com/apache/spark/blob/549c30acfebdc0f72916448a4cbd1dab6ff9b760/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L87
image
  1. https://github.com/apache/spark/blob/549c30acfebdc0f72916448a4cbd1dab6ff9b760/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L231
image

Related spark PR: apache/spark#35085

What changes are included in this PR?

Are there any user-facing changes?

No.

How was this patch tested?

Have tested in our cluster, the file permission looks good now.

It can fetch the shuffle data after this patch.
image

ls -l  /hadoop/2/yarn/local/usercache/b_stf/appcache/application_1754277161261_3329570/blockmgr-ddbec8d5-06d4-41e9-91dd-e07c2d79800d/3c/
total 259048
-rw-r--r-- 1 b_stf b_dpms     32008 Aug 11 11:59 shuffle_0_17718_0.index
-rw-r--r-- 1 b_stf b_dpms     32008 Aug 11 11:50 shuffle_0_20_0.index
-rw-r--r-- 1 b_stf b_dpms 265166734 Aug 11 12:02 shuffle_0_28109_0.data
-rw-r--r-- 1 b_stf b_dpms     32008 Aug 11 12:02 shuffle_0_28109_0.index.tmp

@turboFei turboFei changed the title Fix shuffle file permission issue when using BlazeShuffleManager [BLAZE-1149] Fix shuffle file permission issue when using BlazeShuffleManager Aug 11, 2025
@turboFei turboFei marked this pull request as draft August 11, 2025 05:45
@turboFei turboFei marked this pull request as ready for review August 11, 2025 05:54
@turboFei turboFei marked this pull request as draft August 11, 2025 15:35
@turboFei turboFei force-pushed the fix_permission branch 2 times, most recently from e2618ac to 5c28632 Compare August 11, 2025 17:04

// Set the shuffle file permissions to 0644 to keep it consistent with the permissions of
// the built-in shuffler manager in Spark.
std::fs::set_permissions(path_ref, std::fs::Permissions::from_mode(0o644))?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below code would be impacted by umask.

OpenOptions::new()
    .write(true)
    .create(true)
    .truncate(true)
    .mode(0o644)
    .open(path_ref)?;

So, we have to use std::fs::set_permissions

@turboFei turboFei marked this pull request as ready for review August 11, 2025 19:05
@turboFei
Copy link
Member Author

I have tested in our cluster, it can fetch the shuffle data after this patch.
image

cc @richox

@turboFei
Copy link
Member Author

gentle ping @richox

@richox
Copy link
Contributor

richox commented Aug 20, 2025

i didnt see this setPermission logics in spark. so why do we need a different implementation here?

@turboFei
Copy link
Member Author

i didnt see this setPermission logics in spark. so why do we need a different implementation here?

cc @wangyum

@turboFei
Copy link
Member Author

turboFei commented Aug 20, 2025

@turboFei
Copy link
Member Author

Have updated the PR description.

@wangyum
Copy link
Member

wangyum commented Aug 20, 2025

i didnt see this setPermission logics in spark. so why do we need a different implementation here?

The directory permission created by the c++/rust method in a specific file system or operating system may be inconsistent with the directory permissions created by Spark through Java.

@richox richox merged commit 5794753 into apache:master Aug 20, 2025
73 checks passed
@turboFei turboFei deleted the fix_permission branch August 20, 2025 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix shuffle file permission issue when using BlazeShuffleManager

3 participants