Skip to content

Conversation

@ouyangxiaochen
Copy link

@ouyangxiaochen ouyangxiaochen commented Oct 24, 2018

What changes were proposed in this pull request?

The cleanup mechanism will clear all the eligible directories under SPARK_WORK_DIR. If the other configured paths are the same as the SPARK_WORK_DIR configuration, this will cause the file directories of other configuration items to be deleted by mistake. For example, the SPARK_LOCAL_DIRS and SPARK_WORK_DIR settings are the same.

We should add an another condition which start with 'app-' when removing the expired app-* directories in SPARK_WORK_DIR

How was this patch tested?

manual test

@ouyangxiaochen ouyangxiaochen changed the title [SPARK-25818][CORE] WorkDirCleanup should only remove the directory at the beginning of t… [SPARK-25818][CORE] WorkDirCleanup should only remove the directory at the beginning of the app Oct 24, 2018
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@jiangxb1987
Copy link
Contributor

jiangxb1987 commented Oct 24, 2018

IIUC it's not expected to share the SPARK_WORK_DIR with any other usage. Does this happen in real environment and hard to bypass the limitation?

@ouyangxiaochen
Copy link
Author

ouyangxiaochen commented Oct 25, 2018

@jiangxb1987 Yes, it happened in our real environment.
The scenario as follows:
Some disk corruptions in the production cluster which is normal.
SPARK_LOCAL_DIRS = /data1/bigdata/spark/tmp, disk data1 is broken, the maintenance engineer modified data1 to data2 or another. Unfortunately. the config SPARK_WORK_DIR = /data2/bigdata/spark/tmp, and then we start a Thriftserver process, some temporay folder will be created at the new config path which is the same as SPARK_WORK_DIR, but when the cleanup cycle time is reached, the folder created by Thriftserver will be removed by WorkDirCleanUp, so it will cause the Beeline and JDBC query to fail.
There is a very extreme situation that the user configures the operating system directory, which will cause a lot of trouble. So i think add this condition could reduce some unnecessary risks.

@srowen
Copy link
Member

srowen commented Oct 27, 2018

I don't think that's a reasonable usage scenario. That said is there any harm to this change? would it ever miss cleaning something up that it should?

@dongjoon-hyun
Copy link
Member

@ouyangxiaochen . Sorry, but the use case sounds like a misconfiguration.

@ouyangxiaochen
Copy link
Author

As far as I know, when a spark program is submitted to the cluster, a directory will be created under SPARK_WORK_DIR. The directory name consists of application, timestamp, and five-digit serial number. WorkDirCleanUp should only delete expired application directories. @srowen Could you tell me if there are other types of directories or files being created under SPARK_WORK_DIR? In addition to the application directory. Thanks!

@ouyangxiaochen
Copy link
Author

@dongjoon-hyun Yes, you can think so. So I want to solve this problem on the spark platform to reduce the risk of some misoperations of operation and maintenance engineer. WorkDirCleanUp is only responsible for cleaning up the directories that it generates.

@srowen
Copy link
Member

srowen commented Oct 29, 2018

I don't know what else goes in the work dir. It isn't valid to reuse it for anything else. Can you simply avoid using a work dir that is or has been used by something else?

The argument for making this change anyway is just that the code should delete just what it writes. But I am just not sure it can be something you rely on for correct behavior

@ouyangxiaochen
Copy link
Author

@srowen Your suggestion is very good, but sometimes the maintenance engineers have limited skills in this area. If they configures the operating system root directory as SPARK_WORK_DIR due to disk damage, it will bring a catastrophic accident. So, I think it is necessary to add this condition to avoid this production accidents.

@srowen
Copy link
Member

srowen commented Nov 1, 2018

I just think that if you have engineers randomly writing and reading stuff in this dir, a bunch of other stuff goes wrong. This is not a problem that Spark can reasonably solve. Certainly, you have much bigger production problems if this level of discipline can't be enforced.

@srowen srowen mentioned this pull request Nov 10, 2018
@asfgit asfgit closed this in a3ba3a8 Nov 11, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#21766
Closes apache#21679
Closes apache#21161
Closes apache#20846
Closes apache#19434
Closes apache#18080
Closes apache#17648
Closes apache#17169

Add:
Closes apache#22813
Closes apache#21994
Closes apache#22005
Closes apache#22463

Add:
Closes apache#15899

Add:
Closes apache#22539
Closes apache#21868
Closes apache#21514
Closes apache#21402
Closes apache#21322
Closes apache#21257
Closes apache#20163
Closes apache#19691
Closes apache#18697
Closes apache#18636
Closes apache#17176

Closes apache#23001 from wangyum/CloseStalePRs.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants