[SPARK-38888][BUILD][CORE][YARN][DOCS] Add `RocksDB` support for shuffle service state store #37610

LuciferYang · 2022-08-22T08:06:49Z

What changes were proposed in this pull request?

This is a extended work of SPARK-38909, in this pr, the RocksDB implementation is added for shuffle local state store.

This PR adds the following code:

shuffledb.RocksDB and shuffledb.RocksDBIterator: implementation of RocksDB corresponding to shuffledb.DB and shuffledb.DBIterator
Add ROCKSDB to shuffle.DBBackend and the corresponding file suffix is .rdb and the description of SHUFFLE_SERVICE_DB_BACKEND in also changed
Add RocksDBProvider to build RocksDB instance and extend DBProvider to produce corresponding instances
Add dependency of rocksdbjni to network-common module

Why are the changes needed?

Support shuffle local state store to use RocksDB

Does this PR introduce any user-facing change?

When user configures spark.shuffle.service.db.enabled as true, the user can use rocksdb as the shuffle lcoal state store by specifying SHUFFLE_SERVICE_DB_BACKEND(spark.shuffle.service.db.backend) as RocksDB in spark-default.conf or spark-shuffle-site.xml(for yarn).

The original data store in LevelDB/RocksDB will not be automatically convert to another kind of storage now.

How was this patch tested?

Add new test.

LuciferYang · 2022-08-22T08:07:44Z

Testing first and will update the pr description later

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java

LuciferYang · 2022-08-22T11:52:05Z

common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java

+
+    private static final Logger logger = LoggerFactory.getLogger(RocksDBProvider.class);
+
+    public static RocksDB initRockDB(File dbFile, StoreVersion version, ObjectMapper mapper) throws


cc @dongjoon-hyun Do you have time to help review the RocksDB part?

LuciferYang · 2022-08-22T11:52:35Z

cc @mlaflamm @tgravescs @dongjoon-hyun @zhouyejoe

docs/spark-standalone.md

mridulm

Mostly looks good, but for a couple of comments.
Particularly have not looked at RocksDBProvider - will let @dongjoon-hyun review that (in addition to rest of the PR).

LuciferYang · 2022-08-24T12:54:40Z

friendly ping @Ngone51

mridulm · 2022-08-24T17:17:19Z

docs/configuration.md

+    eventually gets cleaned up.  This config may be removed in the future.
+  </td>
+  <td>3.0.0</td>
+</tr>


This should be in the section at the bottom - in External Shuffle service(server) side configuration options section.

Having said that, I was wrong about spark.shuffle.service.db.enabled - it is always enabled in yarn mode - so we cannot control it with .enabled flag.
The newly introduced spark.shuffle.service.db.backend is relevant though; but can we add a blurb that it is relevant for yarn and standalone - with db enabled by default for yarn, and to look at standalone.md for more details on how to configure it for standalone ? Thx

Does e124c84 look better?

https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

Is this table item should sorted by Property Name? Seems not follow this rule.

Actually, that section I mentioned is for push based shuffle - so probably does not make sense for our changes here.
There is a section here which looks more relevant for yarn.
We should perhaps move the push based shuffle blurb to the yarn doc as well ... but let us do that in a later PR.

Move it back to after

spark/docs/configuration.md

Lines 1000 to 1008 in 6e62b93

<td><code>spark.shuffle.service.removeShuffle</code></td>

<td>false</td>

<td>

Whether to use the ExternalShuffleService for deleting shuffle blocks for

deallocated executors when the shuffle is no longer needed. Without this enabled,

shuffle data on executors that are deallocated will remain on disk until the

application ends.

</td>

<td>3.3.0</td>

?

They have been add into spark-standalone.md and running-on-yarn.md.Is it necessary to keep them in configuration.md?

You are right, we need it only in standalone and yarn docs for now - not in configuration

This reverts commit 4d5f44c.

HyukjinKwon · 2022-08-26T01:16:10Z

cc @HeartSaVioR FYI

LuciferYang · 2022-08-26T02:44:38Z

thanks @HyukjinKwon

common/network-common/src/main/java/org/apache/spark/network/shuffledb/RocksDBIterator.java

docs/configuration.md

Ngone51 · 2022-08-26T05:24:36Z

docs/spark-standalone.md

+    When <code>spark.shuffle.service.db.enabled</code> is true, user can use this to specify the kind of disk-based 
+    store used in shuffle state store. This supports `LEVELDB` and `ROCKSDB` now and `LEVELDB` as default value. 
+    This only affects standalone mode (yarn always has this behavior enabled). 
+    The original data store in `LevelDB/RocksDB` will not be automatically convert to another kind of storage now.


Do you mean the data that is stored in LevelDB is not available after changing to RocksDB for example?

Yes, there is no function to automatically switch *.ldb to *.rdb now.

Ngone51 · 2022-08-26T05:27:46Z

common/network-common/src/main/java/org/apache/spark/network/shuffledb/RocksDBIterator.java

+    }
+
+    @Override
+    public void remove() {


Wondering why we have a remove() (and even no parameter) API for the iterator? What's the expected behavior with a possible implementation?

will remove this , I found

default void remove() { throw new UnsupportedOperationException("remove"); }

in java.util.Iterator

Ngone51 · 2022-08-26T05:38:49Z

common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java

+            logger.error("error opening rocksdb file {}. Creating new file, will not be able to " +
+              "recover state for existing applications", dbFile, e);
+            if (dbFile.isDirectory()) {
+              for (File f : Objects.requireNonNull(dbFile.listFiles())) {


Is it possible to have sub-dir here?

The process here refers to LevelDBProvider:

spark/common/network-common/src/main/java/org/apache/spark/network/util/LevelDBProvider.java

Lines 40 to 85 in a27238a

public static DB initLevelDB(File dbFile, StoreVersion version, ObjectMapper mapper) throws

IOException {

DB tmpDb = null;

if (dbFile != null) {

Options options = new Options();

options.createIfMissing(false);

options.logger(new LevelDBLogger());

try {

tmpDb = JniDBFactory.factory.open(dbFile, options);

} catch (NativeDB.DBException e) {

if (e.isNotFound() || e.getMessage().contains(" does not exist ")) {

logger.info("Creating state database at " + dbFile);

options.createIfMissing(true);

try {

tmpDb = JniDBFactory.factory.open(dbFile, options);

} catch (NativeDB.DBException dbExc) {

throw new IOException("Unable to create state store", dbExc);

}

} else {

// the leveldb file seems to be corrupt somehow. Lets just blow it away and create a new

// one, so we can keep processing new apps

logger.error("error opening leveldb file {}. Creating new file, will not be able to " +

"recover state for existing applications", dbFile, e);

if (dbFile.isDirectory()) {

for (File f : dbFile.listFiles()) {

if (!f.delete()) {

logger.warn("error deleting {}", f.getPath());

}

}

}

if (!dbFile.delete()) {

logger.warn("error deleting {}", dbFile.getPath());

}

options.createIfMissing(true);

try {

tmpDb = JniDBFactory.factory.open(dbFile, options);

} catch (NativeDB.DBException dbExc) {

throw new IOException("Unable to create state store", dbExc);

}

}

}

// if there is a version mismatch, we throw an exception, which means the service is unusable

checkVersion(tmpDb, version, mapper);

}

return tmpDb;

I think it is safer to follow the old code flow, could LevelDBProvider and RocksDBProvider be fixed together after further investigation?

I think this scenario here is a directory or file with the same name as dbFile, but it cannot be recovered, and it is not in the NotFound state, do I understand correctly @tgravescs ?

For it is a directory, it looks like a corner case

I think it is safer to follow the old code flow, could LevelDBProvider and RocksDBProvider be fixed together after further investigation?

sounds good.

Ngone51 · 2022-08-26T05:39:41Z

common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java

+            if (!dbFile.delete()) {
+              logger.warn("error deleting {}", dbFile.getPath());
+            }
+            dbOptions.setCreateIfMissing(true);


This will not take effect if dbFile.delete() failed?

If dbFile.delete() fails, the following RocksDB.open will also fail

Ngone51 · 2022-08-26T05:43:50Z

common/network-common/src/main/java/org/apache/spark/network/util/RocksDBProvider.java

+      dbOptions.setCompressionType(CompressionType.LZ4_COMPRESSION);
+      dbOptions.setTableFormatConfig(tableFormatConfig);
+      try {
+        return RocksDB.open(dbOptions, file.toString());


Why don't we have setCreateIfMissing resetting as we do in initRockDB(File dbFile, StoreVersion version, ObjectMapper mapper)? Is the call on this method supposed to be the first time RocksDB creation?

This method just use by test (ShuffleTestAccessor), and it should already exist in the test

spark/resource-managers/yarn/src/test/scala/org/apache/spark/network/shuffle/ShuffleTestAccessor.scala

Lines 210 to 217 in 43b02d7

def reloadRegisteredExecutors(

dbBackend: DBBackend,

file: File): ConcurrentMap[ExternalShuffleBlockResolver.AppExecId, ExecutorShuffleInfo] = {

val db = DBProvider.initDB(dbBackend, file)

val result = ExternalShuffleBlockResolver.reloadRegisteredExecutors(db)

db.close()

result

}

@Ngone51 should we move it to a new Test Utils or ShuffleTestAccessor

@Ngone51 #37648 unified the use of initDB method. If it is merged, this method can be deleted.

This reverts commit 8acf442.

mridulm

Looks good to me - this PR depends on other PR, so we should wait for those to be merged.

+CC @Ngone51

Ngone51 · 2022-09-02T12:12:40Z

LGTM

mridulm · 2022-09-06T04:53:44Z

Will wait for a day or so to given Dongjoon a chance to review the PR.
Would be great if you can review this PR @dongjoon-hyun, particularly RocksDBProvider. Thanks !

LuciferYang · 2022-09-08T02:59:18Z

...rce-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceSuite.scala


-  test("Finalized merged shuffle are written into DB and cleaned up after application stopped") {
+  // TODO: should enable this test after SPARK-40186 is resolved.
+  ignore("Finalized merged shuffle are written into DB and cleaned up after application stopped") {


will re-enable this due to it should pass after SPARK-40186 merged

mvn clean test -pl resource-managers/yarn -Pyarn -DwildcardSuites=org.apache.spark.network.yarn.YarnShuffleServiceWithRocksDBBackendSuite

YarnShuffleServiceWithRocksDBBackendSuite: - executor and merged shuffle state kept across NM restart - removed applications should not be in registered executor file and merged shuffle file - shuffle service should be robust to corrupt registered executor file - get correct recovery path - moving recovery file from NM local dir to recovery path - service throws error if cannot start - Consistency in AppPathInfo between in-memory hashmap and the DB - Finalized merged shuffle are written into DB and cleaned up after application stopped - SPARK-40186: shuffleMergeManager should have been shutdown before db closed - Dangling finalized merged partition info in DB will be removed during restart - Dangling application path or shuffle information in DB will be removed during restart - Cleanup for former attempts local path info should be triggered in applicationRemoved - recovery db should not be created if NM recovery is not enabled - SPARK-31646: metrics should be registered into Node Manager's metrics system - SPARK-34828: metrics should be registered with configured name - create default merged shuffle file manager instance - create remote block push resolver instance - invalid class name of merge manager will use noop instance Run completed in 9 seconds, 454 milliseconds. Total number of tests run: 18 Suites: completed 2, aborted 0 Tests: succeeded 18, failed 0, canceled 0, ignored 0, pending 0 All tests passed.

LuciferYang · 2022-09-08T03:02:47Z

01220ed merge with master and 52c1c7e re-enable Finalized merged shuffle are written into DB and cleaned up after application stopped in YarnShuffleServiceSuite

mridulm · 2022-09-09T02:06:35Z

Merged to master.
Thanks for working on this @LuciferYang !
Thanks for the review @Ngone51 :-)

LuciferYang · 2022-09-09T02:09:01Z

thanks @mridulm @Ngone51

dongjoon-hyun · 2022-09-09T05:33:27Z

Thank you, @LuciferYang, @mridulm , @Ngone51 and @HyukjinKwon .

LuciferYang added 2 commits August 22, 2022 14:04

add rocksdb support

fc388bd

test

e0a35c8

github-actions bot added BUILD CORE YARN labels Aug 22, 2022

LuciferYang changed the title ~~[SPARK-38888][CORE][YARN] Add RocksDB support for shuffle state store~~ [SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store Aug 22, 2022

LuciferYang added 2 commits August 22, 2022 16:31

fix format

036d761

set db to null

5154462

LuciferYang commented Aug 22, 2022

View reviewed changes

.../network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java Outdated Show resolved Hide resolved

LuciferYang changed the title ~~[SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store~~ [SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store Aug 22, 2022

add md

1aaa1ff

github-actions bot added the DOCS label Aug 22, 2022

LuciferYang commented Aug 22, 2022

View reviewed changes

LuciferYang changed the title ~~[SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store~~ [SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle state store Aug 22, 2022

mridulm reviewed Aug 22, 2022

View reviewed changes

docs/spark-standalone.md Show resolved Hide resolved

mridulm reviewed Aug 22, 2022

View reviewed changes

LuciferYang added 7 commits August 23, 2022 10:49

fix doc

5804140

ignore test

2191659

add more test

fed1e60

Merge branch 'upmaster' into SPARK-38888

6e435b6

more test

ce561e5

remove empty lin

db90b27

docs/configuration.md

15b2325

LuciferYang force-pushed the SPARK-38888 branch from 699530c to 15b2325 Compare August 24, 2022 07:23

LuciferYang mentioned this pull request Aug 24, 2022

[SPARK-38909][CORE][YARN] Encapsulate LevelDB used to store remote/external shuffle state as DB #36200

Closed

mridulm reviewed Aug 24, 2022

View reviewed changes

doc

e124c84

LuciferYang added 2 commits August 25, 2022 10:55

reorder

4d5f44c

Revert "reorder"

8907d73

This reverts commit 4d5f44c.

Ngone51 reviewed Aug 26, 2022

View reviewed changes

LuciferYang added 9 commits August 26, 2022 17:37

fix comments

097d39e

link yarn

0920f0e

add service

c76bc20

fix setCreateIfMissing

7e73ad5

to one

8acf442

Merge branch 'upmaster' into SPARK-38888

5cf1754

Revert "to one"

e068530

This reverts commit 8acf442.

Merge branch 'upmaster' into SPARK-38888

9e1057d

revert configuration.md

cb5148d

mridulm approved these changes Aug 31, 2022

View reviewed changes

Merge branch 'upmaster' into SPARK-38888

4d74965

Ngone51 changed the title ~~[SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle state store~~ [SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle service state store Sep 2, 2022

Merge branch 'upmaster' into SPARK-38888

96c910b

Merge branch 'upmaster' into SPARK-38888

01220ed

LuciferYang commented Sep 8, 2022

View reviewed changes

re-enable test in YarnShuffleServiceSuite

52c1c7e

mridulm closed this in e83aedd Sep 9, 2022


		private static final Logger logger = LoggerFactory.getLogger(RocksDBProvider.class);

		public static RocksDB initRockDB(File dbFile, StoreVersion version, ObjectMapper mapper) throws

	<td><code>spark.shuffle.service.removeShuffle</code></td>
	<td>false</td>
	<td>
	Whether to use the ExternalShuffleService for deleting shuffle blocks for
	deallocated executors when the shuffle is no longer needed. Without this enabled,
	shuffle data on executors that are deallocated will remain on disk until the
	application ends.
	</td>
	<td>3.3.0</td>

	public static DB initLevelDB(File dbFile, StoreVersion version, ObjectMapper mapper) throws
	IOException {
	DB tmpDb = null;
	if (dbFile != null) {
	Options options = new Options();
	options.createIfMissing(false);
	options.logger(new LevelDBLogger());
	try {
	tmpDb = JniDBFactory.factory.open(dbFile, options);
	} catch (NativeDB.DBException e) {
	if (e.isNotFound() \|\| e.getMessage().contains(" does not exist ")) {
	logger.info("Creating state database at " + dbFile);
	options.createIfMissing(true);
	try {
	tmpDb = JniDBFactory.factory.open(dbFile, options);
	} catch (NativeDB.DBException dbExc) {
	throw new IOException("Unable to create state store", dbExc);
	}
	} else {
	// the leveldb file seems to be corrupt somehow. Lets just blow it away and create a new
	// one, so we can keep processing new apps
	logger.error("error opening leveldb file {}. Creating new file, will not be able to " +
	"recover state for existing applications", dbFile, e);
	if (dbFile.isDirectory()) {
	for (File f : dbFile.listFiles()) {
	if (!f.delete()) {
	logger.warn("error deleting {}", f.getPath());
	}
	}
	}
	if (!dbFile.delete()) {
	logger.warn("error deleting {}", dbFile.getPath());
	}
	options.createIfMissing(true);
	try {
	tmpDb = JniDBFactory.factory.open(dbFile, options);
	} catch (NativeDB.DBException dbExc) {
	throw new IOException("Unable to create state store", dbExc);
	}

	}
	}
	// if there is a version mismatch, we throw an exception, which means the service is unusable
	checkVersion(tmpDb, version, mapper);
	}
	return tmpDb;

	def reloadRegisteredExecutors(
	dbBackend: DBBackend,
	file: File): ConcurrentMap[ExternalShuffleBlockResolver.AppExecId, ExecutorShuffleInfo] = {
	val db = DBProvider.initDB(dbBackend, file)
	val result = ExternalShuffleBlockResolver.reloadRegisteredExecutors(db)
	db.close()
	result
	}

[SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle service state store #37610

[SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle service state store #37610

Uh oh!

Conversation

LuciferYang commented Aug 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

LuciferYang commented Aug 22, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Aug 22, 2022

Uh oh!

Uh oh!

mridulm left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Aug 24, 2022

Uh oh!

mridulm Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mridulm Aug 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 26, 2022

Uh oh!

LuciferYang commented Aug 26, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as off-topic.

[SPARK-38888][BUILD][CORE][YARN][DOCS] Add `RocksDB` support for shuffle service state store #37610

[SPARK-38888][BUILD][CORE][YARN][DOCS] Add `RocksDB` support for shuffle service state store #37610

LuciferYang commented Aug 22, 2022 •

edited

Loading

mridulm Aug 24, 2022 •

edited

Loading

mridulm Aug 31, 2022 •

edited

Loading

LuciferYang Aug 26, 2022 •

edited

Loading

LuciferYang Aug 26, 2022 •

edited

Loading

LuciferYang Aug 26, 2022 •

edited

Loading

mridulm left a comment •

edited

Loading