Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Aug 22, 2022

What changes were proposed in this pull request?

This is a extended work of SPARK-38909, in this pr, the RocksDB implementation is added for shuffle local state store.

This PR adds the following code:

  • shuffledb.RocksDB and shuffledb.RocksDBIterator: implementation of RocksDB corresponding to shuffledb.DB and shuffledb.DBIterator
  • Add ROCKSDB to shuffle.DBBackend and the corresponding file suffix is .rdb and the description of SHUFFLE_SERVICE_DB_BACKEND in also changed
  • Add RocksDBProvider to build RocksDB instance and extend DBProvider to produce corresponding instances
  • Add dependency of rocksdbjni to network-common module

Why are the changes needed?

Support shuffle local state store to use RocksDB

Does this PR introduce any user-facing change?

When user configures spark.shuffle.service.db.enabled as true, the user can use rocksdb as the shuffle lcoal state store by specifying SHUFFLE_SERVICE_DB_BACKEND(spark.shuffle.service.db.backend) as RocksDB in spark-default.conf or spark-shuffle-site.xml(for yarn).

The original data store in LevelDB/RocksDB will not be automatically convert to another kind of storage now.

How was this patch tested?

Add new test.

@LuciferYang
Copy link
Contributor Author

Testing first and will update the pr description later

@LuciferYang LuciferYang changed the title [SPARK-38888][CORE][YARN] Add RocksDB support for shuffle state store [SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store Aug 22, 2022
@LuciferYang LuciferYang changed the title [SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store [SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store Aug 22, 2022
@github-actions github-actions bot added the DOCS label Aug 22, 2022

private static final Logger logger = LoggerFactory.getLogger(RocksDBProvider.class);

public static RocksDB initRockDB(File dbFile, StoreVersion version, ObjectMapper mapper) throws
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dongjoon-hyun Do you have time to help review the RocksDB part?

@LuciferYang
Copy link
Contributor Author

@LuciferYang LuciferYang changed the title [SPARK-38888][BUILD][CORE][YARN] Add RocksDB support for shuffle state store [SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle state store Aug 22, 2022
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, but for a couple of comments.
Particularly have not looked at RocksDBProvider - will let @dongjoon-hyun review that (in addition to rest of the PR).

@LuciferYang
Copy link
Contributor Author

friendly ping @Ngone51

eventually gets cleaned up. This config may be removed in the future.
</td>
<td>3.0.0</td>
</tr>
Copy link
Contributor

@mridulm mridulm Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the section at the bottom - in External Shuffle service(server) side configuration options section.

Having said that, I was wrong about spark.shuffle.service.db.enabled - it is always enabled in yarn mode - so we cannot control it with .enabled flag.
The newly introduced spark.shuffle.service.db.backend is relevant though; but can we add a blurb that it is relevant for yarn and standalone - with db enabled by default for yarn, and to look at standalone.md for more details on how to configure it for standalone ? Thx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does e124c84 look better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

Is this table item should sorted by Property Name? Seems not follow this rule.

Copy link
Contributor

@mridulm mridulm Aug 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that section I mentioned is for push based shuffle - so probably does not make sense for our changes here.
There is a section here which looks more relevant for yarn.
We should perhaps move the push based shuffle blurb to the yarn doc as well ... but let us do that in a later PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move it back to after

spark/docs/configuration.md

Lines 1000 to 1008 in 6e62b93

<td><code>spark.shuffle.service.removeShuffle</code></td>
<td>false</td>
<td>
Whether to use the ExternalShuffleService for deleting shuffle blocks for
deallocated executors when the shuffle is no longer needed. Without this enabled,
shuffle data on executors that are deallocated will remain on disk until the
application ends.
</td>
<td>3.3.0</td>

?

They have been add into spark-standalone.md and running-on-yarn.md.Is it necessary to keep them in configuration.md?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, we need it only in standalone and yarn docs for now - not in configuration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@HyukjinKwon
Copy link
Member

cc @HeartSaVioR FYI

@LuciferYang
Copy link
Contributor Author

thanks @HyukjinKwon

When <code>spark.shuffle.service.db.enabled</code> is true, user can use this to specify the kind of disk-based
store used in shuffle state store. This supports `LEVELDB` and `ROCKSDB` now and `LEVELDB` as default value.
This only affects standalone mode (yarn always has this behavior enabled).
The original data store in `LevelDB/RocksDB` will not be automatically convert to another kind of storage now.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the data that is stored in LevelDB is not available after changing to RocksDB for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no function to automatically switch *.ldb to *.rdb now.

}

@Override
public void remove() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering why we have a remove() (and even no parameter) API for the iterator? What's the expected behavior with a possible implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove this , I found

default void remove() {
        throw new UnsupportedOperationException("remove");
    }

in java.util.Iterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

logger.error("error opening rocksdb file {}. Creating new file, will not be able to " +
"recover state for existing applications", dbFile, e);
if (dbFile.isDirectory()) {
for (File f : Objects.requireNonNull(dbFile.listFiles())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have sub-dir here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process here refers to LevelDBProvider:

public static DB initLevelDB(File dbFile, StoreVersion version, ObjectMapper mapper) throws
IOException {
DB tmpDb = null;
if (dbFile != null) {
Options options = new Options();
options.createIfMissing(false);
options.logger(new LevelDBLogger());
try {
tmpDb = JniDBFactory.factory.open(dbFile, options);
} catch (NativeDB.DBException e) {
if (e.isNotFound() || e.getMessage().contains(" does not exist ")) {
logger.info("Creating state database at " + dbFile);
options.createIfMissing(true);
try {
tmpDb = JniDBFactory.factory.open(dbFile, options);
} catch (NativeDB.DBException dbExc) {
throw new IOException("Unable to create state store", dbExc);
}
} else {
// the leveldb file seems to be corrupt somehow. Lets just blow it away and create a new
// one, so we can keep processing new apps
logger.error("error opening leveldb file {}. Creating new file, will not be able to " +
"recover state for existing applications", dbFile, e);
if (dbFile.isDirectory()) {
for (File f : dbFile.listFiles()) {
if (!f.delete()) {
logger.warn("error deleting {}", f.getPath());
}
}
}
if (!dbFile.delete()) {
logger.warn("error deleting {}", dbFile.getPath());
}
options.createIfMissing(true);
try {
tmpDb = JniDBFactory.factory.open(dbFile, options);
} catch (NativeDB.DBException dbExc) {
throw new IOException("Unable to create state store", dbExc);
}
}
}
// if there is a version mismatch, we throw an exception, which means the service is unusable
checkVersion(tmpDb, version, mapper);
}
return tmpDb;

I think it is safer to follow the old code flow, could LevelDBProvider and RocksDBProvider be fixed together after further investigation?

Copy link
Contributor Author

@LuciferYang LuciferYang Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this scenario here is a directory or file with the same name as dbFile, but it cannot be recovered, and it is not in the NotFound state, do I understand correctly @tgravescs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For it is a directory, it looks like a corner case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safer to follow the old code flow, could LevelDBProvider and RocksDBProvider be fixed together after further investigation?

sounds good.

if (!dbFile.delete()) {
logger.warn("error deleting {}", dbFile.getPath());
}
dbOptions.setCreateIfMissing(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not take effect if dbFile.delete() failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If dbFile.delete() fails, the following RocksDB.open will also fail

dbOptions.setCompressionType(CompressionType.LZ4_COMPRESSION);
dbOptions.setTableFormatConfig(tableFormatConfig);
try {
return RocksDB.open(dbOptions, file.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have setCreateIfMissing resetting as we do in initRockDB(File dbFile, StoreVersion version, ObjectMapper mapper)? Is the call on this method supposed to be the first time RocksDB creation?

This comment was marked as off-topic.

Copy link
Contributor Author

@LuciferYang LuciferYang Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method just use by test (ShuffleTestAccessor), and it should already exist in the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def reloadRegisteredExecutors(
dbBackend: DBBackend,
file: File): ConcurrentMap[ExternalShuffleBlockResolver.AppExecId, ExecutorShuffleInfo] = {
val db = DBProvider.initDB(dbBackend, file)
val result = ExternalShuffleBlockResolver.reloadRegisteredExecutors(db)
db.close()
result
}

Copy link
Contributor Author

@LuciferYang LuciferYang Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51 should we move it to a new Test Utils or ShuffleTestAccessor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ngone51 #37648 unified the use of initDB method. If it is merged, this method can be deleted.

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me - this PR depends on other PR, so we should wait for those to be merged.

+CC @Ngone51

@Ngone51
Copy link
Member

Ngone51 commented Sep 2, 2022

LGTM

@Ngone51 Ngone51 changed the title [SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle state store [SPARK-38888][BUILD][CORE][YARN][DOCS] Add RocksDB support for shuffle service state store Sep 2, 2022
@mridulm
Copy link
Contributor

mridulm commented Sep 6, 2022

Will wait for a day or so to given Dongjoon a chance to review the PR.
Would be great if you can review this PR @dongjoon-hyun, particularly RocksDBProvider. Thanks !


test("Finalized merged shuffle are written into DB and cleaned up after application stopped") {
// TODO: should enable this test after SPARK-40186 is resolved.
ignore("Finalized merged shuffle are written into DB and cleaned up after application stopped") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will re-enable this due to it should pass after SPARK-40186 merged

mvn clean test -pl resource-managers/yarn -Pyarn -DwildcardSuites=org.apache.spark.network.yarn.YarnShuffleServiceWithRocksDBBackendSuite
YarnShuffleServiceWithRocksDBBackendSuite:
- executor and merged shuffle state kept across NM restart
- removed applications should not be in registered executor file and merged shuffle file
- shuffle service should be robust to corrupt registered executor file
- get correct recovery path
- moving recovery file from NM local dir to recovery path
- service throws error if cannot start
- Consistency in AppPathInfo between in-memory hashmap and the DB
- Finalized merged shuffle are written into DB and cleaned up after application stopped
- SPARK-40186: shuffleMergeManager should have been shutdown before db closed
- Dangling finalized merged partition info in DB will be removed during restart
- Dangling application path or shuffle information in DB will be removed during restart
- Cleanup for former attempts local path info should be triggered in applicationRemoved
- recovery db should not be created if NM recovery is not enabled
- SPARK-31646: metrics should be registered into Node Manager's metrics system
- SPARK-34828: metrics should be registered with configured name
- create default merged shuffle file manager instance
- create remote block push resolver instance
- invalid class name of merge manager will use noop instance
Run completed in 9 seconds, 454 milliseconds.
Total number of tests run: 18
Suites: completed 2, aborted 0
Tests: succeeded 18, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@LuciferYang
Copy link
Contributor Author

01220ed merge with master and 52c1c7e re-enable Finalized merged shuffle are written into DB and cleaned up after application stopped in YarnShuffleServiceSuite

@mridulm mridulm closed this in e83aedd Sep 9, 2022
@mridulm
Copy link
Contributor

mridulm commented Sep 9, 2022

Merged to master.
Thanks for working on this @LuciferYang !
Thanks for the review @Ngone51 :-)

@LuciferYang
Copy link
Contributor Author

thanks @mridulm @Ngone51

@dongjoon-hyun
Copy link
Member

Thank you, @LuciferYang, @mridulm , @Ngone51 and @HyukjinKwon .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants