HBASE-28839: Handle all types of exceptions during retrieval of bucket-cache from persistence. #6250

jhungund · 2024-09-16T05:11:06Z

During the retrievel of bucket-cache from persistence, it was observed that, if an exception, other than the IOException occurs, the exception is not logged and also the retrieval thread exits leaving the bucket cache in an uninitialised state, leaving it unusable.

This change, enables the retrieval thread to peint all types of exceptions and also reinitialises the bucket cache and makes it reusable.

Unfortunately, the exception was seen due to a parallel execution of eviction during the execution persistence of backing map. Hence, this use-case may not be tested via a unit test.

Change-Id: I81b7f5fe06945702bbc59df96d054f95f03de499

wchevreuil

We should rather understand and solve the trigger, than put this bandaid here.

jhungund · 2024-09-16T12:21:44Z

We should rather understand and solve the trigger, than put this bandaid here.
@wchevreuil , Unfortunately, since the exception gets missed and is not printed. Hence, hence, to understand this further, we need this change which will print the exception that is being missed.
The trace of the exception will reveal exact details of the issue which can then, be handled.

…t-cache from persistence. In certain scenarios, where there is discrepancy between the number of chunks persisted in the file and the number of chunks stored in the persistence file. This is because the bucket cache may be operated upon in parallel. During the retrievel of bucket-cache from persistence, it was observed that, if an exception, other than the IOException occurs, the exception is not logged and also the retrieval thread exits leaving the bucket cache in an uninitialised state, leaving it unusable. With this change, the retrieval code does not rely on the metadata information (number of chunks) and instead, it reads from the file-stream as long as the data is available to be read. This change, enables the retrieval thread to print all types of exceptions and also reinitialises the bucket cache and makes it reusable. Change-Id: I81b7f5fe06945702bbc59df96d054f95f03de499

wchevreuil · 2024-09-19T10:38:25Z

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java

-      } catch (IOException ioex) {
-        LOG.error("Can't restore from file[{}] because of ", persistencePath, ioex);
+      } catch (Throwable ex) {
+        LOG.error("Can't restore from file[{}] because of ", persistencePath, ex);


nit: since we are handling the error, shouldn't this be a warn? And let's explain the cache will be reset and reload would happen.

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java

wchevreuil · 2024-09-19T11:21:43Z

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketProtoUtils.java

+        if (firstChunkPersisted == false) {
          // Persist all details along with the first chunk into BucketCacheEntry
          BucketProtoUtils.toPB(cache, builder.build()).writeDelimitedTo(fos);
+          firstChunkPersisted = true;
        } else {
          // Directly persist subsequent backing-map chunks.
          builder.build().writeDelimitedTo(fos);


maybe for consistency and simplicity, we should just write the BucketCacheProtos.BucketCacheEntry before the for loop, then the backmap chunks only within this loop. That way we wouldn't need this firstChunkPersisted thread.

Change-Id: Icf6cdcf829e7d4bd16f50f48fd02059b415f2d09

wchevreuil

Needs to also address the spotless issues.

wchevreuil · 2024-09-20T11:41:13Z

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketProtoUtils.java

+    // Create the first chunk and persist all details along with it.
+    while (entrySetIter.hasNext()) {
+      blockCount++;
+      Map.Entry<BlockCacheKey, BucketEntry> entry = entrySetIter.next();
+      addToBuilder(entry, entryBuilder, builder);
      if (blockCount % chunkSize == 0 || (blockCount == backingMapSize)) {
-        chunkCount++;
-        if (chunkCount == 1) {
-          // Persist all details along with the first chunk into BucketCacheEntry
          BucketProtoUtils.toPB(cache, builder.build()).writeDelimitedTo(fos);
-        } else {
-          // Directly persist subsequent backing-map chunks.
+          break;
+      }
+    }
+    builder.clear();


Why do we need two while loops? just do the BucketProtoUtils.toPB(cache, builder.build()).writeDelimitedTo(fos); before the loop. That would write all the BucketCacheEntry meta info at the beginning of the file with an empty backingMap at this point, but that's fine as we'll write all backingMap chunks subsequently and this code would be cleaner, I suppose.

wchevreuil · 2024-09-20T12:05:40Z

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java

    BucketCacheProtos.BucketCacheEntry firstChunk =
      BucketCacheProtos.BucketCacheEntry.parseDelimitedFrom(in);
    parseFirstChunk(firstChunk);

    // Subsequent chunks have the backingMap entries.
-    for (int i = 1; i < numChunks; i++) {
-      LOG.info("Reading chunk no: {}", i + 1);
+    int numChunks = 0;
+    while (in.available() > 0) {
      parseChunkPB(BucketCacheProtos.BackingMap.parseDelimitedFrom(in),
        firstChunk.getDeserializersMap());
-      LOG.info("Retrieved chunk: {}", i + 1);
+      numChunks++;
    }


If we change the way we write to the file as I suggested, so that we have only the BucketCacheProtos.BucketCacheEntry at the beginning followed by all BucketCacheProtos.BackingMap chunks , then we should change the naming here, firstChunk should now be entry. And these parseFirstChunk, parseChunkPB are not really parsing anything (parsing is delegated to the proto utils), but rather updating the cache index structures, so we should rename it to something like updateCacheIndex.

Also looking at parseFirstChunck and parseChunk, we should replace duplicate code inside parseFirstChunck by call to parseChunk. Or since parseFirstChunk is only used here, we could just get rid of it and simply do:

fullyCachedFiles.clear(); fullyCachedFiles.putAll(BucketProtoUtils.fromPB(entry.getCachedFilesMap()));

Whilst the backingMap and blocksByHfile would get updated all in the loop.

Change-Id: I97936a683673ff89e04a15bc66542fb93a32fe8a

Apache-HBase · 2024-09-20T14:42:08Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 29s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	hbaseanti	0m 0s		Patch does not have any anti-patterns.
			_ master Compile Tests _
+1 💚	mvninstall	3m 8s		master passed
+1 💚	compile	3m 6s		master passed
+1 💚	checkstyle	0m 35s		master passed
+1 💚	spotbugs	1m 33s		master passed
+1 💚	spotless	0m 44s		branch has no errors when running spotless:check.
			_ Patch Compile Tests _
+1 💚	mvninstall	2m 55s		the patch passed
+1 💚	compile	3m 6s		the patch passed
+1 💚	javac	3m 6s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 35s		the patch passed
+1 💚	spotbugs	1m 36s		the patch passed
+1 💚	hadoopcheck	10m 48s		Patch does not cause any errors with Hadoop 3.3.6 3.4.0.
+1 💚	spotless	0m 42s		patch has no errors when running spotless:check.
			_ Other Tests _
+1 💚	asflicense	0m 10s		The patch does not generate ASF License warnings.
		36m 10s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/7/artifact/yetus-general-check/output/Dockerfile
GITHUB PR	#6250
JIRA Issue	HBASE-28839
Optional Tests	dupname asflicense javac spotbugs checkstyle codespell detsecrets compile hadoopcheck hbaseanti spotless
uname	Linux 479e2404e5d0 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `58ebef3`
Default Java	Eclipse Adoptium-17.0.11+9
Max. process+thread count	84 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/7/console
versions	git=2.34.1 maven=3.9.8 spotbugs=4.7.3
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2024-09-20T18:11:56Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 37s		Docker mode activated.
-0 ⚠️	yetus	0m 3s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ master Compile Tests _
+1 💚	mvninstall	4m 6s		master passed
+1 💚	compile	1m 15s		master passed
+1 💚	javadoc	0m 31s		master passed
+1 💚	shadedjars	6m 26s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	3m 13s		the patch passed
+1 💚	compile	1m 1s		the patch passed
+1 💚	javac	1m 1s		the patch passed
+1 💚	javadoc	0m 28s		the patch passed
+1 💚	shadedjars	5m 53s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
-1 ❌	unit	217m 32s	/patch-unit-hbase-server.txt	hbase-server in the patch failed.
		245m 51s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/7/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6250
JIRA Issue	HBASE-28839
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux 5a31b3419c94 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `58ebef3`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/7/testReport/
Max. process+thread count	5686 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/7/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2024-09-20T22:19:46Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 30s		Docker mode activated.
-0 ⚠️	yetus	0m 3s		Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --author-ignore-list --blanks-eol-ignore-file --blanks-tabs-ignore-file --quick-hadoopcheck
			_ Prechecks _
			_ master Compile Tests _
+1 💚	mvninstall	3m 20s		master passed
+1 💚	compile	1m 3s		master passed
+1 💚	javadoc	0m 29s		master passed
+1 💚	shadedjars	5m 58s		branch has no errors when building our shaded downstream artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	3m 12s		the patch passed
+1 💚	compile	1m 2s		the patch passed
+1 💚	javac	1m 2s		the patch passed
+1 💚	javadoc	0m 29s		the patch passed
+1 💚	shadedjars	5m 53s		patch has no errors when building our shaded downstream artifacts.
			_ Other Tests _
+1 💚	unit	219m 52s		hbase-server in the patch passed.
		245m 41s

Subsystem	Report/Notes
Docker	ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/8/artifact/yetus-jdk17-hadoop3-check/output/Dockerfile
GITHUB PR	#6250
JIRA Issue	HBASE-28839
Optional Tests	javac javadoc unit compile shadedjars
uname	Linux cbcb8dc0e810 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/hbase-personality.sh
git revision	master / `409554d`
Default Java	Eclipse Adoptium-17.0.11+9
Test Results	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/8/testReport/
Max. process+thread count	5266 (vs. ulimit of 30000)
modules	C: hbase-server U: hbase-server
Console output	https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/8/console
versions	git=2.34.1 maven=3.9.8
Powered by	Apache Yetus 0.15.0 https://yetus.apache.org

This message was automatically generated.

jhungund · 2024-09-23T05:57:55Z

Hi @wchevreuil, the failing test seems to have passed in the the rerun.
https://ci-hbase.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-6250/8/testReport/org.apache.hadoop.hbase.security/TestSecureIPC/

I have addressed your review comments. Please take a look.
Thanks,
Janardhan

jhungund · 2024-09-23T06:01:20Z

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketProtoUtils.java

      .putAllDeserializers(CacheableDeserializerIdManager.save())
-      .putAllCachedFiles(toCachedPB(cache.fullyCachedFiles)).setBackingMap(backingMap)
+      .putAllCachedFiles(toCachedPB(cache.fullyCachedFiles))
+      .setBackingMap(backingMapBuilder.build())


We need to pass an empty backingmap here otherwise we see an exception of an incomplete object. Hence, we just pass an empty backing map along with the metadata. Subsequently, we persist all entries of the backing map in chunks.

We don't need a builder, just pass an empty map. Or, since we don't persist any map entries within the BucketCacheEntry proto object, just remove the map from the protobuf message. We already changed the persistent file format on HBASE-28805, as long as this can land on all related branches whilst HBASE-28805 has not made into any release, we are free to change the format.

Hi @wchevreuil, we will need to retain the old format of protobuf message to maintain the backward compatibility. Hence, the we cannot change the protobuf message. We can reuse this protobuf message by persisting an empty backing map instead of introducing a new version.

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java

wchevreuil · 2024-09-23T14:12:47Z

hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketProtoUtils.java

      .putAllDeserializers(CacheableDeserializerIdManager.save())
-      .putAllCachedFiles(toCachedPB(cache.fullyCachedFiles)).setBackingMap(backingMap)
+      .putAllCachedFiles(toCachedPB(cache.fullyCachedFiles))
+      .setBackingMap(backingMapBuilder.build())


We don't need a builder, just pass an empty map. Or, since we don't persist any map entries within the BucketCacheEntry proto object, just remove the map from the protobuf message. We already changed the persistent file format on HBASE-28805, as long as this can land on all related branches whilst HBASE-28805 has not made into any release, we are free to change the format.

…t-cache from persistence. (#6250) Signed-off-by: Wellington Chevreuil <[email protected]>

…t-cache from persistence. (apache#6250) Signed-off-by: Wellington Chevreuil <[email protected]>

…t-cache from persistence. (#6250) (#6288) Signed-off-by: Wellington Chevreuil <[email protected]>

HBASE-28839: Handle all types of exceptions during retrieval of bucket-cache from persistence. (apache#6250) Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: I1e8147bffcc456a59375ec67471e736079e5e107 (cherry picked from commit 2e0b01f)

…t-cache from persistence. (apache#6250) (apache#6288) Signed-off-by: Wellington Chevreuil <[email protected]> Change-Id: Ied978410cc7d353e675144b877365465fcf96c67