Skip to content

Conversation

@botarchiveforandy
Copy link
Contributor

What changes were proposed in this pull request?

Updates SnapshotDiffJobCodec to use Protobuf for serialization and deserialization instead of Jackson.
To maintain backward compatibility, the codec falls back to Jackson when Protobuf parsing fails.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9223

How was this patch tested?

CI:
https://github.com/yuxian0908/ozone/actions/runs/15177445601

@ivandika3 ivandika3 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label May 23, 2025
Copy link
Contributor

@ivandika3 ivandika3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuxian0908 Thanks for the patch. Overall looks good.

Could you add a simple test to serialize using Jackson and then deserialize using the fromPersistedFormatImpl?

@ivandika3 ivandika3 requested a review from szetszwo May 25, 2025 11:47
@botarchiveforandy
Copy link
Contributor Author

@yuxian0908 Thanks for the patch. Overall looks good.

Could you add a simple test to serialize using Jackson and then deserialize using the fromPersistedFormatImpl?

Thanks for the suggestion, @ivandika3!

I’ve added a compatibility test that verifies a SnapshotDiffJob serialized with OldSnapshotDiffJobCodecForTesting (Jackson-based) can be deserialized correctly using the current protobuf-based codec.

@ivandika3
Copy link
Contributor

@yuxian0908 Thanks for the update. Overall LGTM. Please kindly fix the checkstyle and license issues.

private final Codec<SnapshotDiffJob> newCodec = SnapshotDiffJob.getCodec();

@Test
public void testOldJsonSerializedDataCanBeReadByNewCodec() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SnapshotDiffJob.getFromProtoBuf(proto) is expected to throw exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFromProtoBuf() appears to be a safe method that builds a POJO from a fully-parsed SnapshotDiffJobProto object. Since it relies on accessor methods like getCreationTime() and getVolume() which are expected to be present in a valid proto message, we don’t anticipate any checked exceptions being thrown during this process.

Please correct me if I missed anything or overlooked a potential edge case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering: does the fallback have to be triggered when newCodec.fromPersistedFormatImpl(oldFormatData) is called in this test, since the Jackson and Protobuf formats are incompatible? Or could the Protobuf codec parse the Jackson-encoded binary, making the fallback merely a safeguard?

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuxian0908 , thanks for the update! Please see the comments inlined.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuxian0908 Instead of having to maintain 2 codecs can we just drop the entire snapdiff db as part of the upgrade framework. I just feel it would reduce a lot of complexity. Since snapshot diff db is an ephemeral db anyhow as all entries would be eventually deleted.

@botarchiveforandy
Copy link
Contributor Author

@yuxian0908 Instead of having to maintain 2 codecs can we just drop the entire snapdiff db as part of the upgrade framework. I just feel it would reduce a lot of complexity. Since snapshot diff db is an ephemeral db anyhow as all entries would be eventually deleted.

Thank you for the suggestion — I agree that dropping the snapdiff DB during upgrade would certainly simplify the implementation by removing the need to maintain backward compatibility codecs.
However, I'm not entirely sure if it's safe to drop the entire DB as part of the upgrade process. While the snapshot diff DB is indeed ephemeral in nature, I’m not fully certain whether doing so might have implications for in-progress or recently triggered snapshot diff jobs, especially in active systems.
Would you mind sharing a bit more context or guidance on whether this approach has been considered before, and if there are any concerns around data loss or job consistency?
Please feel free to correct me if I misunderstood anything.

@botarchiveforandy
Copy link
Contributor Author

Thanks for the update. Overall LGTM. Please kindly fix the checkstyle and license issues.

Thank you for the review! I have fixed the checkstyle and license issues as requested.

@ivandika3
Copy link
Contributor

ivandika3 commented Jun 1, 2025

@yuxian0908 Instead of having to maintain 2 codecs can we just drop the entire snapdiff db as part of the upgrade framework. I just feel it would reduce a lot of complexity. Since snapshot diff db is an ephemeral db anyhow as all entries would be eventually deleted.

@swamirishi Thanks for the input. In my opinion, this needs to be discussed and handled in a follow up ticket. Simply handling both the old and new codec should make things transparent to users without the need to carry out more operations like dropping the snapdiff db with upgrade framework (which requires some OM downtime due to current non-rolling upgrade framework). If we decide to do it in the upgrade framework, then we can remove the old codec entirely.

@ivandika3 ivandika3 requested review from swamirishi and szetszwo June 1, 2025 05:51
Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuxian0908 , thanks for the update! Just have some minor comments inlined.

}

@Override
public byte[] toPersistedFormatImpl(SnapshotDiffJob object) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation no longer throws any exception. Change the method to

    public byte[] toPersistedFormat(SnapshotDiffJob object) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

try {
SnapshotDiffJobProto proto = SnapshotDiffJobProto.parseFrom(rawData);
return SnapshotDiffJob.getFromProtoBuf(proto);
} catch (IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's catch InvalidProtocolBufferException and add a comment:

      } catch (InvalidProtocolBufferException e) {
        // the rawData was in old format, fallback to the old implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@swamirishi
Copy link
Contributor

swamirishi commented Jun 2, 2025

@yuxian0908 Instead of having to maintain 2 codecs can we just drop the entire snapdiff db as part of the upgrade framework. I just feel it would reduce a lot of complexity. Since snapshot diff db is an ephemeral db anyhow as all entries would be eventually deleted.

@swamirishi Thanks for the input. In my opinion, this needs to be discussed and handled in a follow up ticket. Simply handling both the old and new codec should make things transparent to users without the need to carry out more operations like dropping the snapdiff db with upgrade framework (which requires some OM downtime due to current non-rolling upgrade framework). If we decide to do it in the upgrade framework, then we can remove the old codec entirely.

@ivandika3 How about we introduce a version file inside the snapdiff.db directory and read it here if the version number is different then we can just drop the db. It would be cleaner this way and we can get rid of the earlier implementation altogether.

@szetszwo
Copy link
Contributor

szetszwo commented Jun 2, 2025

... How about we introduce a version file ...

@swamirishi , we already have layout version mechanism in Ozone. We should not introduce a version file.

The question is: do we need bump layout version for this change? It seems unnecessary. Or we could do it in a separated JIRA since the work should also take care how upgrade it.

@chungen0126
Copy link
Contributor

chungen0126 commented Jun 2, 2025

... How about we introduce a version file inside the snapdiff.db directory ...

Thanks @swamirishi for the comment.
I'm not sure dropping the entire snapdiff.db without any handling is a good idea.
If we do that, users will have no way to know which SnapshotDiff jobs have been silently deleted, or what the status of a particular job is.
Even if a job had already completed successfully, users would have no way of knowing its current state or whether it ever existed.
This could be confusing and problematic from both a usability and observability standpoint.

Please feel free to correct me if I’ve misunderstood your comment.

@chungen0126
Copy link
Contributor

I have already created a Jira ticket (HDDS-13164) to track the discussion, I suggest we focus on using protobuf for now in this PR.

@ivandika3
Copy link
Contributor

ivandika3 commented Jun 13, 2025

@yuxian0908 Could you help address the remaining @szetszwo 's comments? Afterwards, I think we can merge this.

@botarchiveforandy
Copy link
Contributor Author

@yuxian0908 , thanks for the update! Just have some minor comments inlined.

Thanks for the review! I’ve addressed the inlined comments.

Copy link
Member

@peterxcli peterxcli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yuxian0908 for updating the patch!

public class TestOmSnapshotDiffJobCodec {
private final OldSnapshotDiffJobCodecForTesting oldCodec
= new OldSnapshotDiffJobCodecForTesting();
private final Codec<SnapshotDiffJob> newCodec = SnapshotDiffJob.getCodec();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
private final Codec<SnapshotDiffJob> newCodec = SnapshotDiffJob.getCodec();
private final SnapshotDiffJobCodec newCodec = new SnapshotDiffJobCodec();

@Override
public byte[] toPersistedFormatImpl(SnapshotDiffJob object) throws IOException {
return MAPPER.writeValueAsBytes(object);
public byte[] toPersistedFormat(SnapshotDiffJob object) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szetszwo I’m not sure when it’s appropriate to override toPersistedFormat or toPersistedFormatImpl in a Codec implementation—their use seems inconsistent across the codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think this is ok as most of the codec do implement this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toPersistedFormatImpl is for the case that it needs to throw an exception E other than CodecException since the default toPersistedFormat will convert E to CodecException.

private final Codec<SnapshotDiffJob> newCodec = SnapshotDiffJob.getCodec();

@Test
public void testOldJsonSerializedDataCanBeReadByNewCodec() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just wondering: does the fallback have to be triggered when newCodec.fromPersistedFormatImpl(oldFormatData) is called in this test, since the Jackson and Protobuf formats are incompatible? Or could the Protobuf codec parse the Jackson-encoded binary, making the fallback merely a safeguard?

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 the change looks good.

@Override
public byte[] toPersistedFormatImpl(SnapshotDiffJob object) throws IOException {
return MAPPER.writeValueAsBytes(object);
public byte[] toPersistedFormat(SnapshotDiffJob object) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toPersistedFormatImpl is for the case that it needs to throw an exception E other than CodecException since the default toPersistedFormat will convert E to CodecException.

@szetszwo szetszwo merged commit 6df3077 into apache:master Jun 17, 2025
41 checks passed
@szetszwo
Copy link
Contributor

Thanks @yuxian0908 for working on this!

Thanks @ivandika3 , @peterxcli, @swamirishi , @chungen0126 , for reviewing this!

aswinshakil added a commit to aswinshakil/ozone that referenced this pull request Jun 20, 2025
…239-container-reconciliation

Commits: 62

da53b5b HDDS-13299. Fix failures related to delete (apache#8665)
8c1b439 HDDS-13296. Integration check always passes due to missing output (apache#8662)
7329859 HDDS-13023. Container checksum is missing after container import (apache#8459)
a0af93e HDDS-13292. Change `<? extends KeyValue>` to `<KeyValue>` in test (apache#8657)
f3050cf HDDS-13276. Use KEY_ONLY/VALUE_ONLY iterator in SCM/Datanode. (apache#8638)
e9c0a45 HDDS-13262. Simplify key name validation (apache#8619)
f713e57 HDDS-12482. Avoid using CommonConfigurationKeys (apache#8647)
b574709 HDDS-12924. datanode used space calculation optimization (apache#8365)
de683aa HDDS-13263. Refactor DB Checkpoint Utilities. (apache#8620)
97262aa HDDS-13256. Updated OM Snapshot Grafana Dashboard to reflect metric updates from HDDS-13181. (apache#8639)
9d2b415 HDDS-13234. Expired secret key can abort leader OM startup. (apache#8601)
d9049a2 HDDS-13220. Change Recon 'Negative usedBytes' message loglevel to DEBUG (apache#8648)
6df3077 HDDS-9223. Use protobuf for SnapshotDiffJobCodec (apache#8503)
a7fc290 HDDS-13236. Change Table methods not to throw IOException. (apache#8645)
9958f5b HDDS-13287. Upgrade commons-beanutils to 1.11.0 due to CVE-2025-48734 (apache#8646)
48aefea HDDS-13277. [Docs] Native C/C++ Ozone clients (apache#8630)
052d912 HDDS-13037. Let container create command support STANDALONE , RATIS and EC containers (apache#8559)
90ed60b HDDS-13279. Skip verifying Apache Ranger binaries in CI (apache#8633)
9bc53b2 HDDS-11513. All deletion configurations should be configurable without restart (apache#8003)
ac511ac HDDS-13259. Deletion Progress - Grafana Dashboard (apache#8617)
3370f42 HDDS-13246. Change `<? extend KeyValue>` to `<KeyValue>` in hadoop-hdds (apache#8631)
7af8c44 HDDS-11454. Ranger integration for Docker Compose environment (apache#8575)
5a3e4e7 HDDS-13273. Bump awssdk to 2.31.63 (apache#8626)
77138b8 HDDS-13254. Change table iterator to optionally read key or value. (apache#8621)
ce288b6 HDDS-13265. Simplify the page Access Ozone using HTTPFS REST API (apache#8629)
36fe888 HDDS-13275. Improve CheckNative implementation (apache#8628)
d38484e HDDS-13274. Bump sqlite-jdbc to 3.50.1.0 (apache#8627)
3f3ec43 HDDS-13266. `ozone debug checknative` to show OpenSSL lib (apache#8623)
8983a63 HDDS-13272. Bump junit to 5.13.1 (apache#8625)
a927113 HDDS-13271. [Docs] Minor text updates, reference links. (apache#8624)
7e77058 HDDS-13112. [Docs] OM Bootstrap can also happen when follower falls behind too much. (apache#8600)
fd13300 HDDS-10775. Support bucket ownership verification (apache#8558)
3ecf345 HDDS-13207. [Docs] Third party systems compatible with Ozone S3. (apache#8584)
ad5a507 HDDS-13035. SnapshotDeletingService should hold write locks while purging deleted snapshots (apache#8554)
38a9186 HDDS-12637. Increase max buffer size for tar entry read/write (apache#8618)
f31c264 HDDS-13045. Implement Immediate Triggering of Heartbeat when Volume Full (apache#8590)
0701d6a HDDS-13248. Remove `ozone debug replicas verify` option --output-dir (apache#8612)
ca1afe8 HDDS-13257. Remove separate split for shell integration tests (apache#8616)
5d6fe94 HDDS-13216. Standardize Container[Replica]NotFoundException messages (apache#8599)
1e47217 HDDS-13168. Fix error response format in CheckUploadContentTypeFilter (apache#8614)
6d4d423 HDDS-13181. Added metrics for internal Snapshot Operations. (apache#8606)
4a461b2 HDDS-10490. Intermittent NPE in TestSnapshotDiffManager#testLoadJobsOnStartUp (apache#8596)
bf29f7f HDDS-13235. The equals/hashCode methods in anonymous KeyValue classes may not work. (apache#8607)
6ff3ad6 HDDS-12873. Improve ContainerData statistics synchronization. (apache#8305)
09d3b27 HDDS-13244. TestSnapshotDeletingServiceIntegrationTest should close snapshots after deleting them (apache#8611)
931bc2d HDDS-13243. copy-rename-maven-plugin version is missing (apache#8605)
3b5985c HDDS-13244. Disable TestSnapshotDeletingServiceIntegrationTest
6bf009c HDDS-12927. metrics and log to indicate datanode crossing disk limits (apache#8573)
752da2b HDDS-12760. Intermittent Timeout in testImportedContainerIsClosed (apache#8349)
8c32363 HDDS-13050. Update StartFromDockerHub.md. (apache#8586)
ba1887c HDDS-13241. Fix some potential resource leaks (apache#8602)
bbaf71e HDDS-13130. Rename all instances of Disk Usage to Namespace usage (apache#8571)
0628386 HDDS-13142. Correct SCMPerformanceMetrics for delete operation. (apache#8592)
516bc96 HDDS-13148. [Docs] Update Transparent Data Encryption doc. (apache#8530)
5787135 HDDS-13229. [Doc] Fix incorrect CLI argument order in OM upgrade docs (apache#8598)
ba95074 HDDS-13107. Support limiting output of `ozone admin datanode list` (apache#8595)
e7f5544 HDDS-13171. Replace pipelineID if nodes are changed (apache#8562)
3c9d4d8 HDDS-13103. Correct transaction metrics in SCMBlockDeletingService. (apache#8516)
f62eb8a HDDS-13160. Remove SnapshotDirectoryCleaningService and refactor AbstractDeletingService (apache#8547)
b46e6b2 HDDS-13150. Fixed SnapshotLimitCheck when failures occur. (apache#8532)
203c1d3 HDDS-13206. Update documentation for Apache Ranger (apache#8583)
2072ef0 HDDS-13214. populate-cache fails due to unused dependency (apache#8594)

Conflicts:
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/ContainerData.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/helpers/KeyValueContainerUtil.java
	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/statemachine/background/BlockDeletingTask.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants