Skip to content

Conversation

@hemantk-12
Copy link
Contributor

@hemantk-12 hemantk-12 commented Jun 27, 2023

What changes were proposed in this pull request?

This patch contains following changes.

  1. First change is to return job failure reason to client if it is available otherwise return "unknown reason".
  2. As part of HDDS-8490 and PR-4819, support to cancel snapshot job was added. In PR-4819, snapshot diff job creation API was altered to cancel the snap diff job as well. It kinda convoluted snapshot diff code path. In this patch, separate API was created to cancel snapshot diff job to simplify the snapshot diff job creation and cancellation flow. Shell command is kept the same.
  3. Reduced the snapshot diff cleanup service time interval to 1 min from 1 hour.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8859

How was this patch tested?

Updated existing unit and integration tests as of now.

Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @hemantk-12 . Other than minor comment. the changes look good to me.

@prashantpogde prashantpogde added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jun 27, 2023
Copy link
Contributor

@prashantpogde prashantpogde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for making these changes @hemantk-12

bucketName, fromSnapshotName, toSnapshotName, new ArrayList<>(),
null),
CANCELLED, 0L, JobCancelResult.CANCELLATION_SUCCESS);
CANCELLED, 0L, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this switch case is for "job status == CANCELLED", should the 'null' be 'CANCEL_ALREADY_CANCELLED_JOB'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is be nothing similar to In_progress, Done and others. I'll update it.

Thanks for the catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

.equals(CANCELLED)) {
String jobKey = generateSnapDiffJobKey.apply(fromSnapInfo, toSnapInfo);
SnapshotDiffJob diffJob = snapDiffJobTable.get(jobKey);
if (diffJob == null || diffJob.getStatus() == CANCELLED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should job status DONE, REJECTED, FAILED also return false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE, REJECTED and FAILED should not come to this path. Because one job is single thread and once it is marked DONE, REJECTED or FAILED, it won't come to this flow. You can create new job after that cleanup of REJECTED or FAILED but that would be considered a different job altogether.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks Hemant!

private synchronized void updateJobStatusToFailed(String jobKey,
String reason) {
SnapshotDiffJob snapshotDiffJob = snapDiffJobTable.get(jobKey);
if (snapshotDiffJob.getStatus() != IN_PROGRESS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we cancel QUEUED job?

Copy link
Contributor Author

@hemantk-12 hemantk-12 Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that should not happen. If happens, job is in invalid state.

Flow is Queue -> In_Progress/Rejected -> Done/Failed/Cancelled.

  1. Once job is queue, we check if executor can take more job if yes, then change the state to In_Progress otherwise Rejected.
  2. In_Progress job is marked Done if report is generated successfully. In case of failure, job is marked Failed. Cancel is marked if someone execute the cancel request only if job is f In_Progress otherwise cancel request fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks Hemant!

@prashantpogde prashantpogde merged commit 1c785fd into apache:master Jun 28, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Jun 30, 2023
* master:
  HDDS-8555. [Snapshot] When snapshot feature is disabled, block OM startup if there are still snapshots in the system (apache#4994)
  HDDS-8782. Improve Volume Scanner Health checks. (apache#4867)
  HDDS-8447. Datanodes should not process container deletes for failed volumes. (apache#4901)
  HDDS-5869. Added support for stream on S3Gateway write path (apache#4970)
  HDDS-8859. [Snapshot] Return failure message to client for a failed snapshot diff jobs (apache#4993)
  HDDS-8939. [Snapshot] isBlockLocationSame check should be skipped if object is not OmKeyInfo. (apache#4991)
  HDDS-8923. Expose XceiverClient cache stats as metrics (apache#4979)
  HDDS-8913. ContainerManagerImpl: reduce processing while locked (apache#4967)
  HDDS-8935. [Snapshot] Fallback to full diff if getDetlaFiles from compaction DAG fails (apache#4986)
  HDDS-8911. Update Hadoop to 3.3.6 (apache#4985)
  HDDS-8931. Allow EC PipelineChoosingPolicy to be defined separately from Ratis (apache#4983)
  HDDS-8895. Support dynamic change of ozone.readonly.administrators in SCM (apache#4977)
  HDDS-6814. Make OM service ID optional for `ozone s3` commands if only one is defined in config (apache#4953)
  HDDS-8925. BaseFreonGenerator may not complete if last attempts fail (apache#4975)
  HDDS-7100. Container scanner incorrectly marks containers unhealthy when DN is shutdown (apache#4951)
  HDDS-8919. Allow EC pipelines to be created and then added to PipelineManager in two steps (apache#4968)
  HDDS-8901. Enable mTLS for InterSCMGrpcProtocol. (apache#4964)

Conflicts:
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/interfaces/Container.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainer.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueContainerCheck.java
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/KeyValueHandler.java
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/ContainerTestUtils.java
vtutrinov pushed a commit to Cyrill/ozone that referenced this pull request Jul 3, 2023
vtutrinov pushed a commit to Cyrill/ozone that referenced this pull request Jul 3, 2023
@hemantk-12 hemantk-12 deleted the HDDS-8859 branch October 28, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants