Skip to content

Conversation

@sodonnel
Copy link
Contributor

What changes were proposed in this pull request?

If you create a container, and then close it, the BCSID is synced on the datanodes and then the value is updated in SCM via setting the "sequenceID" field on the containerInfo object for the container.

If you later restart just SCM, the sequenceID becomes zero, and then container reports for the replica fail with a stack trace like:

Exception in thread "EventQueue-ContainerReportForContainerReportHandler" java.lang.AssertionError
	at org.apache.hadoop.hdds.scm.container.ContainerInfo.updateSequenceId(ContainerInfo.java:176)
	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.updateContainerStats(AbstractContainerReportHandler.java:108)
	at org.apache.hadoop.hdds.scm.container.AbstractContainerReportHandler.processContainerReplica(AbstractContainerReportHandler.java:83)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:162)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:130)
	at org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
	at org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The assertion here is failing, as it does not allow for the sequenceID to be changed on a CLOSED container:

  public void updateSequenceId(long sequenceID) {
    assert (isOpen() || state == HddsProtos.LifeCycleState.QUASI_CLOSED);
    sequenceId = max(sequenceID, sequenceId);
  }

The issue seems to be caused by the serialisation and deserialisation of the containerInfo object to protobuf, as sequenceId never persisted or restored.

However, I am also confused about how this ever worked, as this is a pretty significant problem.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4336

How was this patch tested?

New integration test to reproduce the issue before fixing it.

@sodonnel sodonnel changed the title HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replicas reports HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replica reports Oct 13, 2020
@sodonnel sodonnel merged commit 7ae037e into apache:master Oct 13, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Oct 14, 2020
* master: (23 commits)
  HDDS-4122. Implement OM Delete Expired Open Key Request and Response (apache#1435)
  HDDS-4336. ContainerInfo does not persist BCSID (sequenceId) leading to failed replica reports (apache#1488)
  Remove extra serialization from getBlockID (apache#1470)
  HDDS-4262. Use ClientID and CallID from Rpc Client to detect retry requests (apache#1436)
  HDDS-4285. Read is slow due to frequent calls to UGI.getCurrentUser() and getTokens() (apache#1454)
  HDDS-4312. findbugs check succeeds despite compile error (apache#1476)
  HDDS-4311. Type-safe config design doc points to OM HA (apache#1477)
  HDDS-3814. Drop a column family through debug cli tool (apache#1083)
  HDDS-3728. Bucket space: check quotaUsageInBytes when write key and allocate block. (apache#1458)
  HDDS-4316. Upgrade to angular 1.8.0 due to CVE-2020-7676 (apache#1481)
  HDDS-4325. Incompatible return codes from Ozone getconf -confKey (apache#1485). Contributed by Doroszlai, Attila.
  HDDS-4309. Fix inconsistency in recon config keys starting with recon and not ozone (apache#1478)
  HDDS-4310: Ozone getconf broke the compatibility (apache#1475)
  HDDS-4298. Use an interface in Ozone client instead of XceiverClientManager (apache#1460)
  HDDS-4280. Document notable configurations for Recon. (apache#1448)
  HDDS-4156. add hierarchical layout to Chinese doc (apache#1368)
  HDDS-4242. Copy PrefixInfo proto to new project hadoop-ozone/interface-storage (apache#1444)
  HDDS-4264. Uniform naming conventions of Ozone Shell Options. (apache#1447)
  HDDS-4271. Avoid logging chunk content in Ozone Insight (apache#1466)
  HDDS-4299. Display Ratis version with ozone version (apache#1464)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants