Skip to content

Conversation

@steveloughran
Copy link
Contributor

This is #794 with my edits added.

Ben Roling and others added 3 commits May 2, 2019 12:30
commit ae876ab2df46c68ddd923edf8dd1d314191fcc94
Merge: 2e0254e 6a42745
Author: Ben Roling <[email protected]>
Date:   Thu May 2 10:14:10 2019 -0500

    Merge branch 'trunk' into HADOOP-16085-squashed-2

commit 2e0254e
Author: Ben Roling <[email protected]>
Date:   Thu Apr 18 12:13:40 2019 -0500

    Remove unused import

commit d1275e4
Merge: 450ba66 df76cdc
Author: Ben Roling <[email protected]>
Date:   Thu Apr 18 12:10:01 2019 -0500

    Merge branch 'trunk' into HADOOP-16085-squashed

commit 450ba66
Author: Ben Roling <[email protected]>
Date:   Thu Apr 18 11:45:41 2019 -0500

    Improvements to TestObjectChangeDetectionAttributes, AbstractS3AMockTest

commit 408af6c
Author: Ben Roling <[email protected]>
Date:   Thu Apr 18 10:29:05 2019 -0500

    Use HttpStatus code constant instead of magic number

commit 5f0532b
Author: Ben Roling <[email protected]>
Date:   Thu Apr 18 10:02:50 2019 -0500

    Update core-default.xml

commit 3488b20
Author: Ben Roling <[email protected]>
Date:   Wed Apr 17 16:14:43 2019 -0500

    Fix runaround of creating FileStatus and then calling fromFileStatus()

commit 90d5c9c
Author: Ben Roling <[email protected]>
Date:   Wed Apr 17 15:45:51 2019 -0500

    Fix minor nits

commit 3ff59e4
Author: Ben Roling <[email protected]>
Date:   Wed Apr 17 15:07:02 2019 -0500

    Mutate S3AFileStatus instead of creating new instance

commit 13fab97
Author: Ben Roling <[email protected]>
Date:   Wed Apr 17 14:30:55 2019 -0500

    Rename S3LocatedFileStatus to S3ALocatedFileStatus

commit bee4e52
Author: Ben Roling <[email protected]>
Date:   Wed Apr 17 14:25:17 2019 -0500

    Stop pretending to support group and permission attributes on S3AFileStatus

commit 807e13b
Author: Ben Roling <[email protected]>
Date:   Wed Apr 17 14:20:14 2019 -0500

    Add serialVersionUID to S3LocatedFileStatus

commit 9974cec
Author: Ben Roling <[email protected]>
Date:   Mon Apr 8 13:34:38 2019 -0500

    Fix missed group or owner tweak

commit 708c001
Author: Ben Roling <[email protected]>
Date:   Mon Apr 8 12:58:14 2019 -0500

    Fix S3AFileStatus group handling

    ITestS3AConfiguration.testUsernameFromUGI was failing, expecting the
    user to be copied into the group.

    Strict copying of user into group causes
    TestLocalMetadataStore.testPutNew() to fail since it expects the group
    to be preserved from the original FileStatus.

    This change copies user into group when group is null/empty. With this
    change, all existing tests pass.

commit 5239a9f
Author: Ben Roling <[email protected]>
Date:   Thu Apr 4 16:38:31 2019 -0500

    Skip tests that require versionId when bucket doesn't have versioning enabled

commit 4c6331e
Author: Ben Roling <[email protected]>
Date:   Mon Apr 1 13:58:24 2019 -0500

    Fix broken TestObjectChangeDetectionAttributes

commit 8a19c42
Author: Ben Roling <[email protected]>
Date:   Mon Apr 1 10:08:33 2019 -0500

    Squashed commit of the following:

    commit 9f4ad88
    Author: Ben Roling <[email protected]>
    Date:   Mon Apr 1 09:29:35 2019 -0500

        Add test for 412 response

    commit dc0a3fb
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 16:53:46 2019 -0500

        Update tests that started failing due to HADOOP-15999

    commit 5e1f3e3
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 15:49:26 2019 -0500

        Speed up ITestS3ARemoteFileChanged

    commit 1b6be40
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 14:23:53 2019 -0500

        Skip invalid test when object versioning enabled

    commit 8597d2e
    Merge: 2d235f8 b5db238
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 11:54:50 2019 -0500

        Merge remote-tracking branch 'apache/trunk' into HADOOP-16085

    commit 2d235f8
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 11:51:26 2019 -0500

        Fix typo

    commit dc83cef
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 10:28:09 2019 -0500

        Generalize TestObjectETag to cover versionId and test overwrite

    commit 0d71f32
    Author: Ben Roling <[email protected]>
    Date:   Thu Mar 28 08:45:42 2019 -0500

        Fix trailing whitespace

    commit 324be6d
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 22:00:57 2019 -0500

        S3GuardTool updates to correct ETag or versionId metadata

    commit 2a2bba7
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 21:27:27 2019 -0500

        Clarify log message

    commit 6e62a3a
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 21:17:48 2019 -0500

        Documentation updates per PR feedback

    commit 1ff8bef
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 16:05:59 2019 -0500

        check version.required on CopyResult

    commit e296275
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 16:04:50 2019 -0500

        Minor javadoc improvements from PR review

    commit 3e9ea19
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 13:15:58 2019 -0500

        Skip tests that aren't applicable with change.detection.source=versionId

    commit ddbf68b
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 11:56:38 2019 -0500

        Add tests of case where no version metadata is present

    commit 21d37dd
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 09:25:46 2019 -0500

        Fix compiler deprecation warning

    commit b8e1569
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 09:19:46 2019 -0500

        Fix license issue

    commit 33bb5f9
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 09:19:32 2019 -0500

        Fix findbugs issue

    commit 5b7fadb
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 09:00:39 2019 -0500

        Fix checkstyle issues

    commit 6110a11
    Author: Ben Roling <[email protected]>
    Date:   Wed Mar 27 08:28:37 2019 -0500

        Remove trailing whitespace

    commit d82069b
    Author: Ben Roling <[email protected]>
    Date:   Tue Mar 26 16:05:01 2019 -0500

        Improve S3Guard doc

    commit ca2f0e9
    Author: Ben Roling <[email protected]>
    Date:   Tue Mar 26 14:29:03 2019 -0500

        Fix ITestS3ARemoteFileChanged

    commit 1e4fa85
    Author: Ben Roling <[email protected]>
    Date:   Tue Mar 26 11:37:48 2019 -0500

        Increase local metastore cache timeout

    commit 34b0c80
    Author: Ben Roling <[email protected]>
    Date:   Tue Mar 26 11:35:34 2019 -0500

        Fix isEmptyDir inconsistency

    commit bbf8365
    Author: Ben Roling <[email protected]>
    Date:   Mon Mar 25 16:55:24 2019 -0500

        TestPathMetadataDynamoDBTranslation tests null etag, versonId

    commit 2ae7d16
    Author: Ben Roling <[email protected]>
    Date:   Mon Mar 25 16:54:49 2019 -0500

        Add constants in TestDirListingMetadata

    commit 068a55d
    Author: Ben Roling <[email protected]>
    Date:   Mon Mar 25 15:43:45 2019 -0500

        Add copy exception handling

    commit 0eca6f3
    Author: Ben Roling <[email protected]>
    Date:   Mon Mar 25 12:43:51 2019 -0500

        Don't process response from copy

    commit ad9e152
    Author: Ben Roling <[email protected]>
    Date:   Mon Feb 25 16:41:54 2019 -0600

        HADOOP-16085-003.patch

        Rebase of previous work after merge of HADOOP-15625.
Includes retries for regular reads, select(), and rename()
+add stevel review (primarily of tests)

Change-Id: I75a3b70917eefc0a0ec3190ca1de527e2081551e
@ben-roling
Copy link
Contributor

The changes here looked good to me and I pulled this into #794 as mentioned here:
#794 (comment)

@steveloughran
Copy link
Contributor Author

thanks, I'll move onto that again with my ongoing work.

shanthoosh added a commit to shanthoosh/hadoop that referenced this pull request Oct 15, 2019
Samza users may need to increase the partition count of the input streams of their stateful samza jobs. For example, Kafka needs to limit the maximum size of each partition to scale up its performance. Thus the number of partitions of a Kafka topic needs to be expanded to reduce the partition size if the average byte-in-rate or retention time of the Kafka topic has doubled.

In order to perform a join between streams, stateful jobs generally have to route the partitions from the different input streams to same task of a container. However, when a input stream repartitioning happens, key space of a partition gets redistributed. This will make the stateful jobs to produce erroneous results.

So if the partition count of input stream is increased then the users have to manually purge the changelog topics, local RocksDb state of their stateful jobs. This  results in an increased operational complexity and data loss.

This patch takes a first stab at solving the above problem and is comprised of the following changes:

* Introduce a new group method in `SystemStreamPartitionGrouper` interface to generate task assignment factoring in the partition expansion of input streams.
* Introduced a `StreamPartitionMapper` abstraction to allow the user to plugin the input stream partitioning function.
* Fixed the existing unit tests and added new unit tests to validate the new grouper changes.

In a followup PR shortly, these grouper changes would be integrated with `JobModelManager`(Waiting for PR 790 to be landed for this. It had made significant changes to `JobModelManager`)

Author: Shanthoosh Venkataraman <[email protected]>

Reviewers: Prateek M<[email protected]>, Ray Matharu<[email protected]>, Daniel Nishimura<[email protected]>

Closes apache#803 from shanthoosh/SEP-5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants