Skip to content

Releases: GoogleCloudDataproc/hadoop-connectors

2021-06-29 (GCS 2.2.2)

29 Jun 15:46
48d83c5
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Support footer prefetch in gRPC read channel.

  2. Fix in-place seek functionality in gRPC read channel.

  3. Add option to buffer requests for resumable upload over gRPC:

    fs.gs.grpc.write.buffered.requests (default : 20)
    

2021-05-28 (GCS 2.2.1)

28 May 14:17
bb63be7
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Fix proxy configuration for Apache HTTP transport.

  2. Update gRPC dependency to latest version.

2021-01-07 (GCS 2.2.0, BQ 1.2.0)

07 Jan 16:58
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Delete deprecated methods.

  2. Update all dependencies to latest versions.

  3. Add support for Cloud Storage objects CSEK encryption:

    fs.gs.encryption.algorithm (not set by default)
    fs.gs.encryption.key (not set by default)
    fs.gs.encryption.key.hash (not set by default)
    
  4. Add a property to override storage service path:

    fs.gs.storage.service.path (default: `storage/v1/`)
    
  5. Added a new output stream type which can be used by setting:

    fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
    

    The FLUSHABLE_COMPOSITE output stream type behaves similarly to the SYNCABLE_COMPOSITE type, except it also supports hflush(), which uses the same implementation with hsync() in the SYNCABLE_COMPOSITE output stream type.

  6. Added a new output stream parameter

    fs.gs.outputstream.sync.min.interval.ms (default: 0)
    

    to configure the minimum time interval (milliseconds) between consecutive syncs. This is to avoid getting rate limited by GCS. Default is 0 - no wait between syncs. hsync() when rate limited will block on waiting for the permits, but hflush() will simply perform nothing and return.

  7. Added a new parameter to configure output stream pipe type:

    fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
    

    Valid values are NIO_CHANNEL_PIPE and IO_STREAM_PIPE.

    Output stream now supports (when property value set to NIO_CHANNEL_PIPE) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.

    Note that when using NIO_CHANNEL_PIPE option maximum upload throughput can decrease by 10%.

  8. Add a property to impersonate a service account:

    fs.gs.auth.impersonation.service.account (not set by default)
    

    If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.

  9. Throw ClosedChannelException in GoogleHadoopOutputStream.write methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage.

  10. Add properties to impersonate a service account through user or group name:

    fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default)
    fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
    

    If any of these properties are set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.

  11. Fix complex patterns globbing.

  12. Added support for an authorization handler for Cloud Storage requests. This feature is configurable through the properties:

    fs.gs.authorization.handler.impl=<FULLY_QUALIFIED_AUTHORIZATION_HANDLER_CLASS>
    fs.gs.authorization.handler.properties.<AUTHORIZATION_HANDLER_PROPERTY>=<VALUE>
    

    If the fs.gs.authorization.handler.impl property is set, the specified authorization handler will be used to authorize Cloud Storage API requests before executing them. The handler will throw AccessDeniedException for rejected requests if user does not have enough permissions (not authorized) to execute these requests.

    All properties with the fs.gs.authorization.handler.properties. prefix passed to an instance of the configured authorization handler class after instantiation before calling any Cloud Storage requests handling methods.

  13. Set default value for fs.gs.status.parallel.enable property to true.

  14. Tune exponential backoff configuration for Cloud Storage requests.

  15. Increment Hadoop FileSystem.Statistics counters for read and write operations.

  16. Always infer implicit directories and remove fs.gs.implicit.dir.infer.enable property.

  17. Replace 2 glob-related properties (fs.gs.glob.flatlist.enable and fs.gs.glob.concurrent.enable`) with a single property to configure glob search algorithm:

    fs.gs.glob.algorithm (default: CONCURRENT)
    
  18. Do not create the parent directory objects (this includes buckets) when creating a new file or a directory, instead rely on the implicit directory inference.

  19. Use default logging backend for Google Flogger instead of Slf4j.

  20. Add FsBenchmark tool for benchmarking HCFS.

  21. Remove obsolete fs.gs.inputstream.buffer.size property and related functionality.

  22. Fix unauthenticated access support (fs.gs.auth.null.enable=true).

  23. Improve cache hit ratio when fs.gs.performance.cache.enable property is set to true.

  24. Remove obsolete configuration properties and related functionality:

    fs.gs.auth.client.id
    fs.gs.auth.client.file
    fs.gs.auth.client.secret
    
  25. Add a property that allows to disable HCFS semantic enforcement. If set to false GSC connector will not check if directory with same name already exists when creating a new file and vise versa.

    fs.gs.create.items.conflict.check.enable (default: true)
    
  26. Remove redundant properties:

    fs.gs.config.override.file
    fs.gs.copy.batch.threads
    fs.gs.copy.max.requests.per.batch
    
  27. Change default value of fs.gs.inputstream.min.range.request.size property from 524288 to 2097152.

Big Query connector:

  1. Update all dependencies to latest versions.

  2. Fix BigQuery job status retrieval in non-US locations.

  3. Use default logging backend for Google Flogger instead of Slf4j.

  4. Remove unused mapred.bq.output.buffer.size configuration property.

  5. Fix unauthenticated access support (mapred.bq.auth.null.enable=true).

  6. Remove obsolete configuration properties and related functionality:

    mapred.bq.auth.client.id
    mapred.bq.auth.client.file
    mapred.bq.auth.client.secret
    

2020-11-09 (GCS 2.1.6, BQ 1.1.6)

09 Nov 19:01
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Increment Hadoop FileSystem.Statistics counters for read and write operations.

  2. Add FsBenchmark tool for benchmarking HCFS.

  3. Update all dependencies to latest versions.

Big Query connector:

  1. Fix reads using DirectBigQueryInputFormat.

  2. Update all dependencies to latest versions.

2020-09-11 (GCS 2.1.5, BQ 1.1.5)

11 Sep 21:37
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Fix complex patterns globbing.

  2. Tune exponential backoff configuration for Cloud Storage requests.

  3. Add a property to ignore Cloud Storage precondition failures when overwriting objects in concurrent environment:

    fs.gs.overwrite.generation.mismatch.ignore (default: false)
    
  4. Update all dependencies to latest versions.

Big Query connector:

  1. Fix BigQuery job status retrieval in non-US locations.

  2. Update all dependencies to latest versions.

2020-08-07 (GCS 1.9.18, BQ 0.13.18)

07 Aug 21:35
c746958
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Fix complex patterns globbing.

  2. Throw ClosedChannelException in GoogleHadoopOutputStream.write methods
    if stream already closed. This fixes Spark Streaming jobs checkpointing to
    Cloud Storage.

  3. Fix proxy authentication when using JAVA_NET transport.

Big Query connector:

  1. POM updates for GCS connector 1.9.18.

  2. Fix proxy authentication when using JAVA_NET transport.

2020-07-15 (GCS 2.1.4, BQ 1.1.4)

16 Jul 06:44
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Added a new parameter to configure output stream pipe type:

    fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
    

    Valid values are NIO_CHANNEL_PIPE and IO_STREAM_PIPE.

    Output stream now supports (when property value set to NIO_CHANNEL_PIPE) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.

    Note that when using NIO_CHANNEL_PIPE option maximum upload throughput can decrease by 10%.

  2. Throw ClosedChannelException in GoogleHadoopOutputStream.write methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage.

  3. Add a property to impersonate a service account:

    fs.gs.auth.impersonation.service.account (not set by default)
    

    If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.

  4. Add properties to impersonate a service account through user or group name:

    fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default)
    fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
    

    If any of these properties is set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.

  5. Update all dependencies to latest versions.

Big Query connector:

  1. Update all dependencies to latest versions.

2020-05-08 (GCS 2.1.3, BQ 1.1.3)

08 May 23:13
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add support for Cloud Storage objects CSEK encryption:

    fs.gs.encryption.algorithm (not set by default)
    fs.gs.encryption.key (not set by default)
    fs.gs.encryption.key.hash (not set by default)
    
  2. Update all dependencies to latest versions.

  3. Added a new output stream type which can be used by setting:

    fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
    

    The FLUSHABLE_COMPOSITE output stream type behaves similarly to the
    SYNCABLE_COMPOSITE type, except it also supports hflush(), which uses
    the same implementation with hsync() in the SYNCABLE_COMPOSITE output
    stream type.

  4. Added a new output stream parameter

    fs.gs.outputstream.sync.min.interval.ms (default: 0)
    

    to configure the minimum time interval (milliseconds) between consecutive
    syncs. This is to avoid getting rate limited by GCS. Default is 0 - no
    wait between syncs. hsync() when rate limited will block on waiting for
    the permits, but hflush() will simply perform nothing and return.

  5. Restore compatibility with pre-2.8 Hadoop versions.

Big Query connector:

  1. Update all dependencies to latest versions.

2020-04-02 (GCS 2.1.2, BQ 1.1.2)

03 Apr 01:19
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Update all dependencies to latest versions.

Big Query connector:

  1. Update all dependencies to latest versions.

2020-03-11 (GCS 2.1.1, BQ 1.1.1)

11 Mar 22:31
Compare
Choose a tag to compare

Changelog

Cloud Storage connector:

  1. Add upload cache to support high-level retries of failed uploads. Cache size configured via property and disabled by default (zero or negative value):

    fs.gs.outputstream.upload.cache.size (deafult: 0)
    

Big Query connector:

  1. Fix shaded jar - add back missing relocated dependencies.