Releases · GoogleCloudDataproc/hadoop-connectors

29 Jun 15:46

mprashanthsagar

v2.2.2

48d83c5

2021-06-29 (GCS 2.2.2)

Changelog

Cloud Storage connector:

Support footer prefetch in gRPC read channel.
Fix in-place seek functionality in gRPC read channel.
Add option to buffer requests for resumable upload over gRPC:
```
fs.gs.grpc.write.buffered.requests (default : 20)
```

Assets 2

28 May 14:17

mprashanthsagar

v2.2.1

bb63be7

2021-05-28 (GCS 2.2.1)

Changelog

Cloud Storage connector:

Fix proxy configuration for Apache HTTP transport.
Update gRPC dependency to latest version.

Assets 2

07 Jan 16:58

hongyegong

v2.2.0

42ab6dd

2021-01-07 (GCS 2.2.0, BQ 1.2.0)

Changelog

Cloud Storage connector:

Delete deprecated methods.
Update all dependencies to latest versions.

Add support for Cloud Storage objects CSEK encryption:

fs.gs.encryption.algorithm (not set by default)
fs.gs.encryption.key (not set by default)
fs.gs.encryption.key.hash (not set by default)

Add a property to override storage service path:

fs.gs.storage.service.path (default: `storage/v1/`)

Added a new output stream type which can be used by setting:
```
fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
```
The FLUSHABLE_COMPOSITE output stream type behaves similarly to the SYNCABLE_COMPOSITE type, except it also supports hflush(), which uses the same implementation with hsync() in the SYNCABLE_COMPOSITE output stream type.
Added a new output stream parameter
```
fs.gs.outputstream.sync.min.interval.ms (default: 0)
```
to configure the minimum time interval (milliseconds) between consecutive syncs. This is to avoid getting rate limited by GCS. Default is 0 - no wait between syncs. hsync() when rate limited will block on waiting for the permits, but hflush() will simply perform nothing and return.
Added a new parameter to configure output stream pipe type:
```
fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
```
Valid values are NIO_CHANNEL_PIPE and IO_STREAM_PIPE.

Output stream now supports (when property value set to NIO_CHANNEL_PIPE) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.

Note that when using NIO_CHANNEL_PIPE option maximum upload throughput can decrease by 10%.
Add a property to impersonate a service account:
```
fs.gs.auth.impersonation.service.account (not set by default)
```
If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.
Throw ClosedChannelException in GoogleHadoopOutputStream.write methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage.
Add properties to impersonate a service account through user or group name:
```
fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default)
fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
```
If any of these properties are set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.
Fix complex patterns globbing.
Added support for an authorization handler for Cloud Storage requests. This feature is configurable through the properties:
```
fs.gs.authorization.handler.impl=<FULLY_QUALIFIED_AUTHORIZATION_HANDLER_CLASS>
fs.gs.authorization.handler.properties.<AUTHORIZATION_HANDLER_PROPERTY>=<VALUE>
```
If the fs.gs.authorization.handler.impl property is set, the specified authorization handler will be used to authorize Cloud Storage API requests before executing them. The handler will throw AccessDeniedException for rejected requests if user does not have enough permissions (not authorized) to execute these requests.

All properties with the fs.gs.authorization.handler.properties. prefix passed to an instance of the configured authorization handler class after instantiation before calling any Cloud Storage requests handling methods.
Set default value for fs.gs.status.parallel.enable property to true.
Tune exponential backoff configuration for Cloud Storage requests.
Increment Hadoop FileSystem.Statistics counters for read and write operations.
Always infer implicit directories and remove fs.gs.implicit.dir.infer.enable property.
Replace 2 glob-related properties (fs.gs.glob.flatlist.enable and fs.gs.glob.concurrent.enable`) with a single property to configure glob search algorithm:
```
fs.gs.glob.algorithm (default: CONCURRENT)
```
Do not create the parent directory objects (this includes buckets) when creating a new file or a directory, instead rely on the implicit directory inference.
Use default logging backend for Google Flogger instead of Slf4j.
Add FsBenchmark tool for benchmarking HCFS.
Remove obsolete fs.gs.inputstream.buffer.size property and related functionality.
Fix unauthenticated access support (fs.gs.auth.null.enable=true).
Improve cache hit ratio when fs.gs.performance.cache.enable property is set to true.

Remove obsolete configuration properties and related functionality:

fs.gs.auth.client.id
fs.gs.auth.client.file
fs.gs.auth.client.secret

Add a property that allows to disable HCFS semantic enforcement. If set to false GSC connector will not check if directory with same name already exists when creating a new file and vise versa.
```
fs.gs.create.items.conflict.check.enable (default: true)
```

Remove redundant properties:

fs.gs.config.override.file
fs.gs.copy.batch.threads
fs.gs.copy.max.requests.per.batch

Change default value of fs.gs.inputstream.min.range.request.size property from 524288 to 2097152.

Big Query connector:

Update all dependencies to latest versions.
Fix BigQuery job status retrieval in non-US locations.
Use default logging backend for Google Flogger instead of Slf4j.
Remove unused mapred.bq.output.buffer.size configuration property.
Fix unauthenticated access support (mapred.bq.auth.null.enable=true).

Remove obsolete configuration properties and related functionality:

mapred.bq.auth.client.id
mapred.bq.auth.client.file
mapred.bq.auth.client.secret

Assets 2

09 Nov 19:01

hongyegong

v2.1.6

342edf1

2020-11-09 (GCS 2.1.6, BQ 1.1.6)

Changelog

Cloud Storage connector:

Increment Hadoop FileSystem.Statistics counters for read and write operations.
Add FsBenchmark tool for benchmarking HCFS.
Update all dependencies to latest versions.

Big Query connector:

Fix reads using DirectBigQueryInputFormat.
Update all dependencies to latest versions.

Assets 2

11 Sep 21:37

hongyegong

v2.1.5

6c01af9

2020-09-11 (GCS 2.1.5, BQ 1.1.5)

Changelog

Cloud Storage connector:

Fix complex patterns globbing.
Tune exponential backoff configuration for Cloud Storage requests.
Add a property to ignore Cloud Storage precondition failures when overwriting objects in concurrent environment:
```
fs.gs.overwrite.generation.mismatch.ignore (default: false)
```
Update all dependencies to latest versions.

Big Query connector:

Fix BigQuery job status retrieval in non-US locations.
Update all dependencies to latest versions.

Assets 2

07 Aug 21:35

hongyegong

v1.9.18

c746958

2020-08-07 (GCS 1.9.18, BQ 0.13.18)

Changelog

Cloud Storage connector:

Fix complex patterns globbing.
Throw ClosedChannelException in GoogleHadoopOutputStream.write methods
if stream already closed. This fixes Spark Streaming jobs checkpointing to
Cloud Storage.
Fix proxy authentication when using JAVA_NET transport.

Big Query connector:

POM updates for GCS connector 1.9.18.
Fix proxy authentication when using JAVA_NET transport.

Assets 2

16 Jul 06:44

hongyegong

v2.1.4

c8c3ac7

2020-07-15 (GCS 2.1.4, BQ 1.1.4)

Changelog

Cloud Storage connector:

Added a new parameter to configure output stream pipe type:
```
fs.gs.outputstream.pipe.type (default: IO_STREAM_PIPE)
```
Valid values are NIO_CHANNEL_PIPE and IO_STREAM_PIPE.

Output stream now supports (when property value set to NIO_CHANNEL_PIPE) Java NIO Pipe that allows to reliably write in the output stream from multiple threads without "Pipe broken" exceptions.

Note that when using NIO_CHANNEL_PIPE option maximum upload throughput can decrease by 10%.
Throw ClosedChannelException in GoogleHadoopOutputStream.write methods if stream already closed. This fixes Spark Streaming jobs checkpointing to Cloud Storage.
Add a property to impersonate a service account:
```
fs.gs.auth.impersonation.service.account (not set by default)
```
If this property is set, an access token will be generated for this service account to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.
Add properties to impersonate a service account through user or group name:
```
fs.gs.auth.impersonation.service.account.for.user.<USER_NAME> (not set by default)
fs.gs.auth.impersonation.service.account.for.group.<GROUP_NAME> (not set by default)
```
If any of these properties is set, an access token will be generated for the service account associated with specified user name or group name in order to access GCS. The caller who issues a request for the access token must have been granted the Service Account Token Creator role (roles/iam.serviceAccountTokenCreator) on the service account to impersonate.
Update all dependencies to latest versions.

Big Query connector:

Update all dependencies to latest versions.

Assets 2

08 May 23:13

hongyegong

v2.1.3

1cbbb8f

2020-05-08 (GCS 2.1.3, BQ 1.1.3)

Changelog

Cloud Storage connector:

Add support for Cloud Storage objects CSEK encryption:

fs.gs.encryption.algorithm (not set by default)
fs.gs.encryption.key (not set by default)
fs.gs.encryption.key.hash (not set by default)

Update all dependencies to latest versions.
Added a new output stream type which can be used by setting:
```
fs.gs.outputstream.type=FLUSHABLE_COMPOSITE
```
The FLUSHABLE_COMPOSITE output stream type behaves similarly to the
SYNCABLE_COMPOSITE type, except it also supports hflush(), which uses
the same implementation with hsync() in the SYNCABLE_COMPOSITE output
stream type.
Added a new output stream parameter
```
fs.gs.outputstream.sync.min.interval.ms (default: 0)
```
to configure the minimum time interval (milliseconds) between consecutive
syncs. This is to avoid getting rate limited by GCS. Default is 0 - no
wait between syncs. hsync() when rate limited will block on waiting for
the permits, but hflush() will simply perform nothing and return.
Restore compatibility with pre-2.8 Hadoop versions.

Big Query connector:

Update all dependencies to latest versions.

Assets 2

03 Apr 01:19

hongyegong

v2.1.2

c974aac

2020-04-02 (GCS 2.1.2, BQ 1.1.2)

Changelog

Cloud Storage connector:

Update all dependencies to latest versions.

Big Query connector:

Update all dependencies to latest versions.

Assets 2

11 Mar 22:31

hongyegong

v2.1.1

e7c2a78

2020-03-11 (GCS 2.1.1, BQ 1.1.1)

Changelog

Cloud Storage connector:

Add upload cache to support high-level retries of failed uploads. Cache size configured via property and disabled by default (zero or negative value):
```
fs.gs.outputstream.upload.cache.size (deafult: 0)
```

Big Query connector:

Fix shaded jar - add back missing relocated dependencies.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

Cloud Storage connector:

Changelog

Cloud Storage connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Changelog

Cloud Storage connector:

Big Query connector:

Releases: GoogleCloudDataproc/hadoop-connectors

2021-06-29 (GCS 2.2.2)

Changelog

Cloud Storage connector:

2021-05-28 (GCS 2.2.1)

Changelog

Cloud Storage connector:

2021-01-07 (GCS 2.2.0, BQ 1.2.0)

Changelog

Cloud Storage connector:

Big Query connector:

2020-11-09 (GCS 2.1.6, BQ 1.1.6)

Changelog

Cloud Storage connector:

Big Query connector:

2020-09-11 (GCS 2.1.5, BQ 1.1.5)

Changelog

Cloud Storage connector:

Big Query connector:

2020-08-07 (GCS 1.9.18, BQ 0.13.18)

Changelog

Cloud Storage connector:

Big Query connector:

2020-07-15 (GCS 2.1.4, BQ 1.1.4)

Changelog

Cloud Storage connector:

Big Query connector:

2020-05-08 (GCS 2.1.3, BQ 1.1.3)

Changelog

Cloud Storage connector:

Big Query connector:

2020-04-02 (GCS 2.1.2, BQ 1.1.2)

Changelog

Cloud Storage connector:

Big Query connector:

2020-03-11 (GCS 2.1.1, BQ 1.1.1)

Changelog

Cloud Storage connector:

Big Query connector: