Skip to content

Copy from Google Cloud Storage

Narasimha Kulkarni edited this page Mar 22, 2021 · 2 revisions

Overview

AzCopy v10 (starting in 10.9.0 release) supports copying data from Google Cloud Storage to Azure Blob storage. To do this, AzCopy uses Put from URL API from the Azure Blob storage REST APIs which directly copies a given chunk of publicly accessible data from a given URL to an Azure Blob storage account. For copying data from GCP, AzCopy enumerates all objects in a given bucket, creates pre-signed URLs and then issues Put from URL APIs on each object to copy the data to Azure. Note that the copy operation does not use the machine’s bandwidth where AzCopy is run, hence making it efficient and performant to copy data.

Authentication

AzCopy uses service account keys to authenticate to GCP. For the destination Blob storage account, you can use any of the available authentication options (SAS token, or Azure Active Directory authentication).

Examples

Run azcopy copy –help to see example commands. To transfer any entity from GCP, set environment variable GOOGLE_APPLICATION_CREDENTIALS to absolute path to the Service Account key file.

  1. Copy a single object to Blob Storage using a SAS token.
  • azcopy cp "https://storage.cloud.google.com/[bucket]/[object]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]"
  1. Copy virtual directory on GCP to Blob Storage.
  • azcopy cp "https://storage.cloud.google.com/[bucket]/[folder]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/directory]?[SAS]" --recursive=true
  1. Copy a bucket in GCP to a container in Blob Storage
  • azcopy cp "https://storage.cloud.google.com/[bucket]" "https://[destaccount].blob.core.windows.net/?[SAS]" --recursive=true
  1. Copy all buckets in GCP project as containers in a Azure Storage account. Set environment variable GOOGLE_CLOUD_PROJECT to project ID of GCP source.
  • azcopy cp "https://storage.cloud.google.com/" "https://[destaccount].blob.core.windows.net/?[SAS]" --recursive=true
  1. Copy subset of buckets in GCP project as containers in Azure Storage account. Set environment variable GOOGLE_CLOUD_PROJECT to project ID of GCP source.
  • azcopy cp "https://storage.cloud.google.com/[bucket*name]/" "https://[destaccount].blob.core.windows.net/?[SAS]" --recursive=true

Remarks

a. URL styles supported

AzCopy supports GCS urls in path-style form of https://storage.cloud.google.com/[bucket-name]/[object-path]. Other URL styles are not supported.

b. Bucket name resolving

Google Cloud Storage has different set of naming conventions for bucket names compared to Azure Blob container/ADLS Gen2 filesystem/File share.

For Azure, container/filesystem/share's naming follows:

  1. Lower case letters, numbers and hyphen.
  2. 3-63 length.
  3. Name should not contain two consecutive hyphens.
  4. Name should not start or end with hyphen.

For GCS,

  1. Bucket names must contain only lowercase letters, numbers, dashes (-), underscores (_), and dots (.). Spaces are not allowed.
  2. Bucket names must start and end with a number or letter.
  3. Bucket names must contain 3-63 characters.
  4. Bucket names cannot be represented as an IP address in dotted-decimal notation (for example, 192.168.5.4).
  5. Bucket names cannot begin with the "goog" prefix or contain google or close misspellings, such as "g00gle"

AzCopy will auto-resolve following naming issues:

  1. Names with underscores (_). Here we will try to replace it with hyphens (-). e.g. bucket_name -> bucket-name
  2. Names with consecutive hyphens. Here we will replace it with -[number of hyphens]-. e.g. bucket-----name -> bucket-5-name.
  3. Names with period (.). Here we will replace it with hyphen (-).
  4. In case of any naming collisions we try to add a suffix. e.g. Bucket names: bucket-name, bucket.name. Then we will resolve into bucket.name -> bucket-name -> bucket-name-2

c. Object name handling

Please note that Azure Storage does not permit object name (or any segment in the virtual directory path) to end with trailing dots (ex: dir1/dir2.../file or dir1/dir2/file...). The Storage service will trim away the trailing dots when the copy operation is performed.

d. Metadata handling

GCP allows a different charset for object metadata keys compared to Azure Blob storage. AzCopy provides a flag with 3 options to handle invalid metadata key when transferring objects to Azure: s2s-invalid-metadata-handle=ExcludeIfInvalid, FailIfInvalid, RenameIfInvalid.

  1. ExcludeIfInvalid

This is the default option, metadata with invalid metadata key will be excluded from the transfer while the object itself will be copied to Azure. Use this option when you do not care that you lose the metadata in source, GCP. When the metadata is excluded, following event will be logged as WARNING and the file transfer will succeed:

2019/03/13 09:05:21 WARN: [P#0-T#0] METADATAWARNING: For source "https://storage.cloud.google.com/[bucket-name]/[object-path]", invalid metadata with keys '$%^' '1abc' are excluded

  1. FailIfInvalid

If you set FailIfInvalid for s2s-invalid-metadata-handle flag, then the objects with invalid metadata keys will fail to transfer to Azure Blob storage. Failure will be logged and included in the failed count in the transfer summary. Use this option if you would like to fix the objects with invalid data in the GCP source. Once you have done that, you can restart the AzCopy job using ‘azcopy resume` command to retry the failed objects.

2019/03/13 09:22:38 ERR: [P#0-T#0] COPYFAILED: <sourceURL> : metadata with keys '$%^' '1abc' in source is invalid

  1. RenameIfInvalid

If you set RenameIfInvalid value for s2s-invalid-metadata-handle flag, AzCopy will automatically resolve the invalid metadata key and copy the object to Azure using the resolved metadata key value pair. Use this option if you rather fix the invalid metadata in Azure after moving all data to Azure Storage. This allows you to dispose the contents of your GCP bucket since all of the information is stored on the Azure Blob storage side.

The rename logic is as following:

  1. replace all invalid char (i.e. ASCII chars expect [0-9A-Za-z_]) with '_'
  2. add 'rename_' as prefix for the new valid key, this key will be used to save original metadata's value.
  3. add 'rename_key_' as prefix for the new valid key, this key will be used to save original metadata's invalid key. Example, given invalid metadata of '123-invalid':'content', it will be resolved as two new key value pairs: 'rename_123_invalid':'content' 'rename_key_123_invalid':'123-invalid'

User can then try to recover the metadata in Azure side since metadata key is preserved as a value on the Blob storage service. Transfer of the object will be failed if rename operation fails.