Skip to content

Latest commit

 

History

History
287 lines (207 loc) · 9.52 KB

input_output.md

File metadata and controls

287 lines (207 loc) · 9.52 KB

Input and Output File Handling

An on-premises job scheduler like Grid Engine typically uses a shared file system. Your Grid Engine scripts can reference files by their paths on the shared file system.

With dsub, your input files reside in a Google Cloud Storage bucket and your output files will also be copied out to Cloud Storage.

When you submit a job with dsub

  • your input files will be automatically copied from bucket paths to local disk
  • your code will work on the local file system inside the Docker container
  • your output files will be automatically copied from local disk back to bucket paths.

Rather than giving many options of what disks to allocate and where to put input and output files, dsub is prescriptive.

  • All input and output is written to a single data disk mounted at /mnt/data.
  • All input and output paths mirror the remote storage location with a local path of the form /mnt/data/gs/bucket/path.

Environment variables are made available to your script indicating the Docker container input and output paths.

There are several common use cases for both input and output, each described here and demonstrated in this example.

Input

1. Copy a single file from Cloud Storage.

To copy a single file from Cloud Storage, specify the full URL to the file on the dsub command-line:

--input INPUT_FILE=gs://bucket/path/file.bam

The object at the Cloud Storage path will be copied and made available at the path /mnt/data/input/gs/bucket/path/file.bam.

The Docker container will receive the environment variable:

INPUT_FILE=/mnt/data/input/gs/bucket/path/file.bam

Multiple --input parameters can be specified and they can be specified in any order. Since it will be used as an environment variable, the name of the input parameter must comply with the Open Group Base Specifications.

2. Copy a file pattern from Cloud Storage.

To copy a set of files from Cloud Storage, specify the full URL pattern on the dsub command-line:

--input INPUT_FILES=gs://bucket/path/*.bam

The object(s) at the Cloud Storage path will be copied and made available at the path /mnt/data/input/gs/bucket/path.

The Docker container will receive the environment variable:

INPUT_FILES=/mnt/data/input/gs/bucket/path/*.bam

You will likely want your script code to tokenize the environment variable into its constituent path and pattern components. To tokenize the INPUT_FILES variable, the following code:

INPUT_FILES_PATH="$(dirname "${INPUT_FILES}")"
INPUT_FILES_PATTERN="$(basename "${INPUT_FILES}")"

will set:

INPUT_FILES_PATH=/mnt/data/input/gs/bucket/path
INPUT_FILES_PATTERN=*.bam

To process a list of files from a path + wildcard pattern in Bash, a typical coding pattern is to create an array and iterate over the array.

If you know you don't have spaces in your paths, this can simply be:

readonly INPUT_FILE_LIST=( $(ls "${INPUT_FILES_PATH}"/${INPUT_FILES_PATTERN}) )

If you might have spaces in your file paths, then you need to take a bit more care. The following will create a list of files and force Bash to tokenize the list by newlines (instead of by whitespace):

declare INPUT_FILE_LIST="$(ls -1 "${INPUT_FILES_PATH}"/${INPUT_FILES_PATTERN})"
IFS=$'\n' INPUT_FILE_LIST=(${INPUT_FILE_LIST})
readonly INPUT_FILE_LIST

Note: in both cases above, do not quote ${INPUT_FILES_PATTERN} as that will suppress wildcard expansion.


The following code shows how to iterate over the list of files in the array:

for INPUT_FILE in "${INPUT_FILE_LIST[@]}"; do
  # INPUT_FILE will be the full path including the filename
  # If you need the filename alone, use basename:
  INPUT_FILE_NAME="$(basename "${INPUT_FILE}")"

  # If you further want to trim off the file extension, perhaps to construct
  # a new output file name, then use Bash suffix subsititution:
  INPUT_FILE_ROOTNAME="${INPUT_FILE_NAME%.*}"

  # Do stuff with the INPUT_FILE environment variables you now have
  ...
done

Multiple --input parameters can be specified and they can be specified in any order. Since it will be used as an environment variable, the name of the input parameter must comply with the Open Group Base Specifications.

3. Copy a directory recursively from Cloud Storage.

To recursively copy a directory from Cloud Storage, use the dsub command-line flag --input-recursive.

--input-recursive INPUT_PATH=gs://bucket/path

The object(s) at the Cloud Storage path will be recursively copied and made available at the path /mnt/data/input/gs/bucket/path.

The Docker container will receive the environment variable:

INPUT_PATH=/mnt/data/input/gs/bucket/path

Multiple --input-recursive parameters can be specified and they can be specified in any order. Since it will be used as an environment variable, the name of the input parameter must comply with the Open Group Base Specifications.

Output

1. Copy a single file to Cloud Storage.

To copy a single file to Cloud Storage, specify the full URL to the file on the dsub command-line:

--output OUTPUT_FILE=gs://bucket/path/file.bam

Then have your script write the output file to ${OUTPUT_FILE} within the Docker container. The file will be automatically copied to Cloud Storage when your script or command exits with success.

The Docker container will receive the environment variable:

OUTPUT_FILE=/mnt/data/output/gs/bucket/path/file.bam

Multiple --output parameters can be specified and they can be specified in any order. Since it will be used as an environment variable, the name of the input parameter must comply with the Open Group Base Specifications.

2. Copy a file pattern to Cloud Storage.

To copy a set of files to Cloud Storage, specify the full URL pattern on the dsub command-line:

--output OUTPUT_FILES=gs://bucket/path/*.bam

Then have your job write output files to ${OUTPUT_FILES} within the Docker container. All files matching the pattern /mnt/data/output/gs/bucket/path/*.bam will be automatically copied to Cloud Storage when your script or command exits with success.

The Docker container will receive the environment variable:

OUTPUT_FILES=/mnt/data/output/gs/bucket/path/*.bam

Typically a job script will have the output file extensions hard-coded, but if needed it can be parsed from the environment variable. More commonly, the job script will need the output directory.

To get the output directory and file extension in Bash:

# This will set OUTPUT_DIR to "/mnt/data/output/gs/bucket/path"
OUTPUT_DIR="$(dirname "${OUTPUT_FILES}")"

# This will set OUTPUT_FILE_PATTERN to "*.bam"
OUTPUT_FILE_PATTERN="$(basename "${OUTPUT_FILES}")"

# This will set OUTPUT_EXTENSION to "bam" using the Bash prefix removal
# operator "##", matching the longest pattern up to and including the period.
OUTPUT_EXTENSION="${OUTPUT_FILE_PATTERN##*.}"

Multiple --output parameters can be specified and they can be specified in any order. Since it will be used as an environment variable, the name of the input parameter must comply with the Open Group Base Specifications.

3. Copy a directory recursively to Cloud Storage.

To recursively copy a directory of output to Cloud Storage, use the dsub command-line flag --output-recursive:

--output-recursive OUTPUT_PATH=gs://bucket/path

Then have your job write output files and subdirectories to ${OUTPUT_PATH} within the Docker container. All files and directories under the path /mnt/data/output/gs/bucket/path will be automatically copied to Cloud Storage when your script or command exits with success.

The Docker container will receive the environment variable:

OUTPUT_PATH=/mnt/data/output/gs/bucket/path

Multiple --output-recursive parameters can be specified and they can be specified in any order. Since it will be used as an environment variable, the name of the input parameter must comply with the Open Group Base Specifications.

Requester Pays

Unless specifically enabled, a Google Cloud Storage bucket is "owner pays" for all requests. This includes network charges for egress (data downloads or copies to a different cloud region), as well as retrieval charges on files in "cold" storage classes, such as Nearline, Coldline, and Archive.

When Requester Pays is enabled on a bucket, the requester must specify a Cloud project to which charges can be billed. Use the dsub command-line option --user-project:

--user-project my-cloud-project

The user project specified will be passed for all GCS interactions, including:

  • Logging
  • Localization (inputs)
  • Delocalization (outputs)
  • Mount (gcs fuse)

Unsupported path formats:

  • GCS recursive wildcards (**) are not supported
  • Wildcards in the middle of a path are not supported
  • Output parameters to a directory are not supported, instead:
    • use an explicit wildcard on the filename (such as gs://mybucket/mypath/*)
    • use the recursive copy feature