Skip to content

Performance: small blobs

Ze Qian Zhang edited this page Mar 2, 2021 · 3 revisions

It is inherently harder to achieve high throughput with small blobs (in the KBs), due to the overhead in terms of data transferred per transaction. The AzCopy Team is working actively on improving the user experience in this scenario. This article discusses some ways to tune the AzCopy tool to increase the throughput.

Job size

If your data set is large and you have more than 50 million files in a single job, please consider breaking it down into smaller parts.

Above 50 million files per job, AzCopy's job tracking mechanism takes a significant amount of overhead. It is best to keep the jobs around <10 million files for the optimum performance.

There are multiple options for breaking down a job. Some examples:

  1. Based on sub folder. You can either copy a single sub folder at a time, or use include-path to select more.
  2. Based on include/exclude patterns (ex: include-pattern=".pdf" for first job to only copy PDF files, and exclude-pattern=".pdf" for the second job to copy everything else)

Upload and download

Depending on how powerful your machine is, you should ramp up the AZCOPY_CONCURRENCY_VALUE setting as high as possible without affecting your environment.

To achieve better performance, there is the option of decreasing the logging level to ERROR with --log-level, to minimize the amount of time that AzCopy spends on logging request/response.

To lower the amount of overhead per blob, you could also choose to turn off --check-length, to save 1 I/O which checks the destination file's length after having transferred it.

On some Linux systems, there may be issues related to scanning speed. In other words, the scanning is not happening as fast as needed to saturate all the parallel network connections. In this case you can turn on AZCOPY_CONCURRENT_SCAN. Please refer to the help message in azcopy env.

Copy

Copying blobs is done on the service end, in other words AzCopy coordinates the chunks, but the destination Storage service reads data directly from the source Storage service. In this case, you could be a lot more aggressive with AZCOPY_CONCURRENCY_VALUE, and try to set it to >1000 since there's very little work happening on the client side.

You could also try to break down the job and spin up AzCopy on more than 1 machine/VM. This has proven to be effective to some extent as well.