Skip to content

Performance issue of azure.datalake.store.core.AzureDLFile.write() for Cosmos #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zjw0304 opened this issue Oct 29, 2020 · 0 comments

Comments

@zjw0304
Copy link

zjw0304 commented Oct 29, 2020

Description

I ran benchmark to compare the performance to upload data to Cosmos between AzureDLFile.write() and the write API from native libhdfs.so. The result show there is significant gap. To write same amount data to Cosmos, the time used by azure data lake store is more than double of the time HDFS used. I also checked the network throughput, with HDFS we can push it to about 4Gb. And for ADL the throughput is only reach to 1.3Gb.

In my testing, I used multiple thread to write the data. Each thread creates individual file and write data into it. I tried to increase the thread number and the buffer size. it didn't help to improve the performance.

My questions are:

  1. Is this performance gap expected? Since azure data lake store is based on REST API.
  2. Is there any advanced API or parameter I can try to improve the throughput? For my scenario, we have to use the streaming write API to upload the data.

Environment summary

SDK Version: What version of the SDK are you using? (pip show azure-datalake-store)
Answer here: The latest.

Python Version: What Python version are you using? Is it 64-bit or 32-bit?
Answer here: python version: 3.6.9 64

OS Version: What OS and version are you using?
Answer here: Ubuntu 18.04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant