Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consultation: About DADI's use business scenarios & implementation status #120

Open
bengbeng-pp opened this issue Jun 10, 2022 · 10 comments

Comments

@bengbeng-pp
Copy link

Excuse me, I have a few questions。
Does DADI have a user exchange group?
Does DADI have any unsuitable business scenarios?
Would it be convenient for you to inform Alibaba about the current status of DADI's landing?

@BigVan
Copy link
Member

BigVan commented Jun 10, 2022

Does DADI have a user exchange group?
yep, we create a channel named 'overlaybd' in CNCF on Slack today😂... Login to the workspace through https://slack.cncf.io/ and search 'overlaybd'.

Does DADI have any unsuitable business scenarios?
Theoretically, overlaybd has higher compatibility than other image-format because we support any Filesystem. We strongly suggest you use overlaybd on production environments such as CentOS / Ubuntu / Aliyun Linux. It will be very easy to support.

Would it be convenient for you to inform Alibaba about the current status of DADI's landing?
DADI is widely used in Alibaba Group for about 4 years. And now we support almost all container-related products...
There are some public information:
https://help.aliyun.com/document_detail/184556.html
https://developer.aliyun.com/article/782918?utm_content=g_1000255233
https://www.sohu.com/a/504934204_612370

@bengbeng-pp
Copy link
Author

Thanks for answering.
Let me explain the second and third questions.
In the process of actual implementation, although the image pull time has been accelerated, the time for container startup and business has become longer. I understand that this situation is inevitable. Now ask some practical experience about the official landing to help us implement DADI on a larger scale.

  1. How much influence does the on-demand loading method have on container startup and business startup, and is there any data on this?
  2. What is the impact of the on-demand loading method on container startup and business startup? For example, business scenarios, development languages, etc. We expect that based on this data, we will determine the priority and large-scale implementation of DADI scenarios.
  3. In addition, for this situation, besides trace prefetching, are there other optimization measures or optimization plans?

@BigVan
Copy link
Member

BigVan commented Jun 14, 2022

I see...
It seems that you are concerned about the on-demand loading will negatively affect the business startup time.
The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. In most cases, the application will only load a few data during its life cycle. BTW, overlaybd kept the on-demand image data in its cache directory so you won't feel any problem on next time startup.

In our paper, there are some performance behavior in our production environment.

There are MySQL benchmarks wish can help you.:

One more, DADI is widely used in machine learning and WebIDE in Alibaba cloud, but I think you should take your own test in your environment.

For question 3. currently, we don't have other optimization plans...

You can contact me by email ([email protected]) if you need more information or business landing help in your company :D

@beef9999
Copy link
Contributor

@bengbeng-pp Currently in Alibaba Cloud, only the Function Compute uses trace prefetching, because it's relatively easier for them to record trace. Some business are reluctant to do such a thing.

I think what you need is Cache + P2P distribution. For each of them DADI has an open-source implementation. By setting up a large scale of SSD cluster, you basically distribute / cache every hot piece of data in the network and thus a mighty network filesystem is formed :-)

@bengbeng-pp
Copy link
Author

Thank you very much for your patience.

  1. The cold start in the above figure means that the application starts successfully? Is the data in the picture without the local cache?
  2. The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. What does the IO pattern of application refer to?
  3. Is it mainly used in machine learning and webide, why are these two scenarios, and has online business been implemented?
  4. Why only Function Compute uses trace prefetching? Trace prefetching may not be applicable to our scenario, because recording traces may cause the operation of business logic to modify business data.

other:

  1. We are now using the cache function, can you make a suggestion to improve the observability of the cache? For example, the local cache percentage of images and the cache hit rate.
  2. We haven't used the P2P function yet. P2P is in the planning. Do you have any data on performance optimization for P2P? Also, is it possible to monitor the speed of the download from the remote?

@BigVan
Copy link
Member

BigVan commented Jun 16, 2022

  1. The cold start in the above figure means that the application starts successfully? Is the data in the picture without the local cache?

yes, the figure shows the cold startup time between tgz image and overlaybd

  1. The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. What does the IO pattern of application refer to?

'The IO pattern' I mean is most applications only use a few image data.( ~6.4% FAST 16 )

  1. Is it mainly used in machine learning and webide, why are these two scenarios, and has online business been implemented?

Actually, online business is the first landing scenario. Machine-learning and webIDE which I mentioned, always use larger images than others. (~10GB+)

  1. Why only Function Compute uses trace prefetching? Trace prefetching may not be applicable to our scenario, because recording traces may cause the operation of business logic to modify business data.

Overlaybd recored the image I/O trace without network. In my experience, trace prefetching should be helpful to you.

about cache usage.

we use LRU to auto evict unused cache data and it will never exceed the limit capacity. If you want to know the disk usage about cache, try 'du -sh' on the cache-dir.

aboout p2p

there is a very rudimentary open source code for our p2p... But I don't think you need it. https://github.com/data-accelerator/dadi-p2proxy

Anyway, as I said before, I can only tell you the conclusion from my experiences, you should take your own test. :-D

@bengbeng-pp
Copy link
Author

Overlaybd recored the image I/O trace without network.

Will this cause the application to get stuck when network operations are required, and it is impossible to obtain a complete record of io operations.

@BigVan
Copy link
Member

BigVan commented Jun 20, 2022

Yes.... the prefetch trace is based on the application environment.
It can only accelerate at the beginning time of the container when recording without network.

@bengbeng-pp
Copy link
Author

I understand, thank you very much for the answer

@dbfancier
Copy link

@bengbeng-pp Currently in Alibaba Cloud, only the Function Compute uses trace prefetching, because it's relatively easier for them to record trace. Some business are reluctant to do such a thing.

I think what you need is Cache + P2P distribution. For each of them DADI has an open-source implementation. By setting up a large scale of SSD cluster, you basically distribute / cache every hot piece of data in the network and thus a mighty network filesystem is formed :-)

Hello,Is there any documentation on how to configure cache and p2p? When I pulled obd format image from registry, I can not see anything from /opt/overlaybd/registry_cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants