Consultation: About DADI's use business scenarios & implementation status #120

bengbeng-pp · 2022-06-10T02:20:05Z

Excuse me, I have a few questions。
Does DADI have a user exchange group?
Does DADI have any unsuitable business scenarios?
Would it be convenient for you to inform Alibaba about the current status of DADI's landing?

BigVan · 2022-06-10T09:05:14Z

Does DADI have a user exchange group?
yep, we create a channel named 'overlaybd' in CNCF on Slack today😂... Login to the workspace through https://slack.cncf.io/ and search 'overlaybd'.

Does DADI have any unsuitable business scenarios?
Theoretically, overlaybd has higher compatibility than other image-format because we support any Filesystem. We strongly suggest you use overlaybd on production environments such as CentOS / Ubuntu / Aliyun Linux. It will be very easy to support.

Would it be convenient for you to inform Alibaba about the current status of DADI's landing?
DADI is widely used in Alibaba Group for about 4 years. And now we support almost all container-related products...
There are some public information:
https://help.aliyun.com/document_detail/184556.html
https://developer.aliyun.com/article/782918?utm_content=g_1000255233
https://www.sohu.com/a/504934204_612370

bengbeng-pp · 2022-06-14T08:48:14Z

Thanks for answering.
Let me explain the second and third questions.
In the process of actual implementation, although the image pull time has been accelerated, the time for container startup and business has become longer. I understand that this situation is inevitable. Now ask some practical experience about the official landing to help us implement DADI on a larger scale.

How much influence does the on-demand loading method have on container startup and business startup, and is there any data on this?
What is the impact of the on-demand loading method on container startup and business startup? For example, business scenarios, development languages, etc. We expect that based on this data, we will determine the priority and large-scale implementation of DADI scenarios.
In addition, for this situation, besides trace prefetching, are there other optimization measures or optimization plans?

BigVan · 2022-06-14T14:33:23Z

I see...
It seems that you are concerned about the on-demand loading will negatively affect the business startup time.
The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. In most cases, the application will only load a few data during its life cycle. BTW, overlaybd kept the on-demand image data in its cache directory so you won't feel any problem on next time startup.

In our paper, there are some performance behavior in our production environment.

There are MySQL benchmarks wish can help you.:

One more, DADI is widely used in machine learning and WebIDE in Alibaba cloud, but I think you should take your own test in your environment.

For question 3. currently, we don't have other optimization plans...

You can contact me by email ([email protected]) if you need more information or business landing help in your company :D

beef9999 · 2022-06-14T15:15:46Z

@bengbeng-pp Currently in Alibaba Cloud, only the Function Compute uses trace prefetching, because it's relatively easier for them to record trace. Some business are reluctant to do such a thing.

I think what you need is Cache + P2P distribution. For each of them DADI has an open-source implementation. By setting up a large scale of SSD cluster, you basically distribute / cache every hot piece of data in the network and thus a mighty network filesystem is formed :-)

bengbeng-pp · 2022-06-16T07:06:22Z

Thank you very much for your patience.

The cold start in the above figure means that the application starts successfully? Is the data in the picture without the local cache?
The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. What does the IO pattern of application refer to?
Is it mainly used in machine learning and webide, why are these two scenarios, and has online business been implemented?
Why only Function Compute uses trace prefetching? Trace prefetching may not be applicable to our scenario, because recording traces may cause the operation of business logic to modify business data.

other:

We are now using the cache function, can you make a suggestion to improve the observability of the cache? For example, the local cache percentage of images and the cache hit rate.
We haven't used the P2P function yet. P2P is in the planning. Do you have any data on performance optimization for P2P? Also, is it possible to monitor the speed of the download from the remote?

BigVan · 2022-06-16T08:15:21Z

The cold start in the above figure means that the application starts successfully? Is the data in the picture without the local cache?

yes, the figure shows the cold startup time between tgz image and overlaybd

The business startup time between local image and overlaybd is not very obvious depending on your registry latency and the IO pattern of your application. What does the IO pattern of application refer to?

'The IO pattern' I mean is most applications only use a few image data.( ~6.4% FAST 16 )

Is it mainly used in machine learning and webide, why are these two scenarios, and has online business been implemented?

Actually, online business is the first landing scenario. Machine-learning and webIDE which I mentioned, always use larger images than others. (~10GB+）

Why only Function Compute uses trace prefetching? Trace prefetching may not be applicable to our scenario, because recording traces may cause the operation of business logic to modify business data.

Overlaybd recored the image I/O trace without network. In my experience, trace prefetching should be helpful to you.

about cache usage.

we use LRU to auto evict unused cache data and it will never exceed the limit capacity. If you want to know the disk usage about cache, try 'du -sh' on the cache-dir.

aboout p2p

there is a very rudimentary open source code for our p2p... But I don't think you need it. https://github.com/data-accelerator/dadi-p2proxy

Anyway, as I said before, I can only tell you the conclusion from my experiences, you should take your own test. :-D

bengbeng-pp · 2022-06-17T07:11:27Z

Overlaybd recored the image I/O trace without network.

Will this cause the application to get stuck when network operations are required, and it is impossible to obtain a complete record of io operations.

BigVan · 2022-06-20T09:16:22Z

Yes.... the prefetch trace is based on the application environment.
It can only accelerate at the beginning time of the container when recording without network.

bengbeng-pp · 2022-06-21T06:36:46Z

I understand, thank you very much for the answer

dbfancier · 2022-10-26T06:05:51Z

@bengbeng-pp Currently in Alibaba Cloud, only the Function Compute uses trace prefetching, because it's relatively easier for them to record trace. Some business are reluctant to do such a thing.

I think what you need is Cache + P2P distribution. For each of them DADI has an open-source implementation. By setting up a large scale of SSD cluster, you basically distribute / cache every hot piece of data in the network and thus a mighty network filesystem is formed :-)

Hello，Is there any documentation on how to configure cache and p2p? When I pulled obd format image from registry, I can not see anything from /opt/overlaybd/registry_cache

dbfancier mentioned this issue Oct 26, 2022

How to configure p2p and cache #143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consultation: About DADI's use business scenarios & implementation status #120

Consultation: About DADI's use business scenarios & implementation status #120

bengbeng-pp commented Jun 10, 2022

BigVan commented Jun 10, 2022 •

edited

Loading

bengbeng-pp commented Jun 14, 2022

BigVan commented Jun 14, 2022

beef9999 commented Jun 14, 2022

bengbeng-pp commented Jun 16, 2022

BigVan commented Jun 16, 2022 •

edited

Loading

bengbeng-pp commented Jun 17, 2022

BigVan commented Jun 20, 2022

bengbeng-pp commented Jun 21, 2022

dbfancier commented Oct 26, 2022

Consultation: About DADI's use business scenarios & implementation status #120

Consultation: About DADI's use business scenarios & implementation status #120

Comments

bengbeng-pp commented Jun 10, 2022

BigVan commented Jun 10, 2022 • edited Loading

bengbeng-pp commented Jun 14, 2022

BigVan commented Jun 14, 2022

beef9999 commented Jun 14, 2022

bengbeng-pp commented Jun 16, 2022

BigVan commented Jun 16, 2022 • edited Loading

bengbeng-pp commented Jun 17, 2022

BigVan commented Jun 20, 2022

bengbeng-pp commented Jun 21, 2022

dbfancier commented Oct 26, 2022

BigVan commented Jun 10, 2022 •

edited

Loading

BigVan commented Jun 16, 2022 •

edited

Loading