-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposition to refactor cloud instance models and data #252
Comments
Example of a
|
A tricky example (
|
Any opinions @demeringo @github-benjamin-davy @JacobValdemar? 🤗 |
Thank you for submitting this proposal. Here are my initial thoughts dumped in random order 😄 I agree, the CSV file can be confusing to interact with. The proposed solution makes it more explicit that a cloud instance is a fraction of a platform. I like that. The proposed solution de-duplicates data. I like that. I don't know if it is intentional, but it seems you have removed some of fields from the platform which are currently in Do you propose creating a separate file (e.g. Regarding tricky example Something else is that I think it feels "bad" to work inside a large CSV file like |
Thank you, @samuelrince, for detailing our discussion so well, and thank you, @JacobValdemar, for your feedback, which confirms the importance of this reflection.
In my opinion, we should use the server archetype CSV, which already has the necessary columns. This could allow contributors to add instances by identifying a nearby generic platform already in the file without having to add it. Regarding tricky example d3.2xlarge, could the problem be that we assume that one platform host only one type of instance ? I think this issue will occur also for RAM and GPU. Could the problem be solved by allocating the impacts component per component ?
Platform:
If we find out that this solution is too complicated or not relevant, I would also prefer the virtual platform solution. |
Thank you, both @JacobValdemar @da-ekchajzer for your feedback in such a short timing! 😄
I agree with @da-ekchajzer about the CSV for platforms, we should use the already existing one with server archetypes. If the file gets too big that it becomes an issue, we can still split it by cloud provider in the future.
On the subject of allocation by components, I like the idea, but I think we will struggle to make it happen in the v1, given the architecture (cf our previous Also, for this approach to work, I think we need to specify the "purpose" of an instance to decide on which component we are going to make the allocation (compute, storage, general purpose, etc). For instance, if we take g5 instances, there are SSDs, but the impact is clearly due to the compute part (GPU, CPU, RAM), so we should say that it is a compute instance. Then a compute instance with GPU, so should we allocate by GPU and/or CPU and/or RAM? I think it adds complexity, and we need to think this thoroughly. Plus, I really don't know if it makes sense to have servers hosting different types of instances? Does it really exist? If I select the CPU of the d3 instance, I see that m5, r5, vt1, g4 also share the same CPU so it could be possible... And given that I think the virtual platform makes sense in that use case as well. I am not entirely convinced by the virtual platform, but I find it easier to deal with, even though this is something we will probably have a hard time to fully automate (in terms of platform creation in the CSV file). And later (in v2?), we can maybe address the issue of component-wise allocation strategies. What do you think? |
On that subject, I feel that it can also be an obstacle to new contributions. On my side, for the research part, I open the CSV on GitHub and filter the rows, but we appending data at the end of the file I can see myself struggling with that as well. Maybe we can look out for an open source project that can expose a CSV file with a nice UI in the browser, for instance? I am thinking of projects like instances.vantage.sh for example. |
I was more thinking of a same "type" of instance but different level of resources. I think that in some case, the different resources (RAM,vCPU, SSD, GPU) doesn't scale linearly. Is that the problem for d3.8xlarge ?
I think it would be easy to implement, but may be more complicated to explain/document. We would need to apply a ratio on each component during the impacts aggregation. The ratio would be computed for each component from the platform and instance data. I just cannot figure out if that will solve our problem. |
Well, it's not only about the allocation, but also about how to choose the platform instance in that case. The premise here is to guess the total amount of vCPU of the platform. One CPU (Intel Xeon Platinum 8259CL) has 48 vcpu. So it's enough to fit 1 x d3.8x + 1 x d3.4x. But it could be possible (and highly probable in my opinion) that we have 2 CPUs. If that's the case, how do we guess the scaling of the other components (RAM and HDDs here) for the platform? That is what I proposed in the Alternative 2 of the previous comment. Use the most probable config in terms of vcpus of the platform, then hint the rest based on trying to fit N times the same instance (ideally the biggest one and checking if it works with other variants) I remember from our discussion that you indeed mentioned that it was not that difficult to add allocation based on components. I think the best way to answer this is to test. Scenario 1The platform can fit 1x d3.8x + 1x d3.4x, meaning we can deduce the following minimal configuration: Platform:
Minified platform archetype:
If we compute the embodied impacts of the platform, we have:
Input JSON:{
"model": {
"type": "rack"
},
"configuration": {
"cpu": {
"units": 1,
"name": "Intel Xeon Platinum 8259CL"
},
"ram": [
{
"units": 6,
"capacity": 64
}
],
"disk": [
{
"units": 36,
"type": "hdd",
"capacity": 2000
}
]
}
} Meaning that we can now compute the impacts of d3.8x and d3.4x instances, by vcpu only or by all components. By vcpu onlyFor d3.8xInstance has 32 vcpu so, 32/48 of total embodied impacts.
For d3.4xInstance has 16 vcpu so, 16/48 of total embodied impacts.
By all componentsFor d3.8xInstance has 32 vcpu, 256 GB or RAM and 24 disks
We are very close to the impacts with the previous calculation. Detailed calculation:
For d3.4xNot doing this one, sorry. Scenario 2The platform can fit 3x d3.8x, meaning we can deduce the following minimal configuration: Platform:
Minified platform archetype:
If we compute the embodied impacts of the platform, we have:
Input JSON:{
"model": {
"type": "rack"
},
"configuration": {
"cpu": {
"units": 2,
"name": "Intel Xeon Platinum 8259CL"
},
"ram": [
{
"units": 12,
"capacity": 64
}
],
"disk": [
{
"units": 72,
"type": "hdd",
"capacity": 2000
}
]
}
} Meaning that we can now compute the impacts of d3.8x and d3.4x instances by vcpu only or by all components. By vcpu onlyFor d3.8xInstance has 32 vcpu so, 32/96 of total embodied impacts.
For d3.4xInstance has 16 vcpu so, 16/96 of total embodied impacts.
By all componentsFor d3.8x
Detailed calculation:
For d3.4xNot doing this one, again. TL;DR: I think we are doing overengineering. 😅 Of course, here, the scenario one, is kind of scaled based on the vcpu again, so I am not surprised by that result. If you want to test another configuration, feel free to try. But given the margins, I think it's overkilled. |
Thank you very much for doing this exercise. So from what you say, using an allocation on vcpu or for each component is not an important question from the moment the platforms are built accordingly ? I'm sorry, but I've just managed to identify what's bothering me. The problem by doing so would be that the following data would never be used in the impacts' calculation (but will be used by contributors to construct the platform).
This puts the complexity in the platform's building, and it reduces the importance of instance data, which are the most important to consider. I would have liked contributor to be able to associate an instance with a generic platform when they do not know how to build platforms. If the allocation is made for each component, the API would only allocate the RAM/Storage/CPU/GPU impacts for the instance based on its information. By doing so, we ensure that all reserve resources are accounted for, even if a generic platform is used. In case, we put a great effort in building platforms based on instance information it doesn't change anything (as you have shown), in case a generic platform is used it avoids totally incoherent evaluations. TL;DR: Our families are missing us |
Well, yes, but only if you make an "educated" guess based on vcpu. In other scenarios, it's not the case. I agree with you on the fact that we don't use the instance's specs, and in that case, we shouldn't even bother to ask the user to input it! 😅 So I have made a notebook to quickly test different platforms and instances. Here is an example:
The additional "equivalent server" is here to compare what would be the impact of a probable server that has the same characteristics of the instance. I invite you to test the notebook (it is on Google Colab and editable by anyone, I have a local copy). This made me change my mind, I think we need to do the allocation by components, it makes more sense, and usually, it's closer to "my expected reality" (whatever that means). Also, I think we will probably need to add more archetypes based on what exists in the wild and with smaller min/max ranges so that it makes sense. I think we can make the following archetypes:
With the following variants:
For instance, a TL;DR: You were right since the beginning. 🙌 😇 🙏 |
Perfect! I will work on the implementation during the following days. Do you think that you could handle the addition of AWS platforms and existing instances with the right format ? @JacobValdemar since you have made the file in the first place, you might be of help on that also. |
Sure, just reach out if there is anything |
I started to reference all the instances with the new format and link them with platforms (and "virtual platforms" when we don't know). https://docs.google.com/spreadsheets/d/1EmXYTUx0Nmmubj96_-fTThu7UK16Og-gcSqSOl7qB3c/edit?usp=sharing I still need to run some checks on this file and then create the virtual platforms. Note to myself:
|
Problem
We want to make the process of adding cloud instances as simple as possible, while:
Resulting in describing both instance characteristics and platform (or bare metal) characteristics.
As of today, both are stored under the same CSV file (cloud archetypes) and can be confusing to interact with. First, it leads to duplicated data about some components (especially CPU with cpu_specs.csv). Second, the contributor needs to understand complex concepts to be able to make a new submission (e.g. difference between vcpu, platform_vcpu, CPU.core_units * CPU.units and USAGE.instance_per_server based on vcpu counts).
Solution
We propose a new way to add cloud instances that should clarify this process. We will separate the concept of a cloud instance and platform (or bare-metal server).
A cloud instance will be described with very few fields that are close to the description provided by cloud providers.
Example of a
c5.2xlarge
(in newaws.csv
):The platform defined here is
c5.18xlarge
is another cloud instance AND a server archetype defined as follows:Cloud instance (also in new
aws.csv
):Platform (server archetype):
In this description, the embodied impacts of the cloud instance can be derided from this operation:
We will need to introduce the notion of "vcpu" in
CPU
modeling so that we can take into account the number of "threads" or "virtual cores" in hyper-threading scenarios. Like following:OR:
Happy to hear about your feedback @da-ekchajzer. I will detail other examples below.
The text was updated successfully, but these errors were encountered: