You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, the ONNX model zoo consists of one batch of models, which have been accumulated over the years. Some include quite recent models coded with recent opsets, and others are older often encoded with early opsets. At this time, I believe is that user see only one number, e.g. the fraction of models that a given tool is supporting. However, this number may not directly indicates to user how up-to-date a given tool is. For example, a tool that supports all of the newer ops but not older ops may get a similar score as a tool that gets all of the older ops but is not supporting some of the more recent opsets. Distinguishing these scores would provide values to a user.
Ideally, we would continue to modernize the older benchmarks to use the most recent opsets. I would suggest in addition that we keep the old benchmark and split them in 2 categories: one including legacy models with legacy ops (e.g. including deprecated ops and/or attributes/inputs) and one including the recent models.
What is the problem that this feature solves?
By having 2 sets of benchmarks (one including legacy models and the other recent models), the coverage and applied optimizations will become more directly consumable to users, as they will be able to better evaluate the usefulness of a tool for current and legacy apps.
Describe the feature
I would look at the current benchmarks, and determine which ones are considered recent and which one are older. We would discuss what appropriate criteria to use. Alternatively, we could also bin each of the benchmarks by opset and provide a number per opset.
We can continue using converters, e.g. to include the up-converted older benchmark as part of the recent models, to the extend that the up-conversion does not generate graphs that are too distinct from currently generated ones.
One positive side effect of this effort is that we can also better separate the performance of the converters, which is to transform one opset into another one, with that of execution tools (runtime/compilers) which is to execute a given model.
Relevant Model Zoo feature areas
Mostly impact the organization of the benchmark and performance reporting.
Notes
I know that this issue has been discussed before, and I am looking forward to learning from your past observations.
There is also interest in the community in a slightly separate issue, namely on how to convert older models that use uniform precision to newer models that work on (possibly multiple) reduced precision. Again, having a benchmark set that can better highlight the benefit of a tool on converting such benchmark would be an attractive metric to present to the users.
The text was updated successfully, but these errors were encountered:
It is common to deprecate old versions of software, in this case, old versions of ONNX opsets and models for some obvious reasons:
Developer inability/unwillingness to support old versions
Code complexity to work with many versions
Challenging build and CICD system
Expensive unit and regression tests
Lack of users for old versions
ONNX actually deprecated certain operators in the past. So I think it is reasonable to deprecate opsets and models in the model zoo. We don't necessarily need to remove them from repos. The old versions could be placed in a different category so that the active developers and ecosystem tools such as frontend and backend converters do not need to keep maintaining and debugging legacy code. The deprecated opsets and models can still be handled in a case-by-case basis as needed.
A solid ONNX version converter will help tremendously the entire community to support all ONNX versions because the deprecated models could be brought up to more recent versions and functional without legacy code in various tools.
Feature Request
Right now, the ONNX model zoo consists of one batch of models, which have been accumulated over the years. Some include quite recent models coded with recent opsets, and others are older often encoded with early opsets. At this time, I believe is that user see only one number, e.g. the fraction of models that a given tool is supporting. However, this number may not directly indicates to user how up-to-date a given tool is. For example, a tool that supports all of the newer ops but not older ops may get a similar score as a tool that gets all of the older ops but is not supporting some of the more recent opsets. Distinguishing these scores would provide values to a user.
Ideally, we would continue to modernize the older benchmarks to use the most recent opsets. I would suggest in addition that we keep the old benchmark and split them in 2 categories: one including legacy models with legacy ops (e.g. including deprecated ops and/or attributes/inputs) and one including the recent models.
What is the problem that this feature solves?
By having 2 sets of benchmarks (one including legacy models and the other recent models), the coverage and applied optimizations will become more directly consumable to users, as they will be able to better evaluate the usefulness of a tool for current and legacy apps.
Describe the feature
I would look at the current benchmarks, and determine which ones are considered recent and which one are older. We would discuss what appropriate criteria to use. Alternatively, we could also bin each of the benchmarks by opset and provide a number per opset.
We can continue using converters, e.g. to include the up-converted older benchmark as part of the recent models, to the extend that the up-conversion does not generate graphs that are too distinct from currently generated ones.
One positive side effect of this effort is that we can also better separate the performance of the converters, which is to transform one opset into another one, with that of execution tools (runtime/compilers) which is to execute a given model.
Relevant Model Zoo feature areas
Mostly impact the organization of the benchmark and performance reporting.
Notes
I know that this issue has been discussed before, and I am looking forward to learning from your past observations.
There is also interest in the community in a slightly separate issue, namely on how to convert older models that use uniform precision to newer models that work on (possibly multiple) reduced precision. Again, having a benchmark set that can better highlight the benefit of a tool on converting such benchmark would be an attractive metric to present to the users.
The text was updated successfully, but these errors were encountered: