-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Cast Op performance fix. #6509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cast Op performance fix. #6509
Conversation
| auto snprintf_result = std::snprintf(nullptr, 0, format, value); | ||
| ORT_ENFORCE(snprintf_result > 0, "Failed to determine required snprintf() buffer length."); | ||
|
|
||
| // buffer for string and trailing '\0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a maximum size that a float can possibly use when converted to a string that we could just use as the fixed buffer size to avoid the double call to snprintf? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there probably is, but i'm not sure what it is. updated to make one call in the case where it does fit into a fixed buffer
In reply to: 569863895 [](ancestors = 569863895)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the static buffer a lot bigger (I would use at least 256 bytes) so the chance of needing the fallback path is much lower?
In reply to: 569930348 [](ancestors = 569930348,569863895)
| ORT_ENFORCE( | ||
| snprintf_result > 0 && gsl::narrow_cast<size_t>(snprintf_result) == buffer.size() - 1, | ||
| "Failed to write value with snprintf()."); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer returning a status in code that runs during model execution. In a minimal build that will gracefully return an error (vs. ORT_ENFORCE which will call abort() if exceptions are disabled).
#Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, good to know. however i think this is sufficiently bad to warrant termination
In reply to: 569864399 [](ancestors = 569864399)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way the request will terminate as we essentially exit as soon as the status indicates failure. If you return a Status that will come back to the user gracefully with an error message in all builds. If you throw, it will do the same unless exceptions are disabled, in which case it's a hard abort which is much harder to debug.
In reply to: 569930622 [](ancestors = 569930622,569864399)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. the error message still gets printed though, right?
let me know if you have a strong preference for Status in this particular case. it'll require some plumbing to propagate it. i think these snprintf failures are unexpected - i.e., more like an assertion
In reply to: 569957335 [](ancestors = 569957335,569930622,569864399)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| }; | ||
|
|
||
| #if defined(_M_AMD64) | ||
| // add some specializations to use optimized MLFloat16 -> float conversion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment documenting why this is _M_AMD64 only would be good #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to say that this is _M_AMD64 specific. searched the code and from the filenames it looks like it is only implemented for this architecture, but i'm not sure why that is.
In reply to: 569866100 [](ancestors = 569866100)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Searched the ORT code or Eigen code? Would be good for the comment to mention where you looked and only found amd64 specific optimizations so that in the future we can look in the same places to see if anything has changed and whether other plaforms can have the specializations enabled.
In reply to: 569931342 [](ancestors = 569931342,569866100)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is MLAS specific: at one time, the Windows ML team cared about the performance of a FP16 model running on the CPU (for falling back from the DML EP), so I implemented a Windows x64 specific routine to do the conversion. That scenario wasn't important on other platforms, so the kernel was never adapted anywhere else. The scenario may no longer matter either, if you can get support for retiring this from the Windows ML team. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for explaining that. the MLAS routine is faster from my measurements, so i think it's good to have this optimized version.
In reply to: 570004323 [](ancestors = 570004323)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i searched the ORT code for MlasConvertHalfToFloatBuffer.
In reply to: 569962495 [](ancestors = 569962495,569931342,569866100)
skottmckay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
![]()
Description
Update CPU Cast implementation to fix performance regressions.
Update Cast unit tests for more coverage.
Performance Measurements
baseline: 91b19b8 (prior to #6466 which pessimized Cast CPU performance).
updated: changes from this PR
measured 1000 iterations, median value seemed more stable than average
Motivation and Context
Performance.