Skip to content

Conversation

@edgchen1
Copy link
Contributor

@edgchen1 edgchen1 commented Jan 30, 2021

Description

Update CPU Cast implementation to fix performance regressions.

Update Cast unit tests for more coverage.

Performance Measurements

baseline: 91b19b8 (prior to #6466 which pessimized Cast CPU performance).
updated: changes from this PR

measured 1000 iterations, median value seemed more stable than average

n src_type dst_type median_us baseline median_us updated (baseline - updated) / baseline
32768 float16 float 50.7000 49.6 0.021696252
32768 float16 int64 84.8000 84.3 0.005896226
32768 float16 string 35995.4000 13615.3 0.62174889
32768 float float16 139.2000 139.9 -0.005028736
32768 float int64 62.1000 56.8 0.085346216
32768 float string 35930.9000 13621.3 0.6209029
32768 int64 float16 152.8000 147.7 0.033376963
32768 int64 float 60.6000 46.9 0.226072607
32768 int64 string 25288.2000 1217.2 0.951866879
32768 string float16 N/A (unimplemented) 5924 N/A
32768 string float 5847.4000 5512.2 0.057324623
32768 string int64 4936.9000 4808.4 0.026028479

Motivation and Context

Performance.

@edgchen1 edgchen1 requested a review from a team as a code owner January 30, 2021 03:20
@edgchen1 edgchen1 changed the title Edgchen1/cast perf Cast Op performance fix. Jan 30, 2021
auto snprintf_result = std::snprintf(nullptr, 0, format, value);
ORT_ENFORCE(snprintf_result > 0, "Failed to determine required snprintf() buffer length.");

// buffer for string and trailing '\0'
Copy link
Contributor

@skottmckay skottmckay Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a maximum size that a float can possibly use when converted to a string that we could just use as the fixed buffer size to avoid the double call to snprintf? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there probably is, but i'm not sure what it is. updated to make one call in the case where it does fit into a fixed buffer


In reply to: 569863895 [](ancestors = 569863895)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make the static buffer a lot bigger (I would use at least 256 bytes) so the chance of needing the fallback path is much lower?


In reply to: 569930348 [](ancestors = 569930348,569863895)

ORT_ENFORCE(
snprintf_result > 0 && gsl::narrow_cast<size_t>(snprintf_result) == buffer.size() - 1,
"Failed to write value with snprintf().");

Copy link
Contributor

@skottmckay skottmckay Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer returning a status in code that runs during model execution. In a minimal build that will gracefully return an error (vs. ORT_ENFORCE which will call abort() if exceptions are disabled).
#Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, good to know. however i think this is sufficiently bad to warrant termination


In reply to: 569864399 [](ancestors = 569864399)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way the request will terminate as we essentially exit as soon as the status indicates failure. If you return a Status that will come back to the user gracefully with an error message in all builds. If you throw, it will do the same unless exceptions are disabled, in which case it's a hard abort which is much harder to debug.


In reply to: 569930622 [](ancestors = 569930622,569864399)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. the error message still gets printed though, right?
let me know if you have a strong preference for Status in this particular case. it'll require some plumbing to propagate it. i think these snprintf failures are unexpected - i.e., more like an assertion


In reply to: 569957335 [](ancestors = 569957335,569930622,569864399)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trying out Status


In reply to: 570436510 [](ancestors = 570436510,569957335,569930622,569864399)

};

#if defined(_M_AMD64)
// add some specializations to use optimized MLFloat16 -> float conversion
Copy link
Contributor

@skottmckay skottmckay Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment documenting why this is _M_AMD64 only would be good #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to say that this is _M_AMD64 specific. searched the code and from the filenames it looks like it is only implemented for this architecture, but i'm not sure why that is.


In reply to: 569866100 [](ancestors = 569866100)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searched the ORT code or Eigen code? Would be good for the comment to mention where you looked and only found amd64 specific optimizations so that in the future we can look in the same places to see if anything has changed and whether other plaforms can have the specializations enabled.


In reply to: 569931342 [](ancestors = 569931342,569866100)

Copy link
Contributor

@tracysh tracysh Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is MLAS specific: at one time, the Windows ML team cared about the performance of a FP16 model running on the CPU (for falling back from the DML EP), so I implemented a Windows x64 specific routine to do the conversion. That scenario wasn't important on other platforms, so the kernel was never adapted anywhere else. The scenario may no longer matter either, if you can get support for retiring this from the Windows ML team. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for explaining that. the MLAS routine is faster from my measurements, so i think it's good to have this optimized version.


In reply to: 570004323 [](ancestors = 570004323)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i searched the ORT code for MlasConvertHalfToFloatBuffer.


In reply to: 569962495 [](ancestors = 569962495,569931342,569866100)

Copy link
Contributor

@skottmckay skottmckay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@edgchen1 edgchen1 merged commit 318b82c into master Feb 4, 2021
@edgchen1 edgchen1 deleted the edgchen1/cast_perf branch February 4, 2021 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants