Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance downgrade issue & update doc #229

Merged
merged 7 commits into from
Dec 4, 2023
Merged

Conversation

Binyang2014
Copy link
Contributor

For push function, we only need to make sure the instruction st.global will be executed after the while loop. Since there is a Write-After-Read hazard for trigger.fst (Check this->triggers[curFifoHead % size].fst != 0 first then write value to triggers[curFifoHead % size]), we can expect the compiler and hardware can handle this situation correctly. Remove the release.sys there.

BTW, st.global.release.sys.v2.u64 will cause perf regression issue. Previous we use st.global.release.cta.v2.u64, but seems not necessary.

include/mscclpp/fifo_device.hpp Outdated Show resolved Hide resolved
@saeedmaleki
Copy link
Contributor

For push function, we only need to make sure the instruction st.global will be executed after the while loop. Since there is a Write-After-Read hazard for trigger.fst (Check this->triggers[curFifoHead % size].fst != 0 first then write value to triggers[curFifoHead % size]), we can expect the compiler and hardware can handle this situation correctly. Remove the release.sys there.

BTW, st.global.release.sys.v2.u64 will cause perf regression issue. Previous we use st.global.release.cta.v2.u64, but seems not necessary.

Can you reason about why the compiler/hardware can handle this correctly?

@Binyang2014
Copy link
Contributor Author

Can you reason about why the compiler/hardware can handle this correctly?
I think the code has Write-After-Read hazard. If the compiler or hardware try to reorder the instruction, it will cause incorrect result.

The code can be simplified as this:

volatile int *addr;
while(true) {
  if (atomic_load(addr, std::memory_order_relaxed) == 0) {
    break;
  }
}
atomic_store(addr, 1, std::memory_order_relaxed);

For addr, it has write-after-read hazard. According to https://en.cppreference.com/w/cpp/language/as_if
Accesses (reads and writes) to volatile objects occur strictly according to the semantics of the expressions in which they occur. In particular, they are not reordered with respect to other volatile accesses on the same thread.
, for a single thread, compiler should not reorder the store before load, which will change the outcome of the program.
If we are talking about the different threads, since they will have different curFifoHead, then the polling addrs are different, that's should not be a problem.

@chhwang chhwang linked an issue Dec 4, 2023 that may be closed by this pull request
@Binyang2014
Copy link
Contributor Author

Put this link here: https://en.cppreference.com/w/cpp/language/eval_order. Seems we meet the rule 1, atomic_load(addr, std::memory_order_relaxed) == 0 need to be evaluated due to it has side-effect to atomic_store(addr, 1, std::memory_order_relaxed)

Copy link
Contributor

@chhwang chhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@saeedmaleki saeedmaleki merged commit f1b2c9d into main Dec 4, 2023
18 checks passed
@saeedmaleki saeedmaleki deleted the binyli/check_perf branch December 4, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants