-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map written LocationSet
s to program locations (loc_t
) instead of IR::Expression*
s
#4797
Conversation
d4d7529
to
407aa2d
Compare
frontends/p4/def_use.h
Outdated
/// For each expression the location set it writes | ||
hvec_map<const IR::Expression *, const LocationSet *> writes; | ||
/// For each program location the location set it writes | ||
ordered_map<loc_t, const LocationSet *> writes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asl Let me know if there is a way to use hvec_map
here instead of ordered_map
. I reverted to ordered_map
because I ran into a bunch of errors when changing the key to loc_t
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered_map
is terrible expensive. And this is one of the most hottest places. I would probably suggest that every PR that contains changes to use-def in this way to contain runtimes of:
test/gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512
and
ctest -R "p14_to_16/testdata/p4_14_samples/switch_20160512/switch.p4"
They are more or less the same source code, just running a bit different set of passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Results of running the above commands on this branch:
test/gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512
7930 ms
7918 ms
7949 ms
ctest -R "p14_to_16/testdata/p4_14_samples/switch_20160512/switch.p4"
23.14 s
23.19 s
23.00 s
Main branch:
test/gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512
7858 ms
7854 ms
7873 ms
ctest -R "p14_to_16/testdata/p4_14_samples/switch_20160512/switch.p4"
22.78 s
22.94 s
23.19 s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered_map is terrible expensive. And this is one of the most hottest places. I would probably suggest that every PR that contains changes to use-def in this way to contain runtimes of:
@asl changed to hvec_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way to test this is to remove the passes that were introduced to tree-ify the DAG before def-use (e.g., Cloner in simplifyDefUse.h). Unfortunately not all of them can be removed from the compiler, since other passes may require trees as well.
frontends/p4/def_use.h
Outdated
// A location in the program. Includes the context from the visitor, which needs to | ||
// be copied out of the Visitor::Context objects, as they are allocated on the stack and | ||
// will become invalid as the IR traversal continues | ||
struct loc_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be pretty expensive. def_use is already expensive in memory.
We had closed #151, but I am not sure the fundamental problem is really solved.
Maybe it's not a problem in practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asl can probably comment on the current memory usage of def_use, but I'm not sure that this PR makes it noticeably worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had closed #151, but I am not sure the fundamental problem is really solved.
It is not solved. The overhead is just a bit smaller. Still:
- Lots of malloc traffic (e.g.
joinDefinitions
, orgetPoints
, orwrites
, etc. – they all allocate as crazy) - Lots of maps everywhere and lots of map lookups.
In many cases use-def is the single pass that runs for a significant part of the whole frontend
frontends/p4/def_use.h
Outdated
/// For each expression the location set it writes | ||
hvec_map<const IR::Expression *, const LocationSet *> writes; | ||
/// For each program location the location set it writes | ||
ordered_map<loc_t, const LocationSet *> writes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ordered_map
is terrible expensive. And this is one of the most hottest places. I would probably suggest that every PR that contains changes to use-def in this way to contain runtimes of:
test/gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512
and
ctest -R "p14_to_16/testdata/p4_14_samples/switch_20160512/switch.p4"
They are more or less the same source code, just running a bit different set of passes.
38ecfcd
to
37072bb
Compare
} | ||
private: | ||
// TODO: Make absl::flat_hash_set instead? | ||
std::unordered_set<loc_t> &cached_locs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use an hvec_set
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any hvec_set
in the tree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to ensure stable addresses of values around insertions. So, flat_hash_set
is not an option here. You might want to give node_hash_set
a try:
std::unordered_set<loc_t> &cached_locs; | |
absl::node_hash_set<loc_t, Util::Hash> &cached_locs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not doing any iteration of cached_locs
, so stable insertion value ordering does not matter here. If you think absl::node_hash_set
would be better (e.g. for performance reasons), feel free to push your changes to this branch as I ran into some errors when trying your suggestion and I'd rather not spend the time debugging this unless there is a good reason to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not about stable iteration order. The issue is much more subtle, but is very important: you are doing (though I would prefer ->first
here):
return &*cached_locs.insert(tmp).first;
Note that you are returning the address of hash map entry. Therefore you need to ensure this address will not change during insertions / deletions. STL unordered containers guarantee this (as each bucket is essentially a list). flat_hash_set
does not guarantee this, node_hash_set
does. See https://abseil.io/docs/cpp/guides/container for more information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that you are returning the address of hash map entry. Therefore you need to ensure this address will not change during insertions / deletions.
Ok, I see what you mean now.
I'm not convinced that using std::unordered_set
is a problem in the first place, so I will not spend the time trying to get node_hash_set
to work, but feel free to push such changes to this branch if you know how to do it and think that it will improve the performance.
(but I will at least update the comment if you don't want to make these changes)
This causes some tests to fail. I will investigate this later. |
@mihaibudiu This doesn't work because of a problem I mentioned earlier in #4548 (comment). If you disable the Disabling
and on
so these changes at least seem to be accomplishing their intended purpose. |
If I understand what you are saying, this means that the context representation you are using is incomplete. You need to keep the child number as well. |
abfffee
to
2e76c92
Compare
It may be incomplete, but it is still an improvement over the current implementation, so additional improvements can be made separately. And the future improvements will need to affect both |
Maybe we can use some well-known and proven approaches? E.g. some flavour of value numbering? |
That would probably be even better. We can consider doing this in a future PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@kfcripps Are there updated benchmarking results? Or the ones above are still valid and there were only correctness changes? |
frontends/p4/def_use.cpp
Outdated
std::size_t P4::loc_t::hash() const { | ||
if (!parent) return Util::Hash{}(node->id); | ||
|
||
return Util::Hash{}(node->id, parent->hash()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an idea how "deep" the recursion here could be? Maybe it would worth to memoize the hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is as deep as the deepest IR::Node in an IR::Node DAG. In many of the P4-16 programs that I have analyzed it does not get egregiously deep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well... this is quite a lot actually and it could be very deep especially after inlining. I would probably check the effects of memoization. We can use zero hash value as a thombstone, in a very rare case of zero value hash collision, well, we'd have a single level of recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried unsuccessfully to memoize the hashes: kfcripps@04b7a4e
Let me know if you have any suggestions on how to accomplish this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion worked. Thanks!
// In this case parentLoc is the loc of n's direct parent. | ||
const P4::loc_t *ComputeWriteSet::getLoc(const IR::Node *n, const loc_t *parentLoc) { | ||
loc_t tmp{n, parentLoc}; | ||
return &*cached_locs.insert(tmp).first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return &*cached_locs.insert(tmp).first; | |
return &*cached_locs.emplace(n, parentLoc).first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was basically copied from the midend's version of loc_t
. Maybe tmp
was left for debugging purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this doesn't seem to work:
/local/kfcripps/repos/p4c/frontends/p4/def_use.cpp:776:46: required from here
/usr/include/c++/9/ext/new_allocator.h:146:4: error: new initializer expression list treated as compound expression [-fpermissive]
146 | { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/include/c++/9/ext/new_allocator.h:146:4: error: no matching function for call to 'P4::loc_t::loc_t(const P4::loc_t*&)'
In file included from /local/kfcripps/repos/p4c/frontends/p4/def_use.cpp:17:
/local/kfcripps/repos/p4c/frontends/p4/def_use.h:39:8: note: candidate: 'P4::loc_t::loc_t()'
39 | struct loc_t {
| ^~~~~
/local/kfcripps/repos/p4c/frontends/p4/def_use.h:39:8: note: candidate expects 0 arguments, 1 provided
/local/kfcripps/repos/p4c/frontends/p4/def_use.h:39:8: note: candidate: 'constexpr P4::loc_t::loc_t(const P4::loc_t&)'
/local/kfcripps/repos/p4c/frontends/p4/def_use.h:39:8: note: no known conversion for argument 1 from 'const P4::loc_t*' to 'const P4::loc_t&'
/local/kfcripps/repos/p4c/frontends/p4/def_use.h:39:8: note: candidate: 'constexpr P4::loc_t::loc_t(P4::loc_t&&)'
/local/kfcripps/repos/p4c/frontends/p4/def_use.h:39:8: note: no known conversion for argument 1 from 'const P4::loc_t*' to 'P4::loc_t&&'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, you'd likely need to implement proper constructor ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was unsuccessful yet again: kfcripps@2275e49
Let me know if any suggestions on how to achieve this
const P4::loc_t *ComputeWriteSet::getLoc(const Visitor::Context *ctxt) { | ||
if (!ctxt) return nullptr; | ||
loc_t tmp{ctxt->node, getLoc(ctxt->parent)}; | ||
return &*cached_locs.insert(tmp).first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return &*cached_locs.insert(tmp).first; | |
return &*cached_locs.emplace(ctxt->node, getLoc(ctxt->parent)).first; |
if (p->node == n) return getLoc(p); | ||
auto rv = getLoc(ctxt); | ||
loc_t tmp{n, rv}; | ||
return &*cached_locs.insert(tmp).first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return &*cached_locs.insert(tmp).first; | |
return &*cached_locs.emplace(n, getLoc(ctxt)).first; |
} | ||
private: | ||
// TODO: Make absl::flat_hash_set instead? | ||
std::unordered_set<loc_t> &cached_locs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to ensure stable addresses of values around insertions. So, flat_hash_set
is not an option here. You might want to give node_hash_set
a try:
std::unordered_set<loc_t> &cached_locs; | |
absl::node_hash_set<loc_t, Util::Hash> &cached_locs; |
148e2d4
to
50c0f9f
Compare
New results on this branch (as of 50c0f9f): test/gtestp4c --gtest_filter=P4CParserUnroll.switch_20160512
ctest -R "p14_to_16/testdata/p4_14_samples/switch_20160512/switch.p4"
|
50c0f9f
to
9d1fb35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about #4385? |
Re-run benchmarking in less noisy setup, 10 iterations + warmup:
I would say no impact |
See #4797 (comment).. Mihai's PRs masked the root problem, so I do not know of any test cases that pass with this PR but not on
We can probably revert it, but based on the changes to test outputs in https://github.com/p4lang/p4c/pull/4727/files, there may be other benefits to leaving the added constant folding pass? Also, to be safe, I'd rather not revert any of Mihai's fixes until all related problems with |
My main issue is that we are closing #4507 but it looks likes we are not adding the test case reported in it. For the other two issues we have added the tests already. We can add it here or in a separate PR. |
class StorageFactory; | ||
class LocationSet; | ||
|
||
// A location in the program. Includes the context from the visitor, which needs to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Three slashes. Same goes the other comments.
…Expression*s Signed-off-by: Kyle Cripps <[email protected]>
…_set instead of std::set for cached_locs Signed-off-by: Kyle Cripps <[email protected]>
Signed-off-by: Kyle Cripps <[email protected]>
Signed-off-by: Kyle Cripps <[email protected]>
Signed-off-by: Kyle Cripps <[email protected]>
9d1fb35
to
6858f16
Compare
Adds a modified version of
loc_t
from the midend def_use pass to the frontend one. Please see discussions in #4500, #4507, #4548 for more context. I did not add any test cases as @mihaibudiu's PRs #4539 and #4727 mask the root problem, but I did verify that these changes also fix the test cases from the below issues with an older version of the compiler that did not have Mihai's fixes yet.Fixes #4500
Fixes #4507
Fixes #4548