-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stack monoid take 3 #66
Comments
I’m just getting started digging into the algorithms themselves, but two initial questions: Section 3
What about this example? Am I calculating the semigroup incorrectly?
In Section 5, what is the meaning of the |
I got this wrong, it should be as i decreases. This is fixed in my current draft. I'll upload a new one today, which will address these and more. Glad you're interested! |
What's attached is just a snapshot of the current state, but it doesn't reflect what I want it to be. I'll take a solid pass tomorrow and hopefully fill in the major gaps. |
I'm trying to understand the output of As described, I'd expect the reverse scan for the string |
Section 5.2, first bullet says |
Thanks for these questions, they're very useful in making me describe things more clearly. In 5.2, yes, this is a reverse scan. The description of 5.1 is too informal, and is also missing the fact that only open parentheses are written (this is why your interpretation is that 3 elements are written when it should only be 2). Index i is written to location Bic(s).b - Bic(s[i..n]).b if s[i] = '(' and Bic(s[i+1..n]).a = 0. So for |
Attached is my latest draft. I haven't done a real polish pass yet, but there's a new introduction (following the suggested format for the HPDC conference) and one graphic, so I think it says mostly what I want it to say now. I'll do some measurements and see if I can get another graphic in there. I'm very open to comments and suggestions: what's not clear? what could be better illustrated? |
Still needs work, but here it is in the recommended template form, and incorporating some of the feedback I've gotten. |
I'm actually pretty happy with this version, though of course there's always more that can be done. |
That's missing a closing parenthesis. It's taken me quite awhile to understand Equation 1 and the following paragraph. I think some additional exposition may have helped me (assuming I've gotten it right): Section 7.2: This section does feel like it is missing context. For example, the introduction of segments, and that they can be empty vs. non-empty. Would these be tokens other than |
Thanks! Since posting that draft I've actually added an explanation English, I hope that will explain it better. I'll have another look, fix the parentheses, and have another go at 7.2 (I know this is still a bit in shorthand). New draft up soon. |
I rewrote the section, expanding out a bit what was previously in shorthand. Hopefully, there's enough detail there to figure out what it's actually saying. (empty refers strictly to unmatched open parentheses, and I believe I've made that more precise and explicit now). I don't really expect people to get 7.2, it's pretty detailed and weedsy. But it should be a description of what I've done, and arguably important as it's what supports the performance result and the ability to scale up pretty well even with two dispatches. In any case, I very much appreciate your review and comments. I've been getting some on the Zulip as well, and between the two it makes me feel a lot less like I'm shouting into the void. I think I'll submit this version, and then I do have a small window to revise that further if I feel the need. |
Sorry for the delay, I didn't see a notification of a new post.
I think the new Section 7.2 works quite well. Send it! |
An update. The paper submission above was rejected. I am trying again. The new forum allows publication of the work but asks that the forum not be identified. I am furiously working to make the May 2 deadline. I've done some writing and editing, but the measurements (both of the existing stack monoid and the new bbox oriented work) haven't been done yet. So at this point I'd love to have feedback on the writing, understanding that the obvious TODO sections will be filled in later. |
Attached is a new draft, this time with performance measurements and analysis. I don't think I will have time to create a new visual as originally planned, so from here I will basically take a pass to clean up typos and obvious TODO items. |
This was not accepted by the conference. I now plan to post to arXiv and write an associated blog post. The new version, with conference information stripped off, is attached. Comments are welcome, it will probably be a few days before I get everything lined up here. I'm also hoping to rerun measurements on Nvidia 3080 hardware, but that will also take a few days. |
Hi there! I hope you do not mind some naive comments. I studied mathematics and work in a computer science group. I have no experience in theoretical parallel algorithms but am generally interested in practical implementations. I hope the comments are relevant. Theoretical algorithms and their implementationIn Section 1.2, page 2, first bullet point, you say
My question is why this mapping is not clear:
PRAM modelIn Section 1.3, page 3, third paragraph, you say
My question is would this not mean that a hierarchical memory model fits better current hardware?
|
First, the paper is now on arXiv and that will be the definitive place to find it. I am tentatively planning a v2 with more measurement and will probably delay publicizing it widely until then, but feel free to share. Second, absolutely I don't mind comments. Think of this as the Q&A at the hypothetical conference it would have gotten accepted to :) Personally I believe that there is a huge and deep area of study adapting "classical" parallel algorithms to more realistic hardware. There might be differences of opinion about whether GPU compute shaders represent a narrow niche (not as efficient as TPU-like accelerators, not as agile as CPUs) or whether the hierarchies are a more principled way to organize large numbers of threads, likely to persist in similar fashion well into the future. I'm honestly not sure the extent to which making an abstract compute performance model is useful, there's a possibility it's too complicated. Making such a model is a fairly deep study, and I think wildly too ambitious for this particular paper. Essentially if you were proposing such a thing, you'd need to justify that its predictions correlated with real-world performance. I think a good step in this direction is Vasily Volkov's thesis work. I'm not sure where this kind of work is being done, and where is a good place to publish it. |
I have figured out a seriously faster version of the stack monoid algorithm (see stack monoid revisited) and want to write that up.
The implementation exists in a PR on the piet-gpu repo, where it will become part of the GPU test suite.
This will become a blog post at least at some point, but I'm also doing a submission to the HPDC conference. Attached is a very early draft of that paper, for review while I iterate on it. It's also possible I'll do a PR adding the paper to the assets section here.
stack.pdf
The text was updated successfully, but these errors were encountered: