Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots and lots of things #2924

Merged
merged 9 commits into from
Jul 3, 2024
Merged

Lots and lots of things #2924

merged 9 commits into from
Jul 3, 2024

Conversation

kddnewton
Copy link
Collaborator

This PR does a lot of stuff, because it was all interconnected and it was easier to do as a group.

  • Reconfigure error tests
    • Previously we were creating syntax trees, comparing them to the recovered tree, and comparing error messages. This was very verbose, and required a lot of manual work.
    • The error tests also made heavy use of the DSL module, which I wanted to change for a while. The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.
    • The new error tests are all in their own files, and we check them by parsing them again, formatting the errors inline, and asserting that nothing changed. I also added a bin/prism error utility for creating new files of this kind to make it slightly more manageable.
  • Move location in node initializers — now that the DSL has been changed, it opens up the possibility to reconfigure out node initializers. The location parameter has been moved to the front (after source and node_id) so keep the common fields at the beginning. The actual node initializers should be seen as private API for all intents and purposes, because they will contain the most churn whenever something changes.
  • Expose flags on every node type — every node can have common flags (newline or static_literal). This was already present in the C and Rust APIs, but wasn't exposed to the Ruby, JS, or Java APIs. This is now present everywhere. This addresses a long-standing issue request (fixes A superclass/module/interface/flag for static literals #1531). Now every node has a common static_literal? method that can be used to check if something is a static literal. These flags are now exposed in the inspect output as well.
  • A utility method Node#breadth_first_search is added for convenience, I've written this out a bunch of times before.
  • node_id is now exposed in every API for every node. I was very hesitant to add this functionality to prism, because I was worried about memory usage and performance. However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id. I was getting away with it for CRuby by using (line, column) but for Rails it changes based on the ERB being compiled and I don't see how to get it otherwise. Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes. This fixes Support node_id #2383.

@kddnewton kddnewton force-pushed the stuff branch 5 times, most recently from 8c5afc6 to c385fbf Compare July 3, 2024 16:10
@eregon
Copy link
Member

eregon commented Jul 3, 2024

The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.

Great 🎉 (I remember adapting those tests when reordering fields wasn't fun)

Expose flags on every node type — every node can have common flags (newline or static_literal).

Mmh, this will increase the serialized size probably significantly.
Although it probably won't increase the Java node objects size (which already had a boolean for newline).

#1531 has pretty limited value for Java, I was mostly thinking of it as a Ruby-specific convenience (since only Ruby defines these #value methods).

OTOH it's likely nice to have newline in Java, but not sure how it will play with the existing support there.
This newline flag means "potential newline" but can occur multiple times per line, right?
If so it still needs some kind of post-processing, so it might be better to give a more explicit name to avoid incorrect usages.

node_id is now exposed in every API for every node.

Mmh, I wonder if we need it for Java and in the serialized output.

However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example implicit ordering).

@@ -92,13 +92,17 @@ public abstract class Nodes {

public static final Node[] EMPTY_ARRAY = {};

public final int nodeId;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now it would be best to not include this because it adds 8 or 0 bytes (due to 64-bit word alignment) for each Node, and AFAIK currently it has no purpose in Java.
Specifically (AFAIK) there is no public Ruby API to access the node_id or use it, only RubyVM stuff, which is CRuby-specific.


I feel the node_id stuff in CRuby was done in a rather rushed way, I think it would be worth to gather the requirements, design a proper API for this in Prism, and clarify what information the Ruby implementations need to keep and provide.
For instance TruffleRuby already keeps the startOffset & length of every node, isn't that already enough to identify a Node?

@eregon
Copy link
Member

eregon commented Jul 3, 2024

Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes.

Is it the case for all nodes?
I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

A quick repro to show what I mean:

#include <stdio.h>

struct A {
  int a;
  // int nodeID;
};

struct B {
  struct A a;
  int b;
  int* ptr;
};

int main(int argc, char const *argv[]) {
  printf("%ld\n", sizeof(struct B));
  return 0;
}

So without nodeID, B is 16 bytes, and with nodeID B is 24 bytes.
(adding int* p1; at the start of B still gives a 8 bytes increase for nodeID)

@kddnewton
Copy link
Collaborator Author

Mmh, this will increase the serialized size probably significantly.

I'm not sure this is true. For most files it will probably add 1 byte for every node that didn't already have flags on it. But calls, strings, integers, parameters, etc. all already had flags on them, and now they're part of the same integer.

This newline flag means "potential newline" but can occur multiple times per line, right?

Yes, it matches the FL_NEWLINE flag from CRuby. It indicates that the node could potentially emit a :line event in tracepoint, depending on the order the compiler assigns the lines.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example #2383 (comment)).

It's a combination of a lot of things. Here is the code in Rails that's relevant: https://github.com/rails/rails/blob/4fa56814f18fd3da49c83931fa773caa727d8096/actionview/lib/action_view/template.rb#L232-L248.

Every instruction in YARV keeps around an associated node_id. So I don't want to go the route of making another pass through the AST in order to assign them, because that would mean an entire other pass before we get to compile. If we instead did it while compiling, then the node id order is entirely dependent on the compiler, and we'd need to replicate that some how in order to find the next node.

I don't love the feature, but at this point it's one of the last things blocking us getting merged into CRuby, so I'd like to get this moving.

Is it the case for all nodes? I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Mmh, I wonder if we need it for Java and in the serialized output.

Because we're exposing node_id in the Ruby API, we have to have it in the serialization API because of our FFI backend, otherwise it wouldn't work through the gem on JRuby/TruffleRuby. So it's got to be there either way. I'm fine removing it from the Java nodes though if it's not going to be used.

I feel the node_id stuff in CRuby was done in a rather rushed way, I think it would be worth to gather the requirements, design a proper API for this in Prism, and clarify what information the Ruby implementations need to keep and provide.

I think it would be good to go through this exercise, but I'd really like to get prism merged sooner rather than later, so I'd like to do this after. I don't think it will be a problem to refactor this later, as it's basically a continuation of what is already there.

For instance TruffleRuby already keeps the startOffset & length of every node, isn't that already enough to identify a Node?

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

@kddnewton kddnewton merged commit 12863fd into main Jul 3, 2024
57 checks passed
@kddnewton kddnewton deleted the stuff branch July 3, 2024 18:34
@eregon
Copy link
Member

eregon commented Jul 3, 2024

Thank you for 23f3a1d, I think it's the best trade-off for now.
I understand for now you need node_id for CRuby and I wouldn't want to delay that either.

I will try to take a look at the Rails use-case to understand it better and whether it could be done another way.

@eregon
Copy link
Member

eregon commented Jul 3, 2024

Could you clarify this part:

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

What do you mean by moving?
Is it that ERB adds some prelude?
Or that ERB doesn't always compile an (unchanged) template to the same generated Ruby code?
Or that the template changes and then it can still works if the modification is very minor?

@eregon
Copy link
Member

eregon commented Jul 3, 2024

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Ah right,

typedef struct pm_node {
/**
* This represents the type of the node. It somewhat maps to the nodes that
* existed in the original grammar and ripper, but it's not a 1:1 mapping.
*/
pm_node_type_t type;
/**
* This represents any flags on the node. Some are common to all nodes, and
* some are specific to the type of node.
*/
pm_node_flags_t flags;
/**
* The unique identifier for this node, which is deterministic based on the
* source. It is used to identify unique nodes across parses.
*/
uint32_t node_id;
/**
* This is the location of the node in the source. It's a range of bytes
* containing a start and an end.
*/
pm_location_t location;
} pm_node_t;
so there are pointers inside pm_node_t (via pm_location_t) and I forgot that C makes the whole size of the struct a multiple of its widest scalar member (i.e. a pointer here, so multiple of 8 bytes and there was 4 byte padding before this change).

If the location becomes a pair of uint32_t then the footprint increase would be visible, and I could repro that in https://gist.github.com/eregon/e6cd83df15af5e0338982b46d26a7be7 where pm_constant_read_node goes from 16 to 20 bytes. But that's not the current situation and indeed printing a few nodes, the sizeof() seems the same before & after this change, nice.

@eregon
Copy link
Member

eregon commented Jul 24, 2024

I took a look at serialized size stats with only semantics fields and:
before:
https://github.com/ruby/prism/actions/runs/9746914415/job/26898392339

Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  66075127
total serialized/total source: 0.732

Stats of ratio serialized/source per file:
average: 0.799
median:  0.778
1st quartile: 0.567
3rd quartile: 1.006
min - max: 0.076 - 3.662

after:
https://github.com/ruby/prism/actions/runs/9783124902/job/27877177285

Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  86215117
total serialized/total source: 0.956

Stats of ratio serialized/source per file:
average: 0.950
median:  0.935
1st quartile: 0.669
3rd quartile: 1.204
min - max: 0.080 - 4.052

So that's about a 19% regression for average, quite significant and likely to impact serialization & deserialization in similar amounts for TruffleRuby & JRuby.

I would assume this increase comes from node_id and from common flags (newline and static_literal).
These 3 things have very-limited/no utility for TruffleRuby & JRuby, so I think a good fix there is to not serialize them when PRISM_SERIALIZE_ONLY_SEMANTICS_FIELDS=true. I'll try to do that.

The common flags might increase serialize size because they take 2 bits more and make other flags start at 4 instead of 1, making it more likely to be >=128 and need two bytes serialized. But I think that's pretty minimal so I'll measure how much is that, that's probably not worth optimizing or making things more complex.
EDIT: Ah but of course common flags means every node can have flags so an extra byte serialized per node which had no flags.

end
super(node)
end
end
end

class Node
def newline? # :nodoc:
@newline ? true : false
def newline_flag? # :nodoc:
Copy link
Member

@eregon eregon Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not ideal to rename this one and have the common flag be called newline?, because it changes semantics and probably nobody should use the latter directly: the latter is the "potential but not necessarily newline" while the former is the real newline flag, i.e. "this node is the sole node on that line with the flag".
How about renaming the common flag as POTENTIAL_NEWLINE or so?

eregon added a commit to eregon/yarp that referenced this pull request Jul 24, 2024
…LDS is set

* Note that we could shift the flags by 2 on serialize & deserialize
  but it does not seems worth it as it does not save serialized size
  in any significant amount, i.e. average was 0.799 before ruby#2924.
* $ bundle exec rake serialized_size:topgems
Before:
Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  69477115
total serialized/total source: 0.770

Stats of ratio serialized/source per file:
average: 0.844
median:  0.825
1st quartile: 0.597
3rd quartile: 1.064
min - max: 0.078 - 3.792

After:
Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  66150209
total serialized/total source: 0.733

Stats of ratio serialized/source per file:
average: 0.800
median:  0.779
1st quartile: 0.568
3rd quartile: 1.007
min - max: 0.076 - 3.675
@enebo
Copy link
Collaborator

enebo commented Jul 24, 2024

I wish I had this updated on JRuby to say what the impact is for growing the serialized size. There are two cases and we should consider the balance of both:

  1. JRuby and TruffleRuby do not need this so it is simple to say it should not be in the serialized format
  2. If we ever want to have a universal cross impl serialized format we need to not have separate serialization between impls.

This second option has been brought up but it seems to be an idea and not necessarily a goal? FFI means we need something consistent so I guess that is sort of a form of 2 at least for consuming it as a gem?

From my perspective we have already done other things to grow the size (like only having a callnode vs fcall,vcall,call) so this is just more of the same. I will not have time to look into this for probably two weeks so I cannot access what this cost (which for all we know may be nothing). I am not expressing a recommendation.

@eregon
Copy link
Member

eregon commented Jul 24, 2024

Fixing the increase turned out to be easy, I made a PR: #2956
We already have a different serialization which ignores non-semantic fields (= location fields) for JRuby/TruffleRuby/Java API, and it's also properly guarded in both deserializers to ensure they have the semantics fields or not as they expect. The FFI backend with the Ruby deserializer uses the full serialization format.

We could potentially support the full serialization format for Loader.java/Nodes.java, but since there is no use case currently it would just be additional complexity (and it would notably need some different package to avoid Java class clashes if we want to load them in the same process).

@enebo
Copy link
Collaborator

enebo commented Jul 24, 2024

@eregon yeah I am just saying for option 2 in my comment they would need to be the same unless you decided to load multiple formats which I agree adds complexity for something which is just a future (and I am not suggesting we support full serialization but something which all three impls can load). It might not be worth it or it might which is why I asked.

I am not sure if this will actually add significant or possibly even measurable time. deserialization is pretty fast in comparison to building the tree. I think we should measure this. I would love for us to reduce the size of this as small as possible but in retrospect I have been surprised how fast deserialization has been. I think we should measure the impact.

@enebo
Copy link
Collaborator

enebo commented Jul 24, 2024

BLEH. I cannot even install my current build of jruby-prism-gem because gcc 14 gives an error about using transposed calloc args. Trying to override this with --with-cflags seems to not be picking this up (which could be something JRuby specific since it is uncommon to be compiling C in a JRuby gem).

@enebo
Copy link
Collaborator

enebo commented Jul 24, 2024

ok another issue is I am not tracking serialize vs building AST but from what I remember most time is making the tree and a significant part of that was in the newline visitor. I guess when I do have time regardless of how we resolve this I will break this down more.

eregon added a commit that referenced this pull request Jul 25, 2024
…LDS is set

* Note that we could shift the flags by 2 on serialize & deserialize
  but it does not seems worth it as it does not save serialized size
  in any significant amount, i.e. average was 0.799 before #2924.
* $ bundle exec rake serialized_size:topgems
Before:
Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  69477115
total serialized/total source: 0.770

Stats of ratio serialized/source per file:
average: 0.844
median:  0.825
1st quartile: 0.597
3rd quartile: 1.064
min - max: 0.078 - 3.792

After:
Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  66150209
total serialized/total source: 0.733

Stats of ratio serialized/source per file:
average: 0.800
median:  0.779
1st quartile: 0.568
3rd quartile: 1.007
min - max: 0.076 - 3.675
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support node_id A superclass/module/interface/flag for static literals
3 participants