Lots and lots of things #2924

kddnewton · 2024-07-03T15:11:18Z

This PR does a lot of stuff, because it was all interconnected and it was easier to do as a group.

Reconfigure error tests
- Previously we were creating syntax trees, comparing them to the recovered tree, and comparing error messages. This was very verbose, and required a lot of manual work.
- The error tests also made heavy use of the DSL module, which I wanted to change for a while. The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.
- The new error tests are all in their own files, and we check them by parsing them again, formatting the errors inline, and asserting that nothing changed. I also added a bin/prism error utility for creating new files of this kind to make it slightly more manageable.
Move location in node initializers — now that the DSL has been changed, it opens up the possibility to reconfigure out node initializers. The location parameter has been moved to the front (after source and node_id) so keep the common fields at the beginning. The actual node initializers should be seen as private API for all intents and purposes, because they will contain the most churn whenever something changes.
Expose flags on every node type — every node can have common flags (newline or static_literal). This was already present in the C and Rust APIs, but wasn't exposed to the Ruby, JS, or Java APIs. This is now present everywhere. This addresses a long-standing issue request (fixes A superclass/module/interface/flag for static literals #1531). Now every node has a common static_literal? method that can be used to check if something is a static literal. These flags are now exposed in the inspect output as well.
A utility method Node#breadth_first_search is added for convenience, I've written this out a bunch of times before.
node_id is now exposed in every API for every node. I was very hesitant to add this functionality to prism, because I was worried about memory usage and performance. However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id. I was getting away with it for CRuby by using (line, column) but for Rails it changes based on the ERB being compiled and I don't see how to get it otherwise. Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes. This fixes Support node_id #2383.

eregon · 2024-07-03T16:28:05Z

The DSL module now accepts keyword arguments for each field that you want to change, and has sensible defaults for all other fields. This allows you to more quickly create nodes without having to worry about the order of the fields.

Great 🎉 (I remember adapting those tests when reordering fields wasn't fun)

Expose flags on every node type — every node can have common flags (newline or static_literal).

Mmh, this will increase the serialized size probably significantly.
Although it probably won't increase the Java node objects size (which already had a boolean for newline).

#1531 has pretty limited value for Java, I was mostly thinking of it as a Ruby-specific convenience (since only Ruby defines these #value methods).

OTOH it's likely nice to have newline in Java, but not sure how it will play with the existing support there.
This newline flag means "potential newline" but can occur multiple times per line, right?
If so it still needs some kind of post-processing, so it might be better to give a more explicit name to avoid incorrect usages.

node_id is now exposed in every API for every node.

Mmh, I wonder if we need it for Java and in the serialized output.

However, I don't see any other way around it that will work with the same usage pattern for the way Rails uses node id.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example implicit ordering).

eregon · 2024-07-03T16:44:57Z

templates/java/org/prism/Nodes.java.erb

@@ -92,13 +92,17 @@ public abstract class Nodes {

        public static final Node[] EMPTY_ARRAY = {};

+        public final int nodeId;


I think for now it would be best to not include this because it adds 8 or 0 bytes (due to 64-bit word alignment) for each Node, and AFAIK currently it has no purpose in Java.
Specifically (AFAIK) there is no public Ruby API to access the node_id or use it, only RubyVM stuff, which is CRuby-specific.

I feel the node_id stuff in CRuby was done in a rather rushed way, I think it would be worth to gather the requirements, design a proper API for this in Prism, and clarify what information the Ruby implementations need to keep and provide.
For instance TruffleRuby already keeps the startOffset & length of every node, isn't that already enough to identify a Node?

eregon · 2024-07-03T16:53:23Z

Happily, it doesn't actually take up any more space in the C API because we had a 32-bit hole in all of our nodes on 64-bit systems because of pointer alignment, so it doesn't actually increase the size of the nodes.

Is it the case for all nodes?
I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

A quick repro to show what I mean:

#include <stdio.h>

struct A {
  int a;
  // int nodeID;
};

struct B {
  struct A a;
  int b;
  int* ptr;
};

int main(int argc, char const *argv[]) {
  printf("%ld\n", sizeof(struct B));
  return 0;
}

So without nodeID, B is 16 bytes, and with nodeID B is 24 bytes.
(adding int* p1; at the start of B still gives a 8 bytes increase for nodeID)

kddnewton · 2024-07-03T18:08:56Z

Mmh, this will increase the serialized size probably significantly.

I'm not sure this is true. For most files it will probably add 1 byte for every node that didn't already have flags on it. But calls, strings, integers, parameters, etc. all already had flags on them, and now they're part of the same integer.

This newline flag means "potential newline" but can occur multiple times per line, right?

Yes, it matches the FL_NEWLINE flag from CRuby. It indicates that the node could potentially emit a :line event in tracepoint, depending on the order the compiler assigns the lines.

Do you have a description of the problem? I'd like to understand better why it's needed or if there might be any alternative (for example #2383 (comment)).

It's a combination of a lot of things. Here is the code in Rails that's relevant: https://github.com/rails/rails/blob/4fa56814f18fd3da49c83931fa773caa727d8096/actionview/lib/action_view/template.rb#L232-L248.

Every instruction in YARV keeps around an associated node_id. So I don't want to go the route of making another pass through the AST in order to assign them, because that would mean an entire other pass before we get to compile. If we instead did it while compiling, then the node id order is entirely dependent on the compiler, and we'd need to replicate that some how in order to find the next node.

I don't love the feature, but at this point it's one of the last things blocking us getting merged into CRuby, so I'd like to get this moving.

Is it the case for all nodes? I'm thinking if a node struct would have less than 4 bytes of padding then the effective size will be + 8 bytes.

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Mmh, I wonder if we need it for Java and in the serialized output.

Because we're exposing node_id in the Ruby API, we have to have it in the serialization API because of our FFI backend, otherwise it wouldn't work through the gem on JRuby/TruffleRuby. So it's got to be there either way. I'm fine removing it from the Java nodes though if it's not going to be used.

I feel the node_id stuff in CRuby was done in a rather rushed way, I think it would be worth to gather the requirements, design a proper API for this in Prism, and clarify what information the Ruby implementations need to keep and provide.

I think it would be good to go through this exercise, but I'd really like to get prism merged sooner rather than later, so I'd like to do this after. I don't think it will be a problem to refactor this later, as it's basically a continuation of what is already there.

For instance TruffleRuby already keeps the startOffset & length of every node, isn't that already enough to identify a Node?

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

eregon · 2024-07-03T19:27:02Z

Thank you for 23f3a1d, I think it's the best trade-off for now.
I understand for now you need node_id for CRuby and I wouldn't want to delay that either.

I will try to take a look at the Rails use-case to understand it better and whether it could be done another way.

eregon · 2024-07-03T19:30:13Z

Could you clarify this part:

It's not enough, unfortunately. The ERB use case makes this not sufficient because nodes can move when ERB compiles them.

What do you mean by moving?
Is it that ERB adds some prelude?
Or that ERB doesn't always compile an (unchanged) template to the same generated Ruby code?
Or that the template changes and then it can still works if the modification is very minor?

eregon · 2024-07-03T20:10:12Z

Yeah it's the case for all nodes. All nodes have 16 bytes for type, 16 bytes for flags, then a 32-bit hole, then a 64-bit pointer to the start of the node.

Ah right,

prism/templates/include/prism/ast.h.erb

Lines 126 to 150 in ad5efb2

    
           typedef struct pm_node { 
        
               /** 
        
                * This represents the type of the node. It somewhat maps to the nodes that 
        
                * existed in the original grammar and ripper, but it's not a 1:1 mapping. 
        
                */ 
        
               pm_node_type_t type; 
        
               /** 
        
                * This represents any flags on the node. Some are common to all nodes, and 
        
                * some are specific to the type of node. 
        
                */ 
        
               pm_node_flags_t flags; 
        
               /** 
        
                * The unique identifier for this node, which is deterministic based on the 
        
                * source. It is used to identify unique nodes across parses. 
        
                */ 
        
               uint32_t node_id; 
        
               /** 
        
                * This is the location of the node in the source. It's a range of bytes 
        
                * containing a start and an end. 
        
                */ 
        
               pm_location_t location; 
        
           } pm_node_t;

so there are pointers inside pm_node_t (via pm_location_t) and I forgot that C makes the whole size of the struct a multiple of its widest scalar member (i.e. a pointer here, so multiple of 8 bytes and there was 4 byte padding before this change).

If the location becomes a pair of uint32_t then the footprint increase would be visible, and I could repro that in https://gist.github.com/eregon/e6cd83df15af5e0338982b46d26a7be7 where pm_constant_read_node goes from 16 to 20 bytes. But that's not the current situation and indeed printing a few nodes, the sizeof() seems the same before & after this change, nice.

eregon · 2024-07-24T19:27:24Z

I took a look at serialized size stats with only semantics fields and:
before:
https://github.com/ruby/prism/actions/runs/9746914415/job/26898392339

Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  66075127
total serialized/total source: 0.732

Stats of ratio serialized/source per file:
average: 0.799
median:  0.778
1st quartile: 0.567
3rd quartile: 1.006
min - max: 0.076 - 3.662

after:
https://github.com/ruby/prism/actions/runs/9783124902/job/27877177285

Total sizes for top 100 gems:
total source size:      90207647
total serialized size:  86215117
total serialized/total source: 0.956

Stats of ratio serialized/source per file:
average: 0.950
median:  0.935
1st quartile: 0.669
3rd quartile: 1.204
min - max: 0.080 - 4.052

So that's about a 19% regression for average, quite significant and likely to impact serialization & deserialization in similar amounts for TruffleRuby & JRuby.

I would assume this increase comes from node_id and from common flags (newline and static_literal).
These 3 things have very-limited/no utility for TruffleRuby & JRuby, so I think a good fix there is to not serialize them when PRISM_SERIALIZE_ONLY_SEMANTICS_FIELDS=true. I'll try to do that.

The common flags might increase serialize size because they take 2 bits more and make other flags start at 4 instead of 1, making it more likely to be >=128 and need two bytes serialized. But I think that's pretty minimal so I'll measure how much is that, that's probably not worth optimizing or making things more complex.
EDIT: Ah but of course common flags means every node can have flags so an extra byte serialized per node which had no flags.

eregon · 2024-07-24T19:45:24Z

lib/prism/parse_result/newlines.rb

        end
        super(node)
      end
    end
  end

  class Node
-    def newline? # :nodoc:
-      @newline ? true : false
+    def newline_flag? # :nodoc:


It's not ideal to rename this one and have the common flag be called newline?, because it changes semantics and probably nobody should use the latter directly: the latter is the "potential but not necessarily newline" while the former is the real newline flag, i.e. "this node is the sole node on that line with the flag".
How about renaming the common flag as POTENTIAL_NEWLINE or so?

…LDS is set * Note that we could shift the flags by 2 on serialize & deserialize but it does not seems worth it as it does not save serialized size in any significant amount, i.e. average was 0.799 before ruby#2924. * $ bundle exec rake serialized_size:topgems Before: Total sizes for top 100 gems: total source size: 90207647 total serialized size: 69477115 total serialized/total source: 0.770 Stats of ratio serialized/source per file: average: 0.844 median: 0.825 1st quartile: 0.597 3rd quartile: 1.064 min - max: 0.078 - 3.792 After: Total sizes for top 100 gems: total source size: 90207647 total serialized size: 66150209 total serialized/total source: 0.733 Stats of ratio serialized/source per file: average: 0.800 median: 0.779 1st quartile: 0.568 3rd quartile: 1.007 min - max: 0.076 - 3.675

enebo · 2024-07-24T20:30:34Z

I wish I had this updated on JRuby to say what the impact is for growing the serialized size. There are two cases and we should consider the balance of both:

JRuby and TruffleRuby do not need this so it is simple to say it should not be in the serialized format
If we ever want to have a universal cross impl serialized format we need to not have separate serialization between impls.

This second option has been brought up but it seems to be an idea and not necessarily a goal? FFI means we need something consistent so I guess that is sort of a form of 2 at least for consuming it as a gem?

From my perspective we have already done other things to grow the size (like only having a callnode vs fcall,vcall,call) so this is just more of the same. I will not have time to look into this for probably two weeks so I cannot access what this cost (which for all we know may be nothing). I am not expressing a recommendation.

eregon · 2024-07-24T20:37:28Z

Fixing the increase turned out to be easy, I made a PR: #2956
We already have a different serialization which ignores non-semantic fields (= location fields) for JRuby/TruffleRuby/Java API, and it's also properly guarded in both deserializers to ensure they have the semantics fields or not as they expect. The FFI backend with the Ruby deserializer uses the full serialization format.

We could potentially support the full serialization format for Loader.java/Nodes.java, but since there is no use case currently it would just be additional complexity (and it would notably need some different package to avoid Java class clashes if we want to load them in the same process).

enebo · 2024-07-24T21:50:43Z

@eregon yeah I am just saying for option 2 in my comment they would need to be the same unless you decided to load multiple formats which I agree adds complexity for something which is just a future (and I am not suggesting we support full serialization but something which all three impls can load). It might not be worth it or it might which is why I asked.

I am not sure if this will actually add significant or possibly even measurable time. deserialization is pretty fast in comparison to building the tree. I think we should measure this. I would love for us to reduce the size of this as small as possible but in retrospect I have been surprised how fast deserialization has been. I think we should measure the impact.

enebo · 2024-07-24T22:03:32Z

BLEH. I cannot even install my current build of jruby-prism-gem because gcc 14 gives an error about using transposed calloc args. Trying to override this with --with-cflags seems to not be picking this up (which could be something JRuby specific since it is uncommon to be compiling C in a JRuby gem).

enebo · 2024-07-24T22:08:31Z

ok another issue is I am not tracking serialize vs building AST but from what I remember most time is making the tree and a significant part of that was in the newline visitor. I guess when I do have time regardless of how we resolve this I will break this down more.

…LDS is set * Note that we could shift the flags by 2 on serialize & deserialize but it does not seems worth it as it does not save serialized size in any significant amount, i.e. average was 0.799 before #2924. * $ bundle exec rake serialized_size:topgems Before: Total sizes for top 100 gems: total source size: 90207647 total serialized size: 69477115 total serialized/total source: 0.770 Stats of ratio serialized/source per file: average: 0.844 median: 0.825 1st quartile: 0.597 3rd quartile: 1.064 min - max: 0.078 - 3.792 After: Total sizes for top 100 gems: total source size: 90207647 total serialized size: 66150209 total serialized/total source: 0.733 Stats of ratio serialized/source per file: average: 0.800 median: 0.779 1st quartile: 0.568 3rd quartile: 1.007 min - max: 0.076 - 3.675

kddnewton added 7 commits July 2, 2024 12:34

Reconfigure error tests

6e46c61

Move location to second position for node initializers

dc3122f

Expose flags on every node type

4024fbb

Expose common flags in inspect output

6c4c5ee

Add Node#breadth_first_search

1e4ff49

Move Node#type and Node::type documentation

6c7ff47

Add node ids to nodes

7cfe5f0

kddnewton force-pushed the stuff branch 5 times, most recently from 8c5afc6 to c385fbf Compare July 3, 2024 16:10

eregon reviewed Jul 3, 2024

View reviewed changes

kddnewton added 2 commits July 3, 2024 14:16

Various cleanup for initializers and typechecks

3943797

Do not expose node_id to Java

23f3a1d

kddnewton force-pushed the stuff branch from c385fbf to 23f3a1d Compare July 3, 2024 18:17

kddnewton merged commit 12863fd into main Jul 3, 2024
57 checks passed

kddnewton deleted the stuff branch July 3, 2024 18:34

eregon reviewed Jul 24, 2024

View reviewed changes

eregon mentioned this pull request Jul 24, 2024

Optimize serialized size for PRISM_SERIALIZE_ONLY_SEMANTICS_FIELDS=1 #2956

Merged

andrykonchin mentioned this pull request Sep 12, 2024

Fix node flags values for Java #3054

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lots and lots of things #2924

Lots and lots of things #2924

kddnewton commented Jul 3, 2024

eregon commented Jul 3, 2024 •

edited

Loading

eregon Jul 3, 2024

eregon commented Jul 3, 2024

kddnewton commented Jul 3, 2024

eregon commented Jul 3, 2024

eregon commented Jul 3, 2024

eregon commented Jul 3, 2024

eregon commented Jul 24, 2024 •

edited

Loading

eregon Jul 24, 2024 •

edited

Loading

enebo commented Jul 24, 2024

eregon commented Jul 24, 2024

enebo commented Jul 24, 2024

enebo commented Jul 24, 2024

enebo commented Jul 24, 2024

		@@ -92,13 +92,17 @@ public abstract class Nodes {

		public static final Node[] EMPTY_ARRAY = {};

		public final int nodeId;

Lots and lots of things #2924

Lots and lots of things #2924

Conversation

kddnewton commented Jul 3, 2024

eregon commented Jul 3, 2024 • edited Loading

eregon Jul 3, 2024

Choose a reason for hiding this comment

eregon commented Jul 3, 2024

kddnewton commented Jul 3, 2024

eregon commented Jul 3, 2024

eregon commented Jul 3, 2024

eregon commented Jul 3, 2024

eregon commented Jul 24, 2024 • edited Loading

eregon Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

enebo commented Jul 24, 2024

eregon commented Jul 24, 2024

enebo commented Jul 24, 2024

enebo commented Jul 24, 2024

enebo commented Jul 24, 2024

eregon commented Jul 3, 2024 •

edited

Loading

eregon commented Jul 24, 2024 •

edited

Loading

eregon Jul 24, 2024 •

edited

Loading