Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Expand node identification and manipulation API #105

Open
zachallaun opened this issue Sep 10, 2023 · 15 comments
Open

Proposal: Expand node identification and manipulation API #105

zachallaun opened this issue Sep 10, 2023 · 15 comments

Comments

@zachallaun
Copy link
Contributor

zachallaun commented Sep 10, 2023

I've recently found myself writing a handful of functions and guards that I think would be useful additions to Sourceror.

The motivation for this is to have a handful of primitives that allow you to differentiate between and navigate the nodes of an AST parsed by Sourceror (and to a lesser extent, the default Code.string_to_quoted options). This is based on an assumption that the fundamental "categories" of nodes in an Elixir AST are:

  1. Literals, which Sourceror may wrap in a :__block__ in order to retain additional metadata
  2. Variables
  3. Special forms
  4. Calls, either local or qualified.

Proposed guards:

  • is_local_call(node) - local calls like foo(), 1 + 1, etc. Excludes special nodes like :__block__, :__aliases__, or data structures like :%{}.
  • is_qualified_call(node) - similar to above but for qualified/dot calls, Foo.bar(), foo.(), etc.
  • is_call(node) - helper for local or qualified call
  • is_var(node) - identifies variables
  • is_literal(node) - literals in the AST and wrapped {:__block__, _, [literal]} nodes. (A block containing multiple literals is not a literal. I think this distinction is very useful when navigating AST.)

Proposed functions:

  • fetch_literal(node) - returns {:ok, value} | :error - these would return the same thing: fetch_literal({:__block__, _, [:foo]}) / fetch_literal(:foo)
  • fetch_children(node) - returns {:ok, [child]} | :error
  • has_range? - if true, node can be passed to get_range without raising (removed in favor of Change get_range/1 to allow returning nil and add syntax corpus #107)
  • Possibly deprecate Zipper's branch?(tree) and children(tree), eventually make private?

Additional considerations:

Some of this may be related to the Access work you've been doing. Can it be unified?

@doorgan
Copy link
Owner

doorgan commented Sep 10, 2023

This would be a welcome change!
I'd make the following changes though:

  • is_local_call should be named is_unqualified_call, as local or non-local is a runtime property and not necessarily a syntax thing
  • is_var should be named is_identifier

Also, what if instead of has_range? we add a fetch_range function with ok/error results? Or we could have both, I'm curious on the motivation for this one

@zachallaun
Copy link
Contributor Author

  • is_local_call should be named is_unqualified_call, as local or non-local is a runtime property and not necessarily a syntax thing

  • is_var should be named is_identifier

Good points! Agree with both of these.

Also, what if instead of has_range? we add a fetch_range function with ok/error results? Or we could have both, I'm curious on the motivation for this one

My real motivation for this is wanting to know, regardless of what node I have, whether I should be able to call get_range on it.

My first thought was to actually suggest a change to get_range: instead of raising when it can't calculate the range for a node, it should return nil (same semantics as other get_* functions). However, that has two problems that I could think of:

  1. It would be a breaking change.
  2. Since get_range is defined recursively, it would be a non-trivial refactor to guard against nil in all the recursive cases.

Unfortunately, these issues also sort of apply to a fetch_range -- it could rescue FunctionClauseError, but that's pretty gross :P.

Given that, the next-best thing I could think of was a has_range? function!

@doorgan
Copy link
Owner

doorgan commented Sep 10, 2023

I see, though wouldn't the addition of has_range? need a refactor regardless? For a bunch of nodes we need to recursively check the contents to calculate the range(if possible), so this function would need to do something similar.

Regardless, I think has_range? can still be useful so I'm good with the addition :)

@zachallaun
Copy link
Contributor Author

Hmm... Maybe so! Let's see what happens with the implementation and can decide then what the right API is 🙂

@doorgan
Copy link
Owner

doorgan commented Sep 10, 2023

Sounds good!

@doorgan
Copy link
Owner

doorgan commented Sep 10, 2023

Re: the Access work, that will be built on existing sourceror primitives unless some optimization is required, so it's not something to worry about :)

@zachallaun
Copy link
Contributor Author

zachallaun commented Sep 13, 2023

I started working on this a bit, but realized that I need a precise definition for unqualified calls.

For context, here's some "prior art" from ElixirSense:

defguardp is_call(call, params)
          when is_atom(call) and is_list(params) and
                 call not in [:., :__aliases__, :"::", :{}, :|>, :%, :%{}]

I think this generally makes sense, but the definition doesn't strike me as complete. For instance, it doesn't account for | usage in map-update syntax (%{foo | bar: :baz}), or the pin operator ^, which can only be used in a match operation or macro.

So what makes something a "call" vs. a special form? I think the critical element that can be used to evaluate something is whether that thing is definable by users without modifying the compiler.

By that definition, the following would not be considered calls because they are reserved words or forms that cannot be overridden:

  • reserved words: fn, when, and, or, not, in
  • data structure literals: %, %{}, {}, [], <<>>
  • operators that cannot be overridden: ^, ., =, &, ::, =>, <-, \\, |

This is a much larger set than what ElixirSense is using.

Given this, here's a possible definition:

Definition 1

Calls are functions, macros, and operators that operate on arguments and are definable or overridable in Elixir code.

Does this make sense and is it a useful definition?

An alternative could be to define calls more broadly and to add an additional is_special_form that identifies calls that, in essence, cannot be overridden.

Definition 2

Calls are functions, macros, operators, or special forms that operate on arguments.

If we accept this second definition, I'd think that :__aliases__ would be a call (and special form), while :__block__ would be neither.

Am I overthinking this? 😅

@doorgan
Copy link
Owner

doorgan commented Sep 13, 2023

@zachallaun those are good calls, but from my point of view Sourceror works at a lower level
A "call" vs "special form" is a compiler and runtime thing, but Sourceror works at the syntax level, where those semantics don't matter

As far as we are concerned, with foo <- bar do ... end could be a function, a macro, or none at all

In Elixir AST you can boil it down to two scenarios:
You have AST representing a node with arguments: {type, metadata, arguments} where type is an atom like :foo
This is a call, and includes operators, constructors like % or %{}, etc

In some cases, a call type is another call instead: {{:., dot_metadata, dot_arguments}, metadata, arguments}
Those are qualified calls, and represent the foo.bar construct, where the . is what makes it "qualified"
This is the case for foo.bar, Foo.bar, and even Foo.{Bar} which is a "qualified tuple" but a call nonetheless

So from Sourceror's point of view, a call or unqualified_call is any 3-tuple node where the type is an atom and the arguments a list, a qualified call is the same but with a dot call as the type

Now, from the point of view of a tool like ElixirSense it does make more sense to narrow down the definition because it's dealing with a higher level where the language semantics do matter.

I do welcome functions that deal with a higher level for the sake of making it easier to navigate the AST considering the semantics, maybe in a separate module under Sourceror that is aware of default elixir semantics, but functions in the Sourceror module itself deal with the syntax only and assume al syntax is meaningless, as in fn or when may not be reserved in some context like macros.

@doorgan
Copy link
Owner

doorgan commented Sep 13, 2023

An useful thing however may be to exclude :__block__ from any definition of call because those are workarounds to represent other syntax constructs, this would match your second definition

I'm torn on :__aliases__ though, because it doesn't accept other AST as arguments and it's pretty much another hack to make it easy to extract the alias segments

@zachallaun
Copy link
Contributor Author

Thanks for sharing your thoughts! This largely makes sense. I'm interested in identifying higher level semantics, but agree that it makes sense in a separate module. The existing Identifier module kinda does this, right? Perhaps it makes sense in there?

I think :__aliases__ is not a call because it's not transformative in any way -- as you mentioned, it's basically a syntax hack to represent a slightly more complex, but still "atomic", element.

@doorgan
Copy link
Owner

doorgan commented Sep 13, 2023

Actually now that you mention this, Sourceror.Identifier would be a great place for the functions in this proposal. That module started as a bag for vendored helpers of Code.Normalizer and I figured I could make it public

I do agree with you that :__aliases__ is not a call, so let's put it out of the notion of a "call" at the syntax level

I think the next step for semantics aware functions would be to come up with a name/place for such functions, maybe Sourceror.Code? And we rename the existing and private module with that name to something like Sourceror.Compat.Code

We can't have environment aware functions like ElixirSense may have, but we definitely can provide functions that operate on AST informed by vanilla elixir semantics

@zachallaun
Copy link
Contributor Author

Sounds good!

Another thing we need is a good definition for are literals.

I can imagine two versions, one that's lower level and lives in Sourceror.Identifier and one that's higher level and lives in the new Sourceror.Code.

The lower-level would be is_literal, and it returns true for AST literals. This has a narrow definition: numbers, atoms, strings, or :__block__ nodes that contain only one child -- a number, atom, or string.

The higher-level definition would be data? and would live in Sourceror.Code. This would be similar to Macro.quoted_literal?/1, with the exception that, like is_literal, it considers :__block__ nodes containing a single literal to also be literals.

Thoughts?

@zachallaun
Copy link
Contributor Author

I'm actually thinking these should be called:

Sourceror.Identifier.is_atomic_literal/1
Sourceror.Code.quoted_literal?/1

@zachallaun
Copy link
Contributor Author

Regarding moving the existing Sourceror.Code -- Formatter and Normalizer already is in a lib_vendored. What if we make them all Sourceror.Vendored.*? So:

Sourceror.Vendored.Code
Sourceror.Vendored.Code.Formatter
Sourceror.Vendored.Code.Normalizer

@doorgan
Copy link
Owner

doorgan commented Sep 16, 2023

That sounds like a good change, so yes we can do that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants