Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to use an arena #83

Closed
trishume opened this issue Jun 27, 2017 · 7 comments
Closed

Refactor to use an arena #83

trishume opened this issue Jun 27, 2017 · 7 comments

Comments

@trishume
Copy link
Owner

This would be a large project, and may never happen, but the architecture would work much better with an arena of Contexts instead of using Rc<RefCell<>> everywhere. Not only would the code be cleaner, but the lifetimes of SyntaxDefinition and SyntaxSet would be more correct, avoiding the footgun of dropping a SyntaxSet and keeping a cloned SyntaxDefinition that no longer has references to nested languages.

This would look something like using a Vec<Context> and indices everywhere.

It may also be easier to parallelize and make thread-safe, but that still requires extra work because of lazy regex compilation. See extensive discussion in #20.

@williamyaoh
Copy link
Contributor

williamyaoh commented Jun 27, 2017

Re: lazy regex compilation: Just the change to an arena of some sort should be enough to make SyntaxSet be Send; a Vec<RefCell<Context>> would do the trick without needing to change the existing logic that much. This also opens up the possibility of storing &RefCell<Context> after linking syntaxes instead of indices, making the resulting code easier to read/write.

It would be a bit too much to hope to combine Sync with lazy regex compilation; that would pretty much require synchronizing thread-shared mutability on Contexts, so using Mutex/RwLock or friends.

What's the Clone implementation on SyntaxDefinition being used for, within the codebase?

@williamyaoh
Copy link
Contributor

Disregard that part about using &RefCell instead of numerical indices; it would be implementable, but would preclude a Send implementation. However, using a Vec<RefCell<Context>> to store a given SyntaxSet's regexes + numerical indices as context links would still be enough to make it Send.

@trishume
Copy link
Owner Author

You're right, it would have the advantage of making it Send. I have a plan that could make it even Sync in #20: separate the compiled Regexes into a separate hash table and automatically pool those hash tables but access everything else between threads immutably. This is similar to what the Regex library does, I think I describe it more in #20.

@robinst
Copy link
Collaborator

robinst commented Apr 27, 2018

Hey, looked into this a bit. I have some thoughts and wanted to check if they make sense:

Should we split the types so that "syntax definition read from yaml" and "linked syntax definition" are different types? Let's call the first one SyntaxDefinition and the second SyntaxReference. We could then have a SyntaxSetBuilder where you can add SyntaxDefinition, but when you call
build, you get a SyntaxSet (where all the contexts are linked). SyntaxSet and SyntaxReference is what you would pass to ParseState.

SyntaxSet and SyntaxSetBuilder would both be serializable. The advantage of having the first one would be that it would already be linked, so you could load it from the file and it would be ready to go. That would mean you'd need to have another pair of packdumps though (or maybe just include the SyntaxSetBuilder and if a library user cares, they can create their own dump).

I'm not sure how the find_syntax_by methods fit in with this, as they might also be useful for someone while building a syntax set.

(Edit: This also means splitting types such as Context, Pattern, etc. But the "linked" ones can stay crate-internal I think.)

@trishume
Copy link
Owner Author

@robinst Yah I looked at the code a bit and that seems like a good idea. It'd entail some duplication in the structs but it'd buy a lot of type safety and thus avoid lots of runtime checks that things are linked.
The SyntaxDefinition versions could use something like the current ContextReference enum and the linked versions would just have the direct link, either a borrow from an arena with a lifetime parameter or an index into a Vec<Context>.

I'm still unsure of whether borrows and lifetime parameters or vec arenas would be better. Borrows might give more compile-time safety to the API (can't use the wrong arena with a SyntaxReference) and maybe avoid passing around the arena sometimes, but I think it may cause problems with steps that need mutation like lazy regex compiling where a vec arena wouldn't. Lazy regex compiling is super important for startup time.

As for dumps, I'm unsure of which way is better. Loading pre-linked dumps would be faster, but it makes it so it isn't possible to load new syntaxes into a set afterwards. The binary is pretty large so I only want to choose one set of dumps to include. I bet the linking step will be really fast so probably best to load unlinked ones.

So probably the API (with probably helpers to avoid unnecessary steps in simple cases) will be something like load a Vec<SyntaxDefinition> from a dump, then SyntaxSet will have an API add_definitions(&mut self, defs: &[SyntaxDefinition]) -> Result<(),LinkError> which will link in all the definitions simultaneously. If any links in defs fail to resolve it gives a link error. Sets added in the future can link to previous syntaxes but not future ones (that's a linker error). The reason for no late binding is the linked representation won't have the ContextReference info needed to bind to a syntax added later, it'll just have the direct links.

@robinst
Copy link
Collaborator

robinst commented Jun 24, 2018

Just a heads up: I have something that passes most tests but has lots of open questions! I'm going to raise a PR so we can discuss those soon.

(One of the things I didn't end up doing was splitting the types, but it should be possible to do that on top of my changes.)

@robinst robinst mentioned this issue Aug 1, 2018
9 tasks
@robinst
Copy link
Collaborator

robinst commented Oct 13, 2018

This has been implemented in 3.0.0 🎉: https://github.com/trishume/syntect/blob/master/CHANGELOG.md#version-30

@robinst robinst closed this as completed Oct 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants