Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Two Stage Solution #24

Open
Anniepoo opened this issue May 11, 2019 · 7 comments
Open

The Two Stage Solution #24

Anniepoo opened this issue May 11, 2019 · 7 comments
Assignees
Labels
on hold Work shouldn't begin

Comments

@Anniepoo
Copy link
Collaborator

Anniepoo commented May 11, 2019

Radical idea - we leave tokenize doing more or less what it's doing - perhaps less

It doesn't recognize strings, comments, numbers, etc.

We provide a separate set of downstream filters as part of the pack.

strip_comments takes some options and will either strip comments or turn them into tokens, or perhaps filter them. Maybe it returns the noncomments on a similar stream.

strip_whitespace takes some options, and will do things like strip the space tokens, pack just them, provide the indent level, etc.

make_strings takes some options and makes a string/1 token out of bits of the string. This might be painful, as we've already lost authorial form.

make_numbers parses numbers

and so on

If we make the options the first argument for all these, and make a tokenize_dcg//1
that just moves the options arg to front, we can do chains with dcgs

annies_tokenize -->
     tokenize_dcg(TokenizeOpts),
     strip_comments(CommentsOpts),
     make_strings(StringOpts),
     make_numbers(NumOpts).
@shonfeder
Copy link
Owner

Sounds excellent. I have some alternative preferences about names for the options, but that's trivial and can be worked out as implementations fall in place.

@Anniepoo
Copy link
Collaborator Author

Notice that we can do preprocessing before the tokenize_dcg call, if it's useful.
Especially if tokenize_dcg passes anything in it's input that isn't an int (usually codes strings are just
lists of ints) as a separate token.

eg. say the string handling doesn't handle a backslash escape that's in the language being parsed.
Instead of writing a whole new string handler, we could 'fix' the oddball case

annies_tokenize -->
     fix_oddball_string_escape,
     tokenize_dcg(TokenizeOpts),
     strip_comments(CommentsOpts),
     make_strings(StringOpts),
     make_numbers(NumOpts).

@Anniepoo
Copy link
Collaborator Author

Anniepoo commented May 12, 2019

Late last night Shon and I ended with a discussion that 'tokenize' itself might go away - just have a string of these. Each stage passes on anything that's not 'what it wants', which includes anything not a number for most stages.

So, raw input

[104,101,108,108,111]

after passing through tokenize_words

[word([73]), 32, word([97, 109]), 32, word([97]), 32, word("sentence"), 46, 32|...]

Eventually we're rehashing already tokenized stuff. Eg we might have a combine_operators that takes [punct('*'), punct('=')] to [operator('*=')]

@shonfeder
Copy link
Owner

We may still want tokenize/2 and tokenize/3, but we can just implement them on top of the new architecture. This would make it so that users who don't want to drop int dcg land don't need to.

@Anniepoo
Copy link
Collaborator Author

because of the size of this, we decided not to move forward during the hack day. Work on this is checked into the twostage branch, expected to hang around a while.

@Anniepoo
Copy link
Collaborator Author

A big issue that came up is the issue that it's an offline solution - you don't have the original form, and don't have file location information. The parser will need location info when it prints error messages.
This is a thorny issue, and may kill this idea.

@Anniepoo Anniepoo added the on hold Work shouldn't begin label May 13, 2019
@Anniepoo
Copy link
Collaborator Author

This turns out to be what the current solution does, sorta. it's offline as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
on hold Work shouldn't begin
Projects
None yet
Development

No branches or pull requests

2 participants