The Two Stage Solution #24

Anniepoo · 2019-05-11T22:43:34Z

Radical idea - we leave tokenize doing more or less what it's doing - perhaps less

It doesn't recognize strings, comments, numbers, etc.

We provide a separate set of downstream filters as part of the pack.

strip_comments takes some options and will either strip comments or turn them into tokens, or perhaps filter them. Maybe it returns the noncomments on a similar stream.

strip_whitespace takes some options, and will do things like strip the space tokens, pack just them, provide the indent level, etc.

make_strings takes some options and makes a string/1 token out of bits of the string. This might be painful, as we've already lost authorial form.

make_numbers parses numbers

and so on

If we make the options the first argument for all these, and make a tokenize_dcg//1
that just moves the options arg to front, we can do chains with dcgs

annies_tokenize -->
     tokenize_dcg(TokenizeOpts),
     strip_comments(CommentsOpts),
     make_strings(StringOpts),
     make_numbers(NumOpts).

The text was updated successfully, but these errors were encountered:

shonfeder · 2019-05-11T22:55:28Z

Sounds excellent. I have some alternative preferences about names for the options, but that's trivial and can be worked out as implementations fall in place.

Anniepoo · 2019-05-11T23:15:51Z

Notice that we can do preprocessing before the tokenize_dcg call, if it's useful.
Especially if tokenize_dcg passes anything in it's input that isn't an int (usually codes strings are just
lists of ints) as a separate token.

eg. say the string handling doesn't handle a backslash escape that's in the language being parsed.
Instead of writing a whole new string handler, we could 'fix' the oddball case

annies_tokenize -->
     fix_oddball_string_escape,
     tokenize_dcg(TokenizeOpts),
     strip_comments(CommentsOpts),
     make_strings(StringOpts),
     make_numbers(NumOpts).

Anniepoo · 2019-05-12T14:35:13Z

Late last night Shon and I ended with a discussion that 'tokenize' itself might go away - just have a string of these. Each stage passes on anything that's not 'what it wants', which includes anything not a number for most stages.

So, raw input

[104,101,108,108,111]

after passing through tokenize_words

[word([73]), 32, word([97, 109]), 32, word([97]), 32, word("sentence"), 46, 32|...]

Eventually we're rehashing already tokenized stuff. Eg we might have a combine_operators that takes [punct('*'), punct('=')] to [operator('*=')]

shonfeder · 2019-05-12T14:58:54Z

We may still want tokenize/2 and tokenize/3, but we can just implement them on top of the new architecture. This would make it so that users who don't want to drop int dcg land don't need to.

Anniepoo · 2019-05-12T19:40:45Z

because of the size of this, we decided not to move forward during the hack day. Work on this is checked into the twostage branch, expected to hang around a while.

Anniepoo · 2019-05-12T19:42:01Z

A big issue that came up is the issue that it's an offline solution - you don't have the original form, and don't have file location information. The parser will need location info when it prints error messages.
This is a thorny issue, and may kill this idea.

Anniepoo · 2019-05-13T00:28:13Z

This turns out to be what the current solution does, sorta. it's offline as well.

This was referenced May 12, 2019

Add multifle predicate token_by_regex/2 #22

Open

Design framework for user-entensible tokenization rules #21

Closed

shonfeder assigned Anniepoo May 12, 2019

shonfeder mentioned this issue May 12, 2019

Replace adhoc option parsing with use of library(option) #13

Closed

Anniepoo added the on hold Work shouldn't begin label May 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Two Stage Solution #24

The Two Stage Solution #24

Anniepoo commented May 11, 2019 •

edited

Loading

shonfeder commented May 11, 2019

Anniepoo commented May 11, 2019

Anniepoo commented May 12, 2019 •

edited

Loading

shonfeder commented May 12, 2019

Anniepoo commented May 12, 2019

Anniepoo commented May 12, 2019

Anniepoo commented May 13, 2019

The Two Stage Solution #24

The Two Stage Solution #24

Comments

Anniepoo commented May 11, 2019 • edited Loading

shonfeder commented May 11, 2019

Anniepoo commented May 11, 2019

Anniepoo commented May 12, 2019 • edited Loading

shonfeder commented May 12, 2019

Anniepoo commented May 12, 2019

Anniepoo commented May 12, 2019

Anniepoo commented May 13, 2019

Anniepoo commented May 11, 2019 •

edited

Loading

Anniepoo commented May 12, 2019 •

edited

Loading