Rename Encoding's "streams" to "I/O queues" #215

andreubotella · 2020-05-28T18:40:46Z

This change renames the Encoding-specific concept of "streams", which had been causing confusion with readable/writable streams for years, to ~~"token queues"~~ "I/O queues". It also exports the corresponding definitions.

Closes #180.

Preview | Diff

This change renames the Encoding-specific concept of "streams", which had been causing confusion with readable/writable streams for years, to "token queues". It also exports the corresponding definitions. Closes #180.

annevk · 2020-05-29T09:25:46Z

I realize the timing of this is not great, but looking at this I wonder if this should be (a subtype of) https://infra.spec.whatwg.org/#queues. The main novelty is returning end-of-stream (which we should rename to final-item or final-token, I think) when the list is empty. And even that seems handled in a way as Infra returns nothing which is something we could branch on.

At that point all that remains is mapping strings/byte sequences to lists which is something we should allow implicitly anyway I think so "for each" and such can be used on them (although for strings we might need an explicit variant for code points; if you want neither code units nor scalar values).

(What it would continue to hide/neglect, which may or may not be bad, is some kind of waiting signal to indicate the difference between the end and I/O being slow.)

cc @domenic

andreubotella · 2020-05-29T14:42:51Z

I realize the timing of this is not great, but looking at this I wonder if this should be (a subtype of) https://infra.spec.whatwg.org/#queues. The main novelty is returning end-of-stream (which we should rename to final-item or final-token, I think) when the list is empty. And even that seems handled in a way as Infra returns nothing which is something we could branch on.

I don't think token queues are a subset of Infra queues, since prepend is a thing. Also, since read would need changes from dequeue, and you can push multiple tokens at a time which you can't do with enqueue, no real benefit would come from depending on queue. Making token queues a subset of list, rather than them being an "ordered sequence", would work though.

By the way, should we even export prepend? The "implementation considerations" appendix lists alternatives to implementing prepend which work for the encoding algorithms in the spec, but wouldn't if other specs are allowed to use prepend arbitrarily.

At that point all that remains is mapping strings/byte sequences to lists which is something we should allow implicitly anyway I think so "for each" and such can be used on them (although for strings we might need an explicit variant for code points; if you want neither code units nor scalar values).

While I don't oppose defining strings and byte sequences as lists, I don't see how token streams would benefit from being able to iterate through them without dequeuing tokens, which is what is usually intended.

In any case, the fact that token queues can be implicitly converted to and from strings/byte sequences should be specified. Which brings me to wondering whether the conversion into a string/byte sequence should indeed empty the queue, since if the token queue is backed by I/O it'd have to block either way. If that is the case, then the BOM sniff hook would have to switch to read and prepend rather than use "starts with".

(What it would continue to hide/neglect, which may or may not be bad, is some kind of waiting signal to indicate the difference between the end and I/O being slow.)

Token queues are defined as simple list-like data structures, not dependent on I/O, which implies that a straightforward implementation would have to read a byte stream from the network in its entirety before passing it to one of the decode hooks. The only affordance for I/O in actual implementations is BOM sniff's (formerly decode's) "wait for three bytes or until the end-of-stream". So doing something like that would require changing token queues to optionally be backed by I/O, which would need a separate PR.

Btw, I changed "byte streams" to "byte token queues" but "scalar value streams" to "token queues of scalar values" because "scalar value token queues" sounded weird to me. But maybe that's my native Spanish shining through.

annevk · 2020-06-03T11:59:50Z

Thanks. Given your feedback it seems somewhat tempting to try to remove the need for prepend and add the ability to enqueue multiple items to queues. If we go with a list subtype I think we need a different name from queues. Maybe consumable list or some such?

If that is the case, then the BOM sniff hook would have to switch to read and prepend rather than use "starts with".

Yeah, I think that would be better. Even if we allow conversions without explicit casting it seems best for these algorithms to operate as if they are getting an I/O stream.

…nces. Also refactors the BOM sniff algorithm to not use conversions.

andreubotella · 2020-06-25T12:04:19Z

As I mentioned above, it bugged me a bit the fact that token queues were defined as simple data structures but were in practice used for I/O. I've now come up with a relatively simple way to solve this problem, and so I've opened #221 with that proposal. Please let me know what you think.

annevk · 2020-07-06T12:29:43Z

I think that's reasonable, but I do think we need a different name from queue as it's too confusing with the Infra primitives.

andreubotella · 2020-07-11T06:34:00Z

I think that's reasonable, but I do think we need a different name from queue as it's too confusing with the Infra primitives.

So "streams" as defined here are properly a deque, for which we don't have any Infra primitives. It'd be okay by me to define it as a list, though IMO "constructable list" sounds too generic – I'd prefer something specific to tokens or encoding.

annevk · 2020-07-13T12:00:25Z

Hmm, so:

I think per our IRC conversation it's okay for this to be fully blocking and synchronous. If you need things to be non-blocking and asynchronous, use "in parallel".
We're likely not going to rewrite the Encoding standard to not need "prepend" so we need something like list.
As currently defined in the Encoding standard, "token" is just another word for Infra's "item". Token being a byte or scalar value are examples, it's not exhaustive.

I think rewriting everything in terms of lists would probably add clarity. I guess the one change we could make is that EOF should be explicitly be pushed into the list so we can "wait" for the list to change when it's empty.

We might also want to define some shorthands for byte list and scalar value list or some such that clarify the types involved.

andreubotella · 2020-07-14T19:41:52Z

I think rewriting everything in terms of lists would probably add clarity. I guess the one change we could make is that EOF should be explicitly be pushed into the list so we can "wait" for the list to change when it's empty.

@annevk So if I'm understanding this right, reading would mean waiting until there is at least one item in the list; and if that item is EOF, return an EOF without popping that one off the list.

Also, we should probably add a "new consumable list" operation which populates it with the EOF, and a note for users of the spec and implementers on how to create IO-backed lists.

I think per our IRC conversation it's okay for this to be fully blocking and synchronous. If you need things to be non-blocking and asynchronous, use "in parallel".

Come to think about it, the encoding APIs can't be made async for web compat, which precludes running them in parallel, and aren't allowed to block because they must be callable from window agents. The way the spec text is right now, the blocking is hidden behind the scenes, but the APIs don't run into this issue because none of the lists used in those algorithms are IO-backed. So we should not make "read" always potentially blocking, but instead have it depend on whether the list is IO-backed -- or alternative, on whether the list was created with an EOF.

Edit: Actually, I don't think we need a "wait" operation, since as we discussed previously, converting a consumable list into a string or byte sequence would consume it in its entirety. We should instead have a peek method for the encoding and MIME sniffing algorithms to use.

annevk · 2020-07-15T14:44:22Z

Either that or you ensure the API callers always supply a list that contains EOF so wait won't be used, right?

I'm not sure I understand not needing wait.

andreubotella · 2020-07-15T15:30:23Z

I'm not sure I understand not needing wait.

Wait is only used so that the encoding and MIME sniffing algorithms can work on a delimited prefix of the consumable list as if it were a byte sequence. But IIRC we discussed previously that the conversions from a consumable list into a string or byte sequence (implicit until this PR) would consume the entire list and block until an EOF was found, making "wait" useless. Instead, we should have a peek operation that would take a length, block until the list has that many items or an EOF, and return the prefix of the consumable list as a string or byte sequence.

But if strings and byte sequences are going to be defined as subtypes of list at some point in the future, maybe it'd be best to scrap those conversions and continue using consumable lists as strings and byte sequences.

andreubotella · 2020-07-15T18:02:57Z

Per an IRC discussion, I'm renaming streams / token queues to I/O queues.

With regard to I/O queue being (by the conversations earlier in this thread) a subtype of list, @domenic pointed to HTML's task queue, which is defined as a subtype of set. I additionally pointed that the "prepend" algorithm (the only thing that makes I/O queue not a subtype of queue) is an internal implementation detail of the spec, as evidenced by the "Implementation considerations" section and by the decision earlier in this thread that we should not export that operation.

annevk

The concepts look pretty good to me. I have mostly nits, some I'm happy to address myself once you're satisfied.

encoding.bs

annevk · 2020-07-16T08:10:59Z

encoding.bs


-<p>A <dfn id=concept-token>token</dfn> is a piece of data, such as a <a>byte</a> or
-<a>scalar value</a>.
+<p>An <dfn id=concept-stream export>I/O queue</dfn> is a type of <a>list</a>


Can we wrap new and changed text at 100 columns please? As it's editorial I'm also willing to do this as a final pass.

encoding.bs

See whatwg/encoding#215.

annevk · 2020-08-31T15:16:59Z

Anyone any final thoughts on this? @domenic I think this requires your PR to be rebased. I can help with that.

If there's nothing further I'll merge this later this week.

andreubotella · 2020-09-01T13:36:15Z

Anyone any final thoughts on this?

While working on whatwg/html#5874, I run into the fact that the HTML parser runs on the event loop and so isn't supposed to block. While the tokenizer's "consume the next input character" could be formalized into not blocking by only reading after checking that the input stream isn't empty, the call to decode would still block until the end of the response body.

I previously discussed the problem of blocking on the HTML parser with @annevk, and they suggested changing the different parser stages to run in parallel. But for decode and the rest of hooks, that wouldn't work, since the output queue is returned from the hooks, as well as being an immediate queue. I guess we could fix this without breaking other usages of the hooks by taking the output stream as an optional parameter and pushing end-of-queue after the run algorithm ends.

encoding.bs

andreubotella · 2020-09-04T14:16:23Z

The message of the original commit in this PR no longer reflects the total of the changes. Instead, it should be squashed and merged with the following message:

Rename Encoding's "streams" to "I/O queues"

This change renames the Encoding-specific concept of "streams", which
had been causing confusion with readable/writable streams for years, to
"I/O queues". It also refactors the I/O queue operations and exports the
corresponding definitions.

As part of this refactoring, "end-of-queue" (formerly "end-of-stream")
becomes an optional item in the queue, indicating that the end of the
streaming data has been reached and that no more data is expected. As a
result, the "read" operation explicitly blocks when trying to read from
an empty queue – a behavior that was previously left unstated.

Closes #180.

triple-underscore · 2020-09-10T06:39:19Z

The output argument “I/O queue of scalar values” in the following algorithm description seems mistyped, which would be “I/O queue of bytes” (before the commit, that argument was introduced as a byte stream variable in the encode algorithm):

To encode an I/O queue of scalar values ioQueue given an encoding encoding and an optional I/O queue of scalar values output (default « »), run these steps:

To UTF-8 encode an I/O queue of scalar values ioQueue given an optional I/O queue of scalar values output (default « »), return the result of encoding ioQueue with encoding UTF-8 and output.

andreubotella · 2020-09-10T06:57:27Z

The output argument “I/O queue of scalar values” in the following algorithm description seems mistyped, which would be “I/O queue of bytes” (before the commit, that argument was introduced as a byte stream variable in the encode algorithm):

Damn copy-paste 😄

An "output" parameter was added to the hooks for standards in #215, but no explanation was given as to why it was needed. This change adds that clarification.

See whatwg/encoding#215.

At various times in the ISO-2022-JP decoder and encoder, the prepend operation was being called with an end-of-queue item, even though that was made illegal in #215. This change removes those instances, in order to keep the same behavior as before #215.

Andreu Botella Botella added 2 commits May 28, 2020 20:31

Rename Encoding's "streams" to "token queues"

cbeb639

This change renames the Encoding-specific concept of "streams", which had been causing confusion with readable/writable streams for years, to "token queues". It also exports the corresponding definitions. Closes #180.

Export token queue and its algorithms.

bb9b17b

Andreu Botella Botella added 3 commits June 4, 2020 12:30

Don't export prepend.

a53fcbd

Adds implicit conversions between token queues and strings/byte seque…

660bd8e

…nces. Also refactors the BOM sniff algorithm to not use conversions.

Merge remote-tracking branch 'upstream/master' into token-queues

f93f483

andreubotella mentioned this pull request Jun 25, 2020

Specify the usage of token queues ("streams") for I/O #221

Closed

3 tasks

Fixing typos in the conversion algorithms before refactoring

5367e9b

andreubotella changed the title ~~Rename Encoding's "streams" to "token queues"~~ Rename Encoding's "streams" to "I/O queues" Jul 15, 2020

Andreu Botella added 3 commits July 15, 2020 21:46

Renaming token queues to I/O queues and making them a subtype of list

1114d02

Add the blocking behavior and the special handling of EOF

0314d3e

Clarify that streaming I/O queues should be used in parallel

04c5e8f

annevk reviewed Jul 16, 2020

View reviewed changes

Andreu Botella added 6 commits July 16, 2020 14:01

Remove link defaults for list and ReadableStream.

c62d26c

Rename EOF to end-of-queue.

68f4d9d

Use create, convert and so on on the API section.

2bff546

Fix trailing whitespace.

2d5354d

Remove link defaults for item

f55ce2c

Add a peek operation.

5fba611

andreubotella commented Jul 16, 2020

View reviewed changes

encoding.bs Show resolved Hide resolved

Andreu Botella added 2 commits August 26, 2020 13:41

Incorporate suggestions from code review.

eca3dbe

Don't change the id of the insertion example

a65d680

andreubotella pushed a commit to andreubotella/html that referenced this pull request Aug 28, 2020

Editorial: align with Encoding's changes to "streams" / "I/O queues"

ae700d5

See whatwg/encoding#215.

andreubotella mentioned this pull request Aug 28, 2020

Editorial: align with Encoding's changes to "streams" / "I/O queues" whatwg/html#5874

Open

formatting nits

f4dff40

Andreu Botella added 4 commits September 2, 2020 06:53

Change push and prepend to avoid end-of-queue items on the wrong places.

a4916e6

Refactor BOM sniff in terms of the peek operation.

88ac022

Update the language for reading multiple bytes in UTF-8 decode

852adcb

Add an output optional parameter to the decode/encode hooks.

4096269

andreubotella commented Sep 3, 2020

View reviewed changes

encoding.bs Outdated Show resolved Hide resolved

andreubotella commented Sep 4, 2020

View reviewed changes

encoding.bs Outdated Show resolved Hide resolved

Andreu Botella and others added 2 commits September 7, 2020 13:11

Push end-of-queue on the process operation

faae433

minor nits

fa4a2cf

annevk merged commit 46711e6 into whatwg:master Sep 8, 2020

andreubotella deleted the token-queues branch September 8, 2020 11:03

andreubotella mentioned this pull request Sep 10, 2020

Editorial: Fix the type of the output argument to encode / UTF-8 encode #229

Merged

andreubotella mentioned this pull request Sep 17, 2020

Clarify the usage the output I/O queues in the hooks #230

Merged

annevk pushed a commit that referenced this pull request Sep 21, 2020

Clarify the usage the output I/O queues in the hooks

3e16fb3

An "output" parameter was added to the hooks for standards in #215, but no explanation was given as to why it was needed. This change adds that clarification.

andreubotella pushed a commit to andreubotella/html that referenced this pull request Oct 2, 2020

Editorial: align with Encoding's changes to "streams" / "I/O queues"

e02c1a8

See whatwg/encoding#215.

andreubotella pushed a commit to andreubotella/html that referenced this pull request Oct 2, 2020

Editorial: align with Encoding's changes to "streams" / "I/O queues"

a7ea3d9

See whatwg/encoding#215.

andreubotella mentioned this pull request Oct 22, 2020

Prepending or conversion of a sequence possibly not exact enough explained #239

Closed

andreubotella mentioned this pull request Jan 13, 2021

Specify escaping in the multipart/form-data encoding algorithm whatwg/html#6282

Merged

3 tasks

andreubotella mentioned this pull request Jun 21, 2021

Editorial: Fix instances of prepend used with an end-of-queue item #266

Merged

Rename Encoding's "streams" to "I/O queues" #215

Rename Encoding's "streams" to "I/O queues" #215

Uh oh!

Conversation

andreubotella commented May 28, 2020 • edited by pr-preview bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annevk commented May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreubotella commented May 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annevk commented Jun 3, 2020

Uh oh!

andreubotella commented Jun 25, 2020

Uh oh!

annevk commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andreubotella commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annevk commented Jul 13, 2020

Uh oh!

andreubotella commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

annevk commented Jul 15, 2020

Uh oh!

andreubotella commented Jul 15, 2020

Uh oh!

andreubotella commented Jul 15, 2020

Uh oh!

annevk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

annevk Jul 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

annevk commented Aug 31, 2020

Uh oh!

andreubotella commented Sep 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreubotella commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

triple-underscore commented Sep 10, 2020

Uh oh!

andreubotella commented Sep 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

andreubotella commented May 28, 2020 •

edited by pr-preview bot

Loading

annevk commented May 29, 2020 •

edited

Loading

andreubotella commented May 29, 2020 •

edited

Loading

annevk commented Jul 6, 2020 •

edited

Loading

andreubotella commented Jul 11, 2020 •

edited

Loading

andreubotella commented Jul 14, 2020 •

edited

Loading

andreubotella commented Sep 1, 2020 •

edited

Loading

andreubotella commented Sep 4, 2020 •

edited

Loading