Skip to content

Conversation

andreubotella
Copy link
Member

@andreubotella andreubotella commented May 28, 2020

This change renames the Encoding-specific concept of "streams", which had been causing confusion with readable/writable streams for years, to "token queues" "I/O queues". It also exports the corresponding definitions.

Closes #180.


Preview | Diff

Andreu Botella Botella added 2 commits May 28, 2020 20:31
This change renames the Encoding-specific concept of "streams", which
had been causing confusion with readable/writable streams for years, to
"token queues". It also exports the corresponding definitions.

Closes #180.
@annevk
Copy link
Member

annevk commented May 29, 2020

I realize the timing of this is not great, but looking at this I wonder if this should be (a subtype of) https://infra.spec.whatwg.org/#queues. The main novelty is returning end-of-stream (which we should rename to final-item or final-token, I think) when the list is empty. And even that seems handled in a way as Infra returns nothing which is something we could branch on.

At that point all that remains is mapping strings/byte sequences to lists which is something we should allow implicitly anyway I think so "for each" and such can be used on them (although for strings we might need an explicit variant for code points; if you want neither code units nor scalar values).

(What it would continue to hide/neglect, which may or may not be bad, is some kind of waiting signal to indicate the difference between the end and I/O being slow.)

cc @domenic

@andreubotella
Copy link
Member Author

andreubotella commented May 29, 2020

I realize the timing of this is not great, but looking at this I wonder if this should be (a subtype of) https://infra.spec.whatwg.org/#queues. The main novelty is returning end-of-stream (which we should rename to final-item or final-token, I think) when the list is empty. And even that seems handled in a way as Infra returns nothing which is something we could branch on.

I don't think token queues are a subset of Infra queues, since prepend is a thing. Also, since read would need changes from dequeue, and you can push multiple tokens at a time which you can't do with enqueue, no real benefit would come from depending on queue. Making token queues a subset of list, rather than them being an "ordered sequence", would work though.

By the way, should we even export prepend? The "implementation considerations" appendix lists alternatives to implementing prepend which work for the encoding algorithms in the spec, but wouldn't if other specs are allowed to use prepend arbitrarily.

At that point all that remains is mapping strings/byte sequences to lists which is something we should allow implicitly anyway I think so "for each" and such can be used on them (although for strings we might need an explicit variant for code points; if you want neither code units nor scalar values).

While I don't oppose defining strings and byte sequences as lists, I don't see how token streams would benefit from being able to iterate through them without dequeuing tokens, which is what is usually intended.

In any case, the fact that token queues can be implicitly converted to and from strings/byte sequences should be specified. Which brings me to wondering whether the conversion into a string/byte sequence should indeed empty the queue, since if the token queue is backed by I/O it'd have to block either way. If that is the case, then the BOM sniff hook would have to switch to read and prepend rather than use "starts with".

(What it would continue to hide/neglect, which may or may not be bad, is some kind of waiting signal to indicate the difference between the end and I/O being slow.)

Token queues are defined as simple list-like data structures, not dependent on I/O, which implies that a straightforward implementation would have to read a byte stream from the network in its entirety before passing it to one of the decode hooks. The only affordance for I/O in actual implementations is BOM sniff's (formerly decode's) "wait for three bytes or until the end-of-stream". So doing something like that would require changing token queues to optionally be backed by I/O, which would need a separate PR.


Btw, I changed "byte streams" to "byte token queues" but "scalar value streams" to "token queues of scalar values" because "scalar value token queues" sounded weird to me. But maybe that's my native Spanish shining through.

@annevk
Copy link
Member

annevk commented Jun 3, 2020

Thanks. Given your feedback it seems somewhat tempting to try to remove the need for prepend and add the ability to enqueue multiple items to queues. If we go with a list subtype I think we need a different name from queues. Maybe consumable list or some such?

If that is the case, then the BOM sniff hook would have to switch to read and prepend rather than use "starts with".

Yeah, I think that would be better. Even if we allow conversions without explicit casting it seems best for these algorithms to operate as if they are getting an I/O stream.

@andreubotella
Copy link
Member Author

As I mentioned above, it bugged me a bit the fact that token queues were defined as simple data structures but were in practice used for I/O. I've now come up with a relatively simple way to solve this problem, and so I've opened #221 with that proposal. Please let me know what you think.

@annevk
Copy link
Member

annevk commented Jul 6, 2020

I think that's reasonable, but I do think we need a different name from queue as it's too confusing with the Infra primitives.

@andreubotella
Copy link
Member Author

andreubotella commented Jul 11, 2020

I think that's reasonable, but I do think we need a different name from queue as it's too confusing with the Infra primitives.

So "streams" as defined here are properly a deque, for which we don't have any Infra primitives. It'd be okay by me to define it as a list, though IMO "constructable list" sounds too generic – I'd prefer something specific to tokens or encoding.

@annevk
Copy link
Member

annevk commented Jul 13, 2020

Hmm, so:

  1. I think per our IRC conversation it's okay for this to be fully blocking and synchronous. If you need things to be non-blocking and asynchronous, use "in parallel".
  2. We're likely not going to rewrite the Encoding standard to not need "prepend" so we need something like list.
  3. As currently defined in the Encoding standard, "token" is just another word for Infra's "item". Token being a byte or scalar value are examples, it's not exhaustive.

I think rewriting everything in terms of lists would probably add clarity. I guess the one change we could make is that EOF should be explicitly be pushed into the list so we can "wait" for the list to change when it's empty.

We might also want to define some shorthands for byte list and scalar value list or some such that clarify the types involved.

@andreubotella
Copy link
Member Author

andreubotella commented Jul 14, 2020

I think rewriting everything in terms of lists would probably add clarity. I guess the one change we could make is that EOF should be explicitly be pushed into the list so we can "wait" for the list to change when it's empty.

@annevk So if I'm understanding this right, reading would mean waiting until there is at least one item in the list; and if that item is EOF, return an EOF without popping that one off the list.

Also, we should probably add a "new consumable list" operation which populates it with the EOF, and a note for users of the spec and implementers on how to create IO-backed lists.

  1. I think per our IRC conversation it's okay for this to be fully blocking and synchronous. If you need things to be non-blocking and asynchronous, use "in parallel".

Come to think about it, the encoding APIs can't be made async for web compat, which precludes running them in parallel, and aren't allowed to block because they must be callable from window agents. The way the spec text is right now, the blocking is hidden behind the scenes, but the APIs don't run into this issue because none of the lists used in those algorithms are IO-backed. So we should not make "read" always potentially blocking, but instead have it depend on whether the list is IO-backed -- or alternative, on whether the list was created with an EOF.

Edit: Actually, I don't think we need a "wait" operation, since as we discussed previously, converting a consumable list into a string or byte sequence would consume it in its entirety. We should instead have a peek method for the encoding and MIME sniffing algorithms to use.

@annevk
Copy link
Member

annevk commented Jul 15, 2020

Either that or you ensure the API callers always supply a list that contains EOF so wait won't be used, right?

I'm not sure I understand not needing wait.

@andreubotella
Copy link
Member Author

I'm not sure I understand not needing wait.

Wait is only used so that the encoding and MIME sniffing algorithms can work on a delimited prefix of the consumable list as if it were a byte sequence. But IIRC we discussed previously that the conversions from a consumable list into a string or byte sequence (implicit until this PR) would consume the entire list and block until an EOF was found, making "wait" useless. Instead, we should have a peek operation that would take a length, block until the list has that many items or an EOF, and return the prefix of the consumable list as a string or byte sequence.

But if strings and byte sequences are going to be defined as subtypes of list at some point in the future, maybe it'd be best to scrap those conversions and continue using consumable lists as strings and byte sequences.

@andreubotella
Copy link
Member Author

Per an IRC discussion, I'm renaming streams / token queues to I/O queues.

With regard to I/O queue being (by the conversations earlier in this thread) a subtype of list, @domenic pointed to HTML's task queue, which is defined as a subtype of set. I additionally pointed that the "prepend" algorithm (the only thing that makes I/O queue not a subtype of queue) is an internal implementation detail of the spec, as evidenced by the "Implementation considerations" section and by the decision earlier in this thread that we should not export that operation.

@andreubotella andreubotella changed the title Rename Encoding's "streams" to "token queues" Rename Encoding's "streams" to "I/O queues" Jul 15, 2020
Copy link
Member

@annevk annevk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concepts look pretty good to me. I have mostly nits, some I'm happy to address myself once you're satisfied.

encoding.bs Outdated

<p>A <dfn id=concept-token>token</dfn> is a piece of data, such as a <a>byte</a> or
<a>scalar value</a>.
<p>An <dfn id=concept-stream export>I/O queue</dfn> is a type of <a>list</a>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we wrap new and changed text at 100 columns please? As it's editorial I'm also willing to do this as a final pass.

@annevk
Copy link
Member

annevk commented Aug 31, 2020

Anyone any final thoughts on this? @domenic I think this requires your PR to be rebased. I can help with that.

If there's nothing further I'll merge this later this week.

@andreubotella
Copy link
Member Author

andreubotella commented Sep 1, 2020

Anyone any final thoughts on this?

While working on whatwg/html#5874, I run into the fact that the HTML parser runs on the event loop and so isn't supposed to block. While the tokenizer's "consume the next input character" could be formalized into not blocking by only reading after checking that the input stream isn't empty, the call to decode would still block until the end of the response body.

I previously discussed the problem of blocking on the HTML parser with @annevk, and they suggested changing the different parser stages to run in parallel. But for decode and the rest of hooks, that wouldn't work, since the output queue is returned from the hooks, as well as being an immediate queue. I guess we could fix this without breaking other usages of the hooks by taking the output stream as an optional parameter and pushing end-of-queue after the run algorithm ends.

@andreubotella
Copy link
Member Author

andreubotella commented Sep 4, 2020

The message of the original commit in this PR no longer reflects the total of the changes. Instead, it should be squashed and merged with the following message:

Rename Encoding's "streams" to "I/O queues"

This change renames the Encoding-specific concept of "streams", which
had been causing confusion with readable/writable streams for years, to
"I/O queues". It also refactors the I/O queue operations and exports the
corresponding definitions.

As part of this refactoring, "end-of-queue" (formerly "end-of-stream")
becomes an optional item in the queue, indicating that the end of the
streaming data has been reached and that no more data is expected. As a
result, the "read" operation explicitly blocks when trying to read from
an empty queue – a behavior that was previously left unstated.

Closes #180.

@annevk annevk merged commit 46711e6 into whatwg:master Sep 8, 2020
@andreubotella andreubotella deleted the token-queues branch September 8, 2020 11:03
@triple-underscore
Copy link

The output argument “I/O queue of scalar values” in the following algorithm description seems mistyped, which would be “I/O queue of bytes” (before the commit, that argument was introduced as a byte stream variable in the encode algorithm):

To encode an I/O queue of scalar values ioQueue given an encoding encoding and an optional I/O queue of scalar values output (default « »), run these steps:

To UTF-8 encode an I/O queue of scalar values ioQueue given an optional I/O queue of scalar values output (default « »), return the result of encoding ioQueue with encoding UTF-8 and output.

@andreubotella
Copy link
Member Author

The output argument “I/O queue of scalar values” in the following algorithm description seems mistyped, which would be “I/O queue of bytes” (before the commit, that argument was introduced as a byte stream variable in the encode algorithm):

Damn copy-paste 😄

annevk pushed a commit that referenced this pull request Sep 21, 2020
An "output" parameter was added to the hooks for standards in #215, but no explanation was given as to why it was needed. This change adds that clarification.
andreubotella pushed a commit to andreubotella/html that referenced this pull request Oct 2, 2020
andreubotella pushed a commit to andreubotella/html that referenced this pull request Oct 2, 2020
andreubotella pushed a commit that referenced this pull request Jun 21, 2021
At various times in the ISO-2022-JP decoder and encoder, the prepend
operation was being called with an end-of-queue item, even though that
was made illegal in #215. This change removes those instances, in order
to keep the same behavior as before #215.
annevk pushed a commit that referenced this pull request Jun 21, 2021
At various times in the ISO-2022-JP decoder and encoder, the prepend operation was being called with an end-of-queue item, even though that was made illegal in #215. This change removes those instances, in order to keep the same behavior as before #215.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Export pushed concept to ReSpec

3 participants