-
Notifications
You must be signed in to change notification settings - Fork 85
Rename Encoding's "streams" to "I/O queues" #215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change renames the Encoding-specific concept of "streams", which had been causing confusion with readable/writable streams for years, to "token queues". It also exports the corresponding definitions. Closes #180.
I realize the timing of this is not great, but looking at this I wonder if this should be (a subtype of) https://infra.spec.whatwg.org/#queues. The main novelty is returning end-of-stream (which we should rename to final-item or final-token, I think) when the list is empty. And even that seems handled in a way as Infra returns nothing which is something we could branch on. At that point all that remains is mapping strings/byte sequences to lists which is something we should allow implicitly anyway I think so "for each" and such can be used on them (although for strings we might need an explicit variant for code points; if you want neither code units nor scalar values). (What it would continue to hide/neglect, which may or may not be bad, is some kind of waiting signal to indicate the difference between the end and I/O being slow.) cc @domenic |
I don't think token queues are a subset of Infra queues, since prepend is a thing. Also, since read would need changes from dequeue, and you can push multiple tokens at a time which you can't do with enqueue, no real benefit would come from depending on queue. Making token queues a subset of list, rather than them being an "ordered sequence", would work though. By the way, should we even export prepend? The "implementation considerations" appendix lists alternatives to implementing prepend which work for the encoding algorithms in the spec, but wouldn't if other specs are allowed to use prepend arbitrarily.
While I don't oppose defining strings and byte sequences as lists, I don't see how token streams would benefit from being able to iterate through them without dequeuing tokens, which is what is usually intended. In any case, the fact that token queues can be implicitly converted to and from strings/byte sequences should be specified. Which brings me to wondering whether the conversion into a string/byte sequence should indeed empty the queue, since if the token queue is backed by I/O it'd have to block either way. If that is the case, then the BOM sniff hook would have to switch to read and prepend rather than use "starts with".
Token queues are defined as simple list-like data structures, not dependent on I/O, which implies that a straightforward implementation would have to read a byte stream from the network in its entirety before passing it to one of the decode hooks. The only affordance for I/O in actual implementations is BOM sniff's (formerly decode's) "wait for three bytes or until the end-of-stream". So doing something like that would require changing token queues to optionally be backed by I/O, which would need a separate PR. Btw, I changed "byte streams" to "byte token queues" but "scalar value streams" to "token queues of scalar values" because "scalar value token queues" sounded weird to me. But maybe that's my native Spanish shining through. |
Thanks. Given your feedback it seems somewhat tempting to try to remove the need for prepend and add the ability to enqueue multiple items to queues. If we go with a list subtype I think we need a different name from queues. Maybe consumable list or some such?
Yeah, I think that would be better. Even if we allow conversions without explicit casting it seems best for these algorithms to operate as if they are getting an I/O stream. |
…nces. Also refactors the BOM sniff algorithm to not use conversions.
As I mentioned above, it bugged me a bit the fact that token queues were defined as simple data structures but were in practice used for I/O. I've now come up with a relatively simple way to solve this problem, and so I've opened #221 with that proposal. Please let me know what you think. |
I think that's reasonable, but I do think we need a different name from queue as it's too confusing with the Infra primitives. |
So "streams" as defined here are properly a deque, for which we don't have any Infra primitives. It'd be okay by me to define it as a list, though IMO "constructable list" sounds too generic – I'd prefer something specific to tokens or encoding. |
Hmm, so:
I think rewriting everything in terms of lists would probably add clarity. I guess the one change we could make is that EOF should be explicitly be pushed into the list so we can "wait" for the list to change when it's empty. We might also want to define some shorthands for byte list and scalar value list or some such that clarify the types involved. |
@annevk So if I'm understanding this right, reading would mean waiting until there is at least one item in the list; and if that item is EOF, return an EOF without popping that one off the list. Also, we should probably add a "new consumable list" operation which populates it with the EOF, and a note for users of the spec and implementers on how to create IO-backed lists.
Come to think about it, the encoding APIs can't be made async for web compat, which precludes running them in parallel, and aren't allowed to block because they must be callable from window agents. The way the spec text is right now, the blocking is hidden behind the scenes, but the APIs don't run into this issue because none of the lists used in those algorithms are IO-backed. So we should not make "read" always potentially blocking, but instead have it depend on whether the list is IO-backed -- or alternative, on whether the list was created with an EOF. Edit: Actually, I don't think we need a "wait" operation, since as we discussed previously, converting a consumable list into a string or byte sequence would consume it in its entirety. We should instead have a peek method for the encoding and MIME sniffing algorithms to use. |
Either that or you ensure the API callers always supply a list that contains EOF so wait won't be used, right? I'm not sure I understand not needing wait. |
Wait is only used so that the encoding and MIME sniffing algorithms can work on a delimited prefix of the consumable list as if it were a byte sequence. But IIRC we discussed previously that the conversions from a consumable list into a string or byte sequence (implicit until this PR) would consume the entire list and block until an EOF was found, making "wait" useless. Instead, we should have a peek operation that would take a length, block until the list has that many items or an EOF, and return the prefix of the consumable list as a string or byte sequence. But if strings and byte sequences are going to be defined as subtypes of list at some point in the future, maybe it'd be best to scrap those conversions and continue using consumable lists as strings and byte sequences. |
Per an IRC discussion, I'm renaming streams / token queues to I/O queues. With regard to I/O queue being (by the conversations earlier in this thread) a subtype of list, @domenic pointed to HTML's task queue, which is defined as a subtype of set. I additionally pointed that the "prepend" algorithm (the only thing that makes I/O queue not a subtype of queue) is an internal implementation detail of the spec, as evidenced by the "Implementation considerations" section and by the decision earlier in this thread that we should not export that operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concepts look pretty good to me. I have mostly nits, some I'm happy to address myself once you're satisfied.
encoding.bs
Outdated
|
||
<p>A <dfn id=concept-token>token</dfn> is a piece of data, such as a <a>byte</a> or | ||
<a>scalar value</a>. | ||
<p>An <dfn id=concept-stream export>I/O queue</dfn> is a type of <a>list</a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap new and changed text at 100 columns please? As it's editorial I'm also willing to do this as a final pass.
Anyone any final thoughts on this? @domenic I think this requires your PR to be rebased. I can help with that. If there's nothing further I'll merge this later this week. |
While working on whatwg/html#5874, I run into the fact that the HTML parser runs on the event loop and so isn't supposed to block. While the tokenizer's "consume the next input character" could be formalized into not blocking by only reading after checking that the input stream isn't empty, the call to decode would still block until the end of the response body. I previously discussed the problem of blocking on the HTML parser with @annevk, and they suggested changing the different parser stages to run in parallel. But for decode and the rest of hooks, that wouldn't work, since the output queue is returned from the hooks, as well as being an immediate queue. I guess we could fix this without breaking other usages of the hooks by taking the output stream as an optional parameter and pushing end-of-queue after the run algorithm ends. |
The message of the original commit in this PR no longer reflects the total of the changes. Instead, it should be squashed and merged with the following message:
|
The output argument “I/O queue of scalar values” in the following algorithm description seems mistyped, which would be “I/O queue of bytes” (before the commit, that argument was introduced as a byte stream variable in the encode algorithm):
|
Damn copy-paste 😄 |
An "output" parameter was added to the hooks for standards in #215, but no explanation was given as to why it was needed. This change adds that clarification.
This change renames the Encoding-specific concept of "streams", which had been causing confusion with readable/writable streams for years, to
"token queues""I/O queues". It also exports the corresponding definitions.Closes #180.
Preview | Diff