fastCodeAt that's actually fast #9458

Simn · 2020-05-21T08:13:12Z

The specification for StringTools.fastCodeAt is pretty silly:

  This method is faster than `String.charCodeAt()` on some platforms, but
  the result is unspecified if `index` is negative or greater than
  `s.length`.
  End of file status can be checked by calling `StringTools.isEof()` with
  the returned value as argument.

These two statements contradict each other: If the result is unspecified then we cannot make any guarantees about what you can do with the returned value.

As a consequence, some targets have to branch here, e.g. to avoid throwing exceptions. For instance, Java does this:

return (index < s.length) ? cast(_charAt(s, index), Int) : -1;

This leads to an unnecessary double-branching on implementations that then use isEof.

At this point we obviously cannot break fastCodeAt, but I would like to propose the introduction of an unsafeCodeAt which really is just the fastest implementation possible, with no out-of-bounds guarantees whatsoever. Consumers can check the bounds themselves by comparing whatever indices they use against the string length.

This could then also be utilized by StringIterator, because Iterator.next also says this:

A call to this method while hasNext() is false yields unspecified behavior.

And no, I don't want to "use Bytes instead" because I don't want to deal with unicode myself.

The text was updated successfully, but these errors were encountered:

Gama11 · 2020-05-21T08:36:24Z

Would this also lead to fastCodeAt() being deprecated, or at least having a note that points to unsafeCodeAt()?

Simn · 2020-05-21T08:44:42Z

I don't think I would deprecate it, just update the documentation.

RealyUniqueName · 2020-05-21T09:30:53Z

I agree we need unsafeCodeAt

ncannasse · 2020-05-21T12:09:48Z

I think for real "fast" , we would need some kind of optimized StringIterator that results in direct access to the string data. Would work well with utf8 for instance.

ncannasse · 2020-05-21T12:11:16Z

Wait we have StringIterator already :) but is it a good enough replacement for fastCodeAt ?

ncannasse · 2020-05-21T12:11:49Z

oh, StringIterator is not unicode compatible ? :'(

Simn · 2020-05-21T12:12:50Z

StringIteratorUnicode can be optimized for UTF8 targets by carrying some state, similar to the cursors eval strings have. That is a separate problem though and needs a fast character access function as a basis.

Simn · 2020-05-21T12:14:42Z

Actually it's a separate problem entirely because it's more of a character-offset to byte-offset mapping. Such an iterator would likely be based on Bytes, not String.

Simn added the discussion label May 21, 2020

Simn added this to the Design milestone May 21, 2020

skial mentioned this issue May 21, 2020

Haxe Roundup 530 skial/haxe.io#758

Closed

1 task

Simn mentioned this issue May 22, 2020

StringTools.unsafeCharAt #9467

Merged

Simn closed this as completed in #9467 May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastCodeAt that's actually fast #9458

fastCodeAt that's actually fast #9458

Simn commented May 21, 2020

Gama11 commented May 21, 2020

Simn commented May 21, 2020

RealyUniqueName commented May 21, 2020

ncannasse commented May 21, 2020

ncannasse commented May 21, 2020

ncannasse commented May 21, 2020

Simn commented May 21, 2020

Simn commented May 21, 2020

fastCodeAt that's actually fast #9458

fastCodeAt that's actually fast #9458

Comments

Simn commented May 21, 2020

Gama11 commented May 21, 2020

Simn commented May 21, 2020

RealyUniqueName commented May 21, 2020

ncannasse commented May 21, 2020

ncannasse commented May 21, 2020

ncannasse commented May 21, 2020

Simn commented May 21, 2020

Simn commented May 21, 2020