Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plans for string prototype methods? #65

Open
boogie opened this issue Mar 30, 2023 · 14 comments
Open

Plans for string prototype methods? #65

boogie opened this issue Mar 30, 2023 · 14 comments
Labels
enhancement New feature or request

Comments

@boogie
Copy link

boogie commented Mar 30, 2023

As far as I see, even basics like [num] to access a character, or length property are not supported (there's a reference in the string test, but only as a comment). What is the plan about adding these, and basics like substr, indexOf, split, charCodeAt and fromCharCode? If there's no plan, how do you think it is the best to implement these?

@boogie
Copy link
Author

boogie commented Apr 3, 2023

I think it is a similar answer to #64, however, the index (str[0]) format seems to be not implementable without parser/engine support.

@coder-mike
Copy link
Owner

coder-mike commented Apr 3, 2023

Yeah, it's a similar story. But unlike arrays, Microvium does not have a mechanism to self-reflect on the content of its own strings. The reason for this is that strings in Microvium are defined as being UTF8-encoded, since it's potentially more compact and more aligned with what people commonly use in C, whereas the ECMAScript standard defines it as UTF16. So, a spec-conformant implementation of string.length cannot just return the size, but must actually iterate the string data to translate UTF8 to UTF16 to get the equivalent UTF16 size. Similarly for str[5] which needs to find the 6th UTF16 code point. See here on MDN for a short explanation of how strings normally work.

Roughly speaking, your options are:

  • Use Uint8Array instead. These are as space-efficient as strings, but give you length and indexers which work on the byte level.
  • Implement the desired behavior in C. String indexing like str[5] can't be implemented at all in the host with the current API, but you could implement a plain function String_get(str, index). The C API function mvm_toStringUtf8 gives you access to the string data.
  • If you can explain to me your use case and tell me why str.length and str[5] (and any other string functions you need) are important for this use case and difficult to do any other way, maybe we can add this functionality to the engine. These will unfortunately be computed in linear time which may be unexpected to some.

@boogie
Copy link
Author

boogie commented Apr 4, 2023

What I'm trying to implement is sending messages with Bluetooth to the device and to the Microvium engine. These are string messages with different content. The device should be able to process them with a JavaScript code. Also there are buttons on the device, and I would like to process button presses as well. Different sources are sending different messages on a generic string interface. The device also have an RGB led and a vibration motor.

When I'm sending "%rgb", I would like to set the RGB led to red, then after 0.5s then to green for 0.5s and finally to blue for 0.5s. When I'm sending "~...---..." the vibration motor should play SOS vibrations.

I would like to enter numbers with the numeric buttons, like 1234. My idea is appending every digit, and checking the string length. When it is 4 digits, I would like to do a calculation with it, and then send a Bluetooth message to an other device.

This would be a generic tool that can process, transform events and data, etc. I would like to allow users to write code and create their solutions. I also would like to write some code in JavaScript to accelerate development.

So in general, I would like to do basic text processing. I see there are workarounds, but would be great to keep it simple and easy to understand. It's hard to explain Uint8Array.

I think iterating over UTF-8 strings to figure out the 6th character is not a problem, it will be quick for most of the cases. Same for length. As the strings are mostly short, I would be happy to go with that as an MVP solution. I'm not working with non ASCII strings, but maybe later it will be necessary, I know this problem.

@coder-mike
Copy link
Owner

Ok, I'll take a look at it. I propose the following solution:

  1. I will add in a builtin string prototype object and global String object. The global String object will not be a constructor.
  2. Property access on strings using non-number keys (e.g. str.foo) will be delegated to the string prototype, except .length which will return the number of equivalent UTF-16 code units or throw if the string is not valid UTF-8. And .__proto__ which will return the prototype itself.
  3. The prototype will have just one method, str.charCodeAt, which will return the equivalent UTF-16 code unit. It will throw a type error if the underlying string is not valid UTF-8. The returned code unit is allowed to be part of a surrogate pair. Users can add more methods to the string prototype if they choose.
  4. The global String object will have one method, which is fromCharCode. This will take only one or two arguments, not an arbitrary number of arguments. The two arguments must be two valid UTF-16 code units. The function will return the equivalent string (internally encoded as UTF-8). It will throw if the arguments together do not form a valid unicode code point.
  5. Integer property access on strings, such as str[5], will return a new string of length 1 which is equivalent to calling String.fromCharCode(str.charCodeAt(5)). This will throw if the code unit 5 is part of a surrogage pair.

I believe that this is the minimum that would be required for a user to write all the other string methods, e.g. in the form of a library. charCodeAt, fromCharCode and length give a user access to the equivalent UTF-16 code points, and the ability to extend the prototype makes the solution extensible.

@boogie
Copy link
Author

boogie commented Apr 5, 2023

Oh, this sounds awesome. Thanks a lot. I agree that having these will allow me to implement all the features I would like to.

@boogie boogie closed this as completed Apr 6, 2023
@boogie
Copy link
Author

boogie commented Apr 14, 2023

Probably I've closed this issue by accident. Your proposed solution is great.

@boogie boogie reopened this Apr 14, 2023
@coder-mike coder-mike added the enhancement New feature or request label Apr 21, 2023
@boogie
Copy link
Author

boogie commented Jul 19, 2023

When do you think you can add these features? (No pressure)

@coder-mike
Copy link
Owner

Hi. Sorry, my 2-year-old started going to childcare earlier in the year and suddenly started catching all these viruses from the other kids and has been sick almost constantly since then and it took all my energy and time. I'm starting to get back on track now, but I want to close off a few half-completed branches/features before I start on this one.

Let me see what I can do. Apologies for the long delay. I'll see if I have time later this week, otherwise we're looking sometime in August probably.

@boogie
Copy link
Author

boogie commented Jul 20, 2023

No worries. Mines are 11 and 13, and it is still happening. :D 🤞 Thanks, and looking forward for the updates.

@coder-mike
Copy link
Owner

coder-mike commented Jul 23, 2023

WIP - the branch 65-string-methods has support for str.length and str[i] if the strings are only ASCII.

Unicode support is a rabbit hole and I'll need to resume it another time.

My revised proposal is this:

  • Adding a MVM_TEXT_SUPPORT macro to the port file which is 0, 1, or 2 to support, ASCII, Unicode Basic Multilingual Plan, or full Unicode respectively.
  • Error upon restore if your compiled snapshot has strings which don't adhere to the above selection.
  • No support for global String or String.fromCharCode at this time, but tell me if you need it.
  • Also no support for charCodeAt at this time, but tell me if you need it.
  • Intention to support string prototype, which will be accessible by someStr.__proto__ since I won't put the global String.fromCharCode for the moment.

The reason for the MVM_TEXT_SUPPORT macro is that this feature is actually really expensive. The initial draft implementation for unicode support for just str.length and str[i] was almost 600 bytes of program space compared to the current 10kB for the whole engine.

@coder-mike
Copy link
Owner

Sorry, this isn't likely to happen any time soon. I'm swamped with other things. If anyone else wants to pick up this work and contribute or help, the branch is 65-string-methods.

@coder-mike
Copy link
Owner

How important are the string prototype methods? I'm encountering multiple problems in the implementation.

The first is that I don't want to force a whole object to be allocated in memory for people who don't use that feature. I think I can get around this by saying that the string prototype object is allocated lazily at compile time. So if you access ''.__proto__ at compile time then it will create the string prototype, which will persist to runtime.

The second problem is more significant which is that in the spec, method calls on strings will pass a this parameter which is not the string itself but a String object. As in the following:

''.__proto__.myMethod = function () { console.log(typeof this) }
'abc'.myMethod(); // Logs "object" not "string"

Microvium doesn't have these wrapper types like the String object. I'm hesitant to add them because I think it would be a lot of overhead for a feature that's not directly useful in itself (I think most code avoids String objects and instead just uses primitive string types), and it could complicate stuff like string coercion.

I could leave it non-compliant and just pass this as the string primitive instead of the object wrapper. But I prefer to avoid adding non-compliant behavior to the engine.

So the question is, how important is it? And, can you (or anyone reading this thread) think of any other clever ways of supporting it in a way that is compliant and doesn't add much cost to those who don't use the feature?

@boogie
Copy link
Author

boogie commented Nov 9, 2023

Hi, from a JavaScript developer's point of view, I think they are a very important feature of the language. Even basic string manipulation is hard without them. Please note, that in JavaScript, strings are most of the time NOT objects, ONLY behaving like objects. Probably this is the solution you are looking for. https://developer.mozilla.org/en-US/docs/Glossary/Primitive

"Primitives have no methods but still behave as if they do. When properties are accessed on primitives, JavaScript auto-boxes the value into a wrapper object and accesses the property on that object instead. For example, "foo".includes("f") implicitly creates a String wrapper object and calls String.prototype.includes() on that object. This auto-boxing behavior is not observable in JavaScript code but is a good mental model of various behaviors — for example, why "mutating" primitives does not work (because str.foo = 1 is not assigning to the property foo of str itself, but to an ephemeral wrapper object)."

Of course, you have to consider what are the use cases of Microvium you would like to support. My idea is that if you have a device that has a display, you most probably work with strings, and likely to have to manipulate them.

@boogie
Copy link
Author

boogie commented Nov 9, 2023

I think passing the primitive value as this is just a good enough solution. I have no real use cases in my mind where you need more, I think it is 99% compliant. As a JavaScript developer, I have never used strings as real objects, except using their attributes/methods on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

2 participants