-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for fixed capacity, variable length value type (inline) strings. #2099
Comments
How do you propose something like this be implemented? Types in the CLR must have a known size, so the only method would be to emit a different (and incompatible) struct for every size of this "string". |
@HaloFour - In a similar way that fixed buffers are. In fact these could be implemented as fixed buffers but wrapped in some syntactic sugar. |
Becomes `struct string_64 int curr_length; } plus a bunch of properties etc. |
You mention that fixed buffers didn't work for your solution. Any similar solution for strings would have the same limitations, the length would have to be a constant known at compile time. Why weren't fixed buffers sufficient for your purposes? |
That's the other problem, you have to generate a bunch of separate members just to make these things workable, and they'd all be incompatible with one another, as well as all normal |
We wanted consumers of our client'server API to be able to freely declare messages, here's what a user defined message might look like (the code is unavailable to me just now so excuse typos).
This is how we wanted consumers to use it (and they do) but as you can see we needed a family of types (structs) for a large set of predefined capacities. We use a That interface defined to/from string conversions and compare etc. With the above design it worked well but a user could not use a So as you can see the consumer has no idea how In user code these members were interchangeable with At runtime a |
Exactly, that's the motivation for this post, to introduce a new kind of string type that provides all this out of the box. |
Seems like a very highly specialized solution that would have very narrow benefits but is massively complicated.
You're not asking for one string type, you're asking for 2 billion potential string types, every single one of them with a separate set of members. That would certainly result in metadata explosion. |
Well I'm certainly asking for something but not quite that. What we want (ultimately) is some syntactic mechanism that can convert this:
Or ideally a new type that is implemented better than this, but one that enables the user to declare the capacity at compile time without them needing any knowledge of the implementation. C# could perhaps support the passing of constants into the declaration of type instances, this would probably be enough. I mean support this:
This way we could create types that must contain compile time constants, that let the consumer of the type provide that compile time constant. Then a consumer could just code:
This is I think the fundamental requirement here, a way to propagate compile time constants into type declarations... |
That would involve CLR changes, and the end result is effectively the same, it's just that now the CLR has to generate potentially 2 billion different flavors of that class. |
Perhaps, so I guess what I'm asking is for a better way to solve this problem - on the surface this sounds rudimentary - provide support for fixed capacity (value type based and hence inline) strings - if we forget about what I've said above and what we currently do to implement this and just step back and view this as an abstract problem - are there options? Many other languages support the idea of fixed capacity strings so in principle this isn't a major challenge...or is it? We used fixed buffer wrapper structs only because we had no other way to deliver this, but being able to do this at a deeper langauge/CLR level might well be slicker and less messy, some of the problems you mention may be due purely to the way we implemented this and not necessarily inherent in the problem itself. |
Let's work backwards on this a bit. IN many cases the language has adopted these sorts of 'less easy to use' but 'much more performant' solutions when the gains were made quite explicit. Could you show some real world examples that would benefit from this (along with measurements)? Basically, a real world piece of code that you would envision using this. To get the perf measurements, it would likely suffice to convert that real code to use fixed-size-buffers and see what the resulting difference was. |
The challenge here is the CLR which offers no real facility to accomplish this. Without the CLR it would be relatively easy. Languages that did support fixed-length strings, like Visual Basic, lost them in the transition to .NET. |
The application is a .Net to .Net messaging platform. Because of this instances can be serialized (basically) as a memcpy, the performance of this is outstanding (as I mentioned one benchmark sees 1,000,000 instances per second on a i7 3960 core, these are small messages but did contain a few of these fixed length strings. Other forms of serialization do not achieve these levels. Recent advances to C# (improved However the inline fixed capacity strings remain as a contrivance in our design as I explained this is quite ugly (but works very well). Because a type layout in the CLR is the same across different machines (guaranteed if we're using the same version of the CLR at each end) it is very easy to serialize an instance (even of a class) provided all fields are inline. The recipient then simply creates an instance of the type and "overwrites" its field block with the received block of bytes that were sent. That's the principle anyway (of course we also need to send and cache type descriptions and so on but this is all part of the low level handshaking and protocol). All of this sits on top of a robust async socket management layer with ring buffers and stuff, but at the outer level app developers can just creates classes that inherit from Of course sending the source for this is not easy as the system contains lots of other proprietary stuff and includes some dynamic method generation for certain things. I'm sure you get the idea though and I'm sure you can understand how this achieves very high performance. |
That sounds like a very dangerous assumption. From my understanding, unless you're using explicit layout, the CLR will layout the native member of that struct any way it sees fit, which can differ based on platform. Anyhow, in regards to benchmarking, you're probably going to have to demonstrate that difference and why only this particular solution is suitable. And you're likely going to have to compare that to the various other high-performant serialization libraries out there. |
Thanks! |
@HaloFour - All designs involve compromises and assumptions, provided one is very clear about what these are and takes steps to verify these at runtime where necessary then what we do works very very well. For users who use Windows and the same hardware architecture (Intel, AMD) on the participating nodes (not a huge requirement) they can get these gains in performance. |
Agreed. This is very much feeling like a library problem currently. It may be necessry to elevate to a CLR/Language problem. But it would be really necessary to demonstrate why existing library solutions are insufficient. Note: CoreFx/asp.net was pretty involved in passing feedback along about the areas they needed help for perf. That's what led to all the ref/span/readonly stuff. I don't recall any feedback about this particular area. And they're def trying to make high-perf servers. |
At the outermost level an app developer creates an instance of The base message class contains a lot of low level mechanisms that enable us to get the address and size of the object's field block and then "memcpy" it to a byte[] the rest you can probably envisage). |
I should add too that the system includes various optional compression and encryption modes, but these are not rocket science as you can imagine. I cannot over stress the impact that making all data inline has, this is the key to outstanding performance (I used to work in the City of London many years ago and have a lot of experience in this area on various platforms and languages). |
I'd prefer if htere was a real piece of code that could be used as the exemplar case here. :) Honestly, i'm not trying to make your life hard. I'm just pointing out that for language features that exists solely for perf needs, we need real world code to look at and understand, so we can best assess what the right sort of solutions would be (and just to validate how things would improve). -- Another way of putting it: Imagine if we added this feature... and then you used it... and it didn't make performance any better. The feature would be a failure at its core goal. So we actually need some way of validating things. Furthermore, imagine if we added this feature, and you couldn't use it, because there was some limitation (akin to the ref limitations we have), and your own use case violated that limitation. This would also then fail. |
But you need to. Because other teams that are doing precisely this are not coming to us with this being a use case that must be addressed. These other teams are working in very competitive arenas, trying to squeeze out all the perf they can. Right now, this isn't a place they are finding problematic. So it's hard to gauge for certain if what you are saying is generally applicable, or if this is a very specific problem to your domain. |
Another way of putting it: You're the one asking for this to be done. Like it or not, that means the legwork is on you to provide enough compelling data to make others feel like this is worthwhile. It's unlikely that anyone else is going to go do it for you. So, to maximize your change of success here, it is necessary to go beyond just saying you'd find it useful for your use case :) |
@CyrusNajmabadi - I understand Cyrus, I guess the only way to show the benefit would be to alter our library to support an alternative serialization method but the system is deeply predicated on this (at the core anyway) so this would be quite a lot of work. Bear in mind that the performance gain here is pure CPU, the cost of CPU time in sending and receiving these messages is is far lower than something that uses XML serialization, MS Binary serialization or protocol buffers. A memcpy of a message is Your question is interesting because we need to compare this architecture with another one and I don't have that other one. |
What I would suggest is reimplementing a very basic form of this serialization architecture that could be compared directly to other serialization methods. Afterall, any solution here would have to be very general purpose. |
I did have benchmarks of the serialization layer, I'll see if I can dig these out - that might be a start! |
I agree, and recent support for generic pointers and |
I must get a flight soon, but thanks for taking the time to explore this. |
This proposal is very similar to the |
Understood :) But... well... that comes with the territory. If you want to make a language change (esp. one related to perf), then this sort of comes with the territory. The only way to escape it is to get someone excited enough to do it for you :) |
It seems that you're correct here, also from what others say even wide appeal features stand only a small chance of getting included.
Consider updating say an option price, we can do it pretty much like this: Option * option_ptr = datastore.GetItem<Option>(key); // can be updated soon to use new "ref" support.
option_ptr->bid_price = new_price; This is a tiny cost (including the GetItem()) and enables updates to data at a very high rate and very low CPU cost, perhaps just 8 bytes change (e.g. a Decimal) despite the fact the Option might have many fields (including text fields like name, exchange etc). The datastore incidentally is rather specialized and proprietary and local to the machine running the update operations, we can write to the store like this for example: Option some_new_option = ...;
datastore.Write(ref some_new_option); Because the code (a bit dated now but we can convert a ref to a ptr and vice versa with support code) can serialize very rapidly (using what I'm calling "memcpy" for ease of discussion) this too is very fast and low CPU.
As soon as the format begins to deviate from its in-memory layout you begin to incur significant costs. Nothing comes close to a single "memcpy" (e.g. CopyBlock). We can do this and have "strings" because of the Furthermore because the data is stored in an identical structure to its managed memory layout we can use managed code (via pointers but we could use
Not really, most of the work is updates and most of it to non-string fields.
This is true but as I've said earlier we don't update the "string" stuff much at all, these may be part of a lookup key or data that's used when reports are pulled for example. But 85% of the work is perhaps updating primitive numeric fields and 15% perhaps writing new items both of which operations are very fast. Bear in mind that the datastore is part of the update service's (a Windows service) address space but not part of the |
Check this proposal, which would add int parameter(s) to generics: #749 If that proposal gets implemented, you could do something like this: struct ValueString<CH, const int SZ> {
fixed CH chars[SZ]; // fixed size inline array with SZ elements of type CH
// misc functions, properties, operators, ...
}
ValueString<char, 16> MyStringU16; // fixed size string-like value type with 16 Unicode chars
ValueString<byte, 64> MyStringA64; // fixed size string-like value type with 64 Ansi/Ascii chars To reduce code bloat, the functions of ValueString could be implemented in an inner private empty (field-less) static class / struct. These inner helper functions would take a Span<> (which contains size and address of the fixed array) as parameter. The functions of the outer ValueString struct would be simple (and therefore maybe inline-able) wrappers around the inner work functions ... Note, that proposal 749 is not limited to chars, bytes, strings, one-dimensional arrays, fixed arrays ... |
@MillKaDe - Good lord, how did I miss that (I think someone else mentioned it and I glossed over it - inexcusable). Your are absolutely right, that is exactly what's called for. I think this would work for me, very glad you mentioned this! Thanks |
I definitely think it's better. It's something you can do today. It uses the well-supported and understood 'Span/ReadOnlySpan' types. It's really simple (though does require some boilerplate in a few places). It will interoperate with teh rest of the high-perf, low-overhead, side of C#/.net. Creating something new for this niche case seems pretty objectively worse. It would take years to get it. Would likely need an entirely new way of working with it. Would have to have a design around how it could work in the ref/span world, etc. etc. |
You're proposing something that wants to introduce a very different programing model than hte one that C# has had since 1.0, while also interoperating seamlessly with 20+ years of existing APIs. That's non-trivial. it's equivalent to me coming to rust and asking it to have a totally different ownership model than what it has today. Or going to C++ and wanting lexical scoping to work entirely differently. It may be 'something that seems on the surface straightforward', but only is that way because it can ignore the deep design decisions and history involved here. |
@CyrusNajmabadi - All I can say in response to your most recent remarks is that it seems to me you've ultimately designed yourselves into a corner. If inline string data types cannot be supported (and this is a rather trivial concept just look at strings in Pascal or PL/1) without the Herculean effort you claim, then that has to tell us something about how you've all designed this. I can see now why you've been so critical, it's not that what I asked for is some huge piece of functionality, it's because your design and model is too restrictive, too inflexible. |
Indeed it does. It tells us that C# is a safe, garbage collected language without an ownership model. You might as well say that if Prologue cannot support object oriented programming without herculean effort then that has to tell us something about how they've all designed it. This is how the .Net programming model works. End of story. If you need to do something the programming model doesn't support, use a different language. |
@YairHalberstadt @CyrusNajmabadi - These analogies don't really help nor do I regard them as valid to be frank. Creating a supposed analogy (make Prolog more OO or change the scoping rules in C++) and then discrediting the analogy is referred to as a strawman argument in philosophy and logic, it has no place in a serious technical discussion. |
And it seems that there may be interest in treating fixed buffers as spans, which eliminates some boilerplate as you can use them in an expanding ecosystem of APIs.
Every language is a tradeoff of different philosophical concerns. Languages that allow arbitrary stack allocation and reinterpretation are inherently much less safe than C#, especially if they don't have an ownership model. C# and the CLR never has to concern itself with whether or not the memory backing a string has gone out of scope. This is why the compiler is so strict when it comes to |
It is not a strawman argument. You're arguing that something which is easy in a language with a completely different programming model is difficult in C#. Hence C# is badly designed. We're pointing at that this is an obviously nonsensical argument, and giving some examples of the sort of nonsense conclusions you would come to if you applied that argument. |
Clearly there is no prospect of getting what I sought and that's fine, if the experts see this as a huge challenge then I respect that. But I never asked for arbitrary stack allocation or reinterpretation! What I did seek was a value type mutable fixed capacity string like type which could be used in struct fields in much the same way as primitive types or fixed buffers. |
This is getting silly, I never anywhere said C# was badly designed, this is a false statement and I'm finished with this issue now, thank you all. |
|
@YairHalberstadt - so quote me next time rather than paraphrasing me, my remarks are not logically equivalent to "C# is badly designed" else I would have said that. |
Ok. I'm sorry for paraphrasing you. Let's leave it here shall we? |
Yes. In the same way you would be designed into a corner if you wanted to do something against the rust ownership model, or the C++ lexical scoping model. |
@CyrusNajmabadi - I was done with this thread yet you persist in arguing? Do you really want me to respond to this remark? I disagree with you but lets please drop this now. Thanks. |
I'm pointing out that you're asking to change something very intrinsic to the language. And yes, in that case, it's non-trivial to ask for htat. I hoped, by way of example, that might be clearer to you. |
Could you explain what you disagree with? My point still stands and is valid. It is non-trivial to change something fundamental to how the language was designed. I hoped, by way of analogy, to help make that more understandable. I can go more in depth about this specific issue if you want. But it seems like we're somewhat at an impasse there as well. |
And there's lots of stuff htat is trivial in some languages that would be hard to express in Pascal or PL/1. What's your point? All languages make tradeoffs based on the things they find most and least valuable. It's trivial for me to do some things in C# that are much harder in other languages. And the same is true for me with Rust, Python, Go, and TypeScript (languages i use on a regular basis). All this tells you is that, in terms of design, the activity you are trying to get language support for is not something hte language designers thought was important enough to give support to. And that's something i've been telling you since just a few posts into this issue. Your use case is hugely niche. It's not at all common, and it's already supported in a manner that is felt to be "good enough" by what the language has already shipped with. |
@CyrusNajmabadi - Clearly this is not a capability that is going to get any support so there's little value in discussing the issue further. We could discuss the process we used to discuss the issue if you want, but that really isn't something I'd expect to do in an issue thread like this one. |
You can have that today. It's call a fixed size buffer. The main problem is that when presented with this option, you have rejected it. So you want something that is more than what you specified above. Namely, you want a fixed-size-buffer with a lot of convenience in the language that helps you avoid writing the code to work with the fixed-size-buffer. In other words, you just want things to be more pleasant. You're not asking for something that is not feasible today. This is certainly something you can want. But, then, the onus is on you to properly explain why this is so important a need. For example, there were lots of tangents about performance. However, your language proposal doesn't help with performance. i.e. it would boil down to just the same code that i showed above. So, really, it would be about just making things a little more pleasant. And, frankly, that's a hard thing to sell since you'd be asking to make an utterly niche area of the language nicer for an incredibly tiny group. |
The primary issue here is that you haven't really been listening to relevant feedback and advice on how best to get a language change in. The most critical thing to realize is:
That's really it. At the end of the day, nothing else really matters. So, if you want such a change to C#, the goal is to be able to get someone to 'champion' what you want (or something close enough). That means that the best thing you can do for the things you want is make a cohesive and compelling argument that is factual, accurate, and convincing as to why this is an exceptionally worthwhile thing to do. The feedback you've been getting has been to help understand why your current position is not there currently. For example, the tangents about perf aren't accurate (as you can get the perf you want today). The same holds for any sort of argument that implies that this sort of thing isn't possible today. Finally, little (if any) effort has been spent explaining why it would be so valuable to save a few lines of code (that you can write today) to warrant needing a full-fledged language feature here. So, effectively, the feature is a non-starter. Not because it's not a good idea. But entirely because the issue does not put in the necessary effort in the right ways to convince someone to champion in it. Cheers, and i hope this helps! |
Note: "painted yourself into a corner" generally has a negative connotation. The idiom conveys the idea that you're now at the point where any direction you go you'll invariably have problems because you are going to walk through paint and make a mess. In other words, it directly implies that you did not think about what you were doing as you went along and now are in a situation you need to extricate yourself from. |
@CyrusNajmabadi - I disagree with some of the claims you make in your latest four posts, but as I say this isn't the place to discuss the discussion process itself. I find your tone a little rude and impatient and frankly a discouragement to open informal technical discussion. I've stated several times very recently in this thread that I recognize that what I asked people to consider is - in their informed opinion - very unlikely to ever see the light of day, I've acknowledged that and quite prepared to cease further discussion yet you persist. |
My advice was given in the spirit of making you the most successful at getting improvements to the language that would help you out. I tend to try to want to help steer things in that direction, and i often attempt to move things away from overfocusing on areas that go against that. Having done this a long time, my goal is toward both having things done as efficiently as possible, as well as avoiding the things that often derail a proposal entirely. I'm sorry you took my feedback negatively. However, i do recommend you keep a lot of what i mentioned in mind for the future. Note: if your goal is simply discuss things, i highly recommend gitter.im/dotnet/csharplang as a better venue. Github itself is a place for seriously moving proposals along to real language features. And, if that is your goal, then a lot of what i was talking about and focusing on is very relevant. Cheers! |
@CyrusNajmabadi - Again your condescending tone is evident:
I too recommend that you do the same.
It isn't feedback that I take negatively it is condescension, inaccuracies, false statements and inappropriate analogies and misleading paraphrasing. |
I will indeed work to make it clearer what i am attempting to help address, and what steerage on an issue will help give it the highest chance of success. Thanks! |
Strings in C# are perceived as buffers with an (to all intents and purposes) unlimited capacity and for this reason cannot be stored inline as primitive types are. I'm proposing that consideration be given to introducing an additional string type which has a capacity declared at runtime, and thus a maximum possible length.
This then makes it possible to define classes or structs which contain strings yet have these string appear inline, within the datum's memory much as primitive types are.
This is a problem that came up in a sophisticated very high performance client server design in which we got huge benefits by being able to define fixed length messages that contained strings. In our case we simulated fixed capacity strings as properties that encapsulated fixed buffers (char or byte). This worked well but was messy because the language offers no way for us to 'pass' (at compile time) a length into a fixed buffer declaration, one must actually declare the fixed buffer explicitly with a constant.
As a result we created a huge family of types named like this: ANativeString_64 and UNativeString_128 (ansi and unicode variants) and so on, as I say this worked but was messy.
Each type was a pure struct (as in the new generic constraint 'unmanaged') so when used as member fields in other structs left that containing struct pure, giving us contiguous chunks of memory that contained strings.
As I say this worked very well but was messy under the hood and challenging to maintain.
So could we consider a new primitive type:
string(64) user_name;
for example?
Such strings could be declared locally resulting in a simple stack allocated chunk, or as members within classes/structs in which case they appear inline just like fixed buffers do...
(just to be clear I'm not seeking the capacity to be defined at runtime but at compile time, and I know my syntax won't work but wanted to convey the idea).
The text was updated successfully, but these errors were encountered: