Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce repeating parameters #28768

Closed
JeremyMahieu opened this issue Aug 17, 2022 · 19 comments
Closed

Reduce repeating parameters #28768

JeremyMahieu opened this issue Aug 17, 2022 · 19 comments
Labels
closed-no-further-action The issue is closed and no further action is planned. customer-reported

Comments

@JeremyMahieu
Copy link

JeremyMahieu commented Aug 17, 2022

When doing a query where multiple values are inserted, these values are turned into parameters and added multiple times, even if the values are the same.
image

It would be a shorter query, less network data and shorter logs if the parameters were not repeated. Potentially some performance savings at database side also. At the cost of checking if there parameter already exists.
image

@roji
Copy link
Member

roji commented Aug 18, 2022

This is more complicated than it looks.

For example, two different parameters can have the same .NET value, but have a different facet on them (size, precision, scale); in that case we cannot deduplicate them as above. You can start comparing facets as well, but in #28769 you're also slowing down the general case for everyone, since every time a parameter is added you iterate through the entire parameter list to do the check; I don't think this makes sense just for the case where a parameter value is duplicated (that's generally rare). This could maybe be done more efficiently, but the complexity and overhead are probably not worth it; most parameter values are very short (number, short string).

One additional thing to keep in mind is that some providers don't support referencing the same parameter twice from SQL (e.g. ODBC uses ? as the parameter placeholder). This in itself isn't a blocker (we could disable this optimization there, and we probably don't support those scenarios well anyway), but it's yet more added complexity.

@JeremyMahieu
Copy link
Author

Right. I thought this would be better as a postprocessing step. Right before making the final query, substituting parameter names by using the string version of the values that would en up in the query. But then what if someone relied on the parameter names.

So it's not significant enough of an optimisation to warrant the cost/complexity...

@roji
Copy link
Member

roji commented Aug 18, 2022

One more point here, is that if someone really is sending duplicate parameter values a lot, they should probably reconsider their schema and possibly redesign it in a way which avoids that. In other words, the parameter deduplication only saves on the network bandwidth, but everything is still duplicated in the database itself. So I'm again not so sure this is a good idea.

@JeremyMahieu
Copy link
Author

JeremyMahieu commented Aug 18, 2022

Where I noticed this for my case, these were foreign keys.
Also these could be inserts of multiple rows meaning you'd get queries with more parameters than there are colums. I don't know where the cutoff of using insert/update statements to modify many rows is but it could be a big percentrage of some types of queries.

Are there any standardized cases where we could measure how much of an impact this could have?

@ajcvickers
Copy link
Member

Where I noticed this for my case, these were foreign keys.

@roji This might be interesting as a specific optimization in the update pipeline. All the FK values that are for the same column and come from relationship with the same principal (hence same value) could be a single parameter in the batch.

@roji
Copy link
Member

roji commented Aug 18, 2022

This might be interesting as a specific optimization in the update pipeline. All the FK values that are for the same column and come from relationship with the same principal (hence same value) could be a single parameter in the batch.

That's an interesting idea - you're thinking of a scenario where we insert a principal with many dependents, right?

One problem is that this wouldn't work with the new batching API, where the same parameter set isn't shared across different SQL statements (each batch command has its own parameter set). The exception to that is SQL Server bulk insert mode, where we use a single command with MERGE, so that would still work even with the new batching API.

Also, foreign keys tend to be particularly small (typically an int, maybe a GUID), so the gains may not be worth it.

@JeremyMahieu

Also these could be inserts of multiple rows meaning you'd get queries with more parameters than there are colums. I don't know where the cutoff of using insert/update statements to modify many rows is but it could be a big percentrage of some types of queries.

Do you mean a scenario where you're inserting many rows in the same SaveChanges which largely have the same data in them? That seems like a very specific scenario...

In any case, I'm not sure about the relevance of the "cutoff of using insert/update statements to modify many rows... The SQL Server provider does use MERGE when inserting multiple rows (but not updating), but regardless of whether INSERT or MERGE is used, the parameters are essentially the same.

@JeremyMahieu
Copy link
Author

JeremyMahieu commented Aug 18, 2022

That's an interesting idea - you're thinking of a scenario where we insert a principal with many dependents, right?

Yes or moving dependants from one principal to another. For example 20 students change to a new class at the end of a semester. 20 student rows would get their ClassId column changed, for this 20 duplicate parameters would be made.

INSERT or MERGE is used, the parameters are essentially the same.

Ah so large dataset changes would suffer from the same problem. Sorry I'm not too familiar with everything. Disregard the "cutoff" comment. I assumed large operations woudln't work with similar queries.

Another scenario where this might happen is in select statements where you use multiple Where(o => names.Contains(o.name)) filters. This names list would end up being serialized in a parameter, could potentially be very long and used multiple times.

@JeremyMahieu
Copy link
Author

the same parameter set isn't shared across different SQL statements (each batch command has its own parameter set)

Use a temp parameter table to be used across the transaction. 🙃

@roji
Copy link
Member

roji commented Aug 18, 2022

Another scenario where this might happen is in select statements where you use multiple Where(o => names.Contains(o.name)) filters. This names list would end up being serialized in a parameter, could potentially be very long and used multiple times.

Not sure I understand this one - where would this names list be used multiple times? Each query you execute is its own command (and round trip), with its own parameters; so you wouldn't be able to share the same parameter across multiple queries. In theory, when we enable query batching one day (either for the built-in split query scenario (#10878), or via a user-facing batchin API (#10879)), then that could become meaningful. But it would mean the batched queries have to again resend the same parameter, i.e. two batched queries which happen to do the same names.Contains(o.Name) with the same names parameter. That seems like an extremely narrow scenario, and as above, it would also be incompatible with the new ADO.NET batching API which we're planning to implement, where each command has its own parameter set.

@JeremyMahieu
Copy link
Author

JeremyMahieu commented Aug 18, 2022

Also think about default values.
One real life example I have is an insert of 8 rows, with 21 parameters each. In total 168 parameter.
Most common values:

  • 48 times: False
  • 16 times: '0'
  • 16 times: NULL (DbType = Guid)
  • 8 times: NULL (DbType = Int32) (not sure if this could be grouped with the previous one)
  • 8 times: 'MyRedactedEnum' (Nullable = false) (Size = 4000)
  • 8 times a string that represents a barcode: 'abcdefg' (Size = 4000)
  • 8 times some float: 1.510000228881836
  • 8 time a default date: 0001-01-01T00:00:00.0000000
  • 8 times the same guid
  • The rest is unique data
    So these 128 parameters could be reduced to 8 or 9 parameters. Would more than halve the query length.

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 18, 2022

@JeremyMahieu Have you measured the impact on performance (via ADO.NET) ?

@JeremyMahieu
Copy link
Author

JeremyMahieu commented Aug 18, 2022

@JeremyMahieu Have you measured the impact on performance (via ADO.NET) ?

No nothing of this nature. Could I use the benchmarks that are in the EFCore project for this? I'd like some sort of standardized scenario.

@JeremyMahieu
Copy link
Author

where would this names list be used multiple times?

You're right, I have never done this, would be vary rare/specific.

@roji
Copy link
Member

roji commented Aug 18, 2022

Could I use the benchmarks that are in the EFCore project for this? I'd like some sort of standardized scenario.

Not really - this is a very specific scenario, none of our benchmarks specifically check for repeated parameter values. But it should be trivial to do an ADO.NET benchmark (best to leave EF out of it) with BenchmarkDotNet, which measures the actual impact on duplicated vs. deduplicated parameters. I suspect you won't get a very meaningful difference.

@stevendarby
Copy link
Contributor

Sending the additional parameters probably doesn't have a significant performance impact but the number of rows inserted per batch does have an impact on performance, and the maximum you can insert per batch is limited by the number of parameters you can include in one command.

The example above of 21 parameters per row means inserting a max of ~100 rows at a time. It's possible a larger batch would be more optimal, if a reduced number of parameters allowed it.

@roji
Copy link
Member

roji commented Aug 23, 2022

@stevendarby that's theoretically true, though typically the 2100 parameter limit is rarely an actual limiting factor for batching.

The long-term plan for the 2100 parameter limit is to switch to using the new ADO.NET batching API (#18990); this would mean that instead of a single multistatement command, where all statements share a single parameter collection (as today), we'd batch together multiple commands, each with its own parameter collection. This would mean that the 2100 parameter limit will apply to a single statement, rather than to the entire batch. Unfortunately, for this to work, both EF and SqlClient would have to implement the batching API, which will likely take some time.

In any case, trying to increase batch sizes by deduplicating parameter values would only work for the very specific cases where values are actually duplicated. While I'm sure this would possibly help some specific scenarios, it overall seems like a questionable reason to go into parameter deduplication, which has complexity and performance overhead costs that everyone would need to pay.

@stevendarby
Copy link
Contributor

stevendarby commented Aug 23, 2022

Sure, it's a consideration for @JeremyMahieu's benchmarking if they do end up doing that.

@stevendarby
Copy link
Contributor

FWIW there is a related issue for reducing parameters in queries - #24476. This issue seems distinct from it, but just mentioning in case it's worth considering them together.

@ajcvickers
Copy link
Member

Note from triage: we don't plan to do anything here as of now. Compelling perf analysis could change that.

@ajcvickers ajcvickers closed this as not planned Won't fix, can't repro, duplicate, stale Aug 30, 2022
@ajcvickers ajcvickers added closed-no-further-action The issue is closed and no further action is planned. and removed type-enhancement labels Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-no-further-action The issue is closed and no further action is planned. customer-reported
Projects
None yet
Development

No branches or pull requests

5 participants