Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better API for ADO.NET command batching #15375

Closed
roji opened this issue Oct 7, 2015 · 40 comments
Closed

Better API for ADO.NET command batching #15375

roji opened this issue Oct 7, 2015 · 40 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Data
Milestone

Comments

@roji
Copy link
Member

roji commented Oct 7, 2015

ADO.NET currently supports batching several statements in one command by packing them into a single string as follows:

cmd.CommandText =
        "UPDATE message SET text = $text1 WHERE id = 1;" +
        "UPDATE message SET text = $text2 WHERE id = 2";

While workable, it's a problematic API. The PostgreSQL protocol, for example, needs to send each single SQL statement as a separate protocol message, forcing Npgsql to split CommandText by semicolon at the client side, a pretty complex task (need to understand when the semicolon is quoted, etc.).

It would be much better to provide a list-based API, where the user provides, say, an array of strings. A specific provider can still choose to join by semicolon if needed, but this should be done at the provider level rather than by the user.

For backwards compatibility, the CommandText property would still support the legacy semicolon batching mode. A boolean LegacyBatchingMode connection string parameter, true by default, could be turned off to disable it (thereby removing the client-side SQL parsing in Npgsql, etc.).

@kkurni
Copy link
Contributor

kkurni commented Oct 30, 2015

@roji Does PostgreSQL support SQL Batching ?

It is used for SQL Batching support.
We sent the whole SQL Statement to Server and sql server will understand that statement.
https://msdn.microsoft.com/en-us/library/vstudio/haa3afyz(v=vs.100).aspx

e.g "select 1; update *...; select 2;"
Sql Server will return 2 results set.
How this will work in PostgreSQL ?

@roji
Copy link
Member Author

roji commented Oct 30, 2015

@kkurni, here are some details. PostgreSQL supports two network protocols: simple and extended (here are the detailed docs).

In the simple protocol you can send a single message with "select 1; update *...; select 2;", PostgresSQL will parse out the multiple statements and return 2 result sets. However, the simple protocol has several shortcomings: no binary data, no parameter support, no prepared statements. For these reasons, Npgsql 3.x switched to the extended protocol.

In the extended protocol you send a series of message - Parse, Describe, Bind, Execute, Sync. This allows you to do everything, but you can send only one SQL statement each time. You can still pack multiple messages into a single packet, achieving good network performance.

Npgsql 3.x does allow users to set DbCommand.CommandText to "select 1; update *...; select 2;". However, since it uses the extended protocol it is forced to parse the statements client-side, and send multiple sets of messages in a single packet (2 Parse messages each with a single statement, 2 Describe messages, and so on).

To summarize, PostgreSQL does support SQL batching, but requires you to submit the different SQL statements in separate protocol messages (if you use the extended protocol). Therefore, the ADO.NET batching API, which requires you to pack multiple statements into a single string with semicolon separators, is unsuitable...

Let me know if you need any more details!

@kkurni
Copy link
Contributor

kkurni commented Nov 3, 2015

Thanks @roji for providing us with this detail.

This specification seems to be a provider specific.
As there are other providers which is fine to use CommandText e.g SqlClient.
Do you have any other provider which require this ?

If we end up implementing your request, this will be breaking changes for other provider as well.
This will be much simpler for specific provider to override this if they need to.
E.g

    public class MyCommand : DbCommand
    {
        public List<string> CommandList
        {
            get;
            set;
        }

        private string _commandText;
        public override string CommandText
        {
            get { return _commandText; }
            set
            {
                if (IsExtendedQuery(value))
                {
                    throw new InvalidOperationException("Use CommandList for extended Query");
                }
                _commandText = value;
            }
        }

        private bool IsExtendedQuery(string query)
        {
            //check if it is extended query
            return false; //or true
        }
    }

@roji
Copy link
Member Author

roji commented Nov 3, 2015

@kkurni, I think I didn't explain myself clearly...

The PostgreSQL extended vs. simple protocol isn't really relevant here, it was just to give more context. The problem is that ADO.NET's current batching API assumes that provider drivers can simply send a single string with multiple SQL statements and semicolons, which the backend will take care of splitting and executing. This simply isn't the case for PostgreSQL (extended protocol).

The solution could simply be to add a string[] CommandTexts property to ADO.NET's DbCommand. The default setter implementation for this property would simply be to join the array on semi-colon and set the DbCommand's CommandText to it - this would allow people to use the new batching API with providers such as SqlClient, without a single change to those providers. In Npgsql's case, however, I would override this property and construct the PostgreSQL protocol messages directly from the array, without any sort of semicolon parsing or joining.

This would provide a structured, fully-flexible batching API while fully preserving all existing ADO.NET providers.

Please let me know if this is unclear in any way.

@kkurni
Copy link
Contributor

kkurni commented Nov 3, 2015

Thanks @roji for clarifying this.

by adding string[] CommandTexts will add more complexity for other providers which doesn't need it.
e.g How will you manage CommandTexts and CommandText value ?
Which one will be the source of truth ?

cmd.CommandTexts = new string[] {"select 1;", "select 2;"};
cmd.CommandText = "select 3;";
Assert.Equals(String.Join(cmd.CommandTexts), cmd.CommandText); //this will be false.

We can potentially add a flag like you suggested above boolean LegacyBatchingMode, which will decide CommandText or CommandTexts as source of truth or throw exception if developer set the wrong setter.
But again this will add additional complexity for other providers which doesn't need this.

It will be simpler to add string[] CommandTexts only to provider which need this. E.g PostGreSQL.
Due to specific requirement of PostGreSQL, your custom provider can handle this separately without impacting on other existing provider.

@roji
Copy link
Member Author

roji commented Nov 3, 2015

The addition of a CommandTexts property definitely adds a little of complexity (as always when something is added), but I'm not sure this complexity is added to other providers. If, as I proposed above, the default implementation of this property is to simply set and get the CommandText property, then things are extremely simple - there is only one CommandText value, and CommandTexts reads and writes to it). No need for any sort of legacy flag - the last set of either CommandText or CommandTexts will be the definitive one.

So again, unless I'm missing something this adds a useful, more structured batching API without complexifying existing providers in any way.

Of course I could add CommandTexts to Npgsql, but then I'm forcing people to write Npgsql-specific code in order to do batching. The point is that ADO.NET is supposed to be a provider-independent API, but its batching API models things in a way which isn't.

@kkurni
Copy link
Contributor

kkurni commented Nov 3, 2015

How will CommandTexts read back the value from CommandText ?
Will it need Split CommandText to return them as a list ?
This will push back the complexity from split those CommandText in the first place.

Except that we only allow to SetCommandTexts without able to read them back.
e.g cmd.SetCommandText(string[])

@roji
Copy link
Member Author

roji commented Nov 3, 2015

Providing a set-only function definitely could work, as you suggest.

@kkurni kkurni assigned YoungGah and unassigned kkurni Nov 4, 2015
@roji
Copy link
Member Author

roji commented Dec 1, 2015

@YoungGah, please let me know if you have any questions on this. It's extremely problematic to require providers to fully parse SQL only because their database protocol doesn't allow sending semicolon-joined statements.

@YoungGah
Copy link

YoungGah commented Dec 1, 2015

@roji, I understand the issue and I agree that the API is not optimal. In the past, we discussed providing a better API (similar to what you proposed) within the owning team, but due to complexity of error handling and other higher priority works, we never got to add the new API. We will evaluate and prioritize this request post V1, but the ask is clear and make sense. We will update on this request after we release V1.

@roji
Copy link
Member Author

roji commented Dec 2, 2015

Great, thanks @YoungGah. This is obviously not high-priority - existing providers such as Npgsql already have some sort of answer to the problem. But it's an important long-term goal to be able to cut out SQL parsing.

@roji
Copy link
Member Author

roji commented Dec 25, 2015

There's one more related issue with the ADO.NET batching API: managing parameters. Parameters are currently managed as a collection on the DbCommand - they are therefore "common" to all statements being sent in a batch; there's no association between a parameter and a specific statement in a batch.

Here's why this is a problem. In the PostgreSQL wire protocol, each statement (whether batched or not) comes with its own parameters - there's no way to send a single parameter and then to reference it from two different statements. This makes ADO.NET's "one parameter collection for all batched statements" approach totally incompatible with PostgreSQL. The current workaround is, again, to parse the statement SQLs, identify parameter placeholders, and send each statement in the batch with its own set of parameters.

If and when you guys get around to working on this, I'd recommend you review how the JDBC API works, it may provide some answers.

@NickCraver
Copy link
Member

@roji I'm curious here: for what range do you consider this API useful? By that I mean, what's the reasonable upper-bound before we're into bulk insert territory anyway? I'm thinking about how Dapper would support these scenarios across providers as well and just trying to gain an idea of expected use.

@roji
Copy link
Member Author

roji commented Jan 23, 2016

That's a very good question - there's definitely overlap between bulk insert and command batching. One key thing about batching is the ability to do anything, not just insert - I can batch DELETEs, UPDATEs, or even stored procedure calls. On the other hand, bulk insert is supposed to provide a way of inserting even more optimized than batching INSERT commands.

My main goal here is to make sure that ADO.NET's batching capabilities are more compatible with databases other than SqlServer. After all, a batching "API" already exists in ADO.NET (if you can call shoving a semicolon in the text an API); but while it works well for SqlServer, it's not well-suited for PostgreSQL, for example.

Does that make sense?

@roji
Copy link
Member Author

roji commented Apr 24, 2016

Note: have opened npgsql/npgsql#1042 to handle this specifically in Npgsql, in case you guys don't get around to working on it.

@roji
Copy link
Member Author

roji commented Jul 18, 2016

FYI, I'm working on implementing this for PostgreSQL, i.e. a structured statement API within DbCommand which doesn't depend on semicolon-delimited statements packed inside CommandText. The API looks like this:

using (var cmd = new NpgsqlCommand
{
    Connection = conn,
    Statements =
    {
        new NpgsqlStatement
        {
            SQL = "SELECT $1",
            InputParameters = { new NpgsqlParameter { Value = 8 } }
        },
        new NpgsqlStatement
        {
            SQL = "SELECT $1",
            InputParameters = { new NpgsqlParameter { Value = 9 } }
        }
    }
})

Note how each NpgsqlStatement has its own parameter list, this again corresponds to how the PostgreSQL protocol actually represents statements (parameter per statement rather than a shared parameter list for all statements of a command).

To decide which mode the command uses, i.e. whether it looks at CommandText or at the structured Statements list, it's simply a matter of which API is used last. Setting CommandText sets a flag one way, and modifying the Statements list in any way sets it the other way.

Any comments would be welcome, I'd love to see something like this in standard ADO.NET one day.

@fs7744
Copy link

fs7744 commented Jul 31, 2016

in abo.net , we can use System.Data.SqlClient.SqlCommandSet for batch insert

although SqlCommandSet actually internal

    SqlCommandSet commandSet = new SqlCommandSet();
    commandSet.Connection = conn;
    commandSet.Transaction = trans;

    for (int i = 0; i < data.Length; i++)
    {
        SqlCommand cmd = conn.CreateCommand();
        cmd.CommandText = @"insert into product(name, category, discontinued, id) values(@p1, @p2, @p3, @p4)";
        cmd.CommandType = System.Data.CommandType.Text;
        cmd.Parameters.AddWithValue("@p1", data[i][0]);
        cmd.Parameters.AddWithValue("@p2", data[i][1]);
        cmd.Parameters.AddWithValue("@p3", 0);
        cmd.Parameters.AddWithValue("@p4", Guid.NewGuid());
        commandSet.Append(cmd);
    }
    totalEffected += commandSet.ExecuteNonQuery();

@roji
Copy link
Member Author

roji commented Aug 1, 2016

@fs7744 thanks for commenting, this is the first time I hear of SqlCommandSet! This could indeed be a way forward, but this class seems to be internal? How are you using it?

@roji
Copy link
Member Author

roji commented Aug 1, 2016

Oh sorry, I missed that you actually said that. Any idea what this class is used for?

@fs7744
Copy link

fs7744 commented Aug 1, 2016

in abo.net's design, SqlCommandSet is actually for SqlDataAdapter, so it's internal

someone like use SqlCommandSet when without DataAdapters

like http://stackoverflow.com/questions/402630/batching-in-ado-net-without-dataadapters

You cane see SqlDataAdapter how to use SqlCommandSet in SqlDataAdapter.cs

@roji
Copy link
Member Author

roji commented Aug 1, 2016

Well, it's true that this class is an API for performing batched operations without CommandText concatentation. However, it's not clear what this contributes to the overall problem - the class isn't a public API...

@YoungGah, do you have any guidance here? I'd hate to go and implement my API in Npgsql just to change things again later. Does the team have any direction in mind, or at least an idea whether this will be tackled at some point?

@ngbrown
Copy link

ngbrown commented Aug 21, 2016

NHibernate, for example, got around SqlCommandSet being internal by using reflection to expose it: /src/NHibernate/AdoNet/SqlClientSqlCommandSet.cs.

@roji
Copy link
Member Author

roji commented Aug 24, 2016

I wonder if there's really a performance benefit associated with using SqlCommandSet as opposed to simply appending multiple SQL statements into a single command text (i.e. INSERT...; INSERT)...

@roji
Copy link
Member Author

roji commented Sep 14, 2016

@YoungGah, it's been a long while since someone from Microsoft commented on this. Is it possible to get some visibility on whether the ADO.NET API is going to get any attention soon?

I'm specifically looking for some guidance on what to do here, and I'd like to avoid going down a path that wouldn't be compatible with ADO.NET later. Specifically, SqlClient appears to support batching via two separate APIs - a single command with semicolons in its CommandText, and the internal SqlCommandSet. As I said above, I have a branch with a an NpgsqlRawCommand, which itself can contain multiple NpgsqlStatements - but I'd be happy to have a serious conversation about where this is going before merging this kind of thing.

@bgrainger
Copy link
Contributor

MySQL has a similar problem/opportunity.

There is a text protocol that allows multiple commands to be sent, and returns multiple result sets. However, it doesn't support parameter binding, prepared statements, and transmits all data types in a textual format. Then there is a binary protocol that supports parameter binding, prepared statements, and an efficient binary resultset; however, it only supports one command at a time.

As a result, in order to use the binary protocol, the client has to be restricted to only executing a single command at a time (which is inefficient in terms of network round-trips), or the connector library has to parse the SQL and split it into multiple commands (which runs into all the problems @roji has mentioned with SQL parsing, parameter management, etc.).

@roji
Copy link
Member Author

roji commented Nov 6, 2017

@bgrainger that's interesting, the MySQL protocol situation seems to be exactly identical to PostgreSQL's protocol situation... originally there was only a simple protocol which support multiple statements in the same protocol message, but no parameters, prepared statements or binary data representation. Then they added the "extended protocol", which supports all the above features but does one-statement-per-message. Statements can still easily be batched by packing several messages into the same packet (which is what Npgsql does), but ADO.NET forces Npgsql to parse since it doesn't has a proper batching API and forces users to shove multiple statements into the same CommandText.

I really hope you guys at Microsoft start looking at ADO.NET soon, these issues have been around for quite a while and the API has stagnated.

@bgribaudo
Copy link
Contributor

bgribaudo commented Nov 6, 2017

There's actually a MS SQL Server use case for this feature request, as well.

With SQL Server and ADO.Net, as has already been pointed out, it's currently possible to build a single command consisting of multiple SQL statements.

However, the same doesn't hold true for stored procedures: It's currently not possible to invoke multiple stored procedures using a single CommandType = CommandType.StoredProcedure command. The wire protocol used for client-to-SQL Server communications (Tabular Data Stream [TDS]) supports this scenario (allowing a single execute request to invoke multiple stored procedures) but ADO.Net/SqlCommand's API doesn't expose the functionality to take advantage of it.

Implementing this feature request could make this use case possible.

Keep in mind that, if a query uses parameters, SqlCommand converts the query into a stored procedure call behind-the-scenes. So, supporting this use case would affect a lot more than just direct user invocations of stored procedures--it would allow more efficient transmission of sets of queries that use parameters.

@dario-l
Copy link

dario-l commented Dec 11, 2017

For batching I'm using solution from NHibernate.

Unfortunately on .net core is not implemented. SqlCommandSet type doesn't exist.

PS
Compiled on netstandard2 and run over framework 4.6.1 works.

@roji
Copy link
Member Author

roji commented Dec 11, 2017

@dario-l SqlCommandSet is an SqlClient-specific API, and is not even exposed publicly. The point here would be to add a standard API in ADO.NET, which may or may not end up resembling SqlCommandSet.

@dario-l
Copy link

dario-l commented Dec 12, 2017

@roji That's the point. I must use hacky way to do things performant. There should be the way to do this with official standard API.

If batching is not implemented in .net core yet then, maybe, it is a chance to do this properly even if this requires some changes/extensions in API.

@roji
Copy link
Member Author

roji commented Dec 13, 2017

@dario-l, unless I'm mistaken you can do batching with the official standard API, as described above - you just pack your multiple SQL statements into a single string and set CommandText to it. So the issue is not that an API is lacking, it's that it's unstructured and therefore problematic.

@dario-l
Copy link

dario-l commented Dec 13, 2017

@roji Yes, I can but this leads to rebuild query by database server every time it will change and will be not cached. I've tested that scenario.

@roji
Copy link
Member Author

roji commented Dec 13, 2017

What do you mean when you say that the query "will change" and have to be rebuilt? If you properly use parameters why would the query have to change?

@dario-l
Copy link

dario-l commented Dec 14, 2017

"UPDATE X SET Y = 0 WHERE ID = @id;UPDATE X SET Y = 1; WHERE ID = @id;" != "UPDATE X SET Y = 0 WHERE ID = @id;"

I have better performance through SqlCommandSet and batching.

@roji
Copy link
Member Author

roji commented Dec 14, 2017

@dario-l that's interesting. I wonder if you have the same speed difference if the queries are not identical between them (i.e. two different updates).

In any case, we both agree and are asking for the same thing - a standard, structured API in ADO.NET for batching.

@roji
Copy link
Member Author

roji commented Dec 9, 2018

@roji
Copy link
Member Author

roji commented Dec 9, 2018

It's important to note that batching via CommandText with semicolons also seems to have some pretty serious drawbacks even for SqlServer, as described here: https://blogs.msdn.microsoft.com/dataaccess/2005/05/19/does-ado-net-update-batching-really-do-something/ (also see dotnet/SqlClient#19).

An ADO.NET batching API would permit exposing SqlCommandSet in a portable way and boost performance there as well.

@fubar-coder
Copy link
Contributor

fubar-coder commented Jan 30, 2019

The SQL Anywhere ADO.NET driver solved this problem quite elegantly:

  • Create a single DbCommand
  • Add n*m parameters (with n=number of parameters for a single insert and m=number of rows)
  • Execute

Example:

cmd.CommandText = "UPDATE message SET text = @text WHERE id = @id";
cmd.Parameters.Add(new SqlParameter("text", "text1");
cmd.Parameters.Add(new SqlParameter("id", 1);
cmd.Parameters.Add(new SqlParameter("text", "text2");
cmd.Parameters.Add(new SqlParameter("id", 2);

EDIT: Added example

@roji
Copy link
Member Author

roji commented Jan 30, 2019

One problematic issue with this approach is that it doesn't allow you to batch different statements, only reexecute the same statement with different parameter sets (IIRC the JDBC batching API has the same limitation). One of the goals here is to allow batching of arbitrary statements together (different types of inserts, updates and even selects).

Another limitation is that it's difficult to imagine how information on a specific statement in the batch would be returned. For example, it's important to be able to return the records affected for each individual statement in the batch, but this doesn't seem possible without a more structured API that has an explicit idea of a batch and a contained statement.

Nevertheless the approach above is elegant within the confines of the current API.

@roji
Copy link
Member Author

roji commented Feb 6, 2019

Superceded by dotnet/corefx#35135

@roji roji closed this as completed Feb 6, 2019
@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 4, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Data
Projects
None yet
Development

No branches or pull requests