-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Handling of executemany() #120
Comments
See issue #62 . I believe one of the problems is that "parameter arrays" require the input data to be converted to a fixed-width encoding, i.e. NOT utf-8 or utf-16, because the data has to be temporarily stored in a block of memory, in grid-like fashion. This means converting the input data to something like UCS-16 (with a small but significant possibility of data loss/corruption), and having to figure out how big the grid should be. Not an easy problem to solve, but it would certainly help me in the work I do. |
Is there a workaround that we can use for now? Someone mentioned using a transaction in the Google Code link -- I'm using that and it doesn't help. In Issue #62, it was mentioned to load 250 records at a time. I can try that. Did anything work for you? |
I use the bcp bulk load utility - a lot. Even for routine updates and inserts, I write the updates to a file, bcp them to a temporary table, then update the target table in one transaction. It's a very roundabout way of doing updates, but if you're doing millions of updates to tables containing hundreds of millions of rows, it's the fastest way. By the way, I wrote python modules to do all the heavy lifting (which are company-proprietary and I can't publish - sorry). Using a separate transaction won't make a blind bit of difference. It's all about the number of round-trips to the database server. With a parameter array, there's one round-trip for all the updates. Without a parameter array, there's one round-trip per update row - very, very slow. |
Kicking this issue to keep it alive. |
@dagostinelli The link you provided did not suggest using a transaction, it was suggesting an explicit batch. You are limited to a certain number of parameters (2100 here when I looked) per query so how many statements you can combine will be limited by the specific insert you are doing. It's a workaround for sure, but it was much faster than executemany when i tested it here (I only tested with a single batch of 699 insert statements with 3 parameters per insert):
It would not be difficult to build up an insert as you counted parameters and send each batch as you go. @keitherskine's method is probably still faster but this is simpler IMO. |
Hello I have had a similar experience with executemany writing to an Azure SQL database - excruciatingly slow. Scouting around, I found this link: Perhaps someone more knowledgeable than me could explain why the "encapsulate" method is improving performance so radically. |
@BMagill It's batching the inserts so one round trip inserts multiple rows. This is a similar method to what I mention above, but he's doing it insecurely without using parameters. He's using string formatting everywhere/manually escaping values which is going to be vulnerable to sql injection. This isn't code I would use at all. What is your use case? |
I would suggest people that have issues with pyodbc's executemany speed to try Turbodbc which almost seems to be created for this specific issue. |
@woodlandhunter |
I am currently working on an implementation using the parameter array feature of ODBC, which is showing some promising results in preliminary benchmarks: Time in seconds taken to insert 20K rows into a table with single int column, SQL Server 2016 using Microsoft ODBC Driver 13: I still need to do some more testing with the other datatypes; this is just a notification that I am looking into it. |
That's great news @v-chojas ! Preliminary results look promising, that would make pyodbc usable for our project again (currently we write to CSV and use bulk load). |
That sounds awesome!. I've been meaning to look into that for a couple of years. |
I have some preliminary code at https://github.com/v-chojas/pyodbc/tree/paramarray if you'd like to give it a try. @mkleehammer let me know if you'd like a PR. It has only been tested with SQL Server 2016 and the msodbcsql driver on Linux. To implement the data insertion efficiently, the changes are quite significant so I opted to leave most of the existing code alone and have the new implementation alongside it. Notable differences/improvements are---
Performance test results: Times are for inserting 20K rows into a single-column table of the specified type, in seconds. There is at least an order of magnitude difference between the old and new executemany(). Type new executemany() old executemany() Python execute() loop |
Wow. That looks really, really good. We do need to make sure it works with all of the not-so-great drivers out there, so a generic fallback will be needed. Some examples: Some drivers don't provide SQLDescribeParam so lengths, etc., are not available. (Also, I tried calling it for all parameters, but it slowed it down quite a bit. We might need to watch this - perhaps separate the executemany implementation which would clearly benefit?) Many drivers didn't get the binary numeric struct right. I think Oracle was one, but I don't have an Oracle test install right now. I spent a few hours setting up a new VM and downloading, but then didn't get it configured. PITA. |
Yes, I'm aware there are drivers which don't fully implement ODBC correctly. However, it is necessary to determine the lengths and types in order to layout the parameter array... I suppose it would be possible to let the application supply this information when the driver can't (similar to what #213 is proposing), but I'm not sure of the value of that since such a driver might not support parameter arrays either. Calling SQLDescribeParam certainly adds some overhead, and it will vary across drivers and network environments, but with my tests on SQL Server in a LAN environment, param arrays are already faster with as little as 10 rows. I tried it with many more parameters (128 to be exact) and the performance did not decrease --- in fact it was even better, with >100x speedup observed. But based on your experience, perhaps it would be better to default to the old execute()-in-a-loop path for executemany(), and provide an option to enable all the improvements (e.g. |
I've just tried this fast_executemany code that was merged into master. I am getting the following error:
Here's the code which works fine with
|
Can you give more information on your environment? OS, Python version, etc. |
Windows 7 64bit |
I am unable to reproduce the error, although I am using Win8.1/64 and SQL 2016 the Python is also 3.6.2.
You can provide an ODBC trace (with any sensitive information removed) to assist in debugging this issue further. |
Yes, see attached ODBC trace. |
Thank you. I see that your code is attempting a fast executemany() after it has already used the cursor for some other execution, and the (ODBC, not pyODBC) cursor has not been closed. execute() and regular executemany() closes the cursor first, but the fast executemany implementation does not. As a workaround, you can consume all the resultsets first, which will automatically close the cursor:
(This is recommended in general, especially in the case of statements which return multiple resultsets/rowcounts, because failure to do so causes other problems.) NOTE: I am referring to the ODBC cursor throughout; this is unrelated to and not solved by using the cursor.close() in Python, since that not only closes but frees the underlying ODBC statement handle. |
Adding your workaround code fixed the error. However, the speedup I am getting seems small.
The data is a python list of 21,876 See attached tracing. (Tracing was off for the above time trials) |
That's under 30 rows/s in the non-fastexecutemany case, and ~35 rows/s with FEM. I don't know what hardware you're using but that is more than an order of magnitude slower than what I'd expect even without FEM, so I suspect the bottleneck for you is somewhere else. Observing network and CPU usage on the client/server and disk use on the server may yield more clues. From the trace you provided, I can see that it is working as expected - all rows are sent by the driver at once:
|
Seems to be a CPU usage issue. Both disk I/O, and network I/O are very low during execution of the query, but CPU usage is pegged on the SQL server. Could the "WHERE" be causing this slowdown? The WHERE column is the primary key so it should be indexed. |
I rewrote my query to INSERT INTO a temporary table then UPDATE from the temporary table and it is much, much faster. And your parameterized code is over an order of magnitude quicker. Thanks for the help.
|
Feature included in the 4.0.19 release. See fast_executemany for details. |
@keitherskine have you ever use .fmt file to insert different schema tables? I was having an issue as following, even I removed the spaces in between column names. Any ideas or suggestions on that? SQLState = S1000, NativeError = 0 |
Closed due to inactivity. Feel free to re-open with current information if necessary. |
There was a long standing, somewhat significant issue open at Google Code having to do with executemany() and MS SQL Server. It does not yet appear to be fixed. I'd like to call attention to it again. (insert unhelpful chide for leaving open issues at Google Code and not bringing them over)
Issue #250
Basically -- executemany() is taking forever. The problem manifests itself when there are a lot of records (10k's or 100k's) and using MS-SQL Server.
When one uses the SQL Profiler, you can see, for every statement, an
exec sp_prepexec
, followed by anexec sp_unprepare
.Here's a bit of sample code that makes it happen.
The text was updated successfully, but these errors were encountered: