How to speed up object conversion? #404

bluesmoon · 2017-06-22T20:31:10Z

I'm using PyCall (1.7.2) to run an SQL query against a database, and then getting the results into Julia. It appears to be very slow to convert the python list of tuples to a Julia array of tuples, and it seems that all the slowness is in iterating through the list elements, i.e., the speed is O(n) on the number of items in the list.

Here's some example code. The SQL statement returns exactly 100,000 rows:

Using automatic type conversion

@time rows = cs.cursor[:fetchall]()     # PyCall automatically converts to a Julia array of tuples
157.480336 seconds (24.49 M allocations: 649.669 MB, 0.86% gc time)

@time rowarray = map(collect, rows)     # Convert the tuples to arrays
  0.338664 seconds (1.86 M allocations: 65.706 MB, 37.67% gc time)

length(rowarray)
100000

Getting a `PyObject` and then converting with `map`

@time rows = pycall(cs.cursor[:fetchall], PyObject)
  7.685769 seconds (73 allocations: 4.031 KB)

@time rowarray = map(collect, rows)
119.437264 seconds (27.05 M allocations: 745.472 MB, 1.29% gc time)

length(rowarray)
100000

As you can see, calling fetchall() takes 157seconds, and then the map is very fast, whereas calling pycall(fetchall, PyObject) takes 7seconds, and then the map is very slow.

So, wise PyCall devs, is there a way for me to combine the fastest parts of the two approaches? I'm not averse to going as low level as necessary as this level of database slowness is causing us a lot of grief.

PS: I've tried parallelising this with pmap and other low level julia parallel functions, but this involves copying the object which has the same issue.

The text was updated successfully, but these errors were encountered:

stevengj · 2017-06-22T21:12:17Z

If you know that fetchall produces an array of tuples, then you can specify the exact return type by using pycall or @pycall. This will speed things up considerably by eliminating the type introspection. But it will always be O(n), since it has to convert each object.

bluesmoon · 2017-06-22T21:55:59Z

When I do this:

@time rows = pycall(cs.cursor[:fetchall], Array{Tuple})

I get this:

BoundsError

 in convert at /home/ubuntu/.julia/v0.4/PyCall/src/conversions.jl:178
 in py2array at /home/ubuntu/.julia/v0.4/PyCall/src/conversions.jl:312
 in convert at /home/ubuntu/.julia/v0.4/PyCall/src/conversions.jl:377
 in pycall at /home/ubuntu/.julia/v0.4/PyCall/src/PyCall.jl:565

What should I pass in as the type?

bluesmoon · 2017-06-22T21:58:00Z

Actually it looks like fetchall returns an iterator, so I guess I can't just convert it to a Tuple

bluesmoon · 2017-06-22T22:03:15Z

Ok, so what I have is an iterator, that correctly returns a Tuple, but it's doing type introspection to determine that it needs to return a tuple. How do I pass a hint about the type to speed that up?

stevengj · 2017-06-22T22:15:47Z

What does the iterator return? A tuple of what size/type?

bluesmoon · 2017-06-22T22:20:11Z

it depends on the query, but let's say for this particular query it is: Tuple{Int64, Int64, Int64, AbstractString, AbstractString, Float64}

stevengj · 2017-06-23T02:09:16Z

You can do collect(Tuple{Int64, Int64, Int64, String, String, Float64}, pycall(cs.cursor["fetchall"], PyObject))? (passing the desired element type to collect)

stevengj · 2017-06-23T02:54:36Z

With the latest PyCall master, you should be able to do collect(Tuple{Vararg{PyAny}}}, pycall(cs.cursor["fetchall"], PyObject)) if you don't know the tuple types. (But in this case it will still do type introspection on each tuple element.)

One thing that has been on my to-do list for a while is to speed up type introspection by caching types in a hash table.

bluesmoon · 2017-06-23T03:13:24Z

Thanks. That didn't change the time it takes because presumably the call to collect is already operating on the iterator. I've also tried calling pycall(rows[:__getitem__], Tuple{...}, i) but this appears to be slower for the moment. Will let you know if I get anything further.

bluesmoon · 2017-06-24T00:34:32Z

Ok, so I've managed to get huge speedups (90%) by trying several different things:

Instead of cs.cursor[:fetchall]() which returns a list iterator, I use pycall(cs.cursor[:fetchall], PyObject) which returns a PyObject that behaves like a list iterator
Instead of iterating through each row and type converting individual elements using collect or a loop, I do this:
```
 rowarray = map(1:length(rows)) do i
     row = get(rows, PyObject, i-1)
     return map( j -> get(row, PyObject, j-1), 1:length(cs.description) )
 end
```
Which returns an Array{Array{PyObject, 1}, 1} which is about 500-1000x faster than doing the type conversion to get Array{Tuple{...}} or Array{Array{Any}}
I then call hcat(rowarray...) which converts the above Array{Array{PyObject, 1}, 1} into an Array{PyObject, 2}
I then call transpose on the resulting array to convert it from python's row major format to julia's column major format.
I then do the type conversion one column at a time which is extremely fast, again 500-1000x faster than if it were done via get or pycall or type introspection. This is easy to do since I know in advance the type of every column (since this is available via cs.description)

Overall on average for 100,000 rows & 40 colums I've seen a change from approx 200seconds to 17seconds. Larger datasets have had even more improvements, going from several hours of running time to under 10 minutes.

Hope this helps others.

stevengj · 2017-06-24T02:22:01Z

Why not just do

m = length(rows)
n = length(cs.description)
a = Array{PyObject}(m,n)
for i=1:m
    row = get(rows, PyObject, i-1)
    for j = 1:n
        a[i,j] = get(row, PyObject, j-1)
    end
end

That way, you eliminate all of the constructions of temporary arrays, transposition, etcetera.

bluesmoon · 2017-06-24T03:52:11Z

Nice, that does improve performance a little and reduces memory usage.

bluesmoon · 2017-06-24T04:06:22Z

I made a small change to your code and got it a little faster:

m = length(rows)
n = length(cs.description)
a = Array{PyObject}(n,m)
for i=1:m
    row = get(rows, PyObject, i-1)
    for j = 1:n
        a[j,i] = get(row, PyObject, j-1)
    end
end

b = transpose(a)

This is faster because indexing a julia array a column at a time is faster than indexing it a row at a time and the transpose adds less overhead than switching indexes gains us.

bluesmoon · 2017-06-26T19:45:37Z

Slightly faster and more concise with this:

m = length(rows)
n = length(cs.description)
a = Array{PyObject}(n, m)
for i = 1:m
    a[:, i] = get(rows, PyVector{PyObject}, i-1)
end

bluesmoon closed this as completed Jun 26, 2017

stevengj mentioned this issue Jul 5, 2017

tuple type introspection via indexing/iteration JuliaLang/julia#22687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to speed up object conversion? #404

How to speed up object conversion? #404

bluesmoon commented Jun 22, 2017

stevengj commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

stevengj commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

stevengj commented Jun 23, 2017 •

edited

Loading

stevengj commented Jun 23, 2017

bluesmoon commented Jun 23, 2017

bluesmoon commented Jun 24, 2017

stevengj commented Jun 24, 2017

bluesmoon commented Jun 24, 2017

bluesmoon commented Jun 24, 2017

bluesmoon commented Jun 26, 2017

How to speed up object conversion? #404

How to speed up object conversion? #404

Comments

bluesmoon commented Jun 22, 2017

Using automatic type conversion

Getting a PyObject and then converting with map

stevengj commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

stevengj commented Jun 22, 2017

bluesmoon commented Jun 22, 2017

stevengj commented Jun 23, 2017 • edited Loading

stevengj commented Jun 23, 2017

bluesmoon commented Jun 23, 2017

bluesmoon commented Jun 24, 2017

stevengj commented Jun 24, 2017

bluesmoon commented Jun 24, 2017

bluesmoon commented Jun 24, 2017

bluesmoon commented Jun 26, 2017

Getting a `PyObject` and then converting with `map`

stevengj commented Jun 23, 2017 •

edited

Loading