[SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields #18005
[SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields #18005tpoterba wants to merge 3 commits intoapache:masterfrom
Conversation
Fix quadratic List indexing in ParquetWriteSupport. Minimal solution is to convert rootFieldWriters to a WrappedArray, which has O(1) indexing, and restores complexity to linear.
|
ok to test |
|
Can you also make sure that we do not use a |
|
Test build #76983 has finished for PR 18005 at commit
|
|
It might be nice to explicitly use the type |
|
Yeah, I can change that - I do hate the standard IndexedSeq implementation (Vector) though, and want to make sure that the collection is actually a WrappedArray. I've actually done more than make a one-line change using the Github UI now, and will update with performance benchmarks + a successful build. |
|
|
||
|
|
||
| this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray | ||
| this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter] |
There was a problem hiding this comment.
Either call toIndexedSeq or make the rootFieldWriters an Array. Both are fine.
|
|
||
|
|
||
| this.rootFieldWriters = schema.map(_.dataType).map(makeWriter) | ||
| this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter] |
There was a problem hiding this comment.
Either call toIndexedSeq or make the rootFieldWriters an Array. Both are fine.
|
LGTM pending jenkins |
|
Addressed comments. I tried to get some benchmark stats for this code: spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)I wanted to see the performance improvement for files with various numbers of columns/rows that were all 1.5G. However, I didn't see much of a difference with <30 columns and catalyst blew up when I tried ~50 columns (I wanted to go up to several hundred) |
|
What do you mean by catalyst blew up? |
|
I used this script to generate random CSV files: import uuid
import sys
try:
print('args = ' + str(sys.argv))
filename = sys.argv[1]
cols = int(sys.argv[2])
rows = int(sys.argv[3])
if len(sys.argv) != 4 or cols <= 0 or rows <= 0:
raise RuntimeError()
except Exception as e:
raise RuntimeError('Usage: gen_text_file.py <filename> <cols> <rows>')
rand_to_gen = (cols + 7) / 8
with open(filename, 'w') as f:
f.write(','.join('col%d' % i for i in range(cols)))
f.write('\n')
for i in range(rows):
if (i % 10000 == 0):
print('wrote %d lines' % i)
rands = [x[i:i+4] for i in range(8) for x in [uuid.uuid4().hex for _ in range(rand_to_gen)]]
f.write(','.join(rands[:cols]))
f.write('\n')I generated files that were all the same size on disk with different dimensions (cols x rows): Here's what I tried to do to them: >>> spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)The 10, 20, 30-column files all took between 40s to 1m to complete on 2 cores of my laptop. 60 and up never completed, and actually crashed the java process -- I had to kill it with At one point for the 60-column table, I got a "GC overhead limit exceeded" OOM from the parquet writer (the error suggested that parquet was doing something silly trying to use dictionary encoding for random values, but I haven't figured out how to turn that off). I could be conflating this crash with one we encountered a few months ago, where Spark crashed because Catalyst generated bytecode larger than 64k for dataframes with a large schema. |
|
Others on my team suggest that the >64k bytecode issue has been fixed already (and ported to a 2.1 maintenance release as well) |
|
Test build #77018 has finished for PR 18005 at commit
|
|
LGTM - merging to master/2.2. Thanks! |
…mber of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes #18005 from tpoterba/tpoterba-patch-1. (cherry picked from commit 3f2cd51) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
…mber of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes apache#18005 from tpoterba/tpoterba-patch-1.
Fix quadratic List indexing in ParquetWriteSupport.
I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.
What changes were proposed in this pull request?
The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.
Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.
How was this patch tested?
This is a one-line change for performance reasons.