[SPARK-20773][SQL] ParquetWriteSupport.writeFields is quadratic in number of fields by tpoterba · Pull Request #18005 · apache/spark

tpoterba · 2017-05-16T19:40:31Z

Fix quadratic List indexing in ParquetWriteSupport.

I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call.

What changes were proposed in this pull request?

The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields.

Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing.

How was this patch tested?

This is a one-line change for performance reasons.

Fix quadratic List indexing in ParquetWriteSupport. Minimal solution is to convert rootFieldWriters to a WrappedArray, which has O(1) indexing, and restores complexity to linear.

hvanhovell · 2017-05-16T22:11:01Z

ok to test

hvanhovell · 2017-05-16T22:12:29Z

Can you also make sure that we do not use a Seq for struct writing?

SparkQA · 2017-05-16T22:17:38Z

Test build #76983 has finished for PR 18005 at commit e13ce9f.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2017-05-17T00:46:45Z

It might be nice to explicitly use the type IndexedSeq[ValueWriter] for rootFieldWriters (up on line 61 of this file) since that would capture the intent behind using an Array and would maybe help prevent regression during refactoring.

tpoterba · 2017-05-17T00:51:52Z

Yeah, I can change that - I do hate the standard IndexedSeq implementation (Vector) though, and want to make sure that the collection is actually a WrappedArray.

I've actually done more than make a one-line change using the Github UI now, and will update with performance benchmarks + a successful build.

hvanhovell · 2017-05-17T15:11:37Z



-    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray
+    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter]


Either call toIndexedSeq or make the rootFieldWriters an Array. Both are fine.

hvanhovell · 2017-05-17T15:12:06Z



-    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter)
+    this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter]


Either call toIndexedSeq or make the rootFieldWriters an Array. Both are fine.

hvanhovell · 2017-05-17T15:41:42Z

LGTM pending jenkins

tpoterba · 2017-05-17T15:42:20Z

Addressed comments.

I tried to get some benchmark stats for this code:

spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)

I wanted to see the performance improvement for files with various numbers of columns/rows that were all 1.5G. However, I didn't see much of a difference with <30 columns and catalyst blew up when I tried ~50 columns (I wanted to go up to several hundred)

hvanhovell · 2017-05-17T15:44:02Z

What do you mean by catalyst blew up?

tpoterba · 2017-05-17T16:16:09Z

I used this script to generate random CSV files:

import uuid
import sys

try:
    print('args = ' + str(sys.argv))
    filename = sys.argv[1]
    cols = int(sys.argv[2])
    rows = int(sys.argv[3])
    if len(sys.argv) != 4 or cols <= 0 or rows <= 0:
        raise RuntimeError()
except Exception as e:
    raise RuntimeError('Usage: gen_text_file.py <filename> <cols> <rows>')

rand_to_gen = (cols + 7) / 8


with open(filename, 'w') as f:
    f.write(','.join('col%d' % i for i in range(cols)))
    f.write('\n')
    for i in range(rows):
        if (i % 10000 == 0):
            print('wrote %d lines' % i)
        rands = [x[i:i+4] for i in range(8) for x in [uuid.uuid4().hex for _ in range(rand_to_gen)]]
        f.write(','.join(rands[:cols]))
        f.write('\n')

I generated files that were all the same size on disk with different dimensions (cols x rows):
10x18M
20x9M
30x6M
60x3M
150x1200K
300x600K

Here's what I tried to do to them:

>>> spark.read.csv(text_file).write.mode('overwrite').parquet(parquet_path)

The 10, 20, 30-column files all took between 40s to 1m to complete on 2 cores of my laptop. 60 and up never completed, and actually crashed the java process -- I had to kill it with kill -9.

At one point for the 60-column table, I got a "GC overhead limit exceeded" OOM from the parquet writer (the error suggested that parquet was doing something silly trying to use dictionary encoding for random values, but I haven't figured out how to turn that off). I could be conflating this crash with one we encountered a few months ago, where Spark crashed because Catalyst generated bytecode larger than 64k for dataframes with a large schema.

tpoterba · 2017-05-17T16:23:14Z

Others on my team suggest that the >64k bytecode issue has been fixed already (and ported to a 2.1 maintenance release as well)

SparkQA · 2017-05-17T19:17:57Z

Test build #77018 has finished for PR 18005 at commit 72c5487.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-19T12:16:49Z

LGTM - merging to master/2.2. Thanks!

…mber of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes #18005 from tpoterba/tpoterba-patch-1. (cherry picked from commit 3f2cd51) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

…mber of fields Fix quadratic List indexing in ParquetWriteSupport. I noticed this function while profiling some code with today. It showed up as a significant factor in a table with twenty columns; with hundreds of columns, it could dominate any other function call. ## What changes were proposed in this pull request? The writeFields method iterates from 0 until number of fields, indexing into rootFieldWriters for each element. rootFieldWriters is a List, so indexing is a linear operation. The complexity of the writeFields method is thus quadratic in the number of fields. Solution: explicitly convert rootFieldWriters to Array (implicitly converted to WrappedArray) for constant-time indexing. ## How was this patch tested? This is a one-line change for performance reasons. Author: tpoterba <tpoterba@broadinstitute.org> Author: Tim Poterba <tpoterba@gmail.com> Closes apache#18005 from tpoterba/tpoterba-patch-1.

Fix quadratic List indexing in ParquetWriteSupport

e13ce9f

Fix quadratic List indexing in ParquetWriteSupport. Minimal solution is to convert rootFieldWriters to a WrappedArray, which has O(1) indexing, and restores complexity to linear.

fixed compiler error

4e2d5ac

hvanhovell reviewed May 17, 2017

View reviewed changes

addressed comments

72c5487

asfgit closed this in 3f2cd51 May 19, 2017



		this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray
		this.rootFieldWriters = schema.map(_.dataType).map(makeWriter).toArray[ValueWriter]

Conversation

tpoterba commented May 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell commented May 16, 2017

Uh oh!

hvanhovell commented May 16, 2017

Uh oh!

SparkQA commented May 16, 2017

Uh oh!

JoshRosen commented May 17, 2017

Uh oh!

tpoterba commented May 17, 2017

Uh oh!

hvanhovell May 17, 2017

Choose a reason for hiding this comment

Uh oh!

hvanhovell May 17, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented May 17, 2017

Uh oh!

tpoterba commented May 17, 2017

Uh oh!

hvanhovell commented May 17, 2017

Uh oh!

tpoterba commented May 17, 2017

Uh oh!

tpoterba commented May 17, 2017

Uh oh!

SparkQA commented May 17, 2017

Uh oh!

hvanhovell commented May 19, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hvanhovell May 17, 2017 •

edited

Loading