Skip to content

Conversation

@icexelloss
Copy link
Collaborator

What changes were proposed in this pull request?

I have refactored arrow serialization into arrow column writers

How was this patch tested?

The two tests (int and string) in ArrowSuite pass

@BryanCutler
Copy link
Owner

Thanks for this @icexelloss. It looks good in general but like I mentioned before the Spark committers are very reluctant to add any files or classes. Since we are only working with toPandas() right now, it will be hard to justify adding so much for a small corner of Spark functionality. I would not like to add any more files or classes until we get a committer to sign off on what we have currently. So how about you hold off on this until then?

@wesm
Copy link

wesm commented Jan 18, 2017

Since this is Scala, can't these classes all go in a single file?

@icexelloss
Copy link
Collaborator Author

Bryan and Wes, I moved the column writers to Arrow.scala

@icexelloss
Copy link
Collaborator Author

I added support for more types and switched to use arrow NullableVector

Once #17 is merged, I will rebase and add tests for uncovered types

@icexelloss
Copy link
Collaborator Author

here are some performance issues I found with this change:

(For 1 million long and double)

Before:
internalRowsToArrowRecordBatch
[0.045898734002548736, 0.05031404899636982, 0.044390044997271616, 0.03245397000137018, 0.030072718996962067, 0.03008368899463676, 0.029609725999762304, 0.02923199100041529, 0.029868764999264386, 0.02980228500382509]

After:
internalRowsToArrowRecordBatch
[0.17399869799555745, 0.06763976799993543, 0.17177338899637107, 0.11198228599823778, 0.08500225299940212, 0.12630318199808244, 0.08831146299780812, 0.11189435399865033, 0.08414604800054803, 0.11313620099826949]

This contributes to ~20% slowness to toPandas()

I will spend some time looking into that

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually explicit imports are preferred. Is it basically an import for every vector type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to import org.apache.spark.sql.types._ , otherwise we end up with a very long import (10+)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it will probably be fine like this then

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this to a ColumnWriter object apply() method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and subclasses should probably be package private

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be a little clearer to check for null here, like before and just define writeData and writeNull in the ColumnWriter interface. Is that do-able?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this implicit used? Generally these are somewhat frowned upon, can it be done without this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a scala function to turn a boolean into a int, that's what is this for. I will change it to be non implicit

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some of these classes can be consolidated with a generic class definition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has already been pretty much consolidated. Each column writers have different valueVector type, valueMutator type and writeData. The only part that is same among them is writeNull(), however, arrow doesn't have a nullable mutator interface, therefore this needs to be this way

@BryanCutler
Copy link
Owner

Ok, it looks better with these classes in one file and the ColumnWriter interface is a good way to do this. Just some minor comments, otherwise looking good.

@BryanCutler
Copy link
Owner

BryanCutler commented Jan 18, 2017

Once #17 is merged, I will rebase and add tests for uncovered types

Ok, I'll go ahead and merge that first

This contributes to ~20% slowness to toPandas()

Any idea what was causing this?

@icexelloss
Copy link
Collaborator Author

I boxed primitives by accidents. It is fixed now.

Before:
internalRowsToArrowRecordBatch
[0.18191100400144933, 0.036740217001351994, 0.03310888200212503, 0.02675118299521273, 0.02415680999547476, 0.02392288400005782, 0.023824697003874462, 0.023984511994058266, 0.022978036002314184, 0.026774197998747695]

After:
internalRowsToArrowRecordBatch
[0.2175193030052469, 0.032566589005000424, 0.030894666000676807, 0.031016137996630277, 0.06971160900138784, 0.02801599200029159, 0.028111334002460353, 0.028634793998207897, 0.027740458004700486, 0.0285637749984744]

@icexelloss
Copy link
Collaborator Author

Comments addressed

@icexelloss
Copy link
Collaborator Author

Oops, some tests are failing. Need to fix those.

@icexelloss icexelloss changed the title Implement Arrow column writers WIP: Implement Arrow column writers Jan 19, 2017
@icexelloss icexelloss changed the title WIP: Implement Arrow column writers Implement Arrow column writers Jan 19, 2017
@icexelloss
Copy link
Collaborator Author

Tests pass

@BryanCutler
Copy link
Owner

BryanCutler commented Jan 19, 2017

LGTM, merged. Thanks @icexelloss!

BryanCutler pushed a commit that referenced this pull request Jan 24, 2017
Move column writers to Arrow.scala

Add support for more types; Switch to arrow NullableVector

closes #16
BryanCutler pushed a commit that referenced this pull request Feb 23, 2017
Move column writers to Arrow.scala

Add support for more types; Switch to arrow NullableVector

closes #16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants