-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-37896][SQL] Implement a ConstantColumnVector and improve performance of the hidden file metadata #35068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3cc7606
2d8beb8
f5a17de
4ca60d7
a6595f0
f68be18
b50ea9e
cc0eaec
1ae5696
c9a9b7f
11139c5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,292 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one or more | ||
| * contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. | ||
| * The ASF licenses this file to You under the Apache License, Version 2.0 | ||
| * (the "License"); you may not use this file except in compliance with | ||
| * the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
| package org.apache.spark.sql.execution.vectorized; | ||
|
|
||
| import java.math.BigDecimal; | ||
| import java.math.BigInteger; | ||
|
|
||
| import org.apache.spark.sql.types.*; | ||
| import org.apache.spark.sql.vectorized.ColumnVector; | ||
| import org.apache.spark.sql.vectorized.ColumnarArray; | ||
| import org.apache.spark.sql.vectorized.ColumnarMap; | ||
| import org.apache.spark.unsafe.types.UTF8String; | ||
|
|
||
| /** | ||
| * This class adds the constant support to ColumnVector. | ||
| * It supports all the types and contains `set` APIs, | ||
| * which will set the exact same value to all rows. | ||
| * | ||
| * Capacity: The vector stores only one copy of the data. | ||
| */ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we write a UT for this new vector?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure - working on it! |
||
| public class ConstantColumnVector extends ColumnVector { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am thinking whether we should extend It seems for partition columns, we are doing copying of same value per row (Parquet and ORC). A future improvement is to use the constant column vector we are introducing here to avoid unnecessary operations. @cloud-fan WDYT?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking to extend |
||
|
|
||
| // The data stored in this ConstantColumnVector, the vector stores only one copy of the data. | ||
| private byte nullData; | ||
| private byte byteData; | ||
| private short shortData; | ||
| private int intData; | ||
| private long longData; | ||
| private float floatData; | ||
| private double doubleData; | ||
| private UTF8String stringData; | ||
| private byte[] byteArrayData; | ||
| private ConstantColumnVector[] childData; | ||
| private ColumnarArray arrayData; | ||
| private ColumnarMap mapData; | ||
|
|
||
| private final int numRows; | ||
|
|
||
| /** | ||
| * @param numRows: The number of rows for this ConstantColumnVector | ||
| * @param type: The data type of this ConstantColumnVector | ||
| */ | ||
| public ConstantColumnVector(int numRows, DataType type) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I actually looked at it as well. Seems there's more code change needed if we want to utilize
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Make sense. Perhaps we can remove |
||
| super(type); | ||
| this.numRows = numRows; | ||
|
|
||
| if (type instanceof StructType) { | ||
| this.childData = new ConstantColumnVector[((StructType) type).fields().length]; | ||
| } else if (type instanceof CalendarIntervalType) { | ||
| // Three columns. Months as int. Days as Int. Microseconds as Long. | ||
| this.childData = new ConstantColumnVector[3]; | ||
| } else { | ||
| this.childData = null; | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public void close() { | ||
| byteArrayData = null; | ||
| for (int i = 0; i < childData.length; i++) { | ||
| childData[i].close(); | ||
| childData[i] = null; | ||
| } | ||
| childData = null; | ||
sunchao marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| arrayData = null; | ||
| mapData = null; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean hasNull() { | ||
| return nullData == 1; | ||
| } | ||
|
|
||
| @Override | ||
| public int numNulls() { | ||
| return hasNull() ? numRows : 0; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean isNullAt(int rowId) { | ||
| return nullData == 1; | ||
| } | ||
|
|
||
| /** | ||
| * Sets all rows as `null` | ||
| */ | ||
| public void setNull() { | ||
| nullData = (byte) 1; | ||
| } | ||
|
|
||
| /** | ||
| * Sets all rows as not `null` | ||
| */ | ||
| public void setNotNull() { | ||
| nullData = (byte) 0; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean getBoolean(int rowId) { | ||
| return byteData == 1; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the boolean `value` for all rows | ||
| */ | ||
| public void setBoolean(boolean value) { | ||
| byteData = (byte) ((value) ? 1 : 0); | ||
| } | ||
|
|
||
| @Override | ||
| public byte getByte(int rowId) { | ||
| return byteData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the byte `value` for all rows | ||
| */ | ||
| public void setByte(byte value) { | ||
| byteData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public short getShort(int rowId) { | ||
| return shortData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the short `value` for all rows | ||
| */ | ||
| public void setShort(short value) { | ||
| shortData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public int getInt(int rowId) { | ||
| return intData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the int `value` for all rows | ||
| */ | ||
| public void setInt(int value) { | ||
| intData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public long getLong(int rowId) { | ||
| return longData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the long `value` for all rows | ||
| */ | ||
| public void setLong(long value) { | ||
| longData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public float getFloat(int rowId) { | ||
| return floatData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the float `value` for all rows | ||
| */ | ||
| public void setFloat(float value) { | ||
| floatData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public double getDouble(int rowId) { | ||
| return doubleData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the double `value` for all rows | ||
| */ | ||
| public void setDouble(double value) { | ||
| doubleData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public ColumnarArray getArray(int rowId) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if this can work properly. Looking at |
||
| return arrayData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the `ColumnarArray` `value` for all rows | ||
| */ | ||
| public void setArray(ColumnarArray value) { | ||
| arrayData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public ColumnarMap getMap(int ordinal) { | ||
| return mapData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the `ColumnarMap` `value` for all rows | ||
| */ | ||
| public void setMap(ColumnarMap value) { | ||
| mapData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public Decimal getDecimal(int rowId, int precision, int scale) { | ||
| // copy and modify from WritableColumnVector | ||
| if (precision <= Decimal.MAX_INT_DIGITS()) { | ||
| return Decimal.createUnsafe(getInt(rowId), precision, scale); | ||
| } else if (precision <= Decimal.MAX_LONG_DIGITS()) { | ||
| return Decimal.createUnsafe(getLong(rowId), precision, scale); | ||
| } else { | ||
| byte[] bytes = getBinary(rowId); | ||
| BigInteger bigInteger = new BigInteger(bytes); | ||
| BigDecimal javaDecimal = new BigDecimal(bigInteger, scale); | ||
| return Decimal.apply(javaDecimal, precision, scale); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Sets the `Decimal` `value` with the precision for all rows | ||
| */ | ||
| public void setDecimal(Decimal value, int precision) { | ||
| // copy and modify from WritableColumnVector | ||
| if (precision <= Decimal.MAX_INT_DIGITS()) { | ||
| setInt((int) value.toUnscaledLong()); | ||
| } else if (precision <= Decimal.MAX_LONG_DIGITS()) { | ||
| setLong(value.toUnscaledLong()); | ||
| } else { | ||
| BigInteger bigInteger = value.toJavaBigDecimal().unscaledValue(); | ||
| setByteArray(bigInteger.toByteArray()); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public UTF8String getUTF8String(int rowId) { | ||
| return stringData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the `UTF8String` `value` for all rows | ||
| */ | ||
| public void setUtf8String(UTF8String value) { | ||
| stringData = value; | ||
| } | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe add
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for the suggestion - just wanna put some minimum supports in this PR (implement all necessary APIs extending from |
||
| /** | ||
| * Sets the byte array `value` for all rows | ||
| */ | ||
| private void setByteArray(byte[] value) { | ||
| byteArrayData = value; | ||
| } | ||
|
|
||
| @Override | ||
| public byte[] getBinary(int rowId) { | ||
| return byteArrayData; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the binary `value` for all rows | ||
| */ | ||
| public void setBinary(byte[] value) { | ||
| setByteArray(value); | ||
| } | ||
|
|
||
| @Override | ||
| public ColumnVector getChild(int ordinal) { | ||
Yaohua628 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return childData[ordinal]; | ||
| } | ||
|
|
||
| /** | ||
| * Sets the child `ConstantColumnVector` `value` at the given ordinal for all rows | ||
| */ | ||
| public void setChild(int ordinal, ConstantColumnVector value) { | ||
| childData[ordinal] = value; | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add the Apache license header similar to other files.