Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[blocked] Switch to Utf8View for TPC-H #476

Closed
wants to merge 10 commits into from
Closed

[blocked] Switch to Utf8View for TPC-H #476

wants to merge 10 commits into from

Conversation

a10y
Copy link
Contributor

@a10y a10y commented Jul 17, 2024

@a10y a10y changed the title Switch to Utf8View for TPC-H [blocked] Switch to Utf8View for TPC-H Jul 17, 2024
.github/workflows/ci.yml Outdated Show resolved Hide resolved
@@ -115,3 +117,18 @@ warnings = "deny"
[workspace.lints.clippy]
all = { level = "deny", priority = -1 }
or_fun_call = "deny"

[patch.crates-io]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert all these cheeky bits once upstream releases with fixes

bench-vortex/tpch/q7.sql Outdated Show resolved Hide resolved
encodings/dict/src/lib.rs Outdated Show resolved Hide resolved
@robert3005
Copy link
Member

FWIW this is unblocked if you were to use master of datafusion

@@ -170,7 +170,7 @@ impl TableProvider for VortexMemTable {
/// The array is flattened directly into the nearest Arrow-compatible encoding.
async fn scan(
&self,
state: &SessionState,
state: &dyn Session,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like datafusion mainline changed the API here: apache/datafusion#11516

@a10y
Copy link
Contributor Author

a10y commented Jul 30, 2024

Initial TPC-H benchmarks comparison:

aduffy@DuffyProBook ~/c/vortex (aduffy/utf8view) [1]> critcmp develop-vortex utf8view-vortex
group                                develop-vortex                         utf8view-vortex
-----                                --------------                         ---------------
tpch_q1/vortex-pushdown-disabled     1.00    339.2±0.83ms        ? ?/sec    1.24    421.0±1.13ms        ? ?/sec
tpch_q10/vortex-pushdown-disabled    1.00    165.3±5.40ms        ? ?/sec    1.34    222.0±3.07ms        ? ?/sec
tpch_q11/vortex-pushdown-disabled    1.00    105.7±3.73ms        ? ?/sec    1.26   133.3±17.73ms        ? ?/sec
tpch_q12/vortex-pushdown-disabled    1.00    119.1±3.35ms        ? ?/sec    2.41    287.2±8.59ms        ? ?/sec
tpch_q13/vortex-pushdown-disabled    1.00    156.1±2.57ms        ? ?/sec    1.59   248.4±21.85ms        ? ?/sec
tpch_q14/vortex-pushdown-disabled    1.00     21.1±0.64ms        ? ?/sec    1.58     33.3±1.74ms        ? ?/sec
tpch_q16/vortex-pushdown-disabled    1.00     78.0±2.25ms        ? ?/sec    1.37   107.2±13.18ms        ? ?/sec
tpch_q17/vortex-pushdown-disabled    1.00    307.4±8.72ms        ? ?/sec    1.12   344.4±10.19ms        ? ?/sec
tpch_q18/vortex-pushdown-disabled    1.00   588.3±48.01ms        ? ?/sec    1.04   611.9±23.91ms        ? ?/sec
tpch_q19/vortex-pushdown-disabled    1.00     99.8±0.56ms        ? ?/sec    3.84    383.4±3.93ms        ? ?/sec
tpch_q2/vortex-pushdown-disabled     1.00     77.6±1.22ms        ? ?/sec    1.12     86.8±0.90ms        ? ?/sec
tpch_q20/vortex-pushdown-disabled    1.00    134.3±7.21ms        ? ?/sec    1.10    147.5±5.04ms        ? ?/sec
tpch_q21/vortex-pushdown-disabled    1.00   543.7±33.73ms        ? ?/sec    1.10   598.0±25.40ms        ? ?/sec
tpch_q22/vortex-pushdown-disabled    1.00     66.3±1.76ms        ? ?/sec    1.10     72.6±1.95ms        ? ?/sec
tpch_q3/vortex-pushdown-disabled     1.00     98.0±2.36ms        ? ?/sec    1.05    103.1±4.27ms        ? ?/sec
tpch_q4/vortex-pushdown-disabled     1.00     69.5±1.13ms        ? ?/sec    1.51    104.7±2.85ms        ? ?/sec
tpch_q5/vortex-pushdown-disabled     1.00    153.9±6.25ms        ? ?/sec    1.01    156.0±6.22ms        ? ?/sec
tpch_q6/vortex-pushdown-disabled     1.00     15.7±0.17ms        ? ?/sec    1.01     15.9±0.35ms        ? ?/sec
tpch_q7/vortex-pushdown-disabled     1.00    297.5±1.56ms        ? ?/sec    1.08    321.1±9.67ms        ? ?/sec
tpch_q8/vortex-pushdown-disabled     1.00    135.5±2.94ms        ? ?/sec    1.08    146.7±3.22ms        ? ?/sec
tpch_q9/vortex-pushdown-disabled     1.00   278.7±19.86ms        ? ?/sec    1.14    319.0±6.45ms        ? ?/sec

Going to profile some of the big boys, starting with q19

@a10y
Copy link
Contributor Author

a10y commented Jul 30, 2024

Oh right, I forgot that into_canonical currently copies the world 🤦

image

should be an easy fix

@a10y
Copy link
Contributor Author

a10y commented Aug 1, 2024

Alright, a bit warmer now:

$ critcmp develop-vortex utf8view-vortex-fixed-faster

group                                develop-vortex                         utf8view-vortex-fixed-faster
-----                                --------------                         ----------------------------
tpch_q1/vortex-pushdown-disabled     1.00    339.2±0.83ms        ? ?/sec    1.13    382.8±3.05ms        ? ?/sec
tpch_q10/vortex-pushdown-disabled    1.00    165.3±5.40ms        ? ?/sec    1.50    247.8±0.57ms        ? ?/sec
tpch_q11/vortex-pushdown-disabled    1.00    105.7±3.73ms        ? ?/sec    1.11    117.3±7.19ms        ? ?/sec
tpch_q12/vortex-pushdown-disabled    1.00    119.1±3.35ms        ? ?/sec    1.27    151.5±1.06ms        ? ?/sec
tpch_q13/vortex-pushdown-disabled    1.00    156.1±2.57ms        ? ?/sec    1.01    157.7±2.25ms        ? ?/sec
tpch_q14/vortex-pushdown-disabled    1.00     21.1±0.64ms        ? ?/sec    1.34     28.2±0.72ms        ? ?/sec
tpch_q16/vortex-pushdown-disabled    1.00     78.0±2.25ms        ? ?/sec    1.25     97.6±1.14ms        ? ?/sec
tpch_q17/vortex-pushdown-disabled    1.00    307.4±8.72ms        ? ?/sec    1.12    342.9±3.81ms        ? ?/sec
tpch_q18/vortex-pushdown-disabled    1.00   588.3±48.01ms        ? ?/sec    1.35   795.7±81.62ms        ? ?/sec
tpch_q19/vortex-pushdown-disabled    1.00     99.8±0.56ms        ? ?/sec    1.37    136.9±3.13ms        ? ?/sec
tpch_q2/vortex-pushdown-disabled     1.00     77.6±1.22ms        ? ?/sec    1.07     83.0±2.37ms        ? ?/sec
tpch_q20/vortex-pushdown-disabled    1.00    134.3±7.21ms        ? ?/sec    1.05    141.1±2.10ms        ? ?/sec
tpch_q21/vortex-pushdown-disabled    1.00   543.7±33.73ms        ? ?/sec    1.46   793.8±27.10ms        ? ?/sec
tpch_q22/vortex-pushdown-disabled    1.00     66.3±1.76ms        ? ?/sec    1.06     70.5±1.44ms        ? ?/sec
tpch_q3/vortex-pushdown-disabled     1.00     98.0±2.36ms        ? ?/sec    1.01     99.5±1.49ms        ? ?/sec
tpch_q4/vortex-pushdown-disabled     1.00     69.5±1.13ms        ? ?/sec    1.04     72.6±0.81ms        ? ?/sec
tpch_q5/vortex-pushdown-disabled     1.00    153.9±6.25ms        ? ?/sec    1.02    157.7±2.30ms        ? ?/sec
tpch_q6/vortex-pushdown-disabled     1.00     15.7±0.17ms        ? ?/sec    1.01     15.9±0.18ms        ? ?/sec
tpch_q7/vortex-pushdown-disabled     1.00    297.5±1.56ms        ? ?/sec    1.15    342.7±8.28ms        ? ?/sec
tpch_q8/vortex-pushdown-disabled     1.00    135.5±2.94ms        ? ?/sec    1.05    142.6±3.56ms        ? ?/sec
tpch_q9/vortex-pushdown-disabled     1.00   278.7±19.86ms        ? ?/sec    1.14   317.8±11.65ms        ? ?/sec

Digging in on q10 and q21, it looks like Arrow's take implementation for StringView does a full utf8 validation, something that the take implementation for normal StringArray doesn't do. If we addressed that upstream, I suspect this would put us at parity or above the Utf8 implementation.

image

@a10y
Copy link
Contributor Author

a10y commented Aug 1, 2024

PR upstream for better take kernel: apache/arrow-rs#6168


impl IntoCanonical for VarBinArray {
fn into_canonical(self) -> VortexResult<Canonical> {
Ok(Canonical::VarBin(self))
fn into_byteview(array: &VarBinArray) -> ArrayRef {
let mut builder = GenericByteViewBuilder::<BinaryViewType>::with_capacity(array.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is wrong: this is only if the max(offsets) of the varbin is <= u32::MAX

if it's >u32::MAX (i.e. if more than 4GB of strings in a single array) then should construct via the iterator

Comment on lines +153 to +155
s.field("inline", &"i".to_string());
} else {
s.field("ref", unsafe { &self._ref });
s.field("ref", &"r".to_string());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix these

@robert3005
Copy link
Member

Looks like this still need C Data interface fixes upstream

@@ -196,33 +196,32 @@ fn pack_primitives(
///
/// It is expected this function is only called from [try_canonicalize_chunks], and thus all chunks have
/// been checked to have the same DType already.
#[allow(unused)]
fn pack_views(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw you might find it easier to use the arrow view array builder to avoid alignment issues

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

until we have our own bytes type this is pretty fiddly/errorprone

@a10y
Copy link
Contributor Author

a10y commented Aug 1, 2024

apache/arrow-rs#6171

@a10y
Copy link
Contributor Author

a10y commented Aug 1, 2024

CI won't succeed until this is merged:

Benches will continue to be slower than regular Utf8 until this merges:

We may also want to consider how we handle arrays that canonicalize to something that doesn't fit in Arrow. It's not crazy to build a dictionary-encoded array that has >2GB of string data in a single buffer. That would add complexity to either our internal logic, or the into_arrow logic.

Discussion upstream on i32/u32 for BinaryView happening at apache/arrow-rs#6172

@a10y
Copy link
Contributor Author

a10y commented Aug 2, 2024

Alright, the above 2 PRs have merged into arrow-rs, which means we now need to wait for them to make their way into DataFusion to get the pytests passing.

Looks like there's some amount of agreement on the discussion ticket that i32 should actually be used for BinaryView.

We basically have a few options on our end for how we want our VarBinView array to work:

  • Use i32 internally
  • Continue to use u32 internally, but only support arrays with blocks of size <=2GB
  • Continue to use u32 internally, and when we do into_arrow we can zero-copy it to arrow if all blocks are <=2GB, else we repack into multiple blocks

@a10y
Copy link
Contributor Author

a10y commented Sep 5, 2024

Superceded by #757

@a10y a10y closed this Sep 5, 2024
@a10y a10y deleted the aduffy/utf8view branch September 5, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants