Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LIKE performance for Dictionary arrays #11032

Closed
alamb opened this issue Jun 20, 2024 · 1 comment · Fixed by #11058
Closed

Improve LIKE performance for Dictionary arrays #11032

alamb opened this issue Jun 20, 2024 · 1 comment · Fixed by #11058
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 20, 2024

Is your feature request related to a problem or challenge?

We have a predicate like col LIKE <...> where col is a DictionaryArray. The filter ends up looking like this:

FilterExec: CAST(col@3 AS Utf8) LIKE '%CONST%' elapsed_compute=38.62343607s]

This is non ideal (in our case this requires many seconds of time) because it will copy each dictionary value column into a StringArray before it applies LIKE

Here is an self contained reproducer:

create or replace table strings as values
  ('FooBar'),
  ('Foo'),
  ('Foo'),
  ('Bar'),
  ('FooBar'),
  ('Bar'),
  ('Baz');
  
create or replace table dict_table as
select arrow_cast(column1, 'Dictionary(Int32, Utf8)') as column1
from strings;

> select column1, arrow_typeof(column1) from dict_table;
+---------+----------------------------------+
| column1 | arrow_typeof(dict_table.column1) |
+---------+----------------------------------+
| FooBar  | Dictionary(Int32, Utf8)          |
| Foo     | Dictionary(Int32, Utf8)          |
| Foo     | Dictionary(Int32, Utf8)          |
| Bar     | Dictionary(Int32, Utf8)          |
| FooBar  | Dictionary(Int32, Utf8)          |
| Bar     | Dictionary(Int32, Utf8)          |
| Baz     | Dictionary(Int32, Utf8)          |
+---------+----------------------------------+
7 row(s) fetched.
Elapsed 0.002 seconds.

The query gets the correct anser:

> select column1 from dict_table where column1 LIKE '%oo%';
+---------+
| column1 |
+---------+
| FooBar  |
| Foo     |
| Foo     |
| FooBar  |
+---------+
4 row(s) fetched.
Elapsed 0.010 seconds.

However, the plan shows it is casting to Utf8 which is quite expensive:

> explain select column1 from dict_table where column1 LIKE '%oo%';
+---------------+------------------------------------------------------------+
| plan_type     | plan                                                       |
+---------------+------------------------------------------------------------+
| logical_plan  | Filter: CAST(dict_table.column1 AS Utf8) LIKE Utf8("%oo%") |
|               |   TableScan: dict_table projection=[column1]               |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                |
|               |   FilterExec: CAST(column1@0 AS Utf8) LIKE %oo%            | <--- NOTE THIS HAS A CAST
|               |     MemoryExec: partitions=1, partition_sizes=[1]          |
|               |                                                            |
+---------------+------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.

Describe the solution you'd like

I would like LIKE to operate much faster on DictionaryArrays (without a cast)

Describe alternatives you've considered

What would likely be much faster is done is to evaluate LIKE on the dictionary values first to find any matching keys, and then look up the values

There is code in the arrow like kernels that appears to handle dictionary values, so this "just may work" if we update the coercion rules to avoid the cast.

https://docs.rs/arrow-string/52.0.0/src/arrow_string/like.rs.html#122-126

However, we may also have to implement something like

// find all keys in the dictionary
let matching_values = like(dictionary_array.values(), scalar);
// figure out which values in the dictionary match by callking take
let dict_indicies = dictionary_array.keys();
// now, leave the overall result here 
let like_result = take(matching_values, dict_indices);

Additional context

Note potentially related to LIKE for StringViewArray -- see #11024

Also potentially related to #5222

@alamb alamb added the enhancement New feature or request label Jun 20, 2024
@alamb alamb assigned alamb and unassigned alamb Jun 20, 2024
@Lordworms
Copy link
Contributor

would like to try it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants