Value_counts words, resulting to chunkedArray has no attribute 'values' #1977

ThePianzi · 2022-03-17T14:33:39Z

ThePianzi
Mar 17, 2022

Hi,

I just start to use Vaex yesterday so maybe the answer is obvious but i didn’t find anything yet.
I have a column of string with several words (~3m rows, up to ~150words / string) and I want to count words on the entire frame.
By testing on a small scale, everything works fine
`
text = ['Something', 'very pretty', 'is coming', 'our', 'way.']
df = vaex.from_arrays(text=text)
test = df.text.str.split(' ')
df['test'] = test
df.test.value_counts()

way. 1
is 1
pretty 1
our 1
Something 1
coming 1
very 1
dtype: int64`

But when I try to scale up to the actual set of data, I have this error : AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'values'
and I don’t understand how to work with this 😢

Hasham04 · 2022-03-18T14:55:09Z

Hasham04
Mar 18, 2022

please provide the code you are using to scale up the data. A sample of the data will also be helpful.

0 replies

JovanVeljanoski · 2022-03-18T15:09:11Z

JovanVeljanoski
Mar 18, 2022
Maintainer

I just tried this with my own private dataset. Around 500_000 shot texts, with around 50_000 unique words. Works without problem on the latest version installed via conda..

So indeed, reproducible example would be nice (essential actually.. )

3 replies

ThePianzi Mar 18, 2022
Author

The code to scale up is the same except the reading datas, the only exception is the line to read datas from parquet (snappy compressed).
Datas themselves are an extract from an image board and each row in the specific column I try to work on contain list of tags and looks like this:

"2003 4_fingers <3 bangs big_eyes blonde_hair blue_eyes chip_'n_dale_rescue_rangers coveralls cyrillic_text dialogue disney female fingers fur gadget_hackwrench hair looking_aside mammal mouse murid murine pink_heart pink_nose polish_text raised_arm raised_tail rodent rolled_up_sleeves ruslan_m short_hair simple_background solo tan_body tan_skin teeth text translated white_background zoom_layer"

posts = vaex.open('data/parquet')
text = posts.tag_string.str.split(' ')
posts['cutted'] = text
posts.cutted.value_counts()

AttributeError Traceback (most recent call last)
/notebook_path/ Cell 27' in <cell line: 4>()
2 text = posts.tag_string.str.split(' ')
3 posts['cutted'] = text
----> 4 posts.cutted.value_counts()

File ~\Miniconda3\envs\py38\lib\site-packages\vaex\expression.py:1013, in Expression.value_counts(self, dropna, dropnan, dropmissing, ascending, progress, axis)
1011 return a+b
1012 progressbar = vaex.utils.progressbars(progress, title="value counts")
-> 1013 self.ds.map_reduce(map, reduce, [self.expression], delay=False, progress=progressbar, name='value_counts', info=True, to_numpy=False)
1014 counters = [k for k in counters if k is not None]
1015 counter = counters[0]

File ~\Miniconda3\envs\py38\lib\site-packages\vaex\dataframe.py:429, in DataFrame.map_reduce(self, map, reduce, arguments, progress, delay, info, to_numpy, ignore_filter, pre_filter, name, selection)
427 progressbar.add_task(task, f'map reduce: {name}')
428 task = self.executor.schedule(task)
--> 429 return self._delay(delay, task)

File ~\Miniconda3\envs\py38\lib\site-packages\vaex\dataframe.py:1689, in DataFrame._delay(self, delay, task, progressbar)
1687 else:
1688 self.execute()
-> 1689 return task.get()

File ~\Miniconda3\envs\py38\lib\site-packages\aplus_init_.py:170, in Promise.get(self, timeout)
168 return self._value
169 else:
--> 170 raise self._reason

File ~\Miniconda3\envs\py38\lib\site-packages\vaex\execution.py:566, in ExecutorLocal.process_tasks(self, thread_index, i1, i2, chunks, run, df, tasks)
564 task_part.process(thread_index, i1, i2, filter_mask, selections, blocks)
565 else:
--> 566 task_part.process(thread_index, i1, i2, filter_mask, selections, blocks)
567 except Exception as e:
568 # we cannot call .reject, since then we'll handle fallbacks in this thread
569 task._toreject = e

File ~\Miniconda3\envs\py38\lib\site-packages\vaex\cpu.py:316, in TaskPartMapReduce.process(self, thread_index, i1, i2, filter_mask, selection_masks, blocks)
314 blocks = [filter(block, selection_mask) for block in blocks]
315 if self.info:
--> 316 self.values.append(self._map(thread_index, i1, i2, selection_mask, blocks))
317 else:
318 self.values.append(self._map(*blocks))

File ~\Miniconda3\envs\py38\lib\site-packages\vaex\expression.py:999, in Expression.value_counts..map(thread_index, i1, i2, selection_masks, blocks)
997 counters[thread_index] = counter_type(1)
998 if data_type.is_list and axis is None:
--> 999 ar = ar.values
1000 if data_type_item.is_string:
1001 ar = _to_string_sequence(ar)

AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'values'

JovanVeljanoski Mar 18, 2022
Maintainer

Ok, thanks I will try to reproduce it over the weekend.

Actually.. i wonder if your case will be fixed once this is accepted:
https://github.com/vaexio/vaex/pull/1958/files

If you feel like, you can test it out.

ThePianzi Mar 18, 2022
Author

Thanks, and yes, the pull looks promising 😄
As additional information, when i try to count values from other columns (certainly with less different values), there are no problems. And the error appears after around 20 seconds of computing where the count takes less than 3seconds for other columns…
I also just tried by reading the datas directly from pandas, and it actually works !  But the performance is terrible 😔 Can it comes from the usage of parquets ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Value_counts words, resulting to chunkedArray has no attribute 'values' #1977

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Value_counts words, resulting to chunkedArray has no attribute 'values' #1977

ThePianzi Mar 17, 2022

Replies: 2 comments · 3 replies

Hasham04 Mar 18, 2022

JovanVeljanoski Mar 18, 2022 Maintainer

ThePianzi Mar 18, 2022 Author

JovanVeljanoski Mar 18, 2022 Maintainer

ThePianzi Mar 18, 2022 Author

ThePianzi
Mar 17, 2022

Replies: 2 comments 3 replies

Hasham04
Mar 18, 2022

JovanVeljanoski
Mar 18, 2022
Maintainer

ThePianzi Mar 18, 2022
Author

JovanVeljanoski Mar 18, 2022
Maintainer

ThePianzi Mar 18, 2022
Author