The generate_statistics_from_csv very slowly for large dataset in single server #98

yajunwong · 2020-01-03T21:07:13Z

Hi According to the tfx examples, I pass the pipeline_options to generate_statistics_from_csv which set --direct_num_workers=16 like:

pipeline_options = PipelineOptions(['--direct_num_workers=16'])

It's seem that this option cannot speed up this API, when I set direct_num_workers=1, the cost time is equal the 16 worker, like that:

# direct_num_workers=1
python prep.py  99.27s user 5.84s system 99% cpu 1:45.67 total

# direct_num_workers=16
python prep.py  101.92s user 5.22s system 98% cpu 1:48.44 total

Could someone help me?

The text was updated successfully, but these errors were encountered:

paulgc · 2020-01-03T21:18:57Z

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.

import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)

IveJ · 2020-01-04T18:07:24Z

Hi Yajunwang, When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

…

On Sat, Jan 4, 2020, 04:18 Paul Suganthan ***@***.***> wrote: Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe. import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ> .

yajunwong · 2020-01-06T08:22:31Z

Hi Yajunwang, When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
…
On Sat, Jan 4, 2020, 04:18 Paul Suganthan @.***> wrote: Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe. import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .

It's seem not invalid for this option! Please infer this gist https://gist.github.com/yajunwong/f317c565f375125fd3ec2963967ba164

yajunwong · 2020-01-06T08:27:46Z

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.
import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)

I try to this api, but report error, please refer this issue: #98 (comment)

rmothukuru self-assigned this Jan 6, 2020

rmothukuru added type:performance stat:awaiting tensorflower labels Jan 6, 2020

rmothukuru assigned IreneGi and unassigned rmothukuru Jan 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The generate_statistics_from_csv very slowly for large dataset in single server #98

The generate_statistics_from_csv very slowly for large dataset in single server #98

yajunwong commented Jan 3, 2020

paulgc commented Jan 3, 2020

IveJ commented Jan 4, 2020 via email

yajunwong commented Jan 6, 2020 •

edited

Loading

yajunwong commented Jan 6, 2020

The generate_statistics_from_csv very slowly for large dataset in single server #98

The generate_statistics_from_csv very slowly for large dataset in single server #98

Comments

yajunwong commented Jan 3, 2020

paulgc commented Jan 3, 2020

IveJ commented Jan 4, 2020 via email

yajunwong commented Jan 6, 2020 • edited Loading

yajunwong commented Jan 6, 2020

yajunwong commented Jan 6, 2020 •

edited

Loading