Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The generate_statistics_from_csv very slowly for large dataset in single server #98

Open
yajunwong opened this issue Jan 3, 2020 · 4 comments

Comments

@yajunwong
Copy link

Hi According to the tfx examples, I pass the pipeline_options to generate_statistics_from_csv which set --direct_num_workers=16 like:

pipeline_options = PipelineOptions(['--direct_num_workers=16'])

It's seem that this option cannot speed up this API, when I set direct_num_workers=1, the cost time is equal the 16 worker, like that:

# direct_num_workers=1
python prep.py  99.27s user 5.84s system 99% cpu 1:45.67 total

# direct_num_workers=16
python prep.py  101.92s user 5.22s system 98% cpu 1:48.44 total

Could someone help me?

@paulgc
Copy link
Member

paulgc commented Jan 3, 2020

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.

import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)

@IveJ
Copy link

IveJ commented Jan 4, 2020 via email

@yajunwong
Copy link
Author

yajunwong commented Jan 6, 2020

Hi Yajunwang, When executing your pipeline locally, the default values for the properties in PipelineOptions are generally sufficient and direct runner on one compute. https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

On Sat, Jan 4, 2020, 04:18 Paul Suganthan @.***> wrote: Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe. import tensorflow_data_validation as tfdv import pandas as pd CSV_FILE_PATH = '' df = pd.read_csv(CSV_FILE_PATH) stats = tfdv.generate_statistics_from_dataframe(df) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#98?email_source=notifications&email_token=AEYAML5PHJXRYIRFD6X3GCTQ36TUHA5CNFSM4KCR5WI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEICDLOQ#issuecomment-570701242>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYAMLYSHUNOI27MX5D4ZNTQ36TUHANCNFSM4KCR5WIQ .

It's seem not invalid for this option! Please infer this gist https://gist.github.com/yajunwong/f317c565f375125fd3ec2963967ba164

@yajunwong
Copy link
Author

Another option is to try using generate_statistics_from_dataframe if you can load your dataset as a pandas dataframe.

import tensorflow_data_validation as tfdv
import pandas as pd
CSV_FILE_PATH = ''
df = pd.read_csv(CSV_FILE_PATH)
stats = tfdv.generate_statistics_from_dataframe(df)

I try to this api, but report error, please refer this issue: #98 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants