Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load knowledgebase from zipped json file of pd.DataFrame #238

Open
weixuanfu opened this issue Jan 16, 2020 · 5 comments
Open

Load knowledgebase from zipped json file of pd.DataFrame #238

weixuanfu opened this issue Jan 16, 2020 · 5 comments

Comments

@weixuanfu
Copy link
Contributor

weixuanfu commented Jan 16, 2020

image

Instead of using tsv file, we can use pickle file (which was per-generated from pd.DataFrame) to load the results into AI. This way is very fast because no eval process is needed for converting parameters into python dictionary. The dictionary format can be pickled within pd.DataFrame.

But there is one issue about using pickle on regression knowledgebase: the pickle file is over 200Mb due to its large amounts of results while classification's knowledgebase is only 8Mb.

@weixuanfu weixuanfu changed the title Load knowledgebase via pickle Load knowledgebase from pickle file of pd.DataFrame Jan 16, 2020
@weixuanfu weixuanfu changed the title Load knowledgebase from pickle file of pd.DataFrame Load knowledgebase from json file of pd.DataFrame Jan 23, 2020
@weixuanfu
Copy link
Contributor Author

I tried to use json file instead of pickle file due to large size. The regression knowledgebase in json format is ~30Mb.

@weixuanfu weixuanfu changed the title Load knowledgebase from json file of pd.DataFrame Load knowledgebase from zipped pickle file of pd.DataFrame Jan 23, 2020
@weixuanfu weixuanfu changed the title Load knowledgebase from zipped pickle file of pd.DataFrame Load knowledgebase from zipped json file of pd.DataFrame Jan 23, 2020
@weixuanfu
Copy link
Contributor Author

Hmm, actually, the gzipped pickle file is less than 20Mb while the gzipped tsv file is more than 30Mb, so I think we can add both options (json/pickle).

@weixuanfu
Copy link
Contributor Author

weixuanfu commented Jan 28, 2020

image

The screenshot shows that drop_duplicates or DataFrame.apply without hash is much faster even I added one more step to convert frozenset back to dict.

@weixuanfu
Copy link
Contributor Author

weixuanfu commented Jan 28, 2020

Hmm, the new solution above is not working once the classification knowledgebase was merged with the large regression knowledge base.

I tried the use a new Json Encoder to dump dict into a json file but pandas cannot read it. So I kept the current permHash solution.

@weixuanfu
Copy link
Contributor Author

I monitored time usage of deduplicating results is not very slow, which just took ~5 seconds in my PC. updating AI with regression knowledgebase step took ~1 minutes, which need some improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant