Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249

Merged
merged 18 commits into from
Jun 16, 2020

Conversation

sxjscience
Copy link
Member

@sxjscience sxjscience commented Jun 13, 2020

  • Add match_tokens_with_char_spans to utility
    We convert the character spans to token starts+ends based on binary search
  • Enable downloading from S3. Now we are able to call
from gluonnlp.utils import download
download('s3://commoncrawl/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz', overwrite=True)
  • Add Ubuntu test

In order for this feature to work, the user needs to configure S3 correctly.

Also, speed test shows that downloading from S3 can be around 4 times faster in a c4.8x machine in EC2:

In [7]: %timeit download('s3://commoncrawl/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/
   ...: wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz', overwrite=True)
Downloading CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz from s3://commoncrawl/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz...
100%|██████████████████████████████████████████████████████| 157M/157M [00:01<00:00, 87.3MiB/s]
2.5 s ± 401 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit download('https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segme
   ...: nts/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz', overw
   ...: rite=True)
Downloading CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz from https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz...
100%|██████████████████████████████████████████████████████| 157M/157M [00:08<00:00, 19.0MiB/s]
8.13 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This will help us to download large datasets like wikipedia + commoncrawl.

@sxjscience
Copy link
Member Author

@zheyuye We may try to revise our wikipedia downloading script as:

  1. Try to use S3
  2. Fallback to https if S3 raises an exception.

Copy link
Member

@zheyuye zheyuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sxjscience
Copy link
Member Author

sxjscience commented Jun 13, 2020

Find that wikipedia is not available in S3. However, CommonCrawl is in S3 so this functionality is helpful for us to download commoncrawl.

@codecov
Copy link

codecov bot commented Jun 13, 2020

Codecov Report

Merging #1249 into numpy will increase coverage by 0.11%.
The diff coverage is 61.40%.

Impacted file tree graph

@@            Coverage Diff             @@
##            numpy    #1249      +/-   ##
==========================================
+ Coverage   82.32%   82.44%   +0.11%     
==========================================
  Files          38       38              
  Lines        5410     5450      +40     
==========================================
+ Hits         4454     4493      +39     
- Misses        956      957       +1     
Impacted Files Coverage Δ
src/gluonnlp/utils/misc.py 48.18% <50.00%> (-2.61%) ⬇️
src/gluonnlp/utils/lazy_imports.py 55.71% <66.66%> (+1.02%) ⬆️
src/gluonnlp/utils/__init__.py 100.00% <100.00%> (ø)
src/gluonnlp/utils/preprocessing.py 100.00% <100.00%> (ø)
src/gluonnlp/data/loading.py 83.39% <0.00%> (+7.54%) ⬆️

@zheyuye
Copy link
Member

zheyuye commented Jun 13, 2020

Perhaps it is also possible to fix some invalid links in datasets in this PR like General NLP Benchmarks and scripts in like https://github.com/dmlc/gluon-nlp/blob/numpy/scripts/datasets/README.md

@sxjscience
Copy link
Member Author

sxjscience commented Jun 15, 2020

I'm not sure why it gives s 61.40% diff hit. I'll merge this in first.(Might be related to the coverage for this patch). The test coverage is actually increased after this PR.

@sxjscience sxjscience changed the title [Numpy] Add match_tokens_with_char_spans to utility + Enable downloading from S3 [Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test Jun 15, 2020
szha
szha previously requested changes Jun 15, 2020
codecov.yml Outdated Show resolved Hide resolved
@szha szha dismissed their stale review June 15, 2020 23:28

addressed

@szha szha merged commit 85b6f09 into dmlc:numpy Jun 16, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants