-
Notifications
You must be signed in to change notification settings - Fork 538
[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249
Conversation
@zheyuye We may try to revise our wikipedia downloading script as:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Find that wikipedia is not available in S3. However, CommonCrawl is in S3 so this functionality is helpful for us to download commoncrawl. |
Codecov Report
@@ Coverage Diff @@
## numpy #1249 +/- ##
==========================================
+ Coverage 82.32% 82.44% +0.11%
==========================================
Files 38 38
Lines 5410 5450 +40
==========================================
+ Hits 4454 4493 +39
- Misses 956 957 +1
|
Perhaps it is also possible to fix some invalid links in datasets in this PR like General NLP Benchmarks and scripts in like https://github.com/dmlc/gluon-nlp/blob/numpy/scripts/datasets/README.md |
|
We convert the character spans to token starts+ends based on binary search
In order for this feature to work, the user needs to configure S3 correctly.
Also, speed test shows that downloading from S3 can be around 4 times faster in a c4.8x machine in EC2:
This will help us to download large datasets like wikipedia + commoncrawl.