-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vie] Enhancements #133
Comments
I like the idea of adding a Relatedly, we have this somewhat crude way of not skipping spaces in Chinese transcriptions. Perhaps we should we have a |
I should mention it looks like the Vietnamese data has spaces in the IPA transcriptions between each syllable too. (Have to check whether Good idea re: two flags. |
I believe segments handles it a bit awkwardly. |
Addresses, but does not close, issue #133. Tested (and then reverted---can do once bug is complete): * `./scrape.py` with modified (`vie`-only) `languages.json` ``` [src]$ wc -l ../tsv/vie_* 1 ../tsv/vie_hanoi_phonemic.tsv 13231 ../tsv/vie_hanoi_phonetic.tsv 1 ../tsv/vie_hcmc_phonemic.tsv 13231 ../tsv/vie_hcmc_phonetic.tsv 1 ../tsv/vie_hue_phonemic.tsv 13231 ../tsv/vie_hue_phonetic.tsv 39696 total ``` * `./remove_duplicates_and_split.sh` ``` [src]$ wc -l ../tsv/vie_* 1 ../tsv/vie_hanoi_phonemic.tsv 11023 ../tsv/vie_hanoi_phonetic.tsv 1 ../tsv/vie_hcmc_phonemic.tsv 11023 ../tsv/vie_hcmc_phonetic.tsv 1 ../tsv/vie_hue_phonemic.tsv 11023 ../tsv/vie_hue_phonetic.tsv 33072 total ```
* Adds logging of dialect (when specified). Helps out for #133. * Splits Vietnamese into three dialects. Addresses, but does not close, issue #133. Tested (and then reverted---can do once bug is complete): * `./scrape.py` with modified (`vie`-only) `languages.json` ``` [src]$ wc -l ../tsv/vie_* 1 ../tsv/vie_hanoi_phonemic.tsv 13231 ../tsv/vie_hanoi_phonetic.tsv 1 ../tsv/vie_hcmc_phonemic.tsv 13231 ../tsv/vie_hcmc_phonetic.tsv 1 ../tsv/vie_hue_phonemic.tsv 13231 ../tsv/vie_hue_phonetic.tsv 39696 total ``` * `./remove_duplicates_and_split.sh` ``` [src]$ wc -l ../tsv/vie_* 1 ../tsv/vie_hanoi_phonemic.tsv 11023 ../tsv/vie_hanoi_phonetic.tsv 1 ../tsv/vie_hcmc_phonemic.tsv 11023 ../tsv/vie_hcmc_phonetic.tsv 1 ../tsv/vie_hue_phonemic.tsv 11023 ../tsv/vie_hue_phonetic.tsv 33072 total ```
Why were we skipping data consisting of multi words?
We must have talked about this before, but I cannot recall. Removing this code will help this issue. |
I guess some reasons to skip them include:
You're right that that function controls it, but I think we probably want it to be scriptable so users can control whether they skip, and we can set it on a per-language basis for the big scrape. So I think the obvious thing to do is to add a flag (in Potentially we could also make the rules about what gets skipped more expressive---maybe you could pass in a regular expression and if it's matched, the word is skipped---but that's probably premature. |
Any thoughts on what we should do with transcriptions containing spaces between syllables?
Additionally I've found that some of our Persian data and some of our Tibetan data already contain |
We should remove the "#" as a general post-processing step after
tokenization)
We also should remove the non-breaking space (before tokenization,
presumably).
K
…On Sat, Mar 28, 2020 at 12:31 PM Lucas Ashby ***@***.***> wrote:
Any thoughts on what we should do with transcriptions containing spaces
between syllables? segments does the following:
import segments
tokenizer = segments.Tokenizer()
print(tokenizer("ʔaːm˧˦ hiəw˧˨ʔ", ipa=True)) # Will print: ʔ aː m ˧˦ # h i ə w ˧˨ ʔ
Additionally I've found that some of our Persian
<https://github.com/kylebgorman/wikipron/blob/master/languages/wikipron/tsv/per_phonemic.tsv#L388>
data and some of our Tibetan
<https://github.com/kylebgorman/wikipron/blob/master/languages/wikipron/tsv/tib_phonemic.tsv#L304>
data already contain "#". These transcriptions utilize the non-breaking
space <https://en.wikipedia.org/wiki/Non-breaking_space> character, which
we don't check for and therefore don't skip.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/133#issuecomment-605485017>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4ONMEBK7GZJL55FZM33RJYQ5JANCNFSM4LTZSFKA>
.
|
Could you give us example words with three vs. four dialects? Then I'll try
to replicate.
FWIW, I hadn't seen those other dialect specifications for Vietnamese.
…On Sat, Mar 28, 2020 at 10:26 PM yeonju123 ***@***.***> wrote:
I have some issue. When a word has three dialects, nothing is scraped.
When there are four dialects including Vinh, Thanh Chương, Hà Tĩnh dialect,
only this dialect is scraped. I tried scraping with a dialect specified as
'hanoi', but it still does not work. Because of this issue, I cannot test
whether my code for fixing the space is working. Does anyone have the same
issue?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<https://github.com/kylebgorman/wikipron/issues/133#issuecomment-605548515>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABG4OP25WMMWDYELZCQLIDRJ2WXFANCNFSM4LTZSFKA>
.
|
Ah, don't worry about that issue. I set the phonemic parameter wrong. But as for the 4th dialect, |
Thanks for those two examples; we should add those dialects to the list, I suppose, and see if they get over threshold. |
Scraping for a Vietnamese dialect from the three that Kyle added in #135 returns results for all three current dialects (when |
I notice a few obvious problems with Vietnamese:
languages.json
--no-skip-space
) here, or we could make a language specific-extractor, I suppose.The text was updated successfully, but these errors were encountered: