Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "SpaceAfter=No" for Untokenized Text #6

Merged
merged 1 commit into from
Apr 28, 2020

Conversation

KoichiYasuoka
Copy link
Contributor

Support "SpaceAfter=No" for Untokenized Text

@BramVanroy BramVanroy changed the title Update __init__.py Support "SpaceAfter=No" for Untokenized Text Feb 21, 2020
@BramVanroy
Copy link
Owner

Probably should test whether the .whitespace_ attribute is present in stanfordnlp because otherwise this might break the module.

@KoichiYasuoka
Copy link
Contributor Author

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage
en=StanfordNLPLanguage(stanfordnlp.Pipeline(lang="en"))
s=en("Yes, it's on-going.")
for t in s:
  print("\t".join([str(t.i+1),t.orth_,t.lemma_,t.pos_,t.tag_,"_",str(0 if t.head==t else t.head.i+1),t.dep_,"_","_" if t.whitespace_ else "SpaceAfter=No"]))

The script shown above worked well in my Linux (Debian), and produced CONLL below.

1	Yes	yes	INTJ	UH	_	7	discourse	_	SpaceAfter=No
2	,	,	PUNCT	,	_	7	punct	_	_
3	it	it	PRON	PRP	_	7	nsubj	_	SpaceAfter=No
4	's	be	AUX	VBZ	_	7	cop	_	_
5	on	on	ADV	RB	_	7	advmod	_	SpaceAfter=No
6	-	-	PUNCT	HYPH	_	7	punct	_	SpaceAfter=No
7	going	go	VERB	VBG	_	0	root	_	SpaceAfter=No
8	.	.	PUNCT	.	_	7	punct	_	SpaceAfter=No

@BramVanroy BramVanroy merged commit c954aa0 into BramVanroy:master Apr 28, 2020
@BramVanroy
Copy link
Owner

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants