Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese test improvements #962

Merged
merged 4 commits into from
Jul 28, 2021
Merged

Conversation

wallace11
Copy link
Contributor

Hi there,
Here's some more sensible Japanese tests.
I hope that they pass 😆

@eikek
Copy link
Owner

eikek commented Jul 28, 2021

Hi @wallace11 thanks! I'm afraid that this won't pass. The current algorithm first generates a sequence of words that are determined by some "separating characters" like whitespace/punctuation etc. But now the dates are surrounded by text: 付は2021.7.21で - there is no whitespace or something like that here?

Edit: the CI complains about formatting, this can be fixed by running sbt fix (just fyi)

@wallace11
Copy link
Contributor Author

@eikek
Hey!
Sorry, I noticed your message only after pushing a possible fix (manual, I don't have a Scala environment set up...).

Regarding spaces, that's the thing - in "normal" Japanese there's no such thing. That's exactly why I wanted to create a proper Japanese tests to see if it catches that.

I looked at some of my documents and indeed on some of them you've got the date as part of the first sentence or the title (which is also a sentence).

Do you think it'd be possible to fix that?
If these tests won't work, would you like me to give it a test run on a couple of documents and see how it catches the dates?

@eikek
Copy link
Owner

eikek commented Jul 28, 2021

@wallace11 no worries! (you only would need to install sbt for this) Thanks for your explanation! I just read around wikipedia that there are no spaces in Japanese :) Well, I guess this means doing it completely differently here. If you have some documents you could share, that would help! That way I could run this against some "real" data. I might be able to remove all characters that are not arabic numbers or the letters for year/month/day… maybe this gives some results.

Not very efficient, but should work to find the position of dates in
japanese text.
@eikek
Copy link
Owner

eikek commented Jul 28, 2021

@wallace11 I just pushed a quite crude fix :-). It preprocesses the text and removes all characters that don't take part in a date. Your tests should pass now. You could try this against your documents. I can merge this and some minutes later a nightly version is published.

@wallace11
Copy link
Contributor Author

@eikek
Looks perfect.
I'll definitely give it a go and let you know how it went.
I guess the upcoming weekend is going to be all about organizing documents ;)

@eikek
Copy link
Owner

eikek commented Jul 28, 2021

@wallace11 Great 😃 ! Sounds like a weekend 😉 Thank you four your help!

@eikek eikek merged commit 16ade69 into eikek:master Jul 28, 2021
@eikek eikek added this to the Docspell 0.25.0 milestone Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants