-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add splitexp dataset doc #7
base: main
Are you sure you want to change the base?
Conversation
docs/TDD-C-202105-UNL-002.md
Outdated
@@ -0,0 +1,67 @@ | |||
# SplitExp | |||
|
|||
A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression. The proposed method was applied to the news content collected from the internet and compared with the Punkt algorithm, which is a statistical approach, in terms of speed and accuracy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Burada cumle sonu bulma algoritmalarinan cok olusturulan dataset ile ilgileniyoruz. O yuzden olusturulan algoritmalardan degil de sadece datasetten bahsedilmesi gerekiyor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ayrica bu kisma dataset github linki eklenecek
|
||
### Fields | ||
|
||
Explain the fields of the instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bu kisimda https://raw.githubusercontent.com/ideateknoloji/SplitExp/master/Dataset/sbd.json adresinde verilen example gosterilecek
| field | dtype | | ||
|----------|------------| | ||
| \n | new sentence (token) | | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splits kismi ekleyip orada datasetin kac tane dokumandan ve sample dan olustugunu ekle
## Dataset Details | ||
|
||
This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ilgili paperda sayilar, alintilar gibi cumle sonu iceren farkli durumlarin dataset icindeki dagilimlari verilmis, o dagilimlar da tablo halinde eklenebilir.
|
||
|
||
## Additional Information | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version
seklinde bir baslik acip ilgili repositorynin hangi commitinden aindigini belirt
} | ||
`` | ||
|
||
{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Burasi example kisminda kalmis sanirim silinmesi gerek
|
||
## Dataset Details | ||
|
||
This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bu kısımda tam olarak ne kastettiğini anlamadım, tekrar ve daha basit şekilde yazar mısın ? Sanırım alttaki tabloyu anlatmak için paperdaki şu kısıma benzer bir şey yazmaya çalışıyoruz:
i¸sareti ‘!’, soru i¸sareti ‘?’ ve üç nokta karakteri ‘...’ (U+2026).
Çalı¸smamızda önerilen yöntemde, bu karakterlerin cümle sonu
olmayan durumları, alıntılar, sayılar, kısaltmalar ve uzantılar
olmak üzere 4 ba¸slık altında incelenmi¸stir. Bu durumları e¸sleyen düzenli ifadeler, özyineleme ve ¸sartlı yapılar gibi özellikler
kullanılarak olu¸sturulup metni kesintisiz e¸sleyecek tek bir ana
ifadede birle¸stirilmi¸slerdir.```
No description provided.