Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add splitexp dataset doc #7

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

tugberkcil
Copy link

No description provided.

@@ -0,0 +1,67 @@
# SplitExp

A rule-based method for determining the end of sentence has been developed for Turkish news texts. By including direct quotations that have not been addressed in the problem before, the punctuation ambiguities at the end of the sentence are eliminated by means of a single regular expression. The proposed method was applied to the news content collected from the internet and compared with the Punkt algorithm, which is a statistical approach, in terms of speed and accuracy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Burada cumle sonu bulma algoritmalarinan cok olusturulan dataset ile ilgileniyoruz. O yuzden olusturulan algoritmalardan degil de sadece datasetten bahsedilmesi gerekiyor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ayrica bu kisma dataset github linki eklenecek


### Fields

Explain the fields of the instances.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| field | dtype |
|----------|------------|
| \n | new sentence (token) |

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splits kismi ekleyip orada datasetin kac tane dokumandan ve sample dan olustugunu ekle

## Dataset Details

This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ilgili paperda sayilar, alintilar gibi cumle sonu iceren farkli durumlarin dataset icindeki dagilimlari verilmis, o dagilimlar da tablo halinde eklenebilir.



## Additional Information

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version

seklinde bir baslik acip ilgili repositorynin hangi commitinden aindigini belirt

}
``

{"_id":"5bcdd1ac31878cb578d6a13f","text":"Merkez Bankası, Ziraat, Halkbank, Vakıfbank ve Kalkınma Bankası Hazine ve Maliye Bakanı Berat Albayrak’a bağlandı.","indexes":["113"],"types":["0"]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Burasi example kisminda kalmis sanirim silinmesi gerek


## Dataset Details

This dataset has been created using quotes that are frequently found in Turkish news texts. More than one case was evaluated and a matcher was created over the samples that fit each case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bu kısımda tam olarak ne kastettiğini anlamadım, tekrar ve daha basit şekilde yazar mısın ? Sanırım alttaki tabloyu anlatmak için paperdaki şu kısıma benzer bir şey yazmaya çalışıyoruz:

i¸sareti ‘!’, soru i¸sareti ‘?’ ve üç nokta karakteri ‘...’ (U+2026).
Çalı¸smamızda önerilen yöntemde, bu karakterlerin cümle sonu
olmayan durumları, alıntılar, sayılar, kısaltmalar ve uzantılar
olmak üzere 4 ba¸slık altında incelenmi¸stir. Bu durumları e¸sleyen düzenli ifadeler, özyineleme ve ¸sartlı yapılar gibi özellikler
kullanılarak olu¸sturulup metni kesintisiz e¸sleyecek tek bir ana
ifadede birle¸stirilmi¸slerdir.```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants