Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence tokenizer not working on Full stop #76

Open
shivambatra76 opened this issue Jan 27, 2020 · 0 comments
Open

Sentence tokenizer not working on Full stop #76

shivambatra76 opened this issue Jan 27, 2020 · 0 comments

Comments

@shivambatra76
Copy link

I have given the following input to

from summa.preprocessing.textcleaner import clean_text_by_sentences as _clean_text_by_sentences.

text='''Ad sales boost Time Warner profit
Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales. TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn. Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.
'''
This is the output i have recieved from after preprocessing. As you can see the second sentence should get separated by full stop but instead it is only separating the sentence using space on a new line by enter key pressed.
Screenshot (28)

[Original unit: 'Ad sales boost Time Warner profit' --- Processed unit: 'ad sale boost time warner profit',
Original unit: 'Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.The firm, which is now one of the biggest investors in Google, benefited from sales of high-speed internet connections and higher advert sales.' --- Processed unit: 'quarter profit media giant timewarn jump bn £m month decemb m year earlier firm biggest investor googl benefit sale high speed internet connect higher advert sale',
Original unit: 'TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.' --- Processed unit: 'timewarn said fourth quarter sale rose bn bn',
Original unit: 'Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.' --- Processed unit: 'profit buoy gain offset profit dip warner bros user aol']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant