Tokenizing text

J

Juan Alvarez

Hello,

I need to tokenize English text into sentences. I realize this is a very
complex task to get right all of the time (if possible at all) but for
the time being I'm only trying to implement a better solution than
strintg.split('.').

Bowsing around I found this snippet:

string.scan( /\w.+?[.!?]+(?=\s|\Z)/ )

which almost works for what I need except for two cases: ellipses and at
least most common abbreviations. Abbreviations are the hardest part and
I've been tinkering with a couple possible solutions. How would you
approach this?

Thanks in advance
Juan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top