Tokenizing text

Thread starter Juan Alvarez
Start date Feb 24, 2009

Juan Alvarez

Feb 24, 2009

Hello,

I need to tokenize English text into sentences. I realize this is a very
complex task to get right all of the time (if possible at all) but for
the time being I'm only trying to implement a better solution than
strintg.split('.').

Bowsing around I found this snippet:

string.scan( /\w.+?[.!?]+(?=\s|\Z)/ )

which almost works for what I need except for two cases: ellipses and at
least most common abbreviations. Abbreviations are the hardest part and
I've been tinkering with a couple possible solutions. How would you
approach this?

Thanks in advance
Juan

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Help parsing a text file	6	Aug 29, 2011
[QUIZ] Proper Case (#89)	24	Aug 4, 2006
[QUIZ] Editing Text (#145)	32	Oct 26, 2007
A Practical Introduction to Data Structures and Algorithm Analysis2Ed by Shaffer	0	Feb 4, 2010
[SUMMARY] Proper Case (#89)	1	Aug 10, 2006
[SUMMARY] Text Image (#50)	29	Oct 13, 2005
Importing from text file to Excel	0	Dec 15, 2006
Parsing two formatted text files	6	Apr 1, 2006

Facebook Twitter Reddit Pinterest Tumblr WhatsApp Email Link

Members online

No members online now.

Total: 47 (members: 0, guests: 47)
Robots: 202

Forum statistics

Threads: 473,766

Messages: 2,569,569

Members: 45,043

Latest member: CannalabsCBDReview

Latest Threads

Sign Certificate, Library jsrsasign-latest-all-min.js using function KJUR.jws.JWS.sign('PS256')
- Started by icassiem
- Today at 8:29 AM
Sign Certificate, Library jsrsasign-latest-all-min.js using function KJUR.jws.JWS.sign('PS256')
- Started by icassiem
- Today at 8:23 AM
What are the key advantages of using a SaaS (Software as a Service) model for application development?
- Started by remotedevelopers
- Yesterday at 12:34 PM
How to build a database-driven web page
- Started by av3mar1a153
- Monday at 5:24 PM
Hola
- Started by luuciefer
- Monday at 2:24 AM
Using a DTSX file with GoDaddy
- Started by IBMJunkman
- Sunday at 8:33 PM
Exit the infinity while loop by pressing the button and continue with the switch element.
- Started by NexaHn
- Sunday at 7:06 PM
Hello Everyone
- Started by welly
- Sunday at 5:03 PM
Problem with code
- Started by camilin05
- Saturday at 6:27 PM
How to get expertise in "cyber security" or from where to start for this?
- Started by independent
- Saturday at 2:12 PM

Top