Text parser (text into sentences) that works with UTF-8 and multiple languages?

Discussion in 'Ruby' started by mike b., Jul 30, 2007.

  1. mike b.

    mike b. Guest

    Hi all,

    I have to parse about 2000 files that are written in multiple
    languages (some English, some Korean, some Arabic and some Japanese).
    I have to split these UTF-8 encoded into individual sentences. Has
    anyone written a good parser that can parse all these non-Latin
    character languages or can someone give me some advice on how to go
    about writing a parser that can handle all these fairly different
    languages?

    Thank you,

    Mike
    mike b., Jul 30, 2007
    #1
    1. Advertising

  2. 2007/7/30, mike b. <>:
    > I have to parse about 2000 files that are written in multiple
    > languages (some English, some Korean, some Arabic and some Japanese).
    > I have to split these UTF-8 encoded into individual sentences. Has
    > anyone written a good parser that can parse all these non-Latin
    > character languages or can someone give me some advice on how to go
    > about writing a parser that can handle all these fairly different
    > languages?


    I would consider doing this in Java, as Java's regular expressions
    support Unicode. That might make the job much easier. OTOH, if all
    files use only dot, question mark etc. (i.e. ASCII chars) as sentence
    delimiters then Ruby's regular expressions might as well do the job.

    Kind regards

    robert
    Robert Klemme, Jul 30, 2007
    #2
    1. Advertising

  3. mike b.

    Oblomov Guest

    On Jul 30, 11:26 am, "Robert Klemme" <>
    wrote:
    > 2007/7/30, mike b. <>:
    >
    > > I have to parse about 2000 files that are written in multiple
    > > languages (some English, some Korean, some Arabic and some Japanese).
    > > I have to split these UTF-8 encoded into individual sentences. Has
    > > anyone written a good parser that can parse all these non-Latin
    > > character languages or can someone give me some advice on how to go
    > > about writing a parser that can handle all these fairly different
    > > languages?

    >
    > I would consider doing this in Java, as Java's regular expressions
    > support Unicode. That might make the job much easier. OTOH, if all
    > files use only dot, question mark etc. (i.e. ASCII chars) as sentence
    > delimiters then Ruby's regular expressions might as well do the job.


    Ruby supports UTF-8 regular expressions: for example, /\w+|\W/u can be
    used
    to scan a string splitting it into words and non-words. There were
    some bugs
    with Unicode character classifications in older versions of Ruby, but
    I'm not
    aware of any in 1.8.6; OTOH I've never tried it with non-latin text so
    I don't
    know if it works correctly in those cases too.
    Oblomov, Jul 30, 2007
    #3
  4. On Jul 30, 2007, at 3:50 AM, mike b. wrote:

    > I have to parse about 2000 files that are written in multiple
    > languages (some English, some Korean, some Arabic and some Japanese).
    > I have to split these UTF-8 encoded into individual sentences.


    As has been stated, Ruby's regular expression engine has a Unicode
    mode and that may be all you need here, depending on how you
    recognize sentence boundaries.

    > Has anyone written a good parser that can parse all these non-Latin
    > character languages or can someone give me some advice on how to go
    > about writing a parser that can handle all these fairly different
    > languages?


    I've released an initial version of my Ghost Wheel parser generator
    library. It doesn't have documentation yet, but it was built using
    TDD and you should be able to look over the tests to see how it
    works. I'm also happy to answer questions.

    My hope is that it works fine for non-Latin languages, but I'll
    confess that I haven't tested it that way yet. I would try to fix
    any issues you uncovered though.

    James Edward Gray II
    James Edward Gray II, Jul 30, 2007
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tony
    Replies:
    4
    Views:
    2,124
    Andy De Petter
    Nov 27, 2003
  2. idoh
    Replies:
    0
    Views:
    189
  3. basi
    Replies:
    35
    Views:
    652
    Adam i Agnieszka Gasiorowski FNORD
    Dec 3, 2005
  4. Sandman

    splitting paragraph into sentences

    Sandman, Aug 2, 2004, in forum: Perl Misc
    Replies:
    5
    Views:
    389
    Anno Siegel
    Aug 2, 2004
  5. Gabriella
    Replies:
    4
    Views:
    189
    Bruce Wisentaner
    Sep 19, 2006
Loading...

Share This Page