matching a sentence, greedy up!

Discussion in 'Python' started by Christian Buck, Aug 10, 2003.

  1. Hi,

    i'm writing a regexp that matches complete sentences in a german text,
    and correctly ignores abbrevations. Here is a very simplified version of
    it, as soon as it works i could post the complete regexp if anyone is
    interested (acually 11 kb):

    [A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
    \.)){3,}[\.\?\!]+(?!\s[a-z])

    As you see i use [] for charsets because i don't want to depend on
    locales an speed does'nt matter. (i removed german chars in the above
    example) I do also allow - and _ within a sentence.

    Ok, this is what i think i should do:
    [A-Z] - start with an uppercase char.
    (?: - don't make a group
    [^\.\?\!]+ - eat everything that does not look like an end
    | - OR
    [^a-zA-Z0-9\-_] - accept a non character
    (?: - followed by ...
    [a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
    | - OR
    \d*\. - a number and a dot
    | - OR
    z\.[\s\-]?B\. - some common abbrevations (one one here)
    )){3,} - some times, at least 3
    [\.\?\!]+ - this is the end, and should also match '...'
    (?!\s[a-z]) - not followed by lowercase chars

    here i a sample script:

    - snip -
    import string, re, pre
    s = 'My text may i. E. look like this: This is the end.'
    re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
    r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
    r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
    r'?:(?!\s[a-z]))')
    mo = re_satz.search(s)
    if mo:
    print "found:"
    sentences = re_satz.findall(s)
    for s in sentences:
    print "Sentence: ", s
    else:
    print "not found :-("

    - snip -

    Output:
    found!
    Sentence: My text may i.
    Sentence: This is the end.

    Why isnt the above regexp greedier and matches the whole sentence?

    thx in advance

    Christian
     
    Christian Buck, Aug 10, 2003
    #1
    1. Advertising

  2. Christian Buck wrote:
    > Hi,
    >
    > i'm writing a regexp that matches complete sentences in a german text,
    > and correctly ignores abbrevations. Here is a very simplified version of
    > it, as soon as it works i could post the complete regexp if anyone is
    > interested (acually 11 kb):
    >
    > [A-Z](?:[^\.\?\!]+|[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|\d+\.|a\.[\s\-]?A
    > \.)){3,}[\.\?\!]+(?!\s[a-z])
    >
    > As you see i use [] for charsets because i don't want to depend on
    > locales an speed does'nt matter. (i removed german chars in the above
    > example) I do also allow - and _ within a sentence.
    >
    > Ok, this is what i think i should do:
    > [A-Z] - start with an uppercase char.
    > (?: - don't make a group
    > [^\.\?\!]+ - eat everything that does not look like an end
    > | - OR
    > [^a-zA-Z0-9\-_] - accept a non character
    > (?: - followed by ...
    > [a-zA-Z0-9\-_]\. - a char and a dot like 'i.', '1.' (doesnt work!!!)
    > | - OR
    > \d*\. - a number and a dot
    > | - OR
    > z\.[\s\-]?B\. - some common abbrevations (one one here)
    > )){3,} - some times, at least 3
    > [\.\?\!]+ - this is the end, and should also match '...'
    > (?!\s[a-z]) - not followed by lowercase chars
    >
    > here i a sample script:
    >
    > - snip -
    > import string, re, pre
    > s = 'My text may i. E. look like this: This is the end.'
    > re_satz = re.compile(r'[A-Z](?:[^\.\?\!]+|'
    > r'[^a-zA-Z0-9\-_](?:[a-zA-Z0-9\-_]\.|'
    > r'\d+\.|a\.[\s\-]?A\.)){3,}[\.\?\!]+('
    > r'?:(?!\s[a-z]))')
    > mo = re_satz.search(s)
    > if mo:
    > print "found:"
    > sentences = re_satz.findall(s)
    > for s in sentences:
    > print "Sentence: ", s
    > else:
    > print "not found :-("
    >
    > - snip -
    >
    > Output:
    > found!
    > Sentence: My text may i.
    > Sentence: This is the end.
    >
    > Why isnt the above regexp greedier and matches the whole sentence?
    >


    First, you don't need to escape any characters within a character group [].

    The very first part r'[A-Z](?:[^\.\?\!]+ cannot be greedier since
    you exclude the '.' . So it matches upto but not including the first dot.
    Now, as far as I can see, nothing else fits. So the output is just what
    I expected. How do you think you can differentiate between the end of a
    sentence and (the first part of) an abbreviation?


    --
    Helmut Jarausch

    Lehrstuhl fuer Numerische Mathematik
    RWTH - Aachen University
    D 52056 Aachen, Germany
     
    Helmut Jarausch, Aug 11, 2003
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. kaeli
    Replies:
    3
    Views:
    11,323
  2. Peter Fein

    Pyparsing: Non-greedy matching?

    Peter Fein, Dec 31, 2004, in forum: Python
    Replies:
    2
    Views:
    1,254
    Peter Fein
    Dec 31, 2004
  3. Sam Pointon

    regexp non-greedy matching bug?

    Sam Pointon, Dec 4, 2005, in forum: Python
    Replies:
    8
    Views:
    367
    Fredrik Lundh
    Dec 5, 2005
  4. Dan Kelly

    Greedy and non greedy quantifiers

    Dan Kelly, Jan 17, 2008, in forum: Ruby
    Replies:
    4
    Views:
    147
    Robert Klemme
    Jan 19, 2008
  5. Matt Garrish

    greedy v. non-greedy matching

    Matt Garrish, Feb 16, 2004, in forum: Perl Misc
    Replies:
    4
    Views:
    164
    Matt Garrish
    Feb 16, 2004
Loading...

Share This Page