Extracting repeated words

Discussion in 'Python' started by candide, Apr 1, 2011.

  1. candide

    candide Guest

    Another question relative to regular expressions.

    How to extract all word duplicates in a given text by use of regular
    expression methods ? To make the question concrete, if the text is

    ------------------
    Now is better than never.
    Although never is often better than *right* now.
    ------------------

    duplicates are :

    ------------------------
    better is now than never
    ------------------------

    Some code can solve the question, for instance

    # ------------------
    import re

    regexp=r"\w+"

    c=re.compile(regexp, re.IGNORECASE)

    text="""
    Now is better than never.
    Although never is often better than *right* now."""

    z=[s.lower() for s in c.findall(text)]

    for d in set([s for s in z if z.count(s)>1]):
    print d,
    # ------------------

    but I'm in search of "plain" re code.
     
    candide, Apr 1, 2011
    #1
    1. Advertising

  2. candide

    Ian Kelly Guest

    On Fri, Apr 1, 2011 at 2:54 PM, candide <> wrote:
    > Another question relative to regular expressions.
    >
    > How to extract all word duplicates in a given text by use of regular
    > expression methods ?  To make the question concrete, if the text is
    >
    > ------------------
    > Now is better than never.
    > Although never is often better than *right* now.
    > ------------------
    >
    > duplicates are :
    >
    > ------------------------
    > better is now than never
    > ------------------------
    >
    > Some code can solve the question, for instance
    >
    > # ------------------
    > import re
    >
    > regexp=r"\w+"
    >
    > c=re.compile(regexp, re.IGNORECASE)
    >
    > text="""
    > Now is better than never.
    > Although never is often better than *right* now."""
    >
    > z=[s.lower() for s in c.findall(text)]
    >
    > for d in set([s for s in z if z.count(s)>1]):
    >    print d,
    > # ------------------
    >
    > but I'm in search of "plain" re code.


    You could use a look-ahead assertion with a captured group:

    >>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
    >>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
    >>> c.findall(text)


    But note that this is computationally expensive. The regex that you
    posted is probably more efficient if you use a collections.Counter
    object instead of z.count.

    Cheers,
    Ian
     
    Ian Kelly, Apr 1, 2011
    #2
    1. Advertising

  3. candide

    candide Guest

    Le 02/04/2011 00:42, Ian Kelly a écrit :

    > You could use a look-ahead assertion with a captured group:
    >
    >>>> regexp = r'\b(?P<dup>\w+)\b(?=.+\b(?P=dup)\b)'
    >>>> c = re.compile(regexp, re.IGNORECASE | re.DOTALL)
    >>>> c.findall(text)


    It works fine, lookahead assertions in action is what exatly i was
    looking for, many thanks.
     
    candide, Apr 2, 2011
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,109
    Peter Strøiman
    Aug 23, 2005
  2. Daniele Menozzi
    Replies:
    9
    Views:
    8,807
    Roedy Green
    Jul 18, 2005
  3. Richard Heathfield
    Replies:
    7
    Views:
    380
    Barry Schwarz
    Oct 5, 2003
  4. arnuld
    Replies:
    10
    Views:
    1,846
    =?ISO-8859-1?Q?Erik_Wikstr=F6m?=
    Aug 3, 2007
  5. arnuld

    Remove repeated words from a file

    arnuld, Sep 18, 2009, in forum: C Programming
    Replies:
    3
    Views:
    1,638
    user923005
    Sep 18, 2009
Loading...

Share This Page