Re: Help with regular expression in python

Discussion in 'Python' started by Matt Funk, Aug 19, 2011.

  1. Matt Funk

    Matt Funk Guest

    Hi,
    thanks for the suggestion. I guess i had found another way around the
    problem as well. But i really wanted to match the line exactly and i
    wanted to know why it doesn't work. That is less for the purpose of
    getting the thing to work but more because it greatly annoys me off that
    i can't figure out why it doesn't work. I.e. why the expression is not
    matches {32} times. I just don't get it.

    anyway, thanks though
    matt

    On 8/19/2011 8:41 AM, Jason Friedman wrote:
    >> Hi Josh,
    >> thanks for the reply. I am no expert so please bear with me:
    >> I thought that the {32} was supposed to match the previous expression 32
    >> times?
    >>
    >> So how can i have all matches accessible to me?

    > $ python
    > Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
    > [GCC 4.4.3] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >>>> data

    > '1.002000e+01 2.037000e+01 2.128000e+01 1.908000e+01 1.871000e+01
    > 1.914000e+01 2.007000e+01 1.664000e+01 2.204000e+01 2.109000e+01
    > 2.209000e+01 2.376000e+01 2.158000e+01 2.177000e+01 2.152000e+01
    > 2.267000e+01 1.084000e+01 1.671000e+01 1.888000e+01 1.854000e+01
    > 2.064000e+01 2.000000e+01 2.200000e+01 2.139000e+01 2.137000e+01
    > 2.178000e+01 2.179000e+01 2.123000e+01 2.201000e+01 2.150000e+01
    > 2.150000e+01 2.199000e+01 : (instance: 0) : some
    > description'
    >>>> import re
    >>>> re.findall(r"\d\.\d+e\+\d+", data)

    > ['1.002000e+01', '2.037000e+01', '2.128000e+01', '1.908000e+01',
    > '1.871000e+01', '1.914000e+01', '2.007000e+01', '1.664000e+01',
    > '2.204000e+01', '2.109000e+01', '2.209000e+01', '2.376000e+01',
    > '2.158000e+01', '2.177000e+01', '2.152000e+01', '2.267000e+01',
    > '1.084000e+01', '1.671000e+01', '1.888000e+01', '1.854000e+01',
    > '2.064000e+01', '2.000000e+01', '2.200000e+01', '2.139000e+01',
    > '2.137000e+01', '2.178000e+01', '2.179000e+01', '2.123000e+01',
    > '2.201000e+01', '2.150000e+01', '2.150000e+01', '2.199000e+01']
     
    Matt Funk, Aug 19, 2011
    #1
    1. Advertising

  2. Matt Funk

    jmfauth Guest

    On 19 août, 17:20, Matt Funk <> wrote:
    > Hi,
    > thanks for the suggestion. I guess i had found another way around the
    > problem as well. But i really wanted to match the line exactly and i
    > wanted to know why it doesn't work. That is less for the purpose of
    > getting the thing to work but more because it greatly annoys me off that
    > i can't figure out why it doesn't work. I.e. why the expression is not
    > matches {32} times. I just don't get it.
    >


    re is not always the right tool to be used.
    Without more precisions:

    >>> s = '2.201000e+01 2.150000e+01 2.150000e+01\

    .... : (instance: 0) : some description'
    >>> s

    2.201000e+01 2.150000e+01 2.150000e+01 : (instance: 0) :
    some description
    >>> s[:s.find(':')]

    2.201000e+01 2.150000e+01 2.150000e+01
    >>> s[:s.find(':')].split()

    ['2.201000e+01', '2.150000e+01', '2.150000e+01']
    >>>
    >>>


    jmf
     
    jmfauth, Aug 19, 2011
    #2
    1. Advertising

  3. Matt Funk <> writes:

    > thanks for the suggestion. I guess i had found another way around the
    > problem as well. But i really wanted to match the line exactly and i
    > wanted to know why it doesn't work. That is less for the purpose of
    > getting the thing to work but more because it greatly annoys me off that
    > i can't figure out why it doesn't work. I.e. why the expression is not
    > matches {32} times. I just don't get it.


    Because a line is not 32 times a number, it is a number followed by 31
    times "a space followed by a number". Using Jason's regexp, you can
    build the regexp step by step:

    number = r"\d\.\d+e\+\d+"
    numbersequence = r"%s( %s){31}" % (number,number)

    There are better ways to build your regexp, but I think this one is
    convenient to answer your question. You still have to append what will
    match the end of the line.

    -- Alain.

    P/S: please do not top-post

    >> $ python
    >> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
    >> [GCC 4.4.3] on linux2
    >> Type "help", "copyright", "credits" or "license" for more information.
    >>>>> data

    >> '1.002000e+01 2.037000e+01 2.128000e+01 1.908000e+01 1.871000e+01
    >> 1.914000e+01 2.007000e+01 1.664000e+01 2.204000e+01 2.109000e+01
    >> 2.209000e+01 2.376000e+01 2.158000e+01 2.177000e+01 2.152000e+01
    >> 2.267000e+01 1.084000e+01 1.671000e+01 1.888000e+01 1.854000e+01
    >> 2.064000e+01 2.000000e+01 2.200000e+01 2.139000e+01 2.137000e+01
    >> 2.178000e+01 2.179000e+01 2.123000e+01 2.201000e+01 2.150000e+01
    >> 2.150000e+01 2.199000e+01 : (instance: 0) : some
    >> description'
    >>>>> import re
    >>>>> re.findall(r"\d\.\d+e\+\d+", data)

    >> ['1.002000e+01', '2.037000e+01', '2.128000e+01', '1.908000e+01',
    >> '1.871000e+01', '1.914000e+01', '2.007000e+01', '1.664000e+01',
    >> '2.204000e+01', '2.109000e+01', '2.209000e+01', '2.376000e+01',
    >> '2.158000e+01', '2.177000e+01', '2.152000e+01', '2.267000e+01',
    >> '1.084000e+01', '1.671000e+01', '1.888000e+01', '1.854000e+01',
    >> '2.064000e+01', '2.000000e+01', '2.200000e+01', '2.139000e+01',
    >> '2.137000e+01', '2.178000e+01', '2.179000e+01', '2.123000e+01',
    >> '2.201000e+01', '2.150000e+01', '2.150000e+01', '2.199000e+01']
     
    Alain Ketterlin, Aug 19, 2011
    #3
  4. Matt Funk

    Matt Funk Guest

    On Friday, August 19, 2011, Alain Ketterlin wrote:
    > Matt Funk <> writes:
    > > thanks for the suggestion. I guess i had found another way around the
    > > problem as well. But i really wanted to match the line exactly and i
    > > wanted to know why it doesn't work. That is less for the purpose of
    > > getting the thing to work but more because it greatly annoys me off that
    > > i can't figure out why it doesn't work. I.e. why the expression is not
    > > matches {32} times. I just don't get it.

    >
    > Because a line is not 32 times a number, it is a number followed by 31
    > times "a space followed by a number". Using Jason's regexp, you can
    > build the regexp step by step:
    >
    > number = r"\d\.\d+e\+\d+"
    > numbersequence = r"%s( %s){31}" % (number,number)

    That didn't work either. Using the (modified (where the (.+) matches the end of
    the line)) expression as:

    number = r"\d\.\d+e\+\d+"
    numbersequence = r"%s( %s){31}(.+)" % (number,number)
    instance_linetype_pattern = re.compile(numbersequence)

    The results obtained are:
    results:
    [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    so this matches the last number plus the string at the end of the line, but no
    retaining the previous numbers.

    Anyway, i think at this point i will go another route. Not sure where the
    issues lies at this point.

    thanks for all the help
    matt


    >
    > There are better ways to build your regexp, but I think this one is
    > convenient to answer your question. You still have to append what will
    > match the end of the line.
    >
    > -- Alain.
    >
    > P/S: please do not top-post
    >
    > >> $ python
    > >> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
    > >> [GCC 4.4.3] on linux2
    > >> Type "help", "copyright", "credits" or "license" for more information.
    > >>
    > >>>>> data
    > >>
    > >> '1.002000e+01 2.037000e+01 2.128000e+01 1.908000e+01 1.871000e+01
    > >> 1.914000e+01 2.007000e+01 1.664000e+01 2.204000e+01 2.109000e+01
    > >> 2.209000e+01 2.376000e+01 2.158000e+01 2.177000e+01 2.152000e+01
    > >> 2.267000e+01 1.084000e+01 1.671000e+01 1.888000e+01 1.854000e+01
    > >> 2.064000e+01 2.000000e+01 2.200000e+01 2.139000e+01 2.137000e+01
    > >> 2.178000e+01 2.179000e+01 2.123000e+01 2.201000e+01 2.150000e+01
    > >> 2.150000e+01 2.199000e+01 : (instance: 0) : some
    > >> description'
    > >>
    > >>>>> import re
    > >>>>> re.findall(r"\d\.\d+e\+\d+", data)
    > >>
    > >> ['1.002000e+01', '2.037000e+01', '2.128000e+01', '1.908000e+01',
    > >> '1.871000e+01', '1.914000e+01', '2.007000e+01', '1.664000e+01',
    > >> '2.204000e+01', '2.109000e+01', '2.209000e+01', '2.376000e+01',
    > >> '2.158000e+01', '2.177000e+01', '2.152000e+01', '2.267000e+01',
    > >> '1.084000e+01', '1.671000e+01', '1.888000e+01', '1.854000e+01',
    > >> '2.064000e+01', '2.000000e+01', '2.200000e+01', '2.139000e+01',
    > >> '2.137000e+01', '2.178000e+01', '2.179000e+01', '2.123000e+01',
    > >> '2.201000e+01', '2.150000e+01', '2.150000e+01', '2.199000e+01']
     
    Matt Funk, Aug 19, 2011
    #4
  5. Matt Funk

    jmfauth Guest

    On 19 août, 19:33, Matt Funk <> wrote:
    >
    > The results obtained are:
    > results:
    > [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    > so this matches the last number plus the string at the end of the line, but no
    > retaining the previous numbers.
    >
    > Anyway, i think at this point i will go another route. Not sure where the
    > issues lies at this point.
    >



    Seen on this list:

    And always keep this in mind:
    'Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems.'
    --Jamie Zawinski, comp.lang.emacs


    I proposed a solution which seems to corresponds to your problem
    if it were better formulated...

    jmf
     
    jmfauth, Aug 19, 2011
    #5
  6. Matt Funk

    Guest

    On 08/19/2011 11:33 AM, Matt Funk wrote:
    > On Friday, August 19, 2011, Alain Ketterlin wrote:
    >> Matt Funk <> writes:
    >> > thanks for the suggestion. I guess i had found another way around the
    >> > problem as well. But i really wanted to match the line exactly and i
    >> > wanted to know why it doesn't work. That is less for the purpose of
    >> > getting the thing to work but more because it greatly annoys me off that
    >> > i can't figure out why it doesn't work. I.e. why the expression is not
    >> > matches {32} times. I just don't get it.

    >>
    >> Because a line is not 32 times a number, it is a number followed by 31
    >> times "a space followed by a number". Using Jason's regexp, you can
    >> build the regexp step by step:
    >>
    >> number = r"\d\.\d+e\+\d+"
    >> numbersequence = r"%s( %s){31}" % (number,number)

    > That didn't work either. Using the (modified (where the (.+) matches the end of
    > the line)) expression as:
    >
    > number = r"\d\.\d+e\+\d+"
    > numbersequence = r"%s( %s){31}(.+)" % (number,number)
    > instance_linetype_pattern = re.compile(numbersequence)
    >
    > The results obtained are:
    > results:
    > [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    > so this matches the last number plus the string at the end of the line, but no
    > retaining the previous numbers.


    The secret is buried very unobtrusively in the re docs,
    where it has caught me out in the past. Specifically
    in the docs for re.group():

    "If a group is contained in a part of the pattern that
    matched multiple times, the last match is returned."

    In addition to the findall solution someone else
    posted, another thing you could do is to explicitly
    express the groups in your re:

    number = r"\d\.\d+e\+\d+"
    groups = (r"( %s)" % number)*31
    numbersequence = r"%s%s(.+)" % (number,groups)
    ...
    results = match_object.group(range(1,33))

    Or (what I would probably do), simply match the
    whole string of numbers and pull it apart later:

    number = r"\d\.\d+e\+\d+"
    numbersequence = r"(%s(?: %s){31})(.+)" % (number,number)
    results = (match_object.group(1)).split()

    [none of this code is tested but should be close
    enough to convey the general idea.]
     
    , Aug 19, 2011
    #6
  7. Matt Funk

    Carl Banks Guest

    On Friday, August 19, 2011 10:33:49 AM UTC-7, Matt Funk wrote:
    > number = r"\d\.\d+e\+\d+"
    > numbersequence = r"%s( %s){31}(.+)" % (number,number)
    > instance_linetype_pattern = re.compile(numbersequence)
    >
    > The results obtained are:
    > results:
    > [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    > so this matches the last number plus the string at the end of the line, but no
    > retaining the previous numbers.
    >
    > Anyway, i think at this point i will go another route. Not sure where the
    > issues lies at this point.



    I think the problem is that repeat counts don't actually repeat the groupings; they just repeat the matchings. Take this expression:

    r"(\w+\s*){2}"

    This will match exactly two words separated by whitespace. But the match result won't contain two groups; it'll only contain one group, and the valueof that group will match only the very last thing repeated:

    Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
    [GCC 4.5.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import re
    >>> m = re.match(r"(\w+\s*){2}","abc def")
    >>> m.group(1)

    'def'

    So you see, the regular expression is doing what you think it is, but the way it forms groups is not.


    Just a little advice (I know you've found a different method, and that's good, this is for the general reader).

    The functions re.findall and re.finditer could have helped here, they find all the matches in a string and let you iterate through them. (findall returns the strings matched, and finditer returns the sequence of match objects.) You could have done something like this:

    row = [ float(x) for x in re.findall(r'\d+\.\d+e\+d+',line) ]

    And regexp matching is often overkill for a particular problem; this may beof them. line.split() could have been sufficient:

    row = [ float(x) for x in line.split() ]

    Of course, these solutions don't account for the case where you have lines,some of which aren't 32 floating-point numbers. You need extra error handling for that, but you get the idea.


    Carl Banks
     
    Carl Banks, Aug 19, 2011
    #7
  8. Matt Funk

    MRAB Guest

    On 19/08/2011 20:55, wrote:
    > On 08/19/2011 11:33 AM, Matt Funk wrote:
    >> On Friday, August 19, 2011, Alain Ketterlin wrote:
    >>> Matt Funk<> writes:
    >>>> thanks for the suggestion. I guess i had found another way around the
    >>>> problem as well. But i really wanted to match the line exactly and i
    >>>> wanted to know why it doesn't work. That is less for the purpose of
    >>>> getting the thing to work but more because it greatly annoys me off that
    >>>> i can't figure out why it doesn't work. I.e. why the expression is not
    >>>> matches {32} times. I just don't get it.
    >>>
    >>> Because a line is not 32 times a number, it is a number followed by 31
    >>> times "a space followed by a number". Using Jason's regexp, you can
    >>> build the regexp step by step:
    >>>
    >>> number = r"\d\.\d+e\+\d+"
    >>> numbersequence = r"%s( %s){31}" % (number,number)

    >> That didn't work either. Using the (modified (where the (.+) matches the end of
    >> the line)) expression as:
    >>
    >> number = r"\d\.\d+e\+\d+"
    >> numbersequence = r"%s( %s){31}(.+)" % (number,number)
    >> instance_linetype_pattern = re.compile(numbersequence)
    >>
    >> The results obtained are:
    >> results:
    >> [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    >> so this matches the last number plus the string at the end of the line, but no
    >> retaining the previous numbers.

    >
    > The secret is buried very unobtrusively in the re docs,
    > where it has caught me out in the past. Specifically
    > in the docs for re.group():
    >
    > "If a group is contained in a part of the pattern that
    > matched multiple times, the last match is returned."
    >

    [snip]
    There's a regex implementation on PyPI:

    http://pypi.python.org/pypi/regex

    which does support capturing all of the matches of a group.
     
    MRAB, Aug 19, 2011
    #8
  9. Matt Funk

    Matt Funk Guest

    On Friday, August 19, 2011, jmfauth wrote:
    > On 19 août, 19:33, Matt Funk <> wrote:
    > > The results obtained are:
    > > results:
    > > [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    > > so this matches the last number plus the string at the end of the line,
    > > but no retaining the previous numbers.
    > >
    > > Anyway, i think at this point i will go another route. Not sure where the
    > > issues lies at this point.

    >
    > Seen on this list:
    >
    > And always keep this in mind:
    > 'Some people, when confronted with a problem, think "I know, I'll use
    > regular expressions." Now they have two problems.'
    > --Jamie Zawinski, comp.lang.emacs
    >
    >
    > I proposed a solution which seems to corresponds to your problem
    > if it were better formulated...

    Agreed, and i will probably take your proposed route or a similar one.
    However, i still won't know WHY it didn't work. I would really LIKE to know
    why, simply because it tickles me.

    matt

    >
    > jmf
     
    Matt Funk, Aug 19, 2011
    #9
  10. Matt Funk

    Matt Funk Guest

    On Friday, August 19, 2011, Carl Banks wrote:
    > On Friday, August 19, 2011 10:33:49 AM UTC-7, Matt Funk wrote:
    > > number = r"\d\.\d+e\+\d+"
    > > numbersequence = r"%s( %s){31}(.+)" % (number,number)
    > > instance_linetype_pattern = re.compile(numbersequence)
    > >
    > > The results obtained are:
    > > results:
    > > [(' 2.199000e+01', ' : (instance: 0)\t:\tsome description')]
    > > so this matches the last number plus the string at the end of the line,
    > > but no retaining the previous numbers.
    > >
    > > Anyway, i think at this point i will go another route. Not sure where the
    > > issues lies at this point.

    >
    > I think the problem is that repeat counts don't actually repeat the
    > groupings; they just repeat the matchings. Take this expression:
    >
    > r"(\w+\s*){2}"

    I see

    >
    > This will match exactly two words separated by whitespace. But the match
    > result won't contain two groups; it'll only contain one group, and the
    > value of that group will match only the very last thing repeated:
    >
    > Python 2.7.1+ (r271:86832, Apr 11 2011, 18:13:53)
    > [GCC 4.5.2] on linux2
    > Type "help", "copyright", "credits" or "license" for more information.
    >
    > >>> import re
    > >>> m = re.match(r"(\w+\s*){2}","abc def")
    > >>> m.group(1)

    >
    > 'def'
    >
    > So you see, the regular expression is doing what you think it is, but the
    > way it forms groups is not.
    >
    >
    > Just a little advice (I know you've found a different method, and that's
    > good, this is for the general reader).
    >
    > The functions re.findall and re.finditer could have helped here, they find
    > all the matches in a string and let you iterate through them. (findall
    > returns the strings matched, and finditer returns the sequence of match
    > objects.) You could have done something like this:

    I did use findall but when i tried to match the everything (including the 'some
    description' part) it did not work. But i think the explanation you gave above
    matches this case and explains why it did not.


    >
    > row = [ float(x) for x in re.findall(r'\d+\.\d+e\+d+',line) ]
    >
    > And regexp matching is often overkill for a particular problem; this may be
    > of them. line.split() could have been sufficient:
    >
    > row = [ float(x) for x in line.split() ]
    >
    > Of course, these solutions don't account for the case where you have lines,
    > some of which aren't 32 floating-point numbers. You need extra error
    > handling for that, but you get the idea.


    thanks
    matt

    >
    >
    > Carl Banks
     
    Matt Funk, Aug 19, 2011
    #10
  11. Sorry, if I missed some further specification in the earlier thread or
    if the following is oversimplification of the original problem (using
    3 numbers instead of 32),
    would something like the following work for your data?

    >>> import re
    >>> data = """2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) : some description

    .... 2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description
    .... 2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description
    .... 2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description"""
    >>> for res in re.findall(r"(?m)^(?:(?:[-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][-+]?\d+))?\s+){3}(?:.+)$", data): print res

    ....
    2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description
    2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description
    2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description
    2.201000e+01 2.150000e+01 2.199000e+01 : (instance: 0) :
    some description
    >>>


    i.e. all parentheses are non-capturing (?:...) and there are extra
    anchors for line begining and end ^...$ with the multiline flag set
    via (?m)
    Each result is one matching line in this sample (if you need to acces
    single numbers, you could process these matches further or use the new
    regex implementation mentioned earlier by mrab (its developer) with
    the new match method captures() - using an appropriate pattern with
    the needed groupings).

    regards,
    vbr
     
    Vlastimil Brom, Aug 22, 2011
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,388
  2. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    884
    Alan Moore
    Dec 2, 2005
  3. GIMME
    Replies:
    3
    Views:
    12,049
    vforvikash
    Dec 29, 2008
  4. pekka niiranen
    Replies:
    5
    Views:
    561
    Paul McGuire
    Oct 20, 2004
  5. Gabriel Genellina

    Re: python regular expression help

    Gabriel Genellina, Apr 12, 2007, in forum: Python
    Replies:
    4
    Views:
    272
    7stud
    Apr 12, 2007
Loading...

Share This Page