returning regex matches as lists

Discussion in 'Python' started by Jonathan Lukens, Feb 15, 2008.

  1. I am in the last phase of building a Django app based on something I
    wrote in Java a while back. Right now I am stuck on how to return the
    matches of a regular expression as a list *at all*, and in particular
    given that the regex has a number of groupings. The only method I've
    seen that returns a list is .findall(string), but then I get back the
    groups as tuples, which is sort of a problem.

    Thank you,
    Jonathan
    Jonathan Lukens, Feb 15, 2008
    #1
    1. Advertising

  2. Jonathan Lukens

    John Machin Guest

    On Feb 16, 6:07 am, Jonathan Lukens <> wrote:
    > I am in the last phase of building a Django app based on something I
    > wrote in Java a while back. Right now I am stuck on how to return the
    > matches of a regular expression as a list *at all*, and in particular
    > given that the regex has a number of groupings. The only method I've
    > seen that returns a list is .findall(string), but then I get back the
    > groups as tuples, which is sort of a problem.
    >


    It would help if you explained what you want the contents of the list
    to be, why you want a list as opposed to a tuple or a generator or
    whatever ... we can't be expected to imagine why getting groups as
    tuples is "sort of a problem".

    Use a concrete example, e.g.

    >>> import re
    >>> regex = re.compile(r'(\w+)\s+(\d+)')
    >>> text = 'python 1 junk xyzzy 42 java 666'
    >>> r = regex.findall(text)
    >>> r

    [('python', '1'), ('xyzzy', '42'), ('java', '666')]
    >>>


    What would you like to see instead?
    John Machin, Feb 15, 2008
    #2
    1. Advertising

  3. > What would you like to see instead?

    I had mostly just expected that there was some method that would
    return each entire match as an item on a list. I have this pattern:

    >>> import re
    >>> corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
    >>> terms = corporate_names.findall(sourcetext)


    Which matches a specific way that Russian company names are
    formatted. I was expecting a method that would return this:

    >>> terms

    [u'string one', u'string two', u'string three']

    ...mostly because I was working it this way in Java and haven't
    learned to do things the Python way yet. At the suggestion from
    someone on the list, I just used list() on all the tuples like so:

    >>> detupled_terms = [list(term_tuple) for term_tuple in terms]
    >>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]


    which achieves the desired result, but I am not a programmer and so I
    would still be interested to know if there is a more elegant way of
    doing this.

    I appreciate the help.

    Jonathan
    Jonathan Lukens, Feb 15, 2008
    #3
  4. En Fri, 15 Feb 2008 17:07:21 -0200, Jonathan Lukens
    <> escribió:

    > I am in the last phase of building a Django app based on something I
    > wrote in Java a while back. Right now I am stuck on how to return the
    > matches of a regular expression as a list *at all*, and in particular
    > given that the regex has a number of groupings. The only method I've
    > seen that returns a list is .findall(string), but then I get back the
    > groups as tuples, which is sort of a problem.


    Do you want something like this?

    py> re.findall(r"([a-z]+)([0-9]+)", "foo bar3 w000 no abc123")
    [('bar', '3'), ('w', '000'), ('abc', '123')]
    py> re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
    [('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
    py> groups = re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
    py> groups
    [('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
    py> [group[0] for group in groups]
    ['bar3', 'w000', 'abc123']

    --
    Gabriel Genellina
    Gabriel Genellina, Feb 16, 2008
    #4
  5. En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
    <> escribió:

    >> What would you like to see instead?

    >
    > I had mostly just expected that there was some method that would
    > return each entire match as an item on a list. I have this pattern:
    >
    >>>> import re
    >>>> corporate_names =
    >>>> re.compile(u'(?u)\\b([Ð-Я]{2,}\\s+)([<<"][а-ÑÐ-Я]+)(\\s*-?[а-ÑÐ-Я]+)*([>>"])')
    >>>> terms = corporate_names.findall(sourcetext)

    >
    > Which matches a specific way that Russian company names are
    > formatted. I was expecting a method that would return this:
    >
    >>>> terms

    > [u'string one', u'string two', u'string three']
    >
    > ...mostly because I was working it this way in Java and haven't
    > learned to do things the Python way yet. At the suggestion from
    > someone on the list, I just used list() on all the tuples like so:


    The group() method of match objects does what you want:

    terms = [match.group() for match in corporate_names.finditer(sourcetext)]

    See http://docs.python.org/lib/match-objects.html

    >>>> detupled_terms = [list(term_tuple) for term_tuple in terms]
    >>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

    >
    > which achieves the desired result, but I am not a programmer and so I
    > would still be interested to know if there is a more elegant way of
    > doing this.


    That ''.join(...) works equally well on tuples; you don't have to convert
    tuples to lists first:

    delisted_terms = [''.join(term_list) for term in terms]

    --
    Gabriel Genellina
    Gabriel Genellina, Feb 16, 2008
    #5
  6. On Feb 15, 8:31 pm, "Gabriel Genellina" <>
    wrote:
    > En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
    > <> escribi¨®:
    >
    >
    >
    > >> What would you like to see instead?

    >
    > > I had mostly just expected that there was some method that would
    > > return each entire match as an item on a list. I have this pattern:

    >
    > >>>> import re
    > >>>> corporate_names =
    > >>>> re.compile(u'(?u)\\b([§¡-§Á]{2,}\\s+)([<<"][§Ñ-§ñ§¡-§Á]+)(\\s*-?[§Ñ-§ñ§¡-§Á]+)*([>>"])')
    > >>>> terms = corporate_names.findall(sourcetext)

    >
    > > Which matches a specific way that Russian company names are
    > > formatted. I was expecting a method that would return this:

    >
    > >>>> terms

    > > [u'string one', u'string two', u'string three']

    >
    > > ...mostly because I was working it this way in Java and haven't
    > > learned to do things the Python way yet. At the suggestion from
    > > someone on the list, I just used list() on all the tuples like so:

    >
    > The group() method of match objects does what you want:
    >
    > terms = [match.group() for match in corporate_names.finditer(sourcetext)]
    >
    > Seehttp://docs.python.org/lib/match-objects.html
    >
    > >>>> detupled_terms = [list(term_tuple) for term_tuple in terms]
    > >>>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

    >
    > > which achieves the desired result, but I am not a programmer and so I
    > > would still be interested to know if there is a more elegant way of
    > > doing this.

    >
    > That ''.join(...) works equally well on tuples; you don't have to convert
    > tuples to lists first:
    >
    > delisted_terms = [''.join(term_list) for term in terms]
    >
    > --
    > Gabriel Genellina


    Thanks Gabriel,

    That is just what I was looking for.

    Jonathan
    Jonathan Lukens, Feb 16, 2008
    #6
  7. Jonathan Lukens

    John Machin Guest

    On Feb 16, 8:25 am, Jonathan Lukens <> wrote:
    > > What would you like to see instead?

    >
    > I had mostly just expected that there was some method that would
    > return each entire match as an item on a list. I have this pattern:
    >
    > >>> import re
    > >>> corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
    > >>> terms = corporate_names.findall(sourcetext)

    >
    > Which matches a specific way that Russian company names are
    > formatted. I was expecting a method that would return this:
    >
    > >>> terms

    >
    > [u'string one', u'string two', u'string three']


    What is the point of having parenthesised groups in the regex if you
    are interested only in the whole match?

    Other comments:
    (1) raw string for improved legibility
    ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'
    (2) consider not including space at the end of a group
    ru'(?u)\b([á-ñ]{2,})\s+([<<"][Á-Ñá-ñ]+)\s*(-?[Á-Ñá-ñ]+)*([>>"])'
    (3) what appears between [] is a set of characters, so [<<"] is the
    same as [<"] and probably isn't doing what you expect; have you tested
    this regex for correctness?

    >
    > ...mostly because I was working it this way in Java and haven't
    > learned to do things the Python way yet. At the suggestion from
    > someone on the list, I just used list() on all the tuples like so:
    >
    > >>> detupled_terms = [list(term_tuple) for term_tuple in terms]
    > >>> delisted_terms = [''.join(term_list) for term_list in detupled_terms]

    >
    > which achieves the desired result, but I am not a programmer and so I
    > would still be interested to know if there is a more elegant way of
    > doing this.


    I can't imagine how "not a programmer" implies "interested to know if
    there is a more elegant way". In any case, explore the correctness
    axis first.

    Cheers,
    John
    John Machin, Feb 16, 2008
    #7
  8. John,

    > (1) raw string for improved legibility
    > ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'


    This actually escaped my notice after I had posted -- the letters with
    diacritics are incorrectly decoded Cyrillic letters -- I suppose I
    code use the Unicode escape sequences (the sets [á-ñ] and [Á-Ñá-ñ] are
    the Cyrillic equivalents of [a-z] and [A-Za-z]) but then suddenly the
    legibility goes out the window again.

    > (3) what appears between [] is a set of characters, so [<<"] is the
    > same as [<"] and probably isn't doing what you expect; have you tested
    > this regex for correctness?


    These were angled quotation marks in the original Unicode. Sorry
    again. The regex matches everything it is supposed to. The extra
    parentheses were because I had somehow missed the .group method and it
    had only been returning what was only in the one needed set of
    parentheses.

    > I can't imagine how "not a programmer" implies "interested to know if
    > there is a more elegant way".


    More carefully stated: "I am self-taught have no real training or
    experience as a programmer and would be interested in seeing how a
    programmer with training
    and experience would go about this."

    Thank you,
    Jonathan
    Jonathan Lukens, Feb 16, 2008
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==

    List of lists of lists of lists...

    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==, May 8, 2006, in forum: Python
    Replies:
    5
    Views:
    404
    =?UTF-8?B?w4FuZ2VsIEd1dGnDqXJyZXogUm9kcsOtZ3Vleg==
    May 15, 2006
  2. C++ Newbie
    Replies:
    1
    Views:
    310
    Richard Herring
    Oct 14, 2008
  3. Baba
    Replies:
    1
    Views:
    230
  4. Baba
    Replies:
    8
    Views:
    276
  5. John otac0n Gietzen
    Replies:
    2
    Views:
    175
    John otac0n Gietzen
    Feb 5, 2006
Loading...

Share This Page