returning regex matches as lists

Jonathan Lukens · Feb 15, 2008

I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

Thank you,
Jonathan

John Machin · Feb 15, 2008

I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

It would help if you explained what you want the contents of the list
to be, why you want a list as opposed to a tuple or a generator or
whatever ... we can't be expected to imagine why getting groups as
tuples is "sort of a problem".

Use a concrete example, e.g.

import re
regex = re.compile(r'(\w+)\s+(\d+)')
text = 'python 1 junk xyzzy 42 java 666'
r = regex.findall(text)
r [('python', '1'), ('xyzzy', '42'), ('java', '666')]

Click to expand...

Click to expand...

What would you like to see instead?

Jonathan Lukens · Feb 15, 2008

What would you like to see instead?

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:

import re
corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Click to expand...

Click to expand...

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:
[u'string one', u'string two', u'string three']

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

Click to expand...

Click to expand...

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

I appreciate the help.

Jonathan

Gabriel Genellina · Feb 16, 2008

En Fri, 15 Feb 2008 17:07:21 -0200, Jonathan Lukens

I am in the last phase of building a Django app based on something I
wrote in Java a while back. Right now I am stuck on how to return the
matches of a regular expression as a list *at all*, and in particular
given that the regex has a number of groupings. The only method I've
seen that returns a list is .findall(string), but then I get back the
groups as tuples, which is sort of a problem.

Do you want something like this?

py> re.findall(r"([a-z]+)([0-9]+)", "foo bar3 w000 no abc123")
[('bar', '3'), ('w', '000'), ('abc', '123')]
py> re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> groups = re.findall(r"(([a-z]+)([0-9]+))", "foo bar3 w000 no abc123")
py> groups
[('bar3', 'bar', '3'), ('w000', 'w', '000'), ('abc123', 'abc', '123')]
py> [group[0] for group in groups]
['bar3', 'w000', 'abc123']

Gabriel Genellina · Feb 16, 2008

En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens

What would you like to see instead?

Click to expand...

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:

import re
corporate_names =
re.compile(u'(?u)\\b([Ð-Ð¯]{2,}\\s+)([<<"][Ð°-ÑÐ-Ð¯]+)(\\s*-?[Ð°-ÑÐ-Ð¯]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Click to expand...

Click to expand...

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:
[u'string one', u'string two', u'string three']

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

The group() method of match objects does what you want:

terms = [match.group() for match in corporate_names.finditer(sourcetext)]

See http://docs.python.org/lib/match-objects.html

detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

Click to expand...

Click to expand...

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

That ''.join(...) works equally well on tuples; you don't have to convert
tuples to lists first:

delisted_terms = [''.join(term_list) for term in terms]

Jonathan Lukens · Feb 16, 2008

En Fri, 15 Feb 2008 19:25:59 -0200, Jonathan Lukens
<[email protected]> escribi¨®:

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:

import re
corporate_names =
re.compile(u'(?u)\\b([§¡-§Á]{2,}\\s+)([<<"][§Ñ-§ñ§¡-§Á]+)(\\s*-?[§Ñ-§ñ§¡-§Á]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Click to expand...

Click to expand...

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:

terms

Click to expand...

[u'string one', u'string two', u'string three']

Click to expand...

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

Click to expand...

The group() method of match objects does what you want:

terms = [match.group() for match in corporate_names.finditer(sourcetext)]

Seehttp://docs.python.org/lib/match-objects.html

detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

Click to expand...

Click to expand...

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

Click to expand...

That ''.join(...) works equally well on tuples; you don't have to convert
tuples to lists first:

delisted_terms = [''.join(term_list) for term in terms]

Thanks Gabriel,

That is just what I was looking for.

Jonathan

John Machin · Feb 16, 2008

What would you like to see instead?

Click to expand...

I had mostly just expected that there was some method that would
return each entire match as an item on a list. I have this pattern:

import re
corporate_names = re.compile(u'(?u)\\b([á-ñ]{2,}\\s+)([<<"][Á-Ñá-ñ]+)(\\s*-?[Á-Ñá-ñ]+)*([>>"])')
terms = corporate_names.findall(sourcetext)

Click to expand...

Click to expand...

Which matches a specific way that Russian company names are
formatted. I was expecting a method that would return this:

[u'string one', u'string two', u'string three']

What is the point of having parenthesised groups in the regex if you
are interested only in the whole match?

Other comments:
(1) raw string for improved legibility
ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'
(2) consider not including space at the end of a group
ru'(?u)\b([á-ñ]{2,})\s+([<<"][Á-Ñá-ñ]+)\s*(-?[Á-Ñá-ñ]+)*([>>"])'
(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?

...mostly because I was working it this way in Java and haven't
learned to do things the Python way yet. At the suggestion from
someone on the list, I just used list() on all the tuples like so:

detupled_terms = [list(term_tuple) for term_tuple in terms]
delisted_terms = [''.join(term_list) for term_list in detupled_terms]

Click to expand...

Click to expand...

which achieves the desired result, but I am not a programmer and so I
would still be interested to know if there is a more elegant way of
doing this.

I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way". In any case, explore the correctness
axis first.

Cheers,
John

Jonathan Lukens · Feb 16, 2008

John,

(1) raw string for improved legibility
ru'(?u)\b([á-ñ]{2,}\s+)([<<"][Á-Ñá-ñ]+)(\s*-?[Á-Ñá-ñ]+)*([>>"])'

This actually escaped my notice after I had posted -- the letters with
diacritics are incorrectly decoded Cyrillic letters -- I suppose I
code use the Unicode escape sequences (the sets [á-ñ] and [Á-Ñá-ñ] are
the Cyrillic equivalents of [a-z] and [A-Za-z]) but then suddenly the
legibility goes out the window again.

(3) what appears between [] is a set of characters, so [<<"] is the
same as [<"] and probably isn't doing what you expect; have you tested
this regex for correctness?

These were angled quotation marks in the original Unicode. Sorry
again. The regex matches everything it is supposed to. The extra
parentheses were because I had somehow missed the .group method and it
had only been returning what was only in the one needed set of
parentheses.

I can't imagine how "not a programmer" implies "interested to know if
there is a more elegant way".

More carefully stated: "I am self-taught have no real training or
experience as a programmer and would be interested in seeing how a
programmer with training
and experience would go about this."

Thank you,
Jonathan

Finding all regex matches by index?	1	May 30, 2012
python regex character group matches	2	Sep 17, 2008
replace random matches of regexp	4	Sep 8, 2011
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
fetchall to python lists	2	Sep 30, 2013
RegEx engine returning empty matches between valid tokens.	2	Feb 5, 2006
CSV, lists, and functions	4	Mar 19, 2013
On re / regex replacement	3	Aug 28, 2011

returning regex matches as lists

Jonathan Lukens

John Machin

Jonathan Lukens

Gabriel Genellina

Gabriel Genellina

Jonathan Lukens

John Machin

Jonathan Lukens

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads