simple regular expression problem

duikboot · Sep 17, 2007

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Greetings Arjen

Jason Drew · Sep 17, 2007

You just need a one-character addition to your regex:

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Note, there is now a question mark (?) after the .*

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

This is covered in the documentation for the re module.
http://docs.python.org/lib/module-re.html

Jason

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Greetings Arjen

duikboot · Sep 17, 2007

Thank you very much, it works. I guess I didn't read it right.

Arjen

You just need a one-character addition to your regex:

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Note, there is now a question mark (?) after the .*

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

This is covered in the documentation for the re module.http://docs.python.org/lib/module-re.html

Jason

Hello,

Click to expand...

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?

s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L

Click to expand...

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

Click to expand...

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

Click to expand...

I must be missing something very obvious.

Click to expand...

Greetings Arjen

Click to expand...

George Sakkis · Sep 17, 2007

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

The less obvious thing that you're missing is that regular expressions
is not the best solution to every text-related problem. Thinking at a
higher level helps sometimes; for example here you don't want to
extract "a list of strings from a text", you want to extract specific
elements from an XML data source. There are several standard and non
standard python packages for XML processing, look for them online.
Here's how to do it using the (3rd party) BeautyfulSoup module:
[<organisatie>
<profiel_id>28996</profiel_id>
</organisatie>, <organisatie>
<profiel_id>28997</profiel_id>
</organisatie>]

HTH,
George

Jason Drew · Sep 17, 2007

You're welcome!

Also, of course, parsing XML is a very common task and you might be
interested in using one of the standard modules for that, e.g.
http://docs.python.org/lib/module-xml.parsers.expat.html

Then all the tricky parsing work has been done for you.

Jason

Thank you very much, it works. I guess I didn't read it right.

Arjen

You just need a one-character addition to your regex:

Click to expand...

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Click to expand...

Note, there is now a question mark (?) after the .*

Click to expand...

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

Click to expand...

This is covered in the documentation for the re module.http://docs.python.org/lib/module-re.html

Jason

Click to expand...

Hello,
I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]
I must be missing something very obvious.
Greetings Arjen

Click to expand...

Click to expand...

Bruno Desthuilliers · Sep 17, 2007

duikboot a écrit :

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

wrt/ regexp, Jason gave you the answer. Another point is that, when
dealing with XML, it's sometime better to use an XML parser.

Q&D :

>>> from xml.etree import ElementTree as ET
>>> s = "<root>" + s + "</root>"
>>> tree = ET.fromstring(s)
>>> tree

Click to expand...

>>> tree.findall("organisatie/Profiel_Id")

Click to expand...

>>> _[0].text '28996'
>>> [it.text for it in tree.findall("organisatie/Profiel_Id")] ['28996', '28997']
>>>

Click to expand...

Click to expand...

HTH

Diez B. Roggisch · Sep 17, 2007

duikboot said:
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Don't use regular expressions to process XML. It's not the right tool for
the job, and even if simple cases as yours often can made work initially,
the longer you work with it, the more complex and troublesome the code
gets.

Instead, use the right tool, for example lxml. That has e.g.
XPath-expressions build in, that do the job:

from lxml import etree

tree =
etree.fromstring("""<root><organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie></root>""")

for feld in tree.xpath('//organisatie/Profiel_Id'):
print feld.text

Diez

Aahz · Sep 17, 2007

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.

To emphasize the other answers you got about avoiding regexps, here's a
nice quote from my .sig database:

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.' --Jamie Zawinski

Regular expression problem	13	Mar 10, 2013
Help with regular expression patterns	0	Nov 28, 2008
Recursion regular expression (xtended)	1	Aug 16, 2010
Regular expression for BOM required	6	Jan 12, 2013
Multi-line regular expression match question	5	Nov 19, 2010
Boomer trying to learn coding in C and C++	6	Dec 16, 2022
EL and regular expression	0	Feb 28, 2010
REGULAR EXPRESSION HELP	2	Aug 23, 2008

simple regular expression problem

duikboot

Jason Drew

duikboot

George Sakkis

Jason Drew

Bruno Desthuilliers

Diez B. Roggisch

Aahz

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads