simple regular expression problem

D

duikboot

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Greetings Arjen
 
J

Jason Drew

You just need a one-character addition to your regex:

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Note, there is now a question mark (?) after the .*

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

This is covered in the documentation for the re module.
http://docs.python.org/lib/module-re.html

Jason

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Greetings Arjen
 
D

duikboot

Thank you very much, it works. I guess I didn't read it right.

Arjen

You just need a one-character addition to your regex:

regex = re.compile(r'<organisatie.*?</organisatie>', re.S)

Note, there is now a question mark (?) after the .*

By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.

This is covered in the documentation for the re module.http://docs.python.org/lib/module-re.html

Jason

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]
I must be missing something very obvious.
Greetings Arjen
 
G

George Sakkis

Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?

['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

The less obvious thing that you're missing is that regular expressions
is not the best solution to every text-related problem. Thinking at a
higher level helps sometimes; for example here you don't want to
extract "a list of strings from a text", you want to extract specific
elements from an XML data source. There are several standard and non
standard python packages for XML processing, look for them online.
Here's how to do it using the (3rd party) BeautyfulSoup module:
[<organisatie>
<profiel_id>28996</profiel_id>
</organisatie>, <organisatie>
<profiel_id>28997</profiel_id>
</organisatie>]


HTH,
George
 
J

Jason Drew

You're welcome!

Also, of course, parsing XML is a very common task and you might be
interested in using one of the standard modules for that, e.g.
http://docs.python.org/lib/module-xml.parsers.expat.html

Then all the tricky parsing work has been done for you.

Jason


Thank you very much, it works. I guess I didn't read it right.

Arjen

You just need a one-character addition to your regex:
regex = re.compile(r'<organisatie.*?</organisatie>', re.S)
Note, there is now a question mark (?) after the .*
By default, regular expressions are "greedy" and will grab as much
text as possible when making a match. So your original expression was
grabbing everything between the first opening tag and the last closing
tag. The question mark says, don't be greedy, and you get the
behaviour you need.
This is covered in the documentation for the re module.http://docs.python.org/lib/module-re.html

Hello,
I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
s = """ \n<organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie>"""
regex = re.compile(r'<organisatie.*</organisatie>', re.S)
L = regex.findall(s)
print L
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']
I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]
I must be missing something very obvious.
Greetings Arjen
 
B

Bruno Desthuilliers

duikboot a écrit :
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

wrt/ regexp, Jason gave you the answer. Another point is that, when
dealing with XML, it's sometime better to use an XML parser.

Q&D :
>>> from xml.etree import ElementTree as ET
>>> s = "<root>" + s + "</root>"
>>> tree = ET.fromstring(s)
>>> tree
>>> tree.findall("organisatie/Profiel_Id")
>>> _[0].text '28996'
>>> [it.text for it in tree.findall("organisatie/Profiel_Id")] ['28996', '28997']
>>>

HTH
 
D

Diez B. Roggisch

duikboot said:
Hello,

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.
Could you please help me?
['organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie']

I expected:
[('organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>
\n<organisatie>), (<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</
organisatie')]

I must be missing something very obvious.

Don't use regular expressions to process XML. It's not the right tool for
the job, and even if simple cases as yours often can made work initially,
the longer you work with it, the more complex and troublesome the code
gets.

Instead, use the right tool, for example lxml. That has e.g.
XPath-expressions build in, that do the job:


from lxml import etree

tree =
etree.fromstring("""<root><organisatie>\n<Profiel_Id>28996</Profiel_Id>\n</organisatie>\n<organisatie>\n<Profiel_Id>28997</Profiel_Id>\n</organisatie></root>""")

for feld in tree.xpath('//organisatie/Profiel_Id'):
print feld.text



Diez
 
A

Aahz

I am trying to extract a list of strings from a text. I am looking it
for hours now, googling didn't help either.

To emphasize the other answers you got about avoiding regexps, here's a
nice quote from my .sig database:

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.' --Jamie Zawinski
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,430
Messages
2,571,676
Members
48,796
Latest member
Greg L.

Latest Threads

Top