Regex help needed!

Oltmans · Dec 21, 2009

Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

Umakanth · Dec 21, 2009

How about re.findall(r'\d+(?:\.\d+)?',str)

extracts only numbers from any string....

~uk

mik3 · Dec 21, 2009

Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

don't need regular expression. just do a split on amazon

s="""lksjdfls <div id =\'amazon_345343\'> kdjff lsdfs </div> sdjfls <div id = "amazon_35343433">sdfsd</div><div id=\'amazon_8898\'>welcome</div>"""
for item in s.split("amazon_")[1:]:

Click to expand...

Click to expand...

.... print item
....
345343'> kdjff lsdfs </div> sdjfls <div id = "
35343433">sdfsd</div><div id='
8898'>welcome</div>

then find ' or " indices and do index slicing.

Peter Otten · Dec 21, 2009

Oltmans said:
I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup("""lksjdfls <div id ='amazon_345343'> kdjff lsdfs

Click to expand...

[node["id"][7:] for node in bs(id=lambda id: id.startswith("amazon_"))]

Click to expand...

Click to expand...

[u'345343', u'35343433', u'8898']

I think BeautifulSoup is a better tool for the task since it actually
"understands" HTML.

Peter

Oltmans · Dec 21, 2009

How about re.findall(r'\d+(?:\.\d+)?',str)

extracts only numbers from any string....

Thank you. However, I only need the digits within the ID attribute of
the DIV. Regex that you suggested fails on the following string

Umakanth · Dec 21, 2009

Ok. how about re.findall(r'\w+_(\d+)',str) ?

returns ['345343', '35343433', '8898', '8898'] !

MRAB · Dec 21, 2009

Oltmans said:
Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id

example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

Try:

re.findall(r"""<div\s*id\s*=\s*['"]amazon_(\d+)['"]>""", str)

You shouldn't be using 'str' as a variable name because it hides the
builtin string class 'str'.

Johann Spies · Dec 22, 2009

Your string is in /tmp/y in this example:

$ grep -o [0-9]+ /tmp/y
345343
35343433
8898

Much simpler, isn't it? But that is not python.

Regards
Johann

--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"And there were in the same country shepherds abiding
in the field, keeping watch over their flock by night.
And, lo, the angel of the Lord came upon them, and the
glory of the Lord shone round about them: and they were
sore afraid. And the angel said unto them, Fear not:
for behold I bring you good tidings of great joy, which
shall be to all people. For unto you is born this day
in the city of David a Saviour, which is Christ the
Lord." Luke 2:8-11

Umakanth · Dec 22, 2009

how about re.findall(r'\w+.=\W\D+(\d+)?',str) ?

this will work for any string within id !

~Ukanth

Paul McGuire · Dec 22, 2009

Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

The issue with using regexen for parsing HTML is that you often get
surprised by attributes that you never expected, or out of order, or
with weird or missing quotation marks, or tags or attributes that are
in upper/lower case. BeautifulSoup is one tool to use for HTML
scraping, here is a pyparsing example, with hopefully descriptive
comments:

from pyparsing import makeHTMLTags,ParseException

src = """
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

# use makeHTMLTags to return an expression that will match
# HTML <div> tags, including attributes, upper/lower case,
# etc. (makeHTMLTags will return expressions for both
# opening and closing tags, but we only care about the
# opening one, so just use the [0]th returned item
div = makeHTMLTags("div")[0]

# define a parse action to filter only for <div> tags
# with the proper id form
def filterByIdStartingWithAmazon(tokens):
if not tokens.id.startswith("amazon_"):
raise ParseException(
"must have id attribute starting with 'amazon_'")

# define a parse action that will add a pseudo-
# attribute 'amazon_id', to make it easier to get the
# numeric portion of the id after the leading 'amazon_'
def makeAmazonIdAttribute(tokens):
tokens["amazon_id"] = tokens.id[len("amazon_"):]

# attach parse action callbacks to the div expression -
# these will be called during parse time
div.setParseAction(filterByIdStartingWithAmazon,
makeAmazonIdAttribute)

# search through the input string for matching <div>s,
# and print out their amazon_id's
for divtag in div.searchString(src):
print divtag.amazon_id

Prints:

345343
35343433
8898

F.R. · Dec 24, 2009

Hello,. everyone.

I've a string that looks something like
----
lksjdfls<div id ='amazon_345343'> kdjff lsdfs</div> sdjfls<div id

example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

If you filter in two or even more sequential steps the problem becomes a
lot simpler, not least because you can
test each step separately:

>>> r1 = re.compile ('<div id\D*\d+[^>]*') # Add ignore case and variable white space
>>> r2 = re.compile ('\d+')
>>> [r2.search (item).group () for item in r1.findall (s) if item]

Click to expand...

Click to expand...

# s is your sample
['345343', '35343433', '8898'] # Supposing all ids have digits

Frederic

Aahz · Jan 7, 2010

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.'
--Jamie Zawinski

Take the advice other people gave you and use BeautifulSoup.

Rolando Espinoza La Fuente · Jan 7, 2010

# http://gist.github.com/271661

import lxml.html
import re

src = """
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

regex = re.compile('amazon_(\d+)')

doc = lxml.html.document_fromstring(src)

for div in doc.xpath('//div[starts-with(@id, "amazon_")]'):
match = regex.match(div.get('id'))
if match:
print match.groups()[0]

Help with code	0	Jun 12, 2022
Need Help with Repository Program (Beginner)	1	Jul 7, 2023
Convert AWK regex to Python	6	May 16, 2011
Can't solve problems! please Help	0	Sep 26, 2022
Image upload not working in browser	4	Sep 9, 2022
Java Regex problem - Please Help	4	Nov 17, 2009
I need help with my python assignment and I'm stuck can't find any solution for it. Convert CSV string format to JSON format	0	Oct 12, 2021
Hopefully simple regex	4	Dec 17, 2008

Regex help needed!

Oltmans

Umakanth

mik3

Peter Otten

Oltmans

Umakanth

MRAB

Johann Spies

Umakanth

Paul McGuire

F.R.

Aahz

Rolando Espinoza La Fuente

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads