Regex help needed!

O

Oltmans

Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.
 
U

Umakanth

How about re.findall(r'\d+(?:\.\d+)?',str)

extracts only numbers from any string....

~uk
 
M

mik3

Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
=   "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

don't need regular expression. just do a split on amazon
s="""lksjdfls <div id =\'amazon_345343\'> kdjff lsdfs </div> sdjfls <div id = "amazon_35343433">sdfsd</div><div id=\'amazon_8898\'>welcome</div>"""
for item in s.split("amazon_")[1:]:
.... print item
....
345343'> kdjff lsdfs </div> sdjfls <div id = "
35343433">sdfsd</div><div id='
8898'>welcome</div>

then find ' or " indices and do index slicing.
 
P

Peter Otten

Oltmans said:
I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.
from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup("""lksjdfls <div id ='amazon_345343'> kdjff lsdfs
[node["id"][7:] for node in bs(id=lambda id: id.startswith("amazon_"))]
[u'345343', u'35343433', u'8898']

I think BeautifulSoup is a better tool for the task since it actually
"understands" HTML.

Peter
 
O

Oltmans

How about re.findall(r'\d+(?:\.\d+)?',str)

extracts only numbers from any string....

Thank you. However, I only need the digits within the ID attribute of
the DIV. Regex that you suggested fails on the following string
 
U

Umakanth

Ok. how about re.findall(r'\w+_(\d+)',str) ?

returns ['345343', '35343433', '8898', '8898'] !
 
M

MRAB

Oltmans said:
Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id

example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

Try:

re.findall(r"""<div\s*id\s*=\s*['"]amazon_(\d+)['"]>""", str)

You shouldn't be using 'str' as a variable name because it hides the
builtin string class 'str'.
 
J

Johann Spies

Your string is in /tmp/y in this example:

$ grep -o [0-9]+ /tmp/y
345343
35343433
8898

Much simpler, isn't it? But that is not python.

Regards
Johann

--
Johann Spies Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

"And there were in the same country shepherds abiding
in the field, keeping watch over their flock by night.
And, lo, the angel of the Lord came upon them, and the
glory of the Lord shone round about them: and they were
sore afraid. And the angel said unto them, Fear not:
for behold I bring you good tidings of great joy, which
shall be to all people. For unto you is born this day
in the city of David a Saviour, which is Christ the
Lord." Luke 2:8-11
 
U

Umakanth

how about re.findall(r'\w+.=\W\D+(\d+)?',str) ?

this will work for any string within id !

~Ukanth
 
P

Paul McGuire

Hello,. everyone.

I've a string that looks something like
----
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
=   "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
----

From above string I need the digits within the ID attribute. For
example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

The issue with using regexen for parsing HTML is that you often get
surprised by attributes that you never expected, or out of order, or
with weird or missing quotation marks, or tags or attributes that are
in upper/lower case. BeautifulSoup is one tool to use for HTML
scraping, here is a pyparsing example, with hopefully descriptive
comments:


from pyparsing import makeHTMLTags,ParseException

src = """
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

# use makeHTMLTags to return an expression that will match
# HTML <div> tags, including attributes, upper/lower case,
# etc. (makeHTMLTags will return expressions for both
# opening and closing tags, but we only care about the
# opening one, so just use the [0]th returned item
div = makeHTMLTags("div")[0]

# define a parse action to filter only for <div> tags
# with the proper id form
def filterByIdStartingWithAmazon(tokens):
if not tokens.id.startswith("amazon_"):
raise ParseException(
"must have id attribute starting with 'amazon_'")

# define a parse action that will add a pseudo-
# attribute 'amazon_id', to make it easier to get the
# numeric portion of the id after the leading 'amazon_'
def makeAmazonIdAttribute(tokens):
tokens["amazon_id"] = tokens.id[len("amazon_"):]

# attach parse action callbacks to the div expression -
# these will be called during parse time
div.setParseAction(filterByIdStartingWithAmazon,
makeAmazonIdAttribute)

# search through the input string for matching <div>s,
# and print out their amazon_id's
for divtag in div.searchString(src):
print divtag.amazon_id


Prints:

345343
35343433
8898
 
F

F.R.

Hello,. everyone.

I've a string that looks something like
----
lksjdfls<div id ='amazon_345343'> kdjff lsdfs</div> sdjfls<div id

example, required output from above string is
- 35343433
- 345343
- 8898

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

If you filter in two or even more sequential steps the problem becomes a
lot simpler, not least because you can
test each step separately:
>>> r1 = re.compile ('<div id\D*\d+[^>]*') # Add ignore case and variable white space
>>> r2 = re.compile ('\d+')
>>> [r2.search (item).group () for item in r1.findall (s) if item]
# s is your sample
['345343', '35343433', '8898'] # Supposing all ids have digits

Frederic
 
A

Aahz

I've written this regex that's kind of working
re.findall("\w+\s*\W+amazon_(\d+)",str)

but I was just wondering that there might be a better RegEx to do that
same thing. Can you kindly suggest a better/improved Regex. Thank you
in advance.

'Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.'
--Jamie Zawinski

Take the advice other people gave you and use BeautifulSoup.
 
R

Rolando Espinoza La Fuente

# http://gist.github.com/271661

import lxml.html
import re

src = """
lksjdfls <div id ='amazon_345343'> kdjff lsdfs </div> sdjfls <div id
= "amazon_35343433">sdfsd</div><div id='amazon_8898'>welcome</div>
hello, my age is 86 years old and I was born in 1945. Do you know
that
PI is roughly 3.1443534534534534534 """

regex = re.compile('amazon_(\d+)')

doc = lxml.html.document_fromstring(src)

for div in doc.xpath('//div[starts-with(@id, "amazon_")]'):
match = regex.match(div.get('id'))
if match:
print match.groups()[0]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,067
Latest member
HunterTere

Latest Threads

Top