Regex single quotes in scraper script?

R

Rock

Hi, I started using a python based screen scraper called newsscraper I
downloaded from sourceforge.
http://sourceforge.net/projects/newsscraper/. I have created many python
templates that work just fine from their examples however I ran into a road
block with sites that use single quotes instead of double quotes for
specifying url in their web pages.

For example: <a href='http://www.foo/'>

instead of the usual
<a href="http://www.foo/">

Being a real newbie with this I think I found the area of code that parses
the href. It is in a file called parsefns.py
the full excerpt is listed below but here is the regex line that I believe
is not dealing with single quote.

m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)

I have tried many different variations but no luck and no luck getting hold
of the author. Any ideas? Thx.

---------------------
def get_href(text, base_url=None):
"""get_href(text[, base_url]) -> href or None

Extract the URL out of an HREF tag. If base_url is provided,
will attempt to resolve relative links.

"""
m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
if not m:
return None
link = m.group(1)
if base_url and not link.lower().startswith("http"):
import urlparse
link = urlparse.urljoin(base_url, link)
return link
===============
 
T

Terry Reedy

Unknown said:
Hi, I started using a python based screen scraper called newsscraper I
downloaded from sourceforge.
http://sourceforge.net/projects/newsscraper/. I have created many python
templates that work just fine from their examples however I ran into a road
block with sites that use single quotes instead of double quotes for
specifying url in their web pages.

For example: <a href='http://www.foo/'>

instead of the usual
<a href="http://www.foo/">

Being a real newbie with this I think I found the area of code that parses
the href. It is in a file called parsefns.py
the full excerpt is listed below but here is the regex line that I believe
is not dealing with single quote.

m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)

I have tried many different variations but no luck and no luck getting hold
of the author. Any ideas? Thx.

Did you try reversing all single and double quotes? ie r"...'...'...'..."
If that doesn't work, you need someone else to answer.
A list of the variations not working might also help someone to answer.

TJR
 
C

Christopher T King

Being a real newbie with this I think I found the area of code that parses
the href. It is in a file called parsefns.py
the full excerpt is listed below but here is the regex line that I believe
is not dealing with single quote.

m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)

I have tried many different variations but no luck and no luck getting hold
of the author. Any ideas? Thx.

Good job tracking that down. Methinks you'll want to change it to read
thusly:

m = re.search(r'href\s*=\s*["\']?([^>"\' ]+)["\'> ]', text, re.I)

This will possibly break some sites, though (namely those that use single
quotes in their URLs, but those are broken anyways). A proper fix would
require a tad more work (i.e. either a much, much, messier regex or a
change in the function), and it's really late right now ;)
 
R

Rock

Christopher T King said:
Being a real newbie with this I think I found the area of code that parses
the href. It is in a file called parsefns.py
the full excerpt is listed below but here is the regex line that I believe
is not dealing with single quote.

m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)

I have tried many different variations but no luck and no luck getting hold
of the author. Any ideas? Thx.

Good job tracking that down. Methinks you'll want to change it to read
thusly:

m = re.search(r'href\s*=\s*["\']?([^>"\' ]+)["\'> ]', text, re.I)

woohoo! that fixed my problem with single quotes sites and double quotes
still seem to still work just fine.

Thanks man.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top