Saving search results in a dictionary

L

Lukas Holcik

Hi everyone!

How can I simply search text for regexps (lets say <a
href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a
dictionary { name : URL}? In a single pass if it could.

Or how can I replace the html &entities; in a string
"blablabla&amp;blablabal&amp;balbalbal" with the chars they mean using
re.sub? I found out they are stored in an dict [from htmlentitydefs import
htmlentitydefs]. I though about this functionality:

regexp = re.compile("&[a-zA-Z];")
regexp.sub(entitydefs[r'\1'], url)

but it can't work, because the r'...' must eaten directly by the sub, and
cannot be used so independently ( at least I think so). Any ideas? Thanks
in advance.

-i

---------------------------------------_.)--
| Lukas Holcik ([email protected]) (\=)*
----------------------------------------''--
 
D

Duncan Booth

Or how can I replace the html &entities; in a string
"blablabla&amp;blablabal&amp;balbalbal" with the chars they mean using
re.sub? I found out they are stored in an dict [from htmlentitydefs
import htmlentitydefs]. I though about this functionality:

You really don't want to use a regex for this. Remember that as well as
forms like &amp; you can equally use hex escapes such as &

I suggest you consider parsing your HTML using sgmllib as that will
automatically handle all the entity definitions without you having to worry
about them.

Likewise your question about extracting all the links in a single pass is
much easier to do reliably if you use sgmllib than with a regular
expression.
 
P

Paul McGuire

Lukas Holcik said:
Hi everyone!

How can I simply search text for regexps (lets say <a
href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a
dictionary { name : URL}? In a single pass if it could.

Or how can I replace the html &entities; in a string
"blablabla&amp;blablabal&amp;balbalbal" with the chars they mean using
re.sub? I found out they are stored in an dict [from htmlentitydefs import
htmlentitydefs]. I though about this functionality:

regexp = re.compile("&[a-zA-Z];")
regexp.sub(entitydefs[r'\1'], url)

but it can't work, because the r'...' must eaten directly by the sub, and
cannot be used so independently ( at least I think so). Any ideas? Thanks
in advance.

-i

---------------------------------------_.)--
| Lukas Holcik ([email protected]) (\=)*
----------------------------------------''--
Lukas -

Here is an example script from the upcoming 1.2 release of pyparsing. It is
certainly not a one-liner, but it should be fairly easy to follow. (This
example makes two passes over the input, but only to show two different
output styles - the dictionary creation is done in a single pass.)

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# URL extractor
# Copyright 2004, Paul McGuire
from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
Word,dblQuotedString,alphanums
import urllib
import pprint

# Define the pyparsing grammar for a URL, that is:
# URLlink ::= <a href= URL>linkText</a>
# URL ::= doubleQuotedString | alphanumericWordPath
# Note that whitespace may appear just about anywhere in the link. Note
also
# that it is not necessary to explicitly show this in the pyparsing grammar;
by
# default, pyparsing skips over whitespace between tokens.
linkOpenTag = (Literal("<") + "a" + "href" + "=").suppress() + \
( dblQuotedString | Word(alphanums+"/") ) + \
Suppress(">")
linkCloseTag = Literal("<") + "/" + CaselessLiteral("a") + ">"
link = linkOpenTag + CharsNotIn("<") + linkCloseTag.suppress()

# Go get some HTML with some links in it.
serverListPage = urllib.urlopen( "http://www.yahoo.com" )
htmlText = serverListPage.read()
serverListPage.close()

# scanString is a generator that loops through the input htmlText, and for
each
# match yields the tokens and start and end locations (for this application,
we
# are not interested in the start and end values).
for toks,strt,end in link.scanString(htmlText):
print toks.asList()

# Rerun scanString, but this time create a dict of text:URL key-value pairs.
# Need to reverse the tokens returned by link, using a parse action.
link.setParseAction( lambda st,loc,toks: [ toks[1], toks[0] ] )

# Create dictionary from list comprehension, assembled from each pair of
# tokens returned from a matched URL.
pprint.pprint(
dict( [ toks for toks,strt,end in link.scanString(htmlText) ] )
)
 
L

Lukas Holcik

Hi Paul and thanks for reply,

Why is the pyparsing module better than re? Just a question I must ask
before I can use it. Meant with no offense. I found an extra pdf howto on
python.org about regexps and found out, that there is an object called
finditer, which could accomplish this task quite easily:

regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>", \
re.I)
iterator = regexp.finditer(text)
for match in iterator:
dict[match.group("pcdata")] = match.group("href")

---------------------------------------_.)--
| Lukas Holcik ([email protected]) (\=)*
----------------------------------------''--

Lukas Holcik said:
Hi everyone!

How can I simply search text for regexps (lets say <a
href="(.*?)">(.*?)</a>) and save all URLs(1) and link contents(2) in a
dictionary { name : URL}? In a single pass if it could.

Or how can I replace the html &entities; in a string
"blablabla&amp;blablabal&amp;balbalbal" with the chars they mean using
re.sub? I found out they are stored in an dict [from htmlentitydefs import
htmlentitydefs]. I though about this functionality:

regexp = re.compile("&[a-zA-Z];")
regexp.sub(entitydefs[r'\1'], url)

but it can't work, because the r'...' must eaten directly by the sub, and
cannot be used so independently ( at least I think so). Any ideas? Thanks
in advance.

-i

---------------------------------------_.)--
| Lukas Holcik ([email protected]) (\=)*
----------------------------------------''--
Lukas -

Here is an example script from the upcoming 1.2 release of pyparsing. It is
certainly not a one-liner, but it should be fairly easy to follow. (This
example makes two passes over the input, but only to show two different
output styles - the dictionary creation is done in a single pass.)

Download pyparsing at http://pyparsing.sourceforge.net .

-- Paul

# URL extractor
# Copyright 2004, Paul McGuire
from pyparsing import Literal,Suppress,CharsNotIn,CaselessLiteral,\
Word,dblQuotedString,alphanums
import urllib
import pprint

# Define the pyparsing grammar for a URL, that is:
# URLlink ::= <a href= URL>linkText</a>
# URL ::= doubleQuotedString | alphanumericWordPath
# Note that whitespace may appear just about anywhere in the link. Note
also
# that it is not necessary to explicitly show this in the pyparsing grammar;
by
# default, pyparsing skips over whitespace between tokens.
linkOpenTag = (Literal("<") + "a" + "href" + "=").suppress() + \
( dblQuotedString | Word(alphanums+"/") ) + \
Suppress(">")
linkCloseTag = Literal("<") + "/" + CaselessLiteral("a") + ">"
link = linkOpenTag + CharsNotIn("<") + linkCloseTag.suppress()

# Go get some HTML with some links in it.
serverListPage = urllib.urlopen( "http://www.yahoo.com" )
htmlText = serverListPage.read()
serverListPage.close()

# scanString is a generator that loops through the input htmlText, and for
each
# match yields the tokens and start and end locations (for this application,
we
# are not interested in the start and end values).
for toks,strt,end in link.scanString(htmlText):
print toks.asList()

# Rerun scanString, but this time create a dict of text:URL key-value pairs.
# Need to reverse the tokens returned by link, using a parse action.
link.setParseAction( lambda st,loc,toks: [ toks[1], toks[0] ] )

# Create dictionary from list comprehension, assembled from each pair of
# tokens returned from a matched URL.
pprint.pprint(
dict( [ toks for toks,strt,end in link.scanString(htmlText) ] )
)
 
P

Paul McGuire

Lukas Holcik said:
Hi Paul and thanks for reply,

Why is the pyparsing module better than re? Just a question I must ask
before I can use it. Meant with no offense. I found an extra pdf howto on
python.org about regexps and found out, that there is an object called
finditer, which could accomplish this task quite easily:

regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>", \
re.I)
iterator = regexp.finditer(text)
for match in iterator:
dict[match.group("pcdata")] = match.group("href")

---------------------------------------_.)--
| Lukas Holcik ([email protected]) (\=)*
----------------------------------------''--
<snip>

Lukas -

A reasonable question, no offense taken. :)

Well, I'm not sure I'd say pyparsing was "better" than re - maybe "easier"
or "easier to read" or "easier to maintain" or "easier for those who don't
do regexp's frequently enough to have all the re symbols memorized". And
I'd be the first to admit that pyparsing is slow at runtime. I would also
tell you that I am far from being a regexp expert, having had to delve into
them only 3 or 4 times in the past 10 years (and consequently re-learn them
each time).

On the other hand, pyparsing does do some things to simplify your life. For
instance, there are a number of valid HTML anchors that the re you listed
above would miss. First of all, whitespace has a funny way of cropping up
in unexpected places, such as between 'href' and '=', or between '=' and the
leading ", or in the closing /a tag as "< /a >". What often starts out as a
fairly clean-looking regexp such as you posted quickly becomes mangled with
markers for optional whitespace. (Although I guess there *is* a magic
re tag to indicate that whitespace between tokens may or may not be
there...)

Comments are another element that can confound well-intentioned regexp
efforts. The pyparsing example that I gave earlier did not handle HTML
comments, but to do so, you would define an HTML comment element, and
then add the statement:
link.ignore( htmlComment )
(pyparsing includes a comment definition for C-style block comments of
the /* */ variety - maybe adding an HTML comment definition would be
useful?) What would the above regexp look like to handle embedded HTML
comments?

In the sample I posted earlier, extracting URL refs from www.yahoo.com, a
number of href's were *not* inside quoted strings - how quickly could the
above regexp be modified to handle this?

Doesn't the .* only match non-white characters? Does the above regexp
handle hrefs that are quoted strings with embedded spaces? What about
pcdata with embedded spaces? (Sorry if my re ignorance is showing here.)

Lastly, pyparsing does handle some problems that regexp's can't, most
notable those that have some recursive definition, such as algebraic infix
notation, or EBNF. Working samples of both of these are included in the
sample that come with pyparsing. (There are other parsers out there other
than pyparsing, too, that can do this same job.)

pyparsing's runtime performance is pretty slow, positively glacial compared
to compiled regexp's or string splits. I've warned away some potential
pyparsing users who had *very*clean input data (no hand-edited input text,
very stable and simple input format) that used string split() to run 50X
faster than pyparsing. This was a good exercise for me, I used the hotshot
profiler to remove 30-40% of the runtime, but I was still far shy of the
much-speedier string splitting algorithm. But again, this application had
*very* clean input data, with a straightforward format. He also had a very
demanding runtime performance criterion, having to read and process about
50,000 data records at startup - string.split() took about 0.08 seconds,
pyparsing took about 5 seconds. My recommendation was to *not* use
pyparsing in this case.

On the other hand, for simple one-off's, or for functions that are not time-
critical parts of a program, or if it doesn't matter if the program takes 10
minutes to write and 30 seconds to run (with say, pyparsing) vs. 15 minutes
to write and 0.5 seconds to run (with say, regexp's), I'd say pyparsing was
a good choice. And when you find you need to add or extend a given
parsing construct, it is usually a very straightforward process with
pyparsing.

I've had a number of e-mails telling me how pleasant and intuitive it is to
work with pyparsing, in some ways reminiscent of the "I like coding in
Python, even if it is slower than C at runtime" comments we read in c.l.py
every week (along with many expositions on how raw runtime performance
is not always the best indicator of what solution is the "best").

Just as David Mertz describes in his Text Processing with Python book,
each of these are just one of many tools in our toolkits. Don't get more
complicated in your solution than you need to be. The person who, in 6
months, needs to try to figure out just how the heck your code works,
just might be you.

Sorry for the length of response, hope some of you are still awake...

-- Paul
 
J

Jean Brouwers

If you need a fast parser in Python, try SimpleParse (mxTextTools). We
use it to parse and process large log files, 100+MB in size.

<http://members.rogers.com/mcfletch/programming/simpleparse/simpleparse.
html>

An example, the run time for the parsing step alone with a simple but
non-trivial grammar is comparable to grep. Total run time is dominated
by the processing step and increased formore complex grammars,
obviously.

/Jean Brouwers
ProphICy Semiconductor, Inc.


Paul McGuire said:
Lukas Holcik said:
Hi Paul and thanks for reply,

Why is the pyparsing module better than re? Just a question I must ask
before I can use it. Meant with no offense. I found an extra pdf howto on
python.org about regexps and found out, that there is an object called
finditer, which could accomplish this task quite easily:

regexp = re.compile("<a href=\"(?P<href>.*?)\">(?P<pcdata>.*?)</a>", \
re.I)
iterator = regexp.finditer(text)
for match in iterator:
dict[match.group("pcdata")] = match.group("href")

---------------------------------------_.)--
| Lukas Holcik ([email protected]) (\=)*
----------------------------------------''--
<snip>

Lukas -

A reasonable question, no offense taken. :)

Well, I'm not sure I'd say pyparsing was "better" than re - maybe "easier"
or "easier to read" or "easier to maintain" or "easier for those who don't
do regexp's frequently enough to have all the re symbols memorized". And
I'd be the first to admit that pyparsing is slow at runtime. I would also
tell you that I am far from being a regexp expert, having had to delve into
them only 3 or 4 times in the past 10 years (and consequently re-learn them
each time).

On the other hand, pyparsing does do some things to simplify your life. For
instance, there are a number of valid HTML anchors that the re you listed
above would miss. First of all, whitespace has a funny way of cropping up
in unexpected places, such as between 'href' and '=', or between '=' and the
leading ", or in the closing /a tag as "< /a >". What often starts out as a
fairly clean-looking regexp such as you posted quickly becomes mangled with
markers for optional whitespace. (Although I guess there *is* a magic
re tag to indicate that whitespace between tokens may or may not be
there...)

Comments are another element that can confound well-intentioned regexp
efforts. The pyparsing example that I gave earlier did not handle HTML
comments, but to do so, you would define an HTML comment element, and
then add the statement:
link.ignore( htmlComment )
(pyparsing includes a comment definition for C-style block comments of
the /* */ variety - maybe adding an HTML comment definition would be
useful?) What would the above regexp look like to handle embedded HTML
comments?

In the sample I posted earlier, extracting URL refs from www.yahoo.com, a
number of href's were *not* inside quoted strings - how quickly could the
above regexp be modified to handle this?

Doesn't the .* only match non-white characters? Does the above regexp
handle hrefs that are quoted strings with embedded spaces? What about
pcdata with embedded spaces? (Sorry if my re ignorance is showing here.)

Lastly, pyparsing does handle some problems that regexp's can't, most
notable those that have some recursive definition, such as algebraic infix
notation, or EBNF. Working samples of both of these are included in the
sample that come with pyparsing. (There are other parsers out there other
than pyparsing, too, that can do this same job.)

pyparsing's runtime performance is pretty slow, positively glacial compared
to compiled regexp's or string splits. I've warned away some potential
pyparsing users who had *very*clean input data (no hand-edited input text,
very stable and simple input format) that used string split() to run 50X
faster than pyparsing. This was a good exercise for me, I used the hotshot
profiler to remove 30-40% of the runtime, but I was still far shy of the
much-speedier string splitting algorithm. But again, this application had
*very* clean input data, with a straightforward format. He also had a very
demanding runtime performance criterion, having to read and process about
50,000 data records at startup - string.split() took about 0.08 seconds,
pyparsing took about 5 seconds. My recommendation was to *not* use
pyparsing in this case.

On the other hand, for simple one-off's, or for functions that are not time-
critical parts of a program, or if it doesn't matter if the program takes 10
minutes to write and 30 seconds to run (with say, pyparsing) vs. 15 minutes
to write and 0.5 seconds to run (with say, regexp's), I'd say pyparsing was
a good choice. And when you find you need to add or extend a given
parsing construct, it is usually a very straightforward process with
pyparsing.

I've had a number of e-mails telling me how pleasant and intuitive it is to
work with pyparsing, in some ways reminiscent of the "I like coding in
Python, even if it is slower than C at runtime" comments we read in c.l.py
every week (along with many expositions on how raw runtime performance
is not always the best indicator of what solution is the "best").

Just as David Mertz describes in his Text Processing with Python book,
each of these are just one of many tools in our toolkits. Don't get more
complicated in your solution than you need to be. The person who, in 6
months, needs to try to figure out just how the heck your code works,
just might be you.

Sorry for the length of response, hope some of you are still awake...

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top