Help on regular expression match

Discussion in 'Python' started by Johnny Lee, Sep 23, 2005.

  1. Johnny Lee

    Johnny Lee Guest

    Hi,
    I've met a problem in match a regular expression in python. Hope
    any of you could help me. Here are the details:

    I have many tags like this:
    xxx<a href="http://xxx.xxx.xxx" xxx>xxx
    xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
    xxx<a href="http://xxx.xxx.xxx" xxx>xxx
    .....
    And I want to find all the "http://xxx.xxx.xxx" out, so I do it
    like this:
    httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
    result = httpPat.findall(data)
    I use this to observe my output:
    for i in result:
    print i[2]
    Surprisingly I will get some output like this:
    http://xxx.xxx.xxx">xxx</a>xxx
    In fact it's filtered from this kind of source:
    <a href="http://xxx.xxx.xxx">xxx</a>xxx"
    But some result are right, I wonder how can I get the all the
    answers clean like "http://xxx.xxx.xxx"? Thanks for your help.


    Regards,
    Johnny
     
    Johnny Lee, Sep 23, 2005
    #1
    1. Advertising

  2. Johnny Lee wrote:

    > I've met a problem in match a regular expression in python. Hope
    > any of you could help me. Here are the details:
    >
    > I have many tags like this:
    > xxx<a href="http://xxx.xxx.xxx" xxx>xxx
    > xxx<a href="wap://xxx.xxx.xxx" xxx>xxx
    > xxx<a href="http://xxx.xxx.xxx" xxx>xxx
    > .....
    > And I want to find all the "http://xxx.xxx.xxx" out, so I do it
    > like this:
    > httpPat = re.compile("(<a )(href=\")(http://.*)(\")")
    > result = httpPat.findall(data)
    > I use this to observe my output:
    > for i in result:
    > print i[2]
    > Surprisingly I will get some output like this:
    > http://xxx.xxx.xxx">xxx</a>xxx
    > In fact it's filtered from this kind of source:
    > <a href="http://xxx.xxx.xxx">xxx</a>xxx"
    > But some result are right, I wonder how can I get the all the
    > answers clean like "http://xxx.xxx.xxx"? Thanks for your help.


    ".*" gives the longest possible match (you can think of it as searching back-
    wards from the right end). if you want to search for "everything until a given
    character", searching for "[^x]*x" is often a better choice than ".*x".

    in this case, I suggest using something like

    print re.findall("href=\"([^\"]+)\"", text)

    or, if you're going to parse HTML pages from many different sources, a
    real parser:

    from HTMLParser import HTMLParser

    class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
    if tag == "a":
    for key, value in attrs:
    if key == "href":
    print value

    p = MyHTMLParser()
    p.feed(text)
    p.close()

    see:

    http://docs.python.org/lib/module-HTMLParser.html
    http://docs.python.org/lib/htmlparser-example.html
    http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html

    </F>
     
    Fredrik Lundh, Sep 23, 2005
    #2
    1. Advertising

  3. Johnny Lee

    Johnny Lee Guest

    Fredrik Lundh wrote:
    > ".*" gives the longest possible match (you can think of it as searching back-
    > wards from the right end). if you want to search for "everything until a given
    > character", searching for "[^x]*x" is often a better choice than ".*x".
    >
    > in this case, I suggest using something like
    >
    > print re.findall("href=\"([^\"]+)\"", text)
    >
    > or, if you're going to parse HTML pages from many different sources, a
    > real parser:
    >
    > from HTMLParser import HTMLParser
    >
    > class MyHTMLParser(HTMLParser):
    >
    > def handle_starttag(self, tag, attrs):
    > if tag == "a":
    > for key, value in attrs:
    > if key == "href":
    > print value
    >
    > p = MyHTMLParser()
    > p.feed(text)
    > p.close()
    >
    > see:
    >
    > http://docs.python.org/lib/module-HTMLParser.html
    > http://docs.python.org/lib/htmlparser-example.html
    > http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html
    >
    > </F>


    Thanks for your help.
    I found another solution by just simply adding a '?' after ".*" which
    makes the it searching for the minimal length to match the regular
    expression.
    To the HTMLParser, there is another problem (take my code for example):

    import urllib
    import formatter
    parser = htmllib.HTMLParser(formatter.NullFormatter())
    parser.feed(urllib.urlopen(baseUrl).read())
    parser.close()
    for url in parser.anchorlist:
    if url[0:7] == "http://":
    print url

    when the baseUrl="http://www.nba.com", there will raise an
    HTMLParseError because of a line of code "<! Copyright IBM Corporation,
    2001, 2002 !>". I found that this line of code is inside <script> tags,
    maybe it's because of this?
     
    Johnny Lee, Sep 23, 2005
    #3
  4. Johnny Lee

    John J. Lee Guest

    "Fredrik Lundh" <> writes:
    [...]
    > or, if you're going to parse HTML pages from many different sources, a
    > real parser:
    >
    > from HTMLParser import HTMLParser
    >
    > class MyHTMLParser(HTMLParser):
    >
    > def handle_starttag(self, tag, attrs):
    > if tag == "a":
    > for key, value in attrs:
    > if key == "href":
    > print value
    >
    > p = MyHTMLParser()
    > p.feed(text)
    > p.close()
    >
    > see:
    >
    > http://docs.python.org/lib/module-HTMLParser.html
    > http://docs.python.org/lib/htmlparser-example.html
    > http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html


    It's worth noting that module HTMLParser is less tolerant of the bad
    HTML you find in the real world than is module sgmllib, which has a
    similar interface. There are also third party libraries like
    BeautifulSoup and mxTidy that you may find useful for parsing "HTML as
    deployed" (ie. bad HTML, often).

    Also, htmllib is an extension to sgmllib, and will do your link
    parsing with even less effort:

    import htmllib, formatter, urllib2
    pp = htmllib.HTMLParser(formatter.NullFormatter())
    pp.feed(urllib2.urlopen("http://python.org/").read())
    print pp.anchorlist


    Module HTMLParser does have better support for XHTML, though.


    John
     
    John J. Lee, Sep 24, 2005
    #4
  5. Johnny Lee

    John J. Lee Guest

    "Johnny Lee" <> writes:

    > Fredrik Lundh wrote:

    [...]
    > To the HTMLParser, there is another problem (take my code for example):
    >
    > import urllib
    > import formatter
    > parser = htmllib.HTMLParser(formatter.NullFormatter())
    > parser.feed(urllib.urlopen(baseUrl).read())
    > parser.close()
    > for url in parser.anchorlist:
    > if url[0:7] == "http://":
    > print url
    >
    > when the baseUrl="http://www.nba.com", there will raise an
    > HTMLParseError because of a line of code "<! Copyright IBM Corporation,
    > 2001, 2002 !>". I found that this line of code is inside <script> tags,
    > maybe it's because of this?


    No, i's because they're using a broken HTML comment (should be
    "<!--comment-->"). BeautifulSoup is more tolerant:

    import urllib2
    from BeautifulSoup import BeautifulSoup
    bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read())
    for el in bs.fetch('a'):
    print el['href']


    Or you could pre-process the HTML using mxTidy, and carry on using
    module htmllib.

    Hmm, are you the same Johnny Lee who contributed the MSIE cookie
    support to LWP?


    John
     
    John J. Lee, Sep 24, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. championsleeper
    Replies:
    6
    Views:
    1,056
    championsleeper
    Apr 6, 2004
  2. Liang
    Replies:
    2
    Views:
    1,781
  3. VSK
    Replies:
    2
    Views:
    2,392
  4. Replies:
    4
    Views:
    750
  5. Replies:
    0
    Views:
    377
Loading...

Share This Page