How to extract a part of html file

Discussion in 'Python' started by Joe, Oct 20, 2005.

  1. Joe

    Joe Guest

    I'm trying to extract part of html code from a tag to a tag code begins
    with <span class="boldyellow"><B><U> and ends with
    TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>

    I was thinking of using a regular expression however I having hard time
    getting the desired string. I use

    htmlSource = urllib.urlopen("http://address/")
    s = htmlSource.read()
    htmlSource.close()

    to get the html into a string, now I want to match string s from a <span
    class Tag to <img src="http://whatever/some.gif"> </TD></TR></TABLE> and
    store that into a new string.

    Thanks
     
    Joe, Oct 20, 2005
    #1
    1. Advertising

  2. Joe

    Ben Finney Guest

    Joe <> wrote:
    > I'm trying to extract part of html code from a tag to a tag


    For tag soup, use BeautifulSoup:

    <URL:http://www.crummy.com/software/BeautifulSoup/>

    Available as a package in Debian, probably other decent OSen also.

    --
    \ "I think it would be a good idea." -- Mahatma Gandhi (when |
    `\ asked what he thought of Western civilization) |
    _o__) |
    Ben Finney
     
    Ben Finney, Oct 20, 2005
    #2
    1. Advertising

  3. Joe

    Mike Meyer Guest

    Ben Finney <> writes:

    > Joe <> wrote:
    >> I'm trying to extract part of html code from a tag to a tag

    > For tag soup, use BeautifulSoup:
    > <URL:http://www.crummy.com/software/BeautifulSoup/>


    Except he's trying to extract an apparently random part of the
    file. BeautifulSoup is a wonderful thing for dealing with X/HTML
    documents as structured documents, which is how you want to deal with
    them most of the time.

    In this case, an re works nicely:

    >>> import re
    >>> s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>'
    >>> r = re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', s)
    >>> r.group(1)

    ' and ends with '
    >>>


    String.find also works really well:

    >>> start = s.find('<span class="boldyellow"><B><U>') + len('<span class="boldyellow"><B><U>')
    >>> stop = s.find('TD><TD> <img src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
    >>> s[start:stop]

    ' and ends with '
    >>>


    Not a lot to choose between them.

    <mike
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Oct 20, 2005
    #3
  4. Joe

    Joe Guest

    Thanks Mike that is just what I was looking for, I have looked at
    beautifulsoup but it doesn't really do what I want it to do, maybe I'm
    just new to python and don't exactly know what it is doing just yet.
    However string find woks. Thanks

    On Thu, 20 Oct 2005 09:47:37 -0400, Mike Meyer wrote:

    > Ben Finney <> writes:
    >
    >> Joe <> wrote:
    >>> I'm trying to extract part of html code from a tag to a tag

    >> For tag soup, use BeautifulSoup:
    >> <URL:http://www.crummy.com/software/BeautifulSoup/>

    >
    > Except he's trying to extract an apparently random part of the file.
    > BeautifulSoup is a wonderful thing for dealing with X/HTML documents as
    > structured documents, which is how you want to deal with them most of
    > the time.
    >
    > In this case, an re works nicely:
    >
    >>>> import re
    >>>> s = '<span class="boldyellow"><B><U> and ends with TD><TD> <img
    >>>> src="http://whatever/some.gif"> </TD></TR></TABLE>' r =
    >>>> re.match('<span class="boldyellow"><B><U>(.*)TD><TD> <img
    >>>> src="http://whatever/some.gif"> </TD></TR></TABLE>', s) r.group(1)

    > ' and ends with '
    >>>>
    >>>>

    > String.find also works really well:
    >
    >>>> start = s.find('<span class="boldyellow"><B><U>') + len('<span
    >>>> class="boldyellow"><B><U>') stop = s.find('TD><TD> <img
    >>>> src="http://whatever/some.gif"> </TD></TR></TABLE>', start)
    >>>> s[start:stop]

    > ' and ends with '
    >>>>
    >>>>

    > Not a lot to choose between them.
    >
    > <mike
     
    Joe, Oct 20, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jjouett
    Replies:
    2
    Views:
    1,912
    jjouett
    May 1, 2006
  2. Replies:
    1
    Views:
    596
    Aiken
    Mar 11, 2009
  3. Sandhya Prabhakaran
    Replies:
    6
    Views:
    577
    alex23
    Aug 3, 2009
  4. Guest
    Replies:
    4
    Views:
    296
    Guest
    May 12, 2006
  5. Kevin Morgan

    Extract part of html page

    Kevin Morgan, May 8, 2005, in forum: Javascript
    Replies:
    2
    Views:
    111
    Thomas 'PointedEars' Lahn
    May 15, 2005
Loading...

Share This Page