Parsing html

Discussion in 'Python' started by C Gillespie, Jul 8, 2004.

  1. C Gillespie

    C Gillespie Guest

    Dear All,

    I have hopefully a very simple problem. I wish to parse an html page and
    extract everything between the <body> tags.

    E.g.
    <head>
    <body>
    <b>afsdf</b>
    </body>
    </head>

    Would give
    <body>
    <b>afsdf</b>
    </body>

    I've been playing about with htmllib with no successful. Any suggestions?

    Thanks

    Colin
    C Gillespie, Jul 8, 2004
    #1
    1. Advertising

  2. C Gillespie

    William Park Guest

    C Gillespie <> wrote:
    > Dear All,
    >
    > I have hopefully a very simple problem. I wish to parse an html page and
    > extract everything between the <body> tags.
    >
    > E.g.
    > <head>
    > <body>
    > <b>afsdf</b>
    > </body>
    > </head>
    >
    > Would give
    > <body>
    > <b>afsdf</b>
    > </body>
    >
    > I've been playing about with htmllib with no successful. Any suggestions?
    >
    > Thanks
    >
    > Colin


    1. Take a look at
    http://freshmeat.net/projects/bashdiff/
    and if you want give it try then I'll give you some pointers.
    Essentially,
    x=()
    array -p '<body>' -q '</body>' x "..."

    2. In Python, read the whole thing as string. Delete everything before
    '<body>' and everything after '</body>'.

    3. Use your editor. :)

    --
    William Park, Open Geometry Consulting, <>
    Toronto, Ontario, Canada
    William Park, Jul 8, 2004
    #2
    1. Advertising

  3. C Gillespie wrote:
    > I have hopefully a very simple problem. I wish to parse an html page and
    > extract everything between the <body> tags.


    People are actually suggesting using DOM for this?! A simple approach is
    much better:

    def get_body(html):
    body_start = html.find('<body')
    body_end = html.find('</body>', body_start) + 7
    return html[body_start:body_end]
    Leif K-Brooks, Jul 8, 2004
    #3
  4. C Gillespie

    Lee Harr Guest

    On 2004-07-08, C Gillespie <> wrote:
    > Dear All,
    >
    > I have hopefully a very simple problem. I wish to parse an html page and
    > extract everything between the <body> tags.
    >


    I have not used it yet,
    but I hear that Beatiful Soup
    works well:

    http://www.crummy.com/software/BeautifulSoup/
    Lee Harr, Jul 8, 2004
    #4
  5. C Gillespie

    wes weston Guest

    C Gillespie wrote:
    > Dear All,
    >
    > I have hopefully a very simple problem. I wish to parse an html page and
    > extract everything between the <body> tags.
    >
    > E.g.
    > <head>
    > <body>
    > <b>afsdf</b>
    > </body>
    > </head>
    >
    > Would give
    > <body>
    > <b>afsdf</b>
    > </body>
    >
    > I've been playing about with htmllib with no successful. Any suggestions?
    >
    > Thanks
    >
    > Colin
    >
    >


    #--------------------------------------------------------------------------
    def TokenizeHTML( s ):
    #return a list containing two types of tokens:
    # 1. html tokens starting with '<' and ending with '>'
    # 2. strings between '>' and '<'
    state = 0
    htmlStr = ""
    str = ""
    list = []
    for ch in s:
    if state == 0: #initial state; detection state
    if ch == '<':
    state = 1
    htmlStr += ch
    else:
    state = 2
    str += ch
    elif state == 1: #html state; in a <> pair
    htmlStr += ch
    if ch == '>':
    state = 0
    list.append(htmlStr)
    htmlStr = ""
    elif state == 2: #non html state; not in a <> pair
    if ch == '<':
    state = 1
    list.append(str)
    str = ""
    htmlStr = "<"
    else:
    str += ch
    if len(str) > 0:
    list.append(str)
    return list
    wes weston, Jul 9, 2004
    #5

  6. > > I have hopefully a very simple problem. I wish to parse an html page and
    > > extract everything between the <body> tags.

    >
    > People are actually suggesting using DOM for this?! A simple approach is
    > much better:


    "For every complex problem, there is a solution that is simple ... and wrong"
    Yes, it will work, some of the time. However, it doesn't handle the following
    properly (there are probably others).

    1. Comments.
    2. CDATA sections.
    3. White space.
    4. Mixed or upper case.

    The advantage of using a proper parser is that it caters for these sort of things,
    and you only have to get it right once. OTOH, these advantages are largely
    negated, if you can't be sure your input HTML is valid. What works best for
    you depends on what you are using it for.
    Richard Brodie, Jul 9, 2004
    #6
  7. C Gillespie

    C Gillespie Guest

    Dear All,

    Thanks for all the suggestions, much appreciated.

    Colin
    C Gillespie, Jul 9, 2004
    #7
  8. Am Thu, 08 Jul 2004 17:04:24 +0100 schrieb C Gillespie:

    > Dear All,
    >
    > I have hopefully a very simple problem. I wish to parse an html page and
    > extract everything between the <body> tags.
    >
    > E.g.
    > <head>
    > <body>
    > <b>afsdf</b>
    > </body>
    > </head>
    >
    > Would give
    > <body>
    > <b>afsdf</b>
    > </body>
    >
    > I've been playing about with htmllib with no successful. Any suggestions?


    HTML can be broken in many ways. If you want
    a solution which can read most of the HTML on the
    web, you can use tidy and use XML as output.


    XML can be handled much easier with SAX/DOM.

    Regards,
    Thomas

    --
    Thomas G├╝ttler, http://www.thomas-guettler.de/
    Thomas Guettler, Jul 9, 2004
    #8
  9. Istvan Albert, Jul 9, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. GIMME
    Replies:
    2
    Views:
    875
    GIMME
    Feb 11, 2004
  2. Naren
    Replies:
    0
    Views:
    581
    Naren
    May 11, 2004
  3. Replies:
    7
    Views:
    1,374
  4. Ninja Li

    Parsing HTML with HTML::TableExtract

    Ninja Li, Nov 27, 2009, in forum: Perl Misc
    Replies:
    2
    Views:
    225
    Martien Verbruggen
    Nov 28, 2009
  5. Ninja Li

    Parsing HTML with HTML::Tree

    Ninja Li, Mar 1, 2010, in forum: Perl Misc
    Replies:
    1
    Views:
    146
    Ninja Li
    Mar 1, 2010
Loading...

Share This Page