RE: Parsing HTML - modify URLs

Discussion in 'Python' started by Robert Brewer, Jul 7, 2004.

  1. Fuzzyman wrote:
    > I am trying to parse an HTML page an only modify URLs within tags -
    > e.g. inside IMG, A, SCRIPT, FRAME tags etc...
    >
    > I have built one that works fine using the HTMLParser.HTMLParser and
    > it works fine.... on good HTML. Having done a google it looks like
    > parsing dodgy HTML and having HTMLParser choke is a common theme.


    Haven't used it, but Beautiful Soup sounds like it fits the bill:

    http://www.crummy.com/software/BeautifulSoup/


    FuManChu
     
    Robert Brewer, Jul 7, 2004
    #1
    1. Advertising

  2. Robert Brewer

    Fuzzyman Guest

    "Robert Brewer" <> wrote in message news:<>...
    > Fuzzyman wrote:
    > > I am trying to parse an HTML page an only modify URLs within tags -
    > > e.g. inside IMG, A, SCRIPT, FRAME tags etc...
    > >
    > > I have built one that works fine using the HTMLParser.HTMLParser and
    > > it works fine.... on good HTML. Having done a google it looks like
    > > parsing dodgy HTML and having HTMLParser choke is a common theme.

    >
    > Haven't used it, but Beautiful Soup sounds like it fits the bill:
    >
    > http://www.crummy.com/software/BeautifulSoup/


    It talks about 'walkin the parse tree'... which is a bit more magic
    than I want... I just want to modify URLs in tags... which means I
    mainly want to extract the HTML unchanged and also modify a few tags -
    HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    may have to try beautiful soup though :)

    Regards,



    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html

    >
    >
    > FuManChu
     
    Fuzzyman, Jul 7, 2004
    #2
    1. Advertising

  3. Robert Brewer

    John J. Lee Guest

    (Fuzzyman) writes:

    > "Robert Brewer" <> wrote in message news:<>...
    > > Fuzzyman wrote:
    > > > I am trying to parse an HTML page an only modify URLs within tags -
    > > > e.g. inside IMG, A, SCRIPT, FRAME tags etc...
    > > >
    > > > I have built one that works fine using the HTMLParser.HTMLParser and
    > > > it works fine.... on good HTML. Having done a google it looks like
    > > > parsing dodgy HTML and having HTMLParser choke is a common theme.


    Use sgmllib instead (or htmllib, which adds a few bits and bobs on top
    of sgmllib). sgmllib.SGMLParser (and htmllib.HTMLParser) is more
    robust than HTMLParser.HTMLParser. OTOH, HTMLParser.HTMLParser is
    more suitable for XHTML.

    I remember that sorting out the precise differences between the two
    libraries (htmllib and HTMLParser) was mildly painful and confusing,
    so you might find it useful to look at ClientForm as an example,
    because it can use both htmllib and HTMLParser modules.


    > > Haven't used it, but Beautiful Soup sounds like it fits the bill:
    > >
    > > http://www.crummy.com/software/BeautifulSoup/

    >
    > It talks about 'walkin the parse tree'... which is a bit more magic
    > than I want... I just want to modify URLs in tags... which means I
    > mainly want to extract the HTML unchanged and also modify a few tags -
    > HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    > may have to try beautiful soup though :)


    In general, Murphy has more shots at anything that both parses *and*
    builds a tree, so sticking to just a parser (eg. sgmllib) is
    advantagous in that respect. However, microdom is a tree-building
    library that claims to be relatively tolerant of bad HTML.


    John
     
    John J. Lee, Jul 7, 2004
    #3
  4. Robert Brewer

    richard Guest

    > (Fuzzyman) writes:
    >> "Robert Brewer" <> wrote in message
    >> news:<>...
    >> > Haven't used it, but Beautiful Soup sounds like it fits the bill:
    >> >
    >> > http://www.crummy.com/software/BeautifulSoup/

    >>
    >> It talks about 'walkin the parse tree'... which is a bit more magic
    >> than I want... I just want to modify URLs in tags... which means I
    >> mainly want to extract the HTML unchanged and also modify a few tags -
    >> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    >> may have to try beautiful soup though :)


    From the BeautifulSoup page:

    "You can modify a Tag or NavigableText in place. Printing it out as a
    string will print the new markup text."

    And really, it handles *any* HTML, no matter how crappy - I'm using it to
    deal with pages that have random <span> and </span> in them with no
    matching end / start tags. Eugh.

    Once you've written rewrite_url(), this will do the job on the BeautifulSoup
    side:

    soup = BeautifulSoup()
    soup.feed(source_html)
    for tag, attr in (('img', 'src'), ('a', 'href')):
    for tag in soup(tag):
    if tag.get(attr):
    tag[attr] = rewrite_url(tag[attr])
    print soup


    Richard
     
    richard, Jul 8, 2004
    #4
  5. Robert Brewer

    Fuzzyman Guest

    richard <> wrote in message news:<40ec817a$0$25460$>...
    > > (Fuzzyman) writes:
    > >> "Robert Brewer" <> wrote in message
    > >> news:<>...
    > >> > Haven't used it, but Beautiful Soup sounds like it fits the bill:
    > >> >
    > >> > http://www.crummy.com/software/BeautifulSoup/
    > >>
    > >> It talks about 'walkin the parse tree'... which is a bit more magic
    > >> than I want... I just want to modify URLs in tags... which means I
    > >> mainly want to extract the HTML unchanged and also modify a few tags -
    > >> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    > >> may have to try beautiful soup though :)

    >
    > From the BeautifulSoup page:
    >
    > "You can modify a Tag or NavigableText in place. Printing it out as a
    > string will print the new markup text."
    >
    > And really, it handles *any* HTML, no matter how crappy - I'm using it to
    > deal with pages that have random <span> and </span> in them with no
    > matching end / start tags. Eugh.
    >
    > Once you've written rewrite_url(), this will do the job on the BeautifulSoup
    > side:
    >
    > soup = BeautifulSoup()
    > soup.feed(source_html)
    > for tag, attr in (('img', 'src'), ('a', 'href')):
    > for tag in soup(tag):
    > if tag.get(attr):
    > tag[attr] = rewrite_url(tag[attr])
    > print soup
    >
    >
    > Richard


    Brilliant Richard.
    I did hack together a version that worked inside the Tag class of
    BeautifulSoup - but your suggestion is much more elegant. I've already
    written rewrite_url - twice now :) Should work fine........

    Thanks

    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html
     
    Fuzzyman, Jul 8, 2004
    #5
  6. Robert Brewer

    Fuzzyman Guest

    richard <> wrote in message news:<40ec817a$0$25460$>...
    > > (Fuzzyman) writes:
    > >> "Robert Brewer" <> wrote in message
    > >> news:<>...
    > >> > Haven't used it, but Beautiful Soup sounds like it fits the bill:
    > >> >
    > >> > http://www.crummy.com/software/BeautifulSoup/
    > >>
    > >> It talks about 'walkin the parse tree'... which is a bit more magic
    > >> than I want... I just want to modify URLs in tags... which means I
    > >> mainly want to extract the HTML unchanged and also modify a few tags -
    > >> HTMLParser is quite good at this- but dies *horribly* at bad HTML... I
    > >> may have to try beautiful soup though :)

    >
    > From the BeautifulSoup page:
    >
    > "You can modify a Tag or NavigableText in place. Printing it out as a
    > string will print the new markup text."
    >
    > And really, it handles *any* HTML, no matter how crappy - I'm using it to
    > deal with pages that have random <span> and </span> in them with no
    > matching end / start tags. Eugh.
    >
    > Once you've written rewrite_url(), this will do the job on the BeautifulSoup
    > side:
    >
    > soup = BeautifulSoup()
    > soup.feed(source_html)
    > for tag, attr in (('img', 'src'), ('a', 'href')):
    > for tag in soup(tag):
    > if tag.get(attr):
    > tag[attr] = rewrite_url(tag[attr])
    > print soup
    >
    >
    > Richard


    Haha - just switched to BS and so far it works like a dream...
    building a CGI proxy for escaping restricted/censored internet
    environments...

    Thanks for the help.

    Regards,

    Fuzzy

    http://www.voidspace.org.uk/atlantibots/pythonutils.html
     
    Fuzzyman, Jul 8, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kaidi
    Replies:
    5
    Views:
    475
    Andrew Thompson
    Jan 4, 2004
  2. Fuzzyman

    Parsing HTML - modify URLs

    Fuzzyman, Jul 7, 2004, in forum: Python
    Replies:
    0
    Views:
    306
    Fuzzyman
    Jul 7, 2004
  3. Nathan Sokalski

    Converting Relative URLs into Absolute URLs

    Nathan Sokalski, Aug 11, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    747
    Sriram Srivatsan
    Aug 12, 2008
  4. Adam Monsen

    JDBC URLs ...not really URLs?

    Adam Monsen, Feb 6, 2009, in forum: Java
    Replies:
    11
    Views:
    6,219
    Adam Monsen
    Feb 8, 2009
  5. Steve T.

    dynamic URLS convert to static URLS for search engines

    Steve T., Mar 1, 2004, in forum: ASP .Net Web Services
    Replies:
    7
    Views:
    293
    Steve T.
    Mar 4, 2004
Loading...

Share This Page