htmltools incorrectly parsing HTML containing server-side tags

Discussion in 'Ruby' started by dsutch@gmail.com, Jul 24, 2006.

  1. Guest

    I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
    be processed by the web server. For example, here's an image tag:

    <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
    seperator">

    The <$DCGallery$> will be replaced by some text when returned to the
    browser by the web server.

    What I'm noticing is that HTMLTools doesn't handle tags that contain an
    such an embedded tag. It seems to make an attempt at correcting what
    it sees as invalid HTML. So the above tag, after going through the
    parser and having a new class added, using:

    element.add_attribute('class', ' wide_content')

    results in the following tag:

    <img src="<$DCGallery$>Separators/gtabseps.gif"
    class="wide_content"><$DCGallery$>Separators/gtabseps.gif" alt="this is
    a seperator">

    The image tag is closed after the new class attribute and the
    server-side tag is duplicated and contains the alt attribute from the
    original image tag. Has anyone encountered such behavior?

    I know that HTML Tools probably wasn't built to handle HTML with
    embedded server-side tags, but for this project I need to process HTML
    before being served up by the web server. Shouldn't HTML Tools ignore
    tags found within the quotes of the src attribute's value? Is there an
    option or patch that might get HTML Tools to ignore tags found within
    the values of tag attributes?
    , Jul 24, 2006
    #1
    1. Advertising

  2. wrote:
    > I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
    > be processed by the web server. For example, here's an image tag:
    >
    > <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
    > seperator">


    Is this valid html? From another thread:


    > $ echo '<bar quux="foo>bar" />' | xmllint -
    > <?xml version="1.0"?>
    > <bar quux="foo&gt;bar"/>
    >
    > However, '<' needs to be escaped:
    >
    > $ echo '<bar quux="foo<bar" />' | xmllint -
    > -:1: parser error : Unescaped '<' not allowed in attributes values
    > <bar quux="foo<bar" />
    William James, Jul 24, 2006
    #2
    1. Advertising

  3. sutch Guest

    William James wrote:
    > wrote:
    > > I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
    > > be processed by the web server. For example, here's an image tag:
    > >
    > > <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
    > > seperator">

    >
    > Is this valid html?


    Thank you for this information. I did a bit more research and now
    believe that this is not valid HTML. Read on...

    > From another thread:
    >
    > > $ echo '<bar quux="foo>bar" />' | xmllint -
    > > <?xml version="1.0"?>
    > > <bar quux="foo&gt;bar"/>
    > >
    > > However, '<' needs to be escaped:
    > >
    > > $ echo '<bar quux="foo<bar" />' | xmllint -
    > > -:1: parser error : Unescaped '<' not allowed in attributes values
    > > <bar quux="foo<bar" />


    Unfortunately, escaping is not an option since the HTML files that are
    being parsed are being output from another closed system.

    The question is: can HTML Tools be told to ignore "<" and ">" inside of
    attribute values? Or is there another HTML parser for Ruby that would
    handle this?

    Alternatively, is there some method for finding these characters within
    attribute values and escaping them before parsing by Ruby and then
    un-escaping them after parsing (so that the server can perform the
    required processing of these PHP-like tags).
    sutch, Jul 25, 2006
    #3
  4. sutch wrote:
    > William James wrote:
    > > wrote:
    > > > I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
    > > > be processed by the web server. For example, here's an image tag:
    > > >
    > > > <img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
    > > > seperator">

    > >
    > > Is this valid html?

    >
    > Thank you for this information. I did a bit more research and now
    > believe that this is not valid HTML. Read on...
    >
    > > From another thread:
    > >
    > > > $ echo '<bar quux="foo>bar" />' | xmllint -
    > > > <?xml version="1.0"?>
    > > > <bar quux="foo&gt;bar"/>
    > > >
    > > > However, '<' needs to be escaped:
    > > >
    > > > $ echo '<bar quux="foo<bar" />' | xmllint -
    > > > -:1: parser error : Unescaped '<' not allowed in attributes values
    > > > <bar quux="foo<bar" />

    >
    > Unfortunately, escaping is not an option since the HTML files that are
    > being parsed are being output from another closed system.
    >
    > The question is: can HTML Tools be told to ignore "<" and ">" inside of
    > attribute values? Or is there another HTML parser for Ruby that would
    > handle this?
    >
    > Alternatively, is there some method for finding these characters within
    > attribute values and escaping them before parsing by Ruby and then
    > un-escaping them after parsing (so that the server can perform the
    > required processing of these PHP-like tags).


    Perhaps this will work.

    str = <<HERE
    <html>
    <!--
    A comment can contain <,
    I think.
    -->
    <img src="<$DCGallery$>Separators/gtabseps.gif"
    alt="this is a separator">
    </html>
    HERE

    # We will split the html string into an array of strings.
    # Each member of the array will be an html comment, an
    # html tag, or plain text.

    re = %r{ ( <!--.*?--> |
    < (?:
    [^<>"] +
    |
    " (?: \\. | [^\\"]+ ) * "
    ) *
    >

    ) }xm


    str.split( re ).each { |x|
    if "<" == x[0,1] && "<!" != x[0,2]
    # Since > is o.k., change only <.
    x[1..-2] = x[1..-2].gsub( /</, "&lt;" )
    end

    print x
    }
    William James, Jul 25, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Lee
    Replies:
    2
    Views:
    172
  2. Dan Kohn
    Replies:
    4
    Views:
    121
    James Britt
    Dec 2, 2005
  3. Replies:
    0
    Views:
    85
  4. Peter Bodik
    Replies:
    2
    Views:
    97
    Peter Bodik
    Jan 21, 2006
  5. joes
    Replies:
    1
    Views:
    100
    James Britt
    Mar 24, 2006
Loading...

Share This Page