htmltools incorrectly parsing HTML containing server-side tags

D

dsutch

I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
be processed by the web server. For example, here's an image tag:

<img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
seperator">

The <$DCGallery$> will be replaced by some text when returned to the
browser by the web server.

What I'm noticing is that HTMLTools doesn't handle tags that contain an
such an embedded tag. It seems to make an attempt at correcting what
it sees as invalid HTML. So the above tag, after going through the
parser and having a new class added, using:

element.add_attribute('class', ' wide_content')

results in the following tag:

<img src="<$DCGallery$>Separators/gtabseps.gif"
class="wide_content"><$DCGallery$>Separators/gtabseps.gif" alt="this is
a seperator">

The image tag is closed after the new class attribute and the
server-side tag is duplicated and contains the alt attribute from the
original image tag. Has anyone encountered such behavior?

I know that HTML Tools probably wasn't built to handle HTML with
embedded server-side tags, but for this project I need to process HTML
before being served up by the web server. Shouldn't HTML Tools ignore
tags found within the quotes of the src attribute's value? Is there an
option or patch that might get HTML Tools to ignore tags found within
the values of tag attributes?
 
W

William James

I'm using HTML Tools 1.09 to parse HTML that contains tags that are to
be processed by the web server. For example, here's an image tag:

<img src="<$DCGallery$>Separators/gtabseps.gif" alt="this is a
seperator">

Is this valid html? From another thread:
 
S

sutch

William said:
Is this valid html?

Thank you for this information. I did a bit more research and now
believe that this is not valid HTML. Read on...
From another thread:

Unfortunately, escaping is not an option since the HTML files that are
being parsed are being output from another closed system.

The question is: can HTML Tools be told to ignore "<" and ">" inside of
attribute values? Or is there another HTML parser for Ruby that would
handle this?

Alternatively, is there some method for finding these characters within
attribute values and escaping them before parsing by Ruby and then
un-escaping them after parsing (so that the server can perform the
required processing of these PHP-like tags).
 
W

William James

sutch said:
Thank you for this information. I did a bit more research and now
believe that this is not valid HTML. Read on...


Unfortunately, escaping is not an option since the HTML files that are
being parsed are being output from another closed system.

The question is: can HTML Tools be told to ignore "<" and ">" inside of
attribute values? Or is there another HTML parser for Ruby that would
handle this?

Alternatively, is there some method for finding these characters within
attribute values and escaping them before parsing by Ruby and then
un-escaping them after parsing (so that the server can perform the
required processing of these PHP-like tags).

Perhaps this will work.

str = <<HERE
<html>
<!--
A comment can contain <,
I think.
-->
<img src="<$DCGallery$>Separators/gtabseps.gif"
alt="this is a separator">
</html>
HERE

# We will split the html string into an array of strings.
# Each member of the array will be an html comment, an
# html tag, or plain text.

re = %r{ ( <!--.*?--> |
< (?:
[^<>"] +
|
" (?: \\. | [^\\"]+ ) * "
) *) }xm


str.split( re ).each { |x|
if "<" == x[0,1] && "<!" != x[0,2]
# Since > is o.k., change only <.
x[1..-2] = x[1..-2].gsub( /</, "&lt;" )
end

print x
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top