REXML error reporting (XHTML validation)

  • Thread starter Dmitri Borodaenko
  • Start date
D

Dmitri Borodaenko

I've implemented a simple XHTML validation class based on REXML and
YAML, and it works like a charm except for invalid XML: when there is
something like a loose unescaped '<' character, it just raises
ParseException with no obvious reference to the guilty character. Is
it possible to get more useful info out of REXML, or should I use some
other XML validator?

Sanitize class (54 lines total):

http://savannah.nongnu.org/cgi-bin/viewcvs/samizdat/samizdat/samizdat/sanitize.rb?rev=1.99

YAML file with allowed XHTML tags and attributes:

http://savannah.nongnu.org/cgi-bin/viewcvs/samizdat/samizdat/xhtml.yaml?rev=1.99
 
J

James Britt

Dmitri said:
I've implemented a simple XHTML validation class based on REXML and
YAML, and it works like a charm except for invalid XML: when there is
something like a loose unescaped '<' character, it just raises
ParseException with no obvious reference to the guilty character. Is
it possible to get more useful info out of REXML, or should I use some
other XML validator?

This is quite nice. I'm poking around, looking to see how best to
recover validation error info.

Two comments:

I added logging to my copy so that I could see what was being clobbered
during sanitization. Might be worth including this by default.

I see that 'script' elements are deleted, as the yaml file makes no
mention of that element.


Thanks for the nice work,


James
 
S

Sean E. Russell

--nextPart4963043.LhNTc3DIMM
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

YAML, and it works like a charm except for invalid XML: when there is
something like a loose unescaped '<' character, it just raises
ParseException with no obvious reference to the guilty character. Is
it possible to get more useful info out of REXML, or should I use some
other XML validator?

REXML saves information about the state of the parse stream, including line=
=20
numbers and the reason for the exception. However, this is admittedly pret=
ty=20
weak; the problem being, of course, what constitutes a "line". However,=20
REXML tries to be good about saving parse state, and some of this is captur=
ed=20
in the Source class and repeated in the ParseException. I'd recommend=20
looking at the Source class to see if any of the methods help you.

The main problem right now is that REXML uses '<' as the line separator, as=
it=20
is the only reasonable way to parse open-ended streams.

The short version is that all I can say is that I'm struggling with how to=
=20
improve the error reporting of REXML while maintaining reasonable efficienc=
y,=20
and I haven't come up with a good solution yet.

=2D-=20
### SER =20
### Deutsch|Esperanto|Francaise|Linux|XML|Java|Ruby|Aikido
### http://www.germane-software.com/~ser jabber.com:ser ICQ:83578737=20
### GPG: http://www.germane-software.com/~ser/Security/ser_public.gpg

--nextPart4963043.LhNTc3DIMM
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQBBkCwqP0KxygnleI8RAjREAKDHvf8dfrQJQcIpv5KX12EnxMImpgCgprhA
E16iR1UL9V3pzyldAtDyvos=
=n5W7
-----END PGP SIGNATURE-----

--nextPart4963043.LhNTc3DIMM--
 
D

Dmitri Borodaenko

I added logging to my copy so that I could see what was being clobbered
during sanitization. Might be worth including this by default.

Err, I can't throw Ruby dumps on unsuspecting Wiki users: my problem
is not just to find the cause, but also to report it nicely.
I see that 'script' elements are deleted, as the yaml file makes no
mention of that element.

Right, that was on purpose.

Btw, I've noticed that this script doesn't completely filter out things like:

<IMG width="0" height="0" style="bac\kground:
ur\l(javascript:alert('boop'));" />

...although it cripples it a bit by escaping quotes. I don't want to
remove "style" attributes, is there any easy way around parsing CSS?
 
J

James Britt

Dmitri said:
Err, I can't throw Ruby dumps on unsuspecting Wiki users: my problem
is not just to find the cause, but also to report it nicely.




Right, that was on purpose.

Ah, I see. I thought of this as the start of a general-purpose lib that
might then be used by some more specific application.

A suggestion (motivated by self-interest): arrange for the code to allow
all proper XHTML by default, with the option of passing in a set of
elements and/or attributes that are disallowed at validation time.

For example, if you decide to disallow style or class attributes, you
could pass this information in when calling sanitize

Perhaps sanitize could take an optional hash parameter
sanitize(html, filter = {} )

and disallowed elements/attribute could be specified in perhaps as

'script' => '', # no script element at all
'img' => 'usemap, height' # allow images, but
# no usemap or height attributes
'*' => 'style, class' # no class or style on any element

Just a thought; it's easy to make suggestions when you're not writing
the code ;)

This way, you need not keep editing the base yaml file when adjusting
what to sanitize.

James
 
D

Dmitri Borodaenko

Ah, I see. I thought of this as the start of a general-purpose lib that
might then be used by some more specific application.

Well, it is going to be general-purpose -- if someone is going to use
it. As with Samizdat's RDF layer, I don't want to implement features
that no-one needs, and since I'm the only one currently using it, I do
only the stuff I need.
A suggestion (motivated by self-interest): arrange for the code to allow
all proper XHTML by default, with the option of passing in a set of
elements and/or attributes that are disallowed at validation time. (...)
This way, you need not keep editing the base yaml file when adjusting
what to sanitize.

I don't see the point: the reason I've put it all into a YAML file is
that it's easier to edit it there, rather than in your source code.
Or, if you want to do it programmatically, all you have to do is:

class Sanitize
attr_reader :xhtml
end
...
Sanitize.instance.xhtml['_common'].delete('style')
 
J

James Britt

Dmitri said:
...

I don't see the point: the reason I've put it all into a YAML file is
that it's easier to edit it there, rather than in your source code.


I'm thinking that editing that YAML file directly means that as you add
or remove things you have to check that you're dealing with all the
valid XHTML items you might want. Once you delete something, how do you
know, sometime later, that it is an option that could be restored, other
than perhaps looking through a DTD. Having a base that is always
complete, with edits overlaid, makes it easier to rollback to the most
tolerant sanitization.
 
D

Dmitri Borodaenko

I'm thinking that editing that YAML file directly means that as you add
or remove things you have to check that you're dealing with all the
valid XHTML items you might want. Once you delete something, how do you
know, sometime later, that it is an option that could be restored, other
than perhaps looking through a DTD. Having a base that is always
complete, with edits overlaid, makes it easier to rollback to the most
tolerant sanitization.

How about keeping different versions of the yaml file? And once again,
you don't have to overload API for something you can do directly, the
way I've shown.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top