Problem with regular expressionsq

B

birgit

I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.
 
M

Mark Jeffcoat

birgit said:
I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.

This is hard to get right, even for someone who's not worried
about the syntax of regular expressions. For example, the rules
you've given are very simple to write as a RE, but unfortunately
allow a user to insert arbitrary Javascript into the anchor tag.
Oops.


If you're doing this for real, don't. Find a third-party
implementation that you trust and that you can, through
a relaxed license or cash money, swipe.


If it's homework .. the easiest way to write a complicated
regular expression is to write simpler ones first. I'd start
by detecting pairs of (<,>), and buld from there. Edi
Weitz's Regex Coach is pretty cool:
http://weitz.de/regex-coach/
 
D

Daniel Pitts

birgit said:
I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.

I would suggest using an existing xml parser to validate tags, its much
less error-prone than RE, and more flexible when your business rules
change.
a SAX parser would probably be "good enough" in this case.
 
R

Red Orchid

Message-ID: said:
- only chars of the ISO-8859-1 charset are allowed
- only <b>, </b>, <br />, <a ...> and </a> tags are allowed (no other
HTML Tags)



Probably ...

<code>
// Untested ..

//
// Precompile ..
//

String NotISO_8859_1 = "[^\u0000-\u00FF]+";

String HTML_MARKUP = "</?\\w+[^>]*>";

String HTML_ALLOWED = "(<b>) | " +
"(</b>) | " +
"(<br\\s+/>) | " +
"(<a[^>]*>) | " +
"(</a>) ";


Pattern pNotISO_8859_1 = Pattern.compile(NotISO_8859_1);

Pattern pHTML_MARKUP = Pattern.compile(HTML_MARKUP);

Pattern pHTML_ALLOWED =
Pattern.compile(HTML_ALLOWED,
Pattern.CASE_INSENSITIVE |
Pattern.COMMENTS );



//
// checking routine.
//

String src = ... // usr input


Matcher m;

m = pNotISO_8859_1.matcher(src);

if (m.find()) {

// return Error.
}


m = pHTML_MARKUP.matcher(src);

while (m.find()) {

Matcher ha_m = pHTML_ALLOWED.matcher(m.group());

if (!ha_m.find()) {

// return Error.
}
}

// return OK

</code>
 
B

birgit

Thanks a lot for the answers so far.

I know that regular expressions are not optimal for checking HTML-Tags
but in my case I would really like to use them anyway.
I am using OpenCms structured contents and in the xml schema
definitions I can define validationrules as regular expressions and the
content gets checked automatically - very comfortable. Usually an
editor should not be able to enter any other HTML-Tags but the ones I
provide through buttons, except if he copy and pastes something or if I
allow him to view and edit the source.
So if I don't want to modify the source code of OpenCms I need to use
regular expressions and if possible one which checks everything or
maybe two or three wich can be executed one after the other.
Unfortunately the last suggestion can't be used although I am sure it
would work.

Anyone some more suggestions?
 
M

Mark Jeffcoat

Red Orchid said:
Probably ...

<code>
// Untested ..

[snip code]

In some sense, this is technically excellent. I pasted your
code into a method object, replacing '//return Error' and
'//return OK' with 'return false' and 'return true', and
it worked exactly as specified, first time out of the box.

Not bad for untested. In fact, it makes a big improvement
over the original spec: a naive RE matcher for '<a ...>' would
have accepted strings like '<a></a><i>Anything</i> can go in
here as long as I finish with an angle bracket>', and your
pattern rejects that sort of thing nicely.

However, it also accepts this string as valid:

<a id="code" expr="alert('0wn3d.')"
style="background:url('javascript:eval(document.all.code.expr)')"></a>


This is not just a theoretical attack. When I put that "link"
on my own webpage, Safari and Firefox ignored the code, but
IE happily executed it.

Better add another RE to strip out "javascript", right?

Meditate on this story:
http://namb.la/popular/tech.html


It's possible that for this application, the right
cost/benefit trade-off is to leave the holes open, and
hope that nobody abuses them to. Just be aware of what
you're doing.


(If I wanted to do this, I'd take a look at Slashcode; they're
solving exactly this problem in the comment submissions, and
their solution has been tested by years of attacks. It's
GPL'd (and in perl), so you can't just paste their solution in
directly, but if they have a solution you can re-implement in
Java as a set of RE tests, you'd likely have a winner.)
 
R

Red Orchid

Message-ID: said:
[snip]
So if I don't want to modify the source code of OpenCms I need to use
regular expressions and if possible one which checks everything or
maybe two or three wich can be executed one after the other.
Unfortunately the last suggestion can't be used although I am sure it
would work.


Maybe it is possible to write one regex which checks everything
with Conditional and Lookaround.

But, as I know on,
Java RegEx library do not support Conditional.
(I don't know the reason.)

If you think the regex is possible with Conditional,
it will be worth searching a library that supports
Conditional.
 
B

birgit

So far, I don't want to change any of the code which uses the regular
expression and so this is unfortunately no solution.

I am not really good with regular expressions! But I've got something
which works in most situations, but if there is a wrong HTML Code at
the end of the text it seems like an endless loop. Can anyone correct
this regular expression.

((<b>)??|(<\/b>)??|(<br\s+\/>)??|(<a[^>]*>)??|(<\/a>)??|([\u0000-\u003B\u003D\u003F-\u00FF]*)??)*?

I also know that with this expression no '>' and '<' are allowed in
normal Texts. Can I solve this somehow.

Thanks for your help!
 
D

Daniel Pitts

birgit said:
So far, I don't want to change any of the code which uses the regular
expression and so this is unfortunately no solution.

I am not really good with regular expressions! But I've got something
which works in most situations, but if there is a wrong HTML Code at
the end of the text it seems like an endless loop. Can anyone correct
this regular expression.

((<b>)??|(<\/b>)??|(<br\s+\/>)??|(<a[^>]*>)??|(<\/a>)??|([\u0000-\u003B\u003D\u003F-\u00FF]*)??)*?

I also know that with this expression no '>' and '<' are allowed in
normal Texts. Can I solve this somehow.

Thanks for your help!

Actually, I think the proper way to handle your problem is to HTML
escape the whole thing, and then have "psuedo tags" similar to BBCode.
bold Link

That way, everything is safe for HTML, and you have control of what is
added.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,270
Messages
2,571,102
Members
48,773
Latest member
Kaybee

Latest Threads

Top