Problem with regular expressionsq

birgit · Nov 16, 2006

I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only , , , <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.

Mark Jeffcoat · Nov 16, 2006

birgit said:
I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only , , , <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.

This is hard to get right, even for someone who's not worried
about the syntax of regular expressions. For example, the rules
you've given are very simple to write as a RE, but unfortunately
allow a user to insert arbitrary Javascript into the anchor tag.
Oops.

If you're doing this for real, don't. Find a third-party
implementation that you trust and that you can, through
a relaxed license or cash money, swipe.

If it's homework .. the easiest way to write a complicated
regular expression is to write simpler ones first. I'd start
by detecting pairs of (<,>), and buld from there. Edi
Weitz's Regex Coach is pretty cool:
http://weitz.de/regex-coach/

Daniel Pitts · Nov 16, 2006

birgit said:
I have the following problem to solve with regular expressions:

An editor can enter data through a webfronted editor. Certain HTML Tag
are allowed others are not. When he saves the data the content should
be check using regular expressions.
- only chars of the ISO-8859-1 charset are allowed
- only , , , <a ...> and </a> tags are allowed (no other
HTML Tags)

The HTML-Tags don't need to get deleted. The user only gets an error
message telling him to change his data according to the rules.

I know how to check for the charset:
[\u0000-\u00FF]*

But I don't know how to combine this with the other rules. I'd
appreciate any help.

I would suggest using an existing xml parser to validate tags, its much
less error-prone than RE, and more flexible when your business rules
change.
a SAX parser would probably be "good enough" in this case.

Red Orchid · Nov 16, 2006

Message-ID: said:
- only chars of the ISO-8859-1 charset are allowed
- only , , , <a ...> and </a> tags are allowed (no other
HTML Tags)

Probably ...

<code>
// Untested ..

//
// Precompile ..
//

String NotISO_8859_1 = "[^\u0000-\u00FF]+";

String HTML_MARKUP = "</?\\w+[^>]*>";

String HTML_ALLOWED = "() | " +
"() | " +
"(<br\\s+/>) | " +
"(<a[^>]*>) | " +
"(</a>) ";

Pattern pNotISO_8859_1 = Pattern.compile(NotISO_8859_1);

Pattern pHTML_MARKUP = Pattern.compile(HTML_MARKUP);

Pattern pHTML_ALLOWED =
Pattern.compile(HTML_ALLOWED,
Pattern.CASE_INSENSITIVE |
Pattern.COMMENTS );

//
// checking routine.
//

String src = ... // usr input

Matcher m;

m = pNotISO_8859_1.matcher(src);

if (m.find()) {

// return Error.
}

m = pHTML_MARKUP.matcher(src);

while (m.find()) {

Matcher ha_m = pHTML_ALLOWED.matcher(m.group());

if (!ha_m.find()) {

// return Error.
}
}

// return OK

</code>

birgit · Nov 17, 2006

Thanks a lot for the answers so far.

I know that regular expressions are not optimal for checking HTML-Tags
but in my case I would really like to use them anyway.
I am using OpenCms structured contents and in the xml schema
definitions I can define validationrules as regular expressions and the
content gets checked automatically - very comfortable. Usually an
editor should not be able to enter any other HTML-Tags but the ones I
provide through buttons, except if he copy and pastes something or if I
allow him to view and edit the source.
So if I don't want to modify the source code of OpenCms I need to use
regular expressions and if possible one which checks everything or
maybe two or three wich can be executed one after the other.
Unfortunately the last suggestion can't be used although I am sure it
would work.

Anyone some more suggestions?

Mark Jeffcoat · Nov 17, 2006

Red Orchid said:
Probably ...

<code>
// Untested ..

[snip code]

In some sense, this is technically excellent. I pasted your
code into a method object, replacing '//return Error' and
'//return OK' with 'return false' and 'return true', and
it worked exactly as specified, first time out of the box.

Not bad for untested. In fact, it makes a big improvement
over the original spec: a naive RE matcher for '<a ...>' would
have accepted strings like '<a></a>Anything can go in
here as long as I finish with an angle bracket>', and your
pattern rejects that sort of thing nicely.

However, it also accepts this string as valid:

<a id="code" expr="alert('0wn3d.')"
style="background:url('javascript:eval(document.all.code.expr)')"></a>

This is not just a theoretical attack. When I put that "link"
on my own webpage, Safari and Firefox ignored the code, but
IE happily executed it.

Better add another RE to strip out "javascript", right?

Meditate on this story:
http://namb.la/popular/tech.html

It's possible that for this application, the right
cost/benefit trade-off is to leave the holes open, and
hope that nobody abuses them to. Just be aware of what
you're doing.

(If I wanted to do this, I'd take a look at Slashcode; they're
solving exactly this problem in the comment submissions, and
their solution has been tested by years of attacks. It's
GPL'd (and in perl), so you can't just paste their solution in
directly, but if they have a solution you can re-implement in
Java as a set of RE tests, you'd likely have a winner.)

Red Orchid · Nov 17, 2006

Message-ID: said:
[snip]
So if I don't want to modify the source code of OpenCms I need to use
regular expressions and if possible one which checks everything or
maybe two or three wich can be executed one after the other.
Unfortunately the last suggestion can't be used although I am sure it
would work.

Maybe it is possible to write one regex which checks everything
with Conditional and Lookaround.

But, as I know on,
Java RegEx library do not support Conditional.
(I don't know the reason.)

If you think the regex is possible with Conditional,
it will be worth searching a library that supports
Conditional.

birgit · Nov 17, 2006

So far, I don't want to change any of the code which uses the regular
expression and so this is unfortunately no solution.

I am not really good with regular expressions! But I've got something
which works in most situations, but if there is a wrong HTML Code at
the end of the text it seems like an endless loop. Can anyone correct
this regular expression.

(()??|(<\/b>)??|(<br\s+\/>)??|(<a[^>]*>)??|(<\/a>)??|([\u0000-\u003B\u003D\u003F-\u00FF]*)??)*?

I also know that with this expression no '>' and '<' are allowed in
normal Texts. Can I solve this somehow.

Thanks for your help!

Daniel Pitts · Nov 17, 2006

birgit said:
So far, I don't want to change any of the code which uses the regular
expression and so this is unfortunately no solution.

I am not really good with regular expressions! But I've got something
which works in most situations, but if there is a wrong HTML Code at
the end of the text it seems like an endless loop. Can anyone correct
this regular expression.

(()??|(<\/b>)??|(<br\s+\/>)??|(<a[^>]*>)??|(<\/a>)??|([\u0000-\u003B\u003D\u003F-\u00FF]*)??)*?

I also know that with this expression no '>' and '<' are allowed in
normal Texts. Can I solve this somehow.

Thanks for your help!

Actually, I think the proper way to handle your problem is to HTML
escape the whole thing, and then have "psuedo tags" similar to BBCode.
bold Link

That way, everything is safe for HTML, and you have control of what is
added.

Big problem I need to solve with some unix utils	1	Jun 19, 2022
Logic Problem with BigInteger Method	2	Aug 26, 2023
Problem with android and scrolling with <input textarea	5	May 18, 2022
Search Results with Pagination	1	Oct 25, 2024
Help with my responsive home page	2	Dec 14, 2022
Can someone pls help me with a little algorithm script	1	Nov 28, 2024
The power of regular expressions without regular expressions.	0	Jul 17, 2013
Issue with passing fetched data to POST form. How can I?	0	Jul 23, 2023

Problem with regular expressionsq

birgit

Mark Jeffcoat

Daniel Pitts

Red Orchid

birgit

Mark Jeffcoat

Red Orchid

birgit

Daniel Pitts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads