regex: find semicolon that is not part of an entity

Robert Watkins · Jul 16, 2007

Okay, I have something that works, but I don't like it:

String SEMICOLON_NOT_ENTITY_REGEX = "(?<!&#?\\w{1,20}+);";

The part of the regex that says, "at least 1 but not more than 20" is a
horrible addition that is the only way I could find around the odd
Exception:

java.util.regex.PatternSyntaxException: Look-behind group does not have an
obvious maximum length near index 9
(?<!&#?\w+);
^

Is there any way around this?

Daniel Pitts · Jul 16, 2007

Okay, I have something that works, but I don't like it:

String SEMICOLON_NOT_ENTITY_REGEX = "(?<!&#?\\w{1,20}+);";

The part of the regex that says, "at least 1 but not more than 20" is a
horrible addition that is the only way I could find around the odd
Exception:

java.util.regex.PatternSyntaxException: Look-behind group does not have an
obvious maximum length near index 9
(?<!&#?\w+);
^

Is there any way around this?

Does this regex work for you: "(&#?[^;]*;[^&]*)?;"

Alternatively, you can use an XML parser, and wherever you find a ; in
character data, you found a ; not part of an entity

Roedy Green · Jul 16, 2007

String SEMICOLON_NOT_ENTITY_REGEX

I have written some classes for interconverting entities and chars.
They use manual parsers. see
http://mindprod.com/products.html#ENTITIES
They also include big tables of known entities.

Robert Watkins · Jul 17, 2007

Does this regex work for you: "(&#?[^;]*;[^&]*)?;"

Nope -- that one also finds semicolons that /are/ part of entities.

Alternatively, you can use an XML parser, and wherever you find a ; in
character data, you found a ; not part of an entity

Hmmm. I suppose, but that way more heavy-handed than I was hoping for (and
probably far less performant than the regex I've got that works). All I'm
doing is splitting a String on semicolons, while keeping entities intact.

Thanks,
-- Robert

Robert Watkins · Jul 17, 2007

I have written some classes for interconverting entities and chars.
They use manual parsers. see
http://mindprod.com/products.html#ENTITIES
They also include big tables of known entities.

Thanks, but I don't want to convert the entities. All I'm doing is
splitting a String on semicolons, while keeping entities intact.

Thanks,
-- Robert

Oliver Wong · Jul 17, 2007

Robert Watkins said:
All I'm doing is
splitting a String on semicolons, while keeping entities intact.

It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an entity.
I think the regular expression would look something like:

([^&;]|&[^&;]*

*(&[^&;]*)?

Where the last bit "(&[^&;]*)?" is only necessary if you want to allow
for malformed XML where you have an unterminated entity (e.g.
"<BadXML>Hello World &unterminated</BadXML>"

What the regexp basically says is:

<pseudoRegExp>
(
Any character except '&' and ';'
OR
an entity; that is, '&' followed by any character except '&' and ';'
followed by ';'
) zero or more times
optionally followed by an unterminated entity.
</pseudoRegExp>

- Oliver

Roedy Green · Jul 17, 2007

Thanks, but I don't want to convert the entities. All I'm doing is
splitting a String on semicolons, while keeping entities intact.

One way to skin that cat would be to convent the entities back to
chars, then split on semicolons, then put the entities back.
Consider there are also decimal and hex entities.

You could use your code to find entities, or use mine.

Robert Watkins · Jul 18, 2007

Robert Watkins said:
Robert Watkins said:

All I'm doing is
splitting a String on semicolons, while keeping entities intact.

Click to expand...

It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an
entity. I think the regular expression would look something like:

([^&;]|&[^&;]**(&[^&;]*)?

Thanks for this approach. It does work. Given that I will not allow
malformaed entities, I've changed the regex to:

([^&;]|&#?\\w+

*

which also restricts the entity-specific regex a bit. It took me a while
to respond because (along with all my other work!) I did a fair bit of
testing with your approach, my original approach and with yet another
approach: splitting on all semicolons, then reconstructing the strings
that were split at the end of entities. What surprised me was that there
weren't any hugely significant differences in the performance of all
three approaches. I expected the string reconstruction to be way slower
than the others, but the greatest difference in timing was a mere 8%
(which could certainly be considered more significant in different
contexts).

I'm a bit reticent to admit that I started out as a Perl programmer --
and as such have always fancied myself fairly good with regular
expressions -- but you've got me stumped here. While I was able to parse
your original regex easily enough not to need your kindly provided
pseudoRegExp, I have to admit that I can't figure out why the first
character class needs to be [^&;]. Why does the & have to figure in; why
could it not simply be:

<pseudoRegExp>
(
any character not a semicolon
OR
any entity
) zero or more times
</pseudoRegExp>

I tried this and it simply doesn't work, but I can't think why.

Robert Watkins · Jul 18, 2007

Don't it always happen that way? I answered my own question moments
after posting this repsonse to you.

It's a matter of order. The regex you provided, and which I modified,
tries to match the sole semicolons first, but without the & in the
character class it finds the semicolons in entities before the regex has
tried to match entities. Just switching things around a bit:

(&#?\\w+;|[^;])*

Gives the expected results and is (for me) much clearer, being
essentially what I was looking for in my question to you:

<pseudoRegExp>
(
any entity
OR
any character not a semicolon
) zero or more times
</pseudoRegExp>

In any case, thank you again, you certainly pointed me in the right
direction, having me look at the problem the other way 'round.

-- Robert

Robert Watkins said:
Robert Watkins said:

All I'm doing is
splitting a String on semicolons, while keeping entities intact.

Click to expand...

It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an
entity. I think the regular expression would look something like:

([^&;]|&[^&;]**(&[^&;]*)?

Click to expand...

Thanks for this approach. It does work. Given that I will not allow
malformaed entities, I've changed the regex to:

([^&;]|&#?\\w+*

which also restricts the entity-specific regex a bit. It took me a
while to respond because (along with all my other work!) I did a fair
bit of testing with your approach, my original approach and with yet
another approach: splitting on all semicolons, then reconstructing the
strings that were split at the end of entities. What surprised me was
that there weren't any hugely significant differences in the
performance of all three approaches. I expected the string
reconstruction to be way slower than the others, but the greatest
difference in timing was a mere 8% (which could certainly be
considered more significant in different contexts).

I'm a bit reticent to admit that I started out as a Perl programmer --
and as such have always fancied myself fairly good with regular
expressions -- but you've got me stumped here. While I was able to
parse your original regex easily enough not to need your kindly
provided pseudoRegExp, I have to admit that I can't figure out why the
first character class needs to be [^&;]. Why does the & have to figure
in; why could it not simply be:

<pseudoRegExp>
(
any character not a semicolon
OR
any entity
) zero or more times
</pseudoRegExp>

I tried this and it simply doesn't work, but I can't think why.

My regex kung-fu is not strong =(	0	Apr 4, 2020
Could not find a part of the path	0	May 10, 2010
Can anyone think of a workaround - Ideally I want to pass an accesstype into an entity (not for synt	7	Apr 8, 2011
find which subgroups don't match in regex	3	Jul 17, 2008
review of the "container library", part 1/?	18	Mar 1, 2011
values() of an Enumeration type that is a generic parameter?	7	May 12, 2007
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Replacing Regex with part of itself	1	Jul 14, 2005

regex: find semicolon that is not part of an entity

Robert Watkins

Daniel Pitts

Roedy Green

Robert Watkins

Robert Watkins

Oliver Wong

Roedy Green

Robert Watkins

Robert Watkins

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads