regex: find semicolon that is not part of an entity

R

Robert Watkins

Okay, I have something that works, but I don't like it:

String SEMICOLON_NOT_ENTITY_REGEX = "(?<!&#?\\w{1,20}+);";

The part of the regex that says, "at least 1 but not more than 20" is a
horrible addition that is the only way I could find around the odd
Exception:

java.util.regex.PatternSyntaxException: Look-behind group does not have an
obvious maximum length near index 9
(?<!&#?\w+);
^

Is there any way around this?
 
D

Daniel Pitts

Okay, I have something that works, but I don't like it:

String SEMICOLON_NOT_ENTITY_REGEX = "(?<!&#?\\w{1,20}+);";

The part of the regex that says, "at least 1 but not more than 20" is a
horrible addition that is the only way I could find around the odd
Exception:

java.util.regex.PatternSyntaxException: Look-behind group does not have an
obvious maximum length near index 9
(?<!&#?\w+);
^

Is there any way around this?
Does this regex work for you: "(&#?[^;]*;[^&]*)?;"

Alternatively, you can use an XML parser, and wherever you find a ; in
character data, you found a ; not part of an entity :)
 
R

Robert Watkins

Does this regex work for you: "(&#?[^;]*;[^&]*)?;"
Nope -- that one also finds semicolons that /are/ part of entities.
Alternatively, you can use an XML parser, and wherever you find a ; in
character data, you found a ; not part of an entity :)
Hmmm. I suppose, but that way more heavy-handed than I was hoping for (and
probably far less performant than the regex I've got that works). All I'm
doing is splitting a String on semicolons, while keeping entities intact.

Thanks,
-- Robert
 
O

Oliver Wong

Robert Watkins said:
All I'm doing is
splitting a String on semicolons, while keeping entities intact.

It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an entity.
I think the regular expression would look something like:

([^&;]|&[^&;]*;)*(&[^&;]*)?

Where the last bit "(&[^&;]*)?" is only necessary if you want to allow
for malformed XML where you have an unterminated entity (e.g.
"<BadXML>Hello World &unterminated</BadXML>"

What the regexp basically says is:

<pseudoRegExp>
(
Any character except '&' and ';'
OR
an entity; that is, '&' followed by any character except '&' and ';'
followed by ';'
) zero or more times
optionally followed by an unterminated entity.
</pseudoRegExp>

- Oliver
 
R

Roedy Green

Thanks, but I don't want to convert the entities. All I'm doing is
splitting a String on semicolons, while keeping entities intact.

One way to skin that cat would be to convent the entities back to
chars, then split on semicolons, then put the entities back.
Consider there are also decimal and hex entities.

You could use your code to find entities, or use mine.
 
R

Robert Watkins

Robert Watkins said:
All I'm doing is
splitting a String on semicolons, while keeping entities intact.

It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an
entity. I think the regular expression would look something like:

([^&;]|&[^&;]*;)*(&[^&;]*)?
Thanks for this approach. It does work. Given that I will not allow
malformaed entities, I've changed the regex to:

([^&;]|&#?\\w+;)*

which also restricts the entity-specific regex a bit. It took me a while
to respond because (along with all my other work!) I did a fair bit of
testing with your approach, my original approach and with yet another
approach: splitting on all semicolons, then reconstructing the strings
that were split at the end of entities. What surprised me was that there
weren't any hugely significant differences in the performance of all
three approaches. I expected the string reconstruction to be way slower
than the others, but the greatest difference in timing was a mere 8%
(which could certainly be considered more significant in different
contexts).

I'm a bit reticent to admit that I started out as a Perl programmer --
and as such have always fancied myself fairly good with regular
expressions -- but you've got me stumped here. While I was able to parse
your original regex easily enough not to need your kindly provided
pseudoRegExp, I have to admit that I can't figure out why the first
character class needs to be [^&;]. Why does the & have to figure in; why
could it not simply be:

<pseudoRegExp>
(
any character not a semicolon
OR
any entity
) zero or more times
</pseudoRegExp>

I tried this and it simply doesn't work, but I can't think why.
 
R

Robert Watkins

Don't it always happen that way? I answered my own question moments
after posting this repsonse to you.

It's a matter of order. The regex you provided, and which I modified,
tries to match the sole semicolons first, but without the & in the
character class it finds the semicolons in entities before the regex has
tried to match entities. Just switching things around a bit:

(&#?\\w+;|[^;])*

Gives the expected results and is (for me) much clearer, being
essentially what I was looking for in my question to you:

<pseudoRegExp>
(
any entity
OR
any character not a semicolon
) zero or more times
</pseudoRegExp>

In any case, thank you again, you certainly pointed me in the right
direction, having me look at the problem the other way 'round.

-- Robert

Robert Watkins said:
All I'm doing is
splitting a String on semicolons, while keeping entities intact.

It might be easier to solve the inverse problem, then: Find the
strings that are separated by semi colons that are not part of an
entity. I think the regular expression would look something like:

([^&;]|&[^&;]*;)*(&[^&;]*)?
Thanks for this approach. It does work. Given that I will not allow
malformaed entities, I've changed the regex to:

([^&;]|&#?\\w+;)*

which also restricts the entity-specific regex a bit. It took me a
while to respond because (along with all my other work!) I did a fair
bit of testing with your approach, my original approach and with yet
another approach: splitting on all semicolons, then reconstructing the
strings that were split at the end of entities. What surprised me was
that there weren't any hugely significant differences in the
performance of all three approaches. I expected the string
reconstruction to be way slower than the others, but the greatest
difference in timing was a mere 8% (which could certainly be
considered more significant in different contexts).

I'm a bit reticent to admit that I started out as a Perl programmer --
and as such have always fancied myself fairly good with regular
expressions -- but you've got me stumped here. While I was able to
parse your original regex easily enough not to need your kindly
provided pseudoRegExp, I have to admit that I can't figure out why the
first character class needs to be [^&;]. Why does the & have to figure
in; why could it not simply be:

<pseudoRegExp>
(
any character not a semicolon
OR
any entity
) zero or more times
</pseudoRegExp>

I tried this and it simply doesn't work, but I can't think why.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top