Perl Regex Query

J

Jane

Hi,

I'm attempting to make sure that any ampersands I produce in a webpage
are using the proper entity code (ie. "&" as opposed to simply
"&"). I can get so far with it, and it works fine, except for one
small, but important detail.

Importantly, I have to assume, when swapping any "&" I find, that it is
not already part of an entity, which could include "&" itself, or
something like "&#155", for example. It may also simply be preceeded or
suffixed by nothing other than a space.

So, I produced a little test sentence, below, to try out my regex, and
it swaps everything it's supposed to swap, but it gobbles up an extra
character as well, when I don't want it to. The regex, the original
test sentence, and the sentence after being regexed, are below
(hopefully the ampersands etc don't get escaped when I post this):

THE ORIGINAL SENTENCE:
$x="Apples & oranges from T&J are really good and tasty &#8230, & I
should know...\n";

(which contains two ampersands to be regexed, one between apples and
oranges, and one between "T" and "J". The others are, of course,
already in good shape).


THE REGEX:
$x=~s/&[^#a]/&/g;


THE RESULT:
Apples &oranges from T& are really good and tasty &#8230, &
I should know...


.... so it gobbles up the space character before oranges, and also the
J, from T&J. I've tried all sorts of things, but can only seem to make
it worse! ... Any assistance would be mightily appreciated.

Thanks!
Jane
 
J

Jane

Grrr! All my ampersands did get escaped when I posted, so the post
doesn't make a lot of sense ... however, hopefully someone will
understand what I was babbling about.

Jane
 
G

Gunnar Hjalmarsson

Jane said:
I'm attempting to make sure that any ampersands I produce in a webpage
are using the proper entity code (ie. "&" as opposed to simply
"&").
...
THE REGEX:
$x=~s/&[^#a]/&/g;

Maybe you want something like:

$x =~ s/&(?!#?\w+;)/&/g;

Please look for "negative look-ahead assertion" in "perldoc perlre".
 
B

Ben Bacarisse

Jane said:
Hi,

I'm attempting to make sure that any ampersands I produce in a webpage
are using the proper entity code (ie. "&" as opposed to simply
"&"). I can get so far with it, and it works fine, except for one
small, but important detail.

Importantly, I have to assume, when swapping any "&" I find, that it is
not already part of an entity, which could include "&" itself, or
something like "&#155", for example. It may also simply be preceeded or
suffixed by nothing other than a space.

So, I produced a little test sentence, below, to try out my regex, and
it swaps everything it's supposed to swap, but it gobbles up an extra
character as well, when I don't want it to. The regex, the original
test sentence, and the sentence after being regexed, are below
(hopefully the ampersands etc don't get escaped when I post this):

THE ORIGINAL SENTENCE:
$x="Apples & oranges from T&J are really good and tasty &#8230, & I
should know...\n";

(which contains two ampersands to be regexed, one between apples and
oranges, and one between "T" and "J". The others are, of course,
already in good shape).


THE REGEX:
$x=~s/&[^#a]/&/g;

One way is to use negative look ahead matching (X not followed by Y):

s/&(?!#?\w+;)/&/g

Be careful though. You don't seem to have correct character entities
since your numeric ones are not followed by ; as they should be so the
above does not work with your example. If you have lots of these you
could use:

s/&(?!#\d+;?|\w+;)/&/g

Making the ; optional after alphabetic entity names will not work
because &J will be seen as a valid entity. Let's hope you don't have
any of these.
 
J

Jane

Be careful though. You don't seem to have correct character entities
since your numeric ones are not followed by ; as they should be so the
above does not work with your example. If you have lots of these you
could use:

s/&(?!#\d+;?|\w+;)/&/g

Making the ; optional after alphabetic entity names will not work
because &J will be seen as a valid entity. Let's hope you don't have
any of these.


Wow, thanks so much to everyone for the rapid responses, much
appreciated, and I'll research negative look-ahead for future
reference.

Anyway, the regex from Ben worked perfectly for all cases I have,
including the test sentence I was playing with, and the others worked
except for once instance in the test sentence, which was where it ended
up replacing the & after "tasty", where it was already an entity
reference, but I really appreciate all the suggestions and comments!

Many thanks,
Jane
 
G

Gunnar Hjalmarsson

Jane said:
Wow, thanks so much to everyone for the rapid responses, much
appreciated, and I'll research negative look-ahead for future
reference.

Anyway, the regex from Ben worked perfectly for all cases I have,
including the test sentence I was playing with, and the others worked
except for once instance in the test sentence, which was where it ended
up replacing the & after "tasty", where it was already an entity
reference,

No, there wasn't. Since there was no trailing ';', it was not an entity
reference. Hence it should not be treaded as such, should it?
 
J

Jane

No, there wasn't. Since there was no trailing ';', it was not an entity
reference. Hence it should not be treaded as such, should it?

Just for info' ... In the instances that I'm using it for, I think
that the "&" could in fact be following by any character at all,
including spaces. In some cases it may be part of a numeric entity (or
I suppose other HTML entites, such as nbsp;, etc), but in other cases
it'll just be an ampersand that needs to be entity-ised. However, if
it's followed by a # symbol, then I know not to do anything with it,
and I guess that any other HTML & entity should always be prefixed with
a ";" , and there will be no whitespace between the "&" and the ";".

Anyway, many thanks to each contributor, I appreciate it.

Thanks,
Jane
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top