Perl Regex - Hex bytes

J

JEB

I am trying to use Perl to rescue some legacy word processor files.
The files are ascii, except that some control codes use
bytes in the $80-$ff ranges. I slurp the file into a string for editing.

Regex can hand the bytes <\x7f, but fails to recognize bytes that are \x80
or above.

e.g.,

/\x03//; works
/\x81//; doesn't

Since I thought the problem might be related the adoption of unicode, I've
tried various things like;

no encoding;
use bytes;
and various forms of encoding;
etc.

Nothing helped, but I may not have done it right.

I'm using Perl 5.8+(whatever the lastest revision is) with Redhat Linux
8.0.

Is this something a Perl regex just can't handle?

JEB
 
R

Rafael Garcia-Suarez

JEB said:
/\x03//; works
/\x81//; doesn't

You're giving too little information.
Could you post a sample code that demonstrates the problem, along with
your definition of "doesn't work" ? (warnings, error, expected result vs
actual result)
 
A

Alan J. Flavell

I'm using Perl 5.8+(whatever the lastest revision is)

I suspect you're really using 5.8.0 (as opposed to 5.8.1).
with Redhat Linux

I think that's your clue. Look for utf-8 in your linux locale
setting. It's confusing Perl 5.8.0 into using unicode mode.

(And read other discussions and FAQs on this issue).

Either change your locale setting to remove the reference
to utf-8 (I'm sure this works); or upgrade to 5.8.1, where this
coupling between locale and Perl default behaviour was found too
confusing and has been removed (so I'm told).
Is this something a Perl regex just can't handle?

Wrong diagnosis. Certainly it can handle it.
 
J

JEB

I think that's your clue. Look for utf-8 in your linux locale
setting. It's confusing Perl 5.8.0 into using unicode mode.

(And read other discussions and FAQs on this issue).

Either change your locale setting to remove the reference
to utf-8 (I'm sure this works); or upgrade to 5.8.1, where this
coupling between locale and Perl default behaviour was found too
confusing and has been removed (so I'm told).



THANKS for the idea and help.

Exporting LC_ALL="en_US" in /etc/profile fixed the problem, though in a
clumsy way. I hope it doesn't create problems elsewhere.

JEB
 
B

Ben Morrow

JEB said:
Exporting LC_ALL="en_US" in /etc/profile fixed the problem, though in a
clumsy way. I hope it doesn't create problems elsewhere.

Installing 5.8.1 will also fix it, without the need to lose your
Unicode locale. Alternatively, as a temporary fix, you could

1. Make sure you have /usr/bin/perl5.8.0: if not, copy it from
/usr/bin/perl
2. Remove /usr/bin/perl
3. Create a shell script /usr/bin/perl containing
#!/bin/sh
export LC_ALL="en_US.ISO8859-1"
exec /usr/bin/perl5.8.0 "$@"

Yes, I think this is a pretty evil hack, too, but if you have problems
with losing the Unicode locale it may help.

Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top