comparing binary strings

Y

Yakov

Let's say I create binary strings using pack() or like this:

sub CreateEightByteString { return $_[0]; }
$a = CreateEightByteString("\301" x 8);
$b = CreateEightByteString("\277" x 8);

, and I need to ensure that comparison (lt, gt) of those strings
not depend on locale and not be utf-8 interpreted.

How can I "label" thise string "raw binary" in CreateEightByteString
so that any subsequent comparison be as raw binary 8 bytes,
independent on program's locale ?

Y.L.
 
B

Ben Morrow

Quoth Yakov said:
Let's say I create binary strings using pack() or like this:

sub CreateEightByteString { return $_[0]; }

Uh... what's the point of this?
$a = CreateEightByteString("\301" x 8);
$b = CreateEightByteString("\277" x 8);

Don't use $a and $b, they are magic.
, and I need to ensure that comparison (lt, gt) of those strings
not depend on locale and not be utf-8 interpreted.

Don't attempt to mix POSIX locales and Perl's UTF8 support. The two
don't play well together at all yet.
How can I "label" thise string "raw binary" in CreateEightByteString
so that any subsequent comparison be as raw binary 8 bytes,
independent on program's locale ?

You can't. You need to perform *the comparisons* under 'use bytes'.

If you aren't using locales, and you call CreateEightByteString under
'use bytes', you will get byte-strings back. If you only mix these with
other byte-strings, they will stay that way.

Ben
 
J

Joost Diepenmaat

Don't use locales then. For the purposes of lt and gt and eq it's
irrelevant wether the strings are utf-8 encoded or not.

If you want to be able to output them as bytes, just don't insert any
chars > 255.
Don't attempt to mix POSIX locales and Perl's UTF8 support. The two
don't play well together at all yet.
Yup.


You can't. You need to perform *the comparisons* under 'use bytes'.

No, you don't need to. The only time the encoding of the strings is
important is when you're passing them to external code as a C-style char*
pointer. Or at least it should be.

In general, I'd say: don't use bytes. It breaks stuff - for instance, it
makes it impossible to compare an utf-8 encoded string with a non-utf8
encoded one.
If you aren't using locales, and you call CreateEightByteString under
'use bytes', you will get byte-strings back. If you only mix these with
other byte-strings, they will stay that way.

As long as you're reading and writing to filehandles that have the
correct encoding layer the internal encoding of the strings doesn't
matter to perl code.

Joost.
 
B

Ben Morrow

Quoth Joost Diepenmaat said:
No, you don't need to. The only time the encoding of the strings is
important is when you're passing them to external code as a C-style char*
pointer. Or at least it should be.

I agree, it should be; it's not, however. For instance, under 5.8.8, a
string containing "\xc1" (capital A acute) will match /\w/ if it is
utf8-encoded and not if its not. I'm not sure if this is fixed in 5.10;
I'm not sure, either, what the correct fix would be.

Ben
 
J

Joost Diepenmaat

I agree, it should be; it's not, however. For instance, under 5.8.8, a
string containing "\xc1" (capital A acute) will match /\w/ if it is
utf8-encoded and not if its not. I'm not sure if this is fixed in 5.10;
I'm not sure, either, what the correct fix would be.

Hm... You're right. That's not good, and arguably a bug (and with some
heavy backward-compatibility considerations). It doesn't say anything
about binary strings, though.

I'll give it a shot in the latest blead perl (once I've got and compiled
it) to see what's what.

Joost.
 
D

Dr.Ruud

Joost Diepenmaat schreef:
Ben Morrow:

Hm... You're right. That's not good, and arguably a bug (and with some
heavy backward-compatibility considerations). It doesn't say anything
about binary strings, though.

"\xA0" is whitespace in iso-8859-1:

perl -wle'
print 0+/^[\s\w]/
for "\xC1",
"\xC1\x{100}",
"\xA0",
"\xA0\x{100}",
substr("\xA0\x{100}", 0, 1),
'
0
1
0
1
1
 
B

Ben Morrow

Quoth "Dr.Ruud said:
Joost Diepenmaat schreef:

IMHO definitely a bug; IIRC those on p5p in a position to fix it agree
in principle, but it's harder than it seems, especially given
compatibility.

Err... no, I suppose not, unless you were expecting /\w/ to mean the
same as /[a-zA-Z0-9_]/ as it does under 'use bytes'.
"\xA0" is whitespace in iso-8859-1:
[and matches /\s/ iff utf8]

I'm confused... what does this add to what's already been said?

Ben
 
M

Michele Dondi

I'll give it a shot in the latest blead perl (once I've got and compiled
it) to see what's what.

No need for blead. 5.10 beta is out - or is that still "blead"?


Michele
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top