comparing binary strings

Yakov · Dec 10, 2007

Let's say I create binary strings using pack() or like this:

sub CreateEightByteString { return $_[0]; }
$a = CreateEightByteString("\301" x 8);
$b = CreateEightByteString("\277" x 8);

, and I need to ensure that comparison (lt, gt) of those strings
not depend on locale and not be utf-8 interpreted.

How can I "label" thise string "raw binary" in CreateEightByteString
so that any subsequent comparison be as raw binary 8 bytes,
independent on program's locale ?

Y.L.

Ben Morrow · Dec 10, 2007

Quoth Yakov said:
Let's say I create binary strings using pack() or like this:

sub CreateEightByteString { return $_[0]; }

Uh... what's the point of this?

$a = CreateEightByteString("\301" x 8);
$b = CreateEightByteString("\277" x 8);

Don't use $a and $b, they are magic.

, and I need to ensure that comparison (lt, gt) of those strings
not depend on locale and not be utf-8 interpreted.

Don't attempt to mix POSIX locales and Perl's UTF8 support. The two
don't play well together at all yet.

How can I "label" thise string "raw binary" in CreateEightByteString
so that any subsequent comparison be as raw binary 8 bytes,
independent on program's locale ?

You can't. You need to perform *the comparisons* under 'use bytes'.

If you aren't using locales, and you call CreateEightByteString under
'use bytes', you will get byte-strings back. If you only mix these with
other byte-strings, they will stay that way.

Ben

Joost Diepenmaat · Dec 10, 2007

Don't use locales then. For the purposes of lt and gt and eq it's
irrelevant wether the strings are utf-8 encoded or not.

If you want to be able to output them as bytes, just don't insert any
chars > 255.

Don't attempt to mix POSIX locales and Perl's UTF8 support. The two
don't play well together at all yet.
Yup.

You can't. You need to perform *the comparisons* under 'use bytes'.

No, you don't need to. The only time the encoding of the strings is
important is when you're passing them to external code as a C-style char*
pointer. Or at least it should be.

In general, I'd say: don't use bytes. It breaks stuff - for instance, it
makes it impossible to compare an utf-8 encoded string with a non-utf8
encoded one.

If you aren't using locales, and you call CreateEightByteString under
'use bytes', you will get byte-strings back. If you only mix these with
other byte-strings, they will stay that way.

As long as you're reading and writing to filehandles that have the
correct encoding layer the internal encoding of the strings doesn't
matter to perl code.

Joost.

Ben Morrow · Dec 10, 2007

Quoth Joost Diepenmaat said:
No, you don't need to. The only time the encoding of the strings is
important is when you're passing them to external code as a C-style char*
pointer. Or at least it should be.

I agree, it should be; it's not, however. For instance, under 5.8.8, a
string containing "\xc1" (capital A acute) will match /\w/ if it is
utf8-encoded and not if its not. I'm not sure if this is fixed in 5.10;
I'm not sure, either, what the correct fix would be.

Ben

Joost Diepenmaat · Dec 10, 2007

I agree, it should be; it's not, however. For instance, under 5.8.8, a
string containing "\xc1" (capital A acute) will match /\w/ if it is
utf8-encoded and not if its not. I'm not sure if this is fixed in 5.10;
I'm not sure, either, what the correct fix would be.

Hm... You're right. That's not good, and arguably a bug (and with some
heavy backward-compatibility considerations). It doesn't say anything
about binary strings, though.

I'll give it a shot in the latest blead perl (once I've got and compiled
it) to see what's what.

Joost.

Joost Diepenmaat · Dec 10, 2007

I'll give it a shot in the latest blead perl (once I've got and compiled
it) to see what's what.

:-/ Perl 5.10 rc-2 still does the same.

Joost.

Dr.Ruud · Dec 11, 2007

Joost Diepenmaat schreef:

Ben Morrow:

Hm... You're right. That's not good, and arguably a bug (and with some
heavy backward-compatibility considerations). It doesn't say anything
about binary strings, though.

"\xA0" is whitespace in iso-8859-1:

perl -wle'
print 0+/^[\s\w]/
for "\xC1",
"\xC1\x{100}",
"\xA0",
"\xA0\x{100}",
substr("\xA0\x{100}", 0, 1),
'
0
1
0
1
1

Ben Morrow · Dec 11, 2007

Quoth "Dr.Ruud said:
Joost Diepenmaat schreef:

IMHO definitely a bug; IIRC those on p5p in a position to fix it agree
in principle, but it's harder than it seems, especially given
compatibility.

Err... no, I suppose not, unless you were expecting /\w/ to mean the
same as /[a-zA-Z0-9_]/ as it does under 'use bytes'.

"\xA0" is whitespace in iso-8859-1:

[and matches /\s/ iff utf8]

I'm confused... what does this add to what's already been said?

Ben

Michele Dondi · Dec 11, 2007

I'll give it a shot in the latest blead perl (once I've got and compiled
it) to see what's what.

No need for blead. 5.10 beta is out - or is that still "blead"?

Michele

Binary strings, unicode and encodings	11	Jan 15, 2004
A 'raw' codec for binary "strings" in Python?	2	Mar 1, 2004
read binary .dat file?	1	May 24, 2007
unicode: equal strings give different results?	2	Sep 27, 2004
Parallel sorting algorithms...	0	Sep 7, 2012
Paralle sorting algorithms...	0	Sep 7, 2012
Ruby 1.8 - character encoding	22	Jul 7, 2009
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009

comparing binary strings

Yakov

Ben Morrow

Joost Diepenmaat

Ben Morrow

Joost Diepenmaat

Joost Diepenmaat

Dr.Ruud

Ben Morrow

Michele Dondi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads