unicode: equal strings give different results?

peter pilsl · Sep 27, 2004

perl 5.8.5

Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)

I didnt manage to provide a really short example, the whole script is
46lines and includes CGI.

I read a text (one char per line) from a CGI-field (UTF8) and print out
the sorted text. The sorting is supposed to be according to german
locales, so I use the locale-pragma (which is ways faster than
unicode::collate)
The sorting order however of my output is wrong. I manually included a
possible input as reference to the script and here the output is
correct. If I enter the reference-string in my textfield the output is
still wrong, but the two strings are exactely the same according to 'eq'
and the hex-dump.
If I do a chr(ord($_)) on all chars the result is ok again.
So obviously I miss something very important about unicode here. Some
extra information is stored somewhere but I dont know about it.

the example is online under

http://www.customers.goldfisch.at/cgi-bin/unicodetest9.pl

If you enter (one line each, dont forget the last newline after p)
ä
b
ö
a
o
p
in the mask, you'll produce the same input than the referencestring, but
will see different results.

Where am I stuck?

thnx,
peter

---------------------------------------------------------
#!/usr/local/bin/perl -w
use strict;

# step1: prepare for german locales
use POSIX qw(locale_h);
use locale;
setlocale(LC_COLLATE, "de_AT");

# step2: prepare for unicode
binmode(STDOUT,":utf8");
binmode(STDIN,":utf8");

# step3: prepare for CGI
use CGI;
my $query = new CGI;
my $charset = 'UTF-8';
$CGI::XHTML= 0;
print
$query->header(-charset=>$charset),$query->start_html(-title=>'Unicodetest');
print "cgi-version = ",$CGI::VERSION,"<br><br>\n";

# set reference-string
my $sr=("\x{00e4}\n\x{0062}\n\x{00f6}\n\x{0061}\n\x{006f}\n\x{0070}\n");

# stepA : get unicode and print it
print "<h4>your input</h4>";
my $si=$query->param('unicode');
$si=~s/\r//g;
#my $sin='';foreach(0..length($si)-1)
{$sin.=chr(ord(substr($si,$_,1)))};$si=$sin;
print_and_sort($si);

# stepB : get reference and print it
print "<h4>reference</h4>";
print "(input and reference are considered equal)<br>" if $si eq $sr;
$sr=~s/\r//g;
print_and_sort($sr);

# stepC : print text-field and finish CGI
print '<br><br>enter your unicode-testtext here :
',$query->start_multipart_form,
$query->textarea(-name=>'unicode',-rows=>10,-columns=>100),
"\n<br>\n",
$query->submit(-name=>'submit',-value=>'proceed'),"\n",
$query->endform,"\n";
print $query->end_html;

# sub : get a string, print its ord, split it by its linebreaks and then
# sort the data and print it out
sub print_and_sort {
my $s=shift;
print "hexdump : ";
foreach my $i (0..length($s)-1) {
print sprintf ("%04x",ord(substr($s,$i,1)))." ";
}
print "<br>\n";
print "<br>sorted:<br>\n";
my @data=split(/\n/,$s);
foreach (sort(@data)) {
print $_;
print "  (length=",length($_),")";
print "  ";
foreach my $j (0..length($_)-1) {
print sprintf ("%04x",ord(substr($_,$j,1)))." ";
}
print "<br>\n";
}

}

Alan J. Flavell · Sep 27, 2004

perl 5.8.5

haven't got that far yet...

Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and have
equal ord-values of all characters, but gives different results if feeded to
the very same subroutine.

This sounds like an FAQ to me.

It seems one of the two strings does not know fully
that its actually unicode.

At least in the versions of Perl that I've been familiar with, Perl
will not upgrade an iso-8859-1 string to Unicode unless it finds some
reason to do so. This can result in identical strings appearing to
not match. I don't have the references to hand, but I'm sure it's
either an FAQ or in the unicode tutorials.

Hope this helps a bit.

Shawn Corey · Sep 27, 2004

peter said:
perl 5.8.5

Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)

I haven't read your code but you can start with:

perldoc perluniintro
perldoc perlunicode
perldoc encode

And yes, there are two types of strings in Perl 5.8+, one is_utf8(), the
other not.

--- Shawn

Search Results with Pagination	1	Oct 25, 2024
Padding strings for a clean visual print out...	5	Dec 23, 2023
Unicode help please	5	Oct 19, 2013
sorting file according to a unicode column	17	May 28, 2014
Unicode: Strings marked 'utf8'. Can they be converted to 'byte' without going the vec() route?	0	Aug 3, 2009
Can someone pls help me with a little algorithm script	1	Nov 28, 2024
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
returning calculation results in perl scripts	2	Jan 21, 2009

unicode: equal strings give different results?

peter pilsl

Alan J. Flavell

Shawn Corey

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads