unicode: equal strings give different results?

P

peter pilsl

perl 5.8.5

Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)

I didnt manage to provide a really short example, the whole script is
46lines and includes CGI.

I read a text (one char per line) from a CGI-field (UTF8) and print out
the sorted text. The sorting is supposed to be according to german
locales, so I use the locale-pragma (which is ways faster than
unicode::collate)
The sorting order however of my output is wrong. I manually included a
possible input as reference to the script and here the output is
correct. If I enter the reference-string in my textfield the output is
still wrong, but the two strings are exactely the same according to 'eq'
and the hex-dump.
If I do a chr(ord($_)) on all chars the result is ok again.
So obviously I miss something very important about unicode here. Some
extra information is stored somewhere but I dont know about it.


the example is online under

http://www.customers.goldfisch.at/cgi-bin/unicodetest9.pl

If you enter (one line each, dont forget the last newline after p)
ä
b
ö
a
o
p
in the mask, you'll produce the same input than the referencestring, but
will see different results.

Where am I stuck?

thnx,
peter

---------------------------------------------------------
#!/usr/local/bin/perl -w
use strict;

# step1: prepare for german locales
use POSIX qw(locale_h);
use locale;
setlocale(LC_COLLATE, "de_AT");

# step2: prepare for unicode
binmode(STDOUT,":utf8");
binmode(STDIN,":utf8");

# step3: prepare for CGI
use CGI;
my $query = new CGI;
my $charset = 'UTF-8';
$CGI::XHTML= 0;
print
$query->header(-charset=>$charset),$query->start_html(-title=>'Unicodetest');
print "cgi-version = ",$CGI::VERSION,"<br><br>\n";


# set reference-string
my $sr=("\x{00e4}\n\x{0062}\n\x{00f6}\n\x{0061}\n\x{006f}\n\x{0070}\n");

# stepA : get unicode and print it
print "<h4>your input</h4>";
my $si=$query->param('unicode');
$si=~s/\r//g;
#my $sin='';foreach(0..length($si)-1)
{$sin.=chr(ord(substr($si,$_,1)))};$si=$sin;
print_and_sort($si);


# stepB : get reference and print it
print "<h4>reference</h4>";
print "(input and reference are considered equal)<br>" if $si eq $sr;
$sr=~s/\r//g;
print_and_sort($sr);


# stepC : print text-field and finish CGI
print '<br><br>enter your unicode-testtext here :
',$query->start_multipart_form,
$query->textarea(-name=>'unicode',-rows=>10,-columns=>100),
"\n<br>\n",
$query->submit(-name=>'submit',-value=>'proceed'),"\n",
$query->endform,"\n";
print $query->end_html;

# sub : get a string, print its ord, split it by its linebreaks and then
# sort the data and print it out
sub print_and_sort {
my $s=shift;
print "hexdump : ";
foreach my $i (0..length($s)-1) {
print sprintf ("%04x",ord(substr($s,$i,1)))."&nbsp;";
}
print "<br>\n";
print "<br>sorted:<br>\n";
my @data=split(/\n/,$s);
foreach (sort(@data)) {
print $_;
print "&nbsp;&nbsp;(length=",length($_),")";
print "&nbsp;&nbsp;";
foreach my $j (0..length($_)-1) {
print sprintf ("%04x",ord(substr($_,$j,1)))."&nbsp;";
}
print "<br>\n";
}

}
 
A

Alan J. Flavell

perl 5.8.5

haven't got that far yet...
Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and have
equal ord-values of all characters, but gives different results if feeded to
the very same subroutine.

This sounds like an FAQ to me.
It seems one of the two strings does not know fully
that its actually unicode.

At least in the versions of Perl that I've been familiar with, Perl
will not upgrade an iso-8859-1 string to Unicode unless it finds some
reason to do so. This can result in identical strings appearing to
not match. I don't have the references to hand, but I'm sure it's
either an FAQ or in the unicode tutorials.

Hope this helps a bit.
 
S

Shawn Corey

peter said:
perl 5.8.5

Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)

I haven't read your code but you can start with:

perldoc perluniintro
perldoc perlunicode
perldoc encode

And yes, there are two types of strings in Perl 5.8+, one is_utf8(), the
other not.

--- Shawn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top