P
peter pilsl
perl 5.8.5
Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)
I didnt manage to provide a really short example, the whole script is
46lines and includes CGI.
I read a text (one char per line) from a CGI-field (UTF8) and print out
the sorted text. The sorting is supposed to be according to german
locales, so I use the locale-pragma (which is ways faster than
unicode::collate)
The sorting order however of my output is wrong. I manually included a
possible input as reference to the script and here the output is
correct. If I enter the reference-string in my textfield the output is
still wrong, but the two strings are exactely the same according to 'eq'
and the hex-dump.
If I do a chr(ord($_)) on all chars the result is ok again.
So obviously I miss something very important about unicode here. Some
extra information is stored somewhere but I dont know about it.
the example is online under
http://www.customers.goldfisch.at/cgi-bin/unicodetest9.pl
If you enter (one line each, dont forget the last newline after p)
ä
b
ö
a
o
p
in the mask, you'll produce the same input than the referencestring, but
will see different results.
Where am I stuck?
thnx,
peter
---------------------------------------------------------
#!/usr/local/bin/perl -w
use strict;
# step1: prepare for german locales
use POSIX qw(locale_h);
use locale;
setlocale(LC_COLLATE, "de_AT");
# step2: prepare for unicode
binmode(STDOUT,":utf8");
binmode(STDIN,":utf8");
# step3: prepare for CGI
use CGI;
my $query = new CGI;
my $charset = 'UTF-8';
$CGI::XHTML= 0;
print
$query->header(-charset=>$charset),$query->start_html(-title=>'Unicodetest');
print "cgi-version = ",$CGI::VERSION,"<br><br>\n";
# set reference-string
my $sr=("\x{00e4}\n\x{0062}\n\x{00f6}\n\x{0061}\n\x{006f}\n\x{0070}\n");
# stepA : get unicode and print it
print "<h4>your input</h4>";
my $si=$query->param('unicode');
$si=~s/\r//g;
#my $sin='';foreach(0..length($si)-1)
{$sin.=chr(ord(substr($si,$_,1)))};$si=$sin;
print_and_sort($si);
# stepB : get reference and print it
print "<h4>reference</h4>";
print "(input and reference are considered equal)<br>" if $si eq $sr;
$sr=~s/\r//g;
print_and_sort($sr);
# stepC : print text-field and finish CGI
print '<br><br>enter your unicode-testtext here :
',$query->start_multipart_form,
$query->textarea(-name=>'unicode',-rows=>10,-columns=>100),
"\n<br>\n",
$query->submit(-name=>'submit',-value=>'proceed'),"\n",
$query->endform,"\n";
print $query->end_html;
# sub : get a string, print its ord, split it by its linebreaks and then
# sort the data and print it out
sub print_and_sort {
my $s=shift;
print "hexdump : ";
foreach my $i (0..length($s)-1) {
print sprintf ("%04x",ord(substr($s,$i,1)))." ";
}
print "<br>\n";
print "<br>sorted:<br>\n";
my @data=split(/\n/,$s);
foreach (sort(@data)) {
print $_;
print " (length=",length($_),")";
print " ";
foreach my $j (0..length($_)-1) {
print sprintf ("%04x",ord(substr($_,$j,1)))." ";
}
print "<br>\n";
}
}
Does a string hold any extra information additional to its pure characters?
I managed to create two strings that are equal to the 'eq'-operator and
have equal ord-values of all characters, but gives different results if
feeded to the very same subroutine. It seems one of the two strings does
not know fully that its actually unicode. (length gives the correct
result. wrong lengths are usually a first hint that the string does not
feel as unicode)
I didnt manage to provide a really short example, the whole script is
46lines and includes CGI.
I read a text (one char per line) from a CGI-field (UTF8) and print out
the sorted text. The sorting is supposed to be according to german
locales, so I use the locale-pragma (which is ways faster than
unicode::collate)
The sorting order however of my output is wrong. I manually included a
possible input as reference to the script and here the output is
correct. If I enter the reference-string in my textfield the output is
still wrong, but the two strings are exactely the same according to 'eq'
and the hex-dump.
If I do a chr(ord($_)) on all chars the result is ok again.
So obviously I miss something very important about unicode here. Some
extra information is stored somewhere but I dont know about it.
the example is online under
http://www.customers.goldfisch.at/cgi-bin/unicodetest9.pl
If you enter (one line each, dont forget the last newline after p)
ä
b
ö
a
o
p
in the mask, you'll produce the same input than the referencestring, but
will see different results.
Where am I stuck?
thnx,
peter
---------------------------------------------------------
#!/usr/local/bin/perl -w
use strict;
# step1: prepare for german locales
use POSIX qw(locale_h);
use locale;
setlocale(LC_COLLATE, "de_AT");
# step2: prepare for unicode
binmode(STDOUT,":utf8");
binmode(STDIN,":utf8");
# step3: prepare for CGI
use CGI;
my $query = new CGI;
my $charset = 'UTF-8';
$CGI::XHTML= 0;
$query->header(-charset=>$charset),$query->start_html(-title=>'Unicodetest');
print "cgi-version = ",$CGI::VERSION,"<br><br>\n";
# set reference-string
my $sr=("\x{00e4}\n\x{0062}\n\x{00f6}\n\x{0061}\n\x{006f}\n\x{0070}\n");
# stepA : get unicode and print it
print "<h4>your input</h4>";
my $si=$query->param('unicode');
$si=~s/\r//g;
#my $sin='';foreach(0..length($si)-1)
{$sin.=chr(ord(substr($si,$_,1)))};$si=$sin;
print_and_sort($si);
# stepB : get reference and print it
print "<h4>reference</h4>";
print "(input and reference are considered equal)<br>" if $si eq $sr;
$sr=~s/\r//g;
print_and_sort($sr);
# stepC : print text-field and finish CGI
print '<br><br>enter your unicode-testtext here :
',$query->start_multipart_form,
$query->textarea(-name=>'unicode',-rows=>10,-columns=>100),
"\n<br>\n",
$query->submit(-name=>'submit',-value=>'proceed'),"\n",
$query->endform,"\n";
print $query->end_html;
# sub : get a string, print its ord, split it by its linebreaks and then
# sort the data and print it out
sub print_and_sort {
my $s=shift;
print "hexdump : ";
foreach my $i (0..length($s)-1) {
print sprintf ("%04x",ord(substr($s,$i,1)))." ";
}
print "<br>\n";
print "<br>sorted:<br>\n";
my @data=split(/\n/,$s);
foreach (sort(@data)) {
print $_;
print " (length=",length($_),")";
print " ";
foreach my $j (0..length($_)-1) {
print sprintf ("%04x",ord(substr($_,$j,1)))." ";
}
print "<br>\n";
}
}