unicode newbie, can you help?

A

alexxx.magni

sorry if it's a naive question, but I never needed up to now to deal
with anything beyond a-zA-Z0-9....

now I have a long text with standard ascii and, sometimes, cyrillic
text - and I want to filter it out (the cyrillic, of course).
I tried with

perl -wne 'foreach($_){if (/(\p{InCyrillic})/){print"record $_ matches
but it fails to recognize it - what am I doing wrong?


thanks for any help...


alessandro
 
R

RedGrittyBrick

sorry if it's a naive question, but I never needed up to now to deal
with anything beyond a-zA-Z0-9....

now I have a long text with standard ascii and, sometimes, cyrillic
text - and I want to filter it out (the cyrillic, of course).
I tried with

perl -wne 'foreach($_){if (/(\p{InCyrillic})/){print"record $_ matches

but it fails to recognize it - what am I doing wrong?

What you might be doing wrong is not telling perl which encoding is used
in test.html!

Your test.htm might be encoded in UTF-8, KO18(?) or something else. My
Perl installation might assume ISO-8859-1 and therefore not /see/ any
Cyrillic characters. YMMV
 
J

Jürgen Exner

sorry if it's a naive question, but I never needed up to now to deal
with anything beyond a-zA-Z0-9....

Note: please remember latin resp. English characters are part of
Unicode, too.
now I have a long text with standard ascii and, sometimes, cyrillic
text - and I want to filter it out (the cyrillic, of course).

What specific encoding (charset, code page, ....) are you using and did
you tell Perl about that encoding or where you reading the text as byte
strings? Just saying "Unicode" is insufficient, because there are
several different enocdings for unicode, e.g. UTF-8, UTF-16, UCS-2 to
name just the most common.

jue
 
J

Jochen Lehmeier

now I have a long text with standard ascii and, sometimes, cyrillic
text - and I want to filter it out (the cyrillic, of course).
I tried with

perl -wne 'foreach($_){if (/(\p{InCyrillic})/){print"record $_ matches

but it fails to recognize it - what am I doing wrong?

You are not telling Perl to expect utf8.

Try to add "use Encoding;" and "$_ = decode_utf8($_)" (and maybe "binmode
STDOUT,':utf8'" to avoid a warning "wide character..." while printing) at
the appropriate places; see "perldoc Encode" and "perldoc -f binmode" for
more details on this.
 
A

alexxx.magni

You are not telling Perl to expect utf8.

Try to add "use Encoding;" and "$_ = decode_utf8($_)" (and maybe "binmode  
STDOUT,':utf8'" to avoid a warning "wide character..." while printing) at 
the appropriate places; see "perldoc Encode" and "perldoc -f binmode" for 
more details on this.

wow, so many things I didnt know... there isnt any utility able to
detect the coding of a file ?

I tried your suggestion, but discovered I do not even have Encoding.pm
installed - I need to do a bit of homework I guess...

thanks!

alessandro
 
J

Jochen Lehmeier

wow, so many things I didnt know... there isnt any utility able to
detect the coding of a file ?

There is Encode::Guess, but, again, you'd have to tell perl to use that.
It is not possible to reliably detect the encoding of a file, so Perl does
not try to do so by default. In other words, it does not change the bytes
it reads or writes unless you tell it to, generally.

Do you know the encoding of your input files? If it is utf8, first try to
get your code to work with the minimum amount of additions (for example,
those I gave you); you can fiddle around with Encode::Guess afterwards. If
it is not utf8... well... substitute whatever encoding it is. If you don't
know the encoding, it will become complicated.
I tried your suggestion, but discovered I do not even have Encoding.pm
installed - I need to do a bit of homework I guess...

Sorry, it's "use Encode" instead of "use Encoding". I assume you are on a
unix/linux machine; "perldoc Encode" should bring up its documentation.
 
J

Jürgen Exner

there isnt any utility able to
detect the coding of a file ?

Impossible.
At the very best you can make an educated guess, e.g. based on
- specific characteristics of the file format (if it has a byte order
mark then probably it is UTF-16 or UCS-2; this lead double byte is
unlikly to occur in any other encoding as the lead character(s) except
in Windows UTF-8),
- statistical analysis (if I assume this is Windows-1251 then I get a
character distribution that is consistent with the Serbian language)
- text analysis (if I assume this file as encoded in ISO 639-3 then most
of the resulting words can be found in a Farsi dictionary)

Obviously none of these methods are conclusive.

jue
 
S

sln

sorry if it's a naive question, but I never needed up to now to deal
with anything beyond a-zA-Z0-9....

now I have a long text with standard ascii and, sometimes, cyrillic
text - and I want to filter it out (the cyrillic, of course).
I tried with

perl -wne 'foreach($_){if (/(\p{InCyrillic})/){print"record $_ matches

but it fails to recognize it - what am I doing wrong?


thanks for any help...


alessandro

Try this on your file and post the output.
Assign $fname = 'your file' and run it.

-sln

# File: bomtest.pl
#
# CheckUTF() -
# UTF-8/16/32 check, Will add a guessed utf-16/32 layer
# if there is no current encoding() layer.
# Opened filehandle should be in byte context.
# This is more for automatically adding a utf layer on
# unknown files opened in byte mode.
# Guesses utf-8/16/32 encodings, BOMb'd or not.
# Uses Encode::Guess module. Will use standard guess settings (see docs).
# Used standalone or optionally with Get_BOM().
#
# Get_BOM() -
# SEEK_SET workaround for BOMb'd Encoding(utf-) layer Bug
# and a general way to get the BOM and offset for subsequent
# seeks.
# The file can be opened using any encoding. The handle passed
# is duped, that handle set to ':raw'. No character semantics
# are used when analysing the BOM.
# This will also detect a mismatch in width between the
# detected BOM and the current opened encoding layer.
# Used standalone or optionally with CheckUTF().
# CheckUTF() has an option to get the bom on the callers
# behalf, combining steps.
#
# test_surrogate_pair() -
# Just a utility function to print hi/lo surrogate pairs
# given a unicode test character (> 0x10000).
#
#
# Modify the top variables to test various configurations.
# Includes:
# $fname File name for output and input
# $write_file Flag 0/1 to create a test file
# (0 to not write, just read file $fname)
# $data Data string for writing test file, can be unicode
# $enc Encoding to read/write file $fname
# $test_surr Stand-alone surrogate pair test
#
# -sln, 8/13/09
# --------------------------------------------------------

use strict;
use warnings;

binmode (STDOUT, ':utf8');

{
# my $fname = 'dummy.txt';

my $fname = 'c:\temp\XML\CollectedData_1804.xml';

my $Unichar = "\x{21000}";

my $data = "\x{1}\x{2}$Unichar";
my $enc = ''; # blank or UTF-xxLE/BE or :bytes, :utf8, :raw, etc ...
my $enc_string = ($enc =~ /:/) ? "$enc" : (length($enc) ? "encoding($enc)" : '');
my $write_file = 0;
unless ($write_file eq 'dummy.txt') { $write_file = 0; }
my $test_surr = 0;

my ($fh, $line);
my ($width,$encfound,$bom,$offset);

# Test surrogate pairs ..
##
test_surrogate_pair($Unichar) if $test_surr;


# Create a test file with encoding ..
##
if ($write_file) {
print "\nWriting sample data as '$enc' to $fname\n",'-'x20,"\n";
open $fh, ">$enc_string", $fname or die "can't open $fname for write $!";

print $fh $data;

close $fh;
} else {
print "\nNOT WRITING to $fname !!\n",'-'x20,"\n";
}

# Read test file with encoding ..
##
print "\nReading $fname as '$enc'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

$offset = CheckUTF($fh, 'size'=> 100, 'verbose'=> 1, 'getbom_v'=> 1);
print "\n";

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
print "$line\n";
}
close $fh;

# Read test file with encoding into scalar, open handle to scalar as :utf8 ..
##
print "\nBuffering $fname as '$enc', open/read buffer as ':utf8'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

$offset = CheckUTF($fh, 'size'=> 100, 'verbose'=> 1, 'getbom_v'=> 1);
seek($fh, $offset, 0); # SEEK_SET

my $membuf = <$fh>;
close $fh;

open $fh, "<:utf8", \$membuf or die "can't open \$membuf for read $!";

$offset = 0;
print "\n";

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
}
close $fh;

# Read test file in byte mode ..
##
print "\nReading $fname as bytes\n",'-'x20,"\n";
open $fh, '<', $fname or die "can't open $fname for read $!";

$offset = 0;

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ";
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}
print "\n";
}
close $fh;

}
exit 0;

##
## End of program ...
##

sub ordsplit
{
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%x ",$_);
}
return $buf;
}

sub Get_BOM
{
my ($fh_orig, $verbose) = @_;
$verbose = 0 unless (defined $verbose and lc($verbose) eq 'v');
use Encode;

my %bom2enc = (
map { encode($_, "\x{feff}") => $_ } qw(
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
)
);
my %enc2bom = (
reverse(%bom2enc),
map { $_ => encode($_, "\x{feff}") } qw(
UCS-2
iso-10646-1
utf8
)
);
my @bombs = sort { length $b <=> length $a } keys %bom2enc;
my $MAX_BOM_LENGTH = length $bombs[0];
my $bomstr = join '|', @bombs;
my $bom_re = qr/^($bomstr)/o;
my @arlays = PerlIO::get_layers($fh_orig);
my $layers = "@arlays";

print "Get_BOM() - Layers = $layers\n" if $verbose;

if (defined($fh_orig) && (ref($fh_orig) eq 'GLOB' || ref(\$fh_orig) eq 'GLOB'))
{
# Dup the passed in handle, binmode it to raw
# We will ONLY do byte semantics !!
# --------------------------------------------------
open my $fh_dup, "<&", $fh_orig or die "can't dup filehandle: $!";
binmode ($fh_dup, ":raw");

if (seek($fh_dup, 0, 0)) # SEEK_SET
{
my $sample = '';

## Read in $MAX_BOM_LENGTH characters.
## As a check, store character/bytes read based
## on the differential file position per read.
## Since our dup handle is in byte mode, the bytes
## per character will always be 1.
## This is a hold over when there could be encoded
## characters and is not necessary.
## -----------------------------------------------
my $fpos = tell($fh_dup);
my @samparray = ();
print " * Reading up to $MAX_BOM_LENGTH characters\n" if $verbose;

for (1 .. $MAX_BOM_LENGTH)
{
my $count = read ($fh_dup, my $buff, 1);
last if (!defined $count or $count != 1);
push @samparray, {'char'=> $buff, 'width'=> (tell($fh_dup)-$fpos)};
$fpos = tell($fh_dup);
}
$sample = '';
for my $href (@samparray) {
$sample .= $href->{'char'};
printf (" - char %x , bytes = %d\n", ord($href->{'char'}), $href->{'width'}) if $verbose;
}
if ($verbose) {
print " - Read ".length($sample)." characters. Position = $fpos\n";
print " - Sample (".length($sample).") = ";
for (map {ord $_} split //, $sample) {
printf ("%02x ", $_);
}
print "\n * Analysing bytes\n";
}
## Convert the 'characters' to bytes.
## Only process $MAX_BOM_LENGTH bytes.
## We only care about the BOM, which will match
## the encoding bytes we already have in %bom2enc.
## -----------------------------------------------
my ($octets, $numb_octets, $vecoffset) = ('',0,0);

for my $href (@samparray)
{
my @ar = ();
my ($charwidth, $ordchar) = (
$href->{'width'},
ord( $href->{'char'})
);
last if (($numb_octets + $charwidth) > $MAX_BOM_LENGTH);
$numb_octets += $charwidth;

while ($ordchar > 0) {
push @ar, $ordchar & 0xff;
$ordchar >>= 8;
--$charwidth;
}
push (@ar,0) while ($charwidth-- > 0);
for (reverse @ar) {
vec ($octets, $vecoffset++, 8) = $_;
}
}
if ($verbose) {
print " BOMS avail = ";
for my $bomavail (@bombs) {
print '( ';
for (map {ord $_} split //, $bomavail) {
printf ("%02x",$_);
}
print ' ) ';
}
print "\n - Bom bytes from sample (".length($octets).") = ";
for (map {ord $_} split //, $octets) {
printf ("%02x ", $_);
}
print "\n";
}
# end convert

my ($bomenc, $bom, $bomwidth, $bomoffset) = ('','','',0);
if (my ($found) = $octets =~ /$bom_re/) {

($bomenc, $bomoffset) = (
$bom2enc{$found},
length($found)
);
for (map {ord $_} split //, $found) {
$bom .= sprintf ("%02x", $_);
}
print " * Found $bomenc, BOM = $bom\n" if ($verbose);

my $noendian = $bomenc;
$noendian =~ s/(?:LE|BE)$//i;
$bomwidth = ($layers =~ /$noendian/i) ? 'ok' : 'mismatch';
print " - Does not match width of current encoding layer\n" if ($verbose and $bomwidth ne 'ok');
} else {
print " * BOM Not found\n" if $verbose;
}

close ($fh_dup);

# Original handle -> Must SEEK_SET/read twice to
# mitigate UTF-16/32 BOM seek bug
# ----------------------------------------------
seek($fh_orig, 0, 0); # SEEK_SET
read ($fh_orig, $sample, 1); # read
seek($fh_orig, 0, 0); # SEEK_SET last. Caller to seek past BOM.

return ($bomenc, $bom, $bomwidth, $bomoffset);
}
close ($fh_dup);

# seek failed, fall through
}
# There was an error ..
if ($verbose) {
print " * Invalid filehandle or file is unseekable\n - Utf BOMs available (max length = $MAX_BOM_LENGTH):\n";
while (my ($key,$val) = each %enc2bom)
{
my $valstring = '';
for (split //, $val) {
$valstring .= sprintf ("%x ",ord $_);
}
print " $key = $bom2enc{$val} = $valstring\n";
#print " $key = $bom2enc{$val} = '$val' = '$enc2bom{$key}' = $valstring\n";
}
}
return ('','','',-1);
}

sub CheckUTF
{
# UTF-8/16/32 check
# Will add a layer if there is no current
# encoding() layer. So open filehandle should
# be in byte context.
# ----------------------------------------
# Result value:
# -1 - not filehandle (or undef'ed)
# 0 - read error or empty file
# 1 - already have encoding layer (could force change later)
# 2 - guess_encoder() error message
# 3 - do nothing unless utf-16/32(be/le)
# 4 - added a layer

my ($fh_check,@args) = @_;
my %parm = ('verbose'=>0, 'size'=>60, 'getbom' =>0, 'bomopts'=>'');
while (my ($name, $val) = splice (@args, 0, 2))
{
$name =~ s/^\s+|\s+$//;
$name = lc $name;
next if not defined $val;

if ( $name =~ /^getbom(?:_(\w*))?/ and $val) {
$parm{'gbom'} = 1;
$parm{'bomopts'} = $1 if defined($1);
}
elsif ($name eq 'verbose' || $name eq 'size' && $val > 4) {
$parm{$name} = $val;
}
}
# 0 1 2 3 4 5 6
my ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset) = (
0, 'UTF Check', '', '', '', '', 0);
$layers = ':'.join (':', PerlIO::get_layers($fh_check)).':';

if (!defined($fh_check) || (ref($fh_check) ne 'GLOB' && ref(\$fh_check) ne 'GLOB')) {
$Utfmsg .= ", not a filehandle";
print "$Utfmsg\n" if $parm{'verbose'};
$Res = -1;
}
elsif ($layers =~ /:encoding/) {
$Utfmsg .= ", already have encoding layer";
$Res = 1;
}
else {
my ($count, $sample);
my $utf8layer = $layers =~ /:utf8/;

binmode ($fh_check,":bytes");
seek ($fh_check, 0, 0); # SEEK_SET

# Try to read a large sample
if (!defined($count = read ($fh_check,$sample,$parm{'size'},0))) {
$Utfmsg .= ". $!"; # read error
} elsif ($count <= 0) {
$Utfmsg .= ". File is empty";
} else {
seek ($fh_check, 0, 0); # SEEK_SET
use Encode::Guess;

my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF
$Res = 2;

if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count): $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
binmode ($fh_check, ":encoding($name)");
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
}
binmode ($fh_check,":utf8") if $utf8layer;
}
$Utfmsg .= ' ..';
$layers = "@{[PerlIO::get_layers($fh_check)]}";

print "@{[$Utfmsg,$layers]}\n" if $parm{'verbose'};

$bomoffset = $Res if $Res == -1;
if ($Res != -1 && $parm{'gbom'}) {
($bomenc, $bom, $bomwidth, $bomoffset) = Get_BOM ($fh_check, $parm{'bomopts'});
$Utfmsg .= " bom($bomoffset) = $bom ..";
print "Get_BOM returned $bomenc, $bom, $bomwidth, $bomoffset\n" if $parm{'verbose'};
}
return ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset);
}

sub test_surrogate_pair
{
my $test = shift;
print "\nTesting surrogate pairs\n",'-'x20,"\n";
if (ord($test) < 0x10000) {
print "nothing to test < 0x10000\n";
} else {
my $hi = (ord($test) - 0x10000) / 0x400 + 0xD800;
my $lo = (ord($test) - 0x10000) % 0x400 + 0xDC00;
my $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
printf "test uni = %x\n", ord($test);
printf "hi surrogate = %x\n", $hi;
printf "lo surrogate = %x\n", $lo;
printf "uni = %x\n", $uni;
}
}
__END__


c:\temp\new_v3c>perl bomtest.pl

NOT WRITING to c:\temp\XML\CollectedData_1804.xml !!
--------------------

Reading c:\temp\XML\CollectedData_1804.xml as ''
--------------------
UTF Check, guess(100): UTF-16. Adding layer .. unix crlf encoding(UTF-16) utf8
Get_BOM() - Layers = unix crlf encoding(UTF-16) utf8
* Reading up to 4 characters
- char ff , bytes = 1
- char fe , bytes = 1
- char 3c , bytes = 1
- char 0 , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = ff fe 3c 00
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = ff fe 3c 00
* Found UTF-16LE, BOM = fffe
Get_BOM returned UTF-16LE, fffe, ok, 2

Pass 1: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Pass 2: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Pass 3: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>


Buffering c:\temp\XML\CollectedData_1804.xml as '', open/read buffer as ':utf8'
--------------------
UTF Check, guess(100): UTF-16. Adding layer .. unix crlf encoding(UTF-16) utf8
Get_BOM() - Layers = unix crlf encoding(UTF-16) utf8
* Reading up to 4 characters
- char ff , bytes = 1
- char fe , bytes = 1
- char 3c , bytes = 1
- char 0 , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = ff fe 3c 00
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = ff fe 3c 00
* Found UTF-16LE, BOM = fffe
Get_BOM returned UTF-16LE, fffe, ok, 2

Pass 1: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
Pass 2: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
Pass 3: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a

Reading c:\temp\XML\CollectedData_1804.xml as bytes
--------------------
Pass 1: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a
Pass 2: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a
Pass 3: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a

c:\temp\new_v3c>
 
M

Martijn Lievaart

wow, so many things I didnt know... there isnt any utility able to
detect the coding of a file ?

Easy, as this file is html. Look at the encoding tag. If there is none,
it'a iso-latin-1 by default.

M4
 
B

Brad Baxter

Jochen said:
You are not telling Perl to expect utf8.

Try to add "use Encoding;" and "$_ = decode_utf8($_)" (and maybe
"binmode STDOUT,':utf8'" to avoid a warning "wide character..." while
printing) at the appropriate places; see "perldoc Encode" and "perldoc
-f binmode" for more details on this.

I found the perlunitut tutorial to be very helpful as
an entry point, e.g.,

http://perldoc.perl.org/perlunitut.html

If your experience is like mine, you'll find the author
is right when he says there, "You may have to re-read
this entire section a few times..."
 
A

alexxx.magni

Try this on your file and post the output.
Assign $fname = 'your file' and run it.

-sln

# File: bomtest.pl
#
(snip snip)


here is the output...

thank! Did you write this?

alessandro
----------------------------------------------

NOT WRITING to test.htm !!
--------------------

Reading test.htm as ''
--------------------
UTF Check, guess(100): ascii. Do nothing .. unix perlio
Get_BOM() - Layers = unix perlio
* Reading up to 4 characters
- char 3c , bytes = 1
- char 68 , bytes = 1
- char 74 , bytes = 1
- char 6d , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = 3c 68 74 6d
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe )
( feff )
- Bom bytes from sample (4) = 3c 68 74 6d
* BOM Not found
Get_BOM returned , , , 0

Pass 1: seek 0/0 ok, data (332) = 3c 68 74 6d 6c 3e 3c 68 65
61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71 75
69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 22
20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68 74
6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d 38
22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b 33
34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40 d0
b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30 3c 2f
74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e 3c 62 6f 64
79 3e 3c 61 20 68 72 65 65 78 2e 68 74 6d 6c 22 20
73 74 79 6c 65 3d 22 63 6f 6c 6f 72 3a 23 30 30 30
30 30 30 22 3e d0 b2 d0 b2 d0 b5 5b 33 34 6d 7e 40
5b 33 34 6d 7e 45 3c 2f 61 3e 3c 62 72 3e 3c 62 72
3e 3c 62 3e d0 a1 5b 33 34 6d 7e 42 5b 33 34 6d 7e
40 d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30
3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c 74 62 6f 64 79
3e 3c 74 72 20 62 67 63 6f 6c 6f 72 3d 22 23 30 30
30 30 30 30 22 20 73 74 79 6c 65 3d 22 63 6f 6c 6f
72 3a 23 66 66 66 66 66 66 22 3e 3c 74 64 20 77 69
64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c 2f 74 64 3e a
<html><head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" /><title>g - [34m~A[34m~B[34m~@ани[34m~Fа 0</
title></head><body><a hreex.html" style="color:#000000">вве[34m~@
[34m~E</a><br><br><b>С[34m~B[34m~@ани[34m~Fа 0</
b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td
wid><b>ID</b></td>

Pass 2: seek 0/0 ok, data (332) = 3c 68 74 6d 6c 3e 3c 68 65
61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71 75
69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 22
20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68 74
6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d 38
22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b 33
34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40 d0
b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30 3c 2f
74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e 3c 62 6f 64
79 3e 3c 61 20 68 72 65 65 78 2e 68 74 6d 6c 22 20
73 74 79 6c 65 3d 22 63 6f 6c 6f 72 3a 23 30 30 30
30 30 30 22 3e d0 b2 d0 b2 d0 b5 5b 33 34 6d 7e 40
5b 33 34 6d 7e 45 3c 2f 61 3e 3c 62 72 3e 3c 62 72
3e 3c 62 3e d0 a1 5b 33 34 6d 7e 42 5b 33 34 6d 7e
40 d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30
3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c 74 62 6f 64 79
3e 3c 74 72 20 62 67 63 6f 6c 6f 72 3d 22 23 30 30
30 30 30 30 22 20 73 74 79 6c 65 3d 22 63 6f 6c 6f
72 3a 23 66 66 66 66 66 66 22 3e 3c 74 64 20 77 69
64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c 2f 74 64 3e a
<html><head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" /><title>g - [34m~A[34m~B[34m~@ани[34m~Fа 0</
title></head><body><a hreex.html" style="color:#000000">вве[34m~@
[34m~E</a><br><br><b>С[34m~B[34m~@ани[34m~Fа 0</
b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td
wid><b>ID</b></td>

Pass 3: seek 0/0 ok, data (332) = 3c 68 74 6d 6c 3e 3c 68 65
61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71 75
69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 22
20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68 74
6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d 38
22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b 33
34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40 d0
b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30 3c 2f
74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e 3c 62 6f 64
79 3e 3c 61 20 68 72 65 65 78 2e 68 74 6d 6c 22 20
73 74 79 6c 65 3d 22 63 6f 6c 6f 72 3a 23 30 30 30
30 30 30 22 3e d0 b2 d0 b2 d0 b5 5b 33 34 6d 7e 40
5b 33 34 6d 7e 45 3c 2f 61 3e 3c 62 72 3e 3c 62 72
3e 3c 62 3e d0 a1 5b 33 34 6d 7e 42 5b 33 34 6d 7e
40 d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30
3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c 74 62 6f 64 79
3e 3c 74 72 20 62 67 63 6f 6c 6f 72 3d 22 23 30 30
30 30 30 30 22 20 73 74 79 6c 65 3d 22 63 6f 6c 6f
72 3a 23 66 66 66 66 66 66 22 3e 3c 74 64 20 77 69
64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c 2f 74 64 3e a
<html><head><meta http-equiv="Content-Type" content="text/html;
charset=UTF-8" /><title>g - [34m~A[34m~B[34m~@ани[34m~Fа 0</
title></head><body><a hreex.html" style="color:#000000">вве[34m~@
[34m~E</a><br><br><b>С[34m~B[34m~@ани[34m~Fа 0</
b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td
wid><b>ID</b></td>


Buffering test.htm as '', open/read buffer as ':utf8'
--------------------
UTF Check, guess(100): ascii. Do nothing .. unix perlio
Get_BOM() - Layers = unix perlio
* Reading up to 4 characters
- char 3c , bytes = 1
- char 68 , bytes = 1
- char 74 , bytes = 1
- char 6d , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = 3c 68 74 6d
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe )
( feff )
- Bom bytes from sample (4) = 3c 68 74 6d
* BOM Not found
Get_BOM returned , , , 0

Pass 1: seek 0/0 ok, data (320) = 3c 68 74 6d 6c 3e 3c 68 65
61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71 75
69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 22
20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68 74
6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d 38
22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b 33
34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40 430
43d 438 5b 33 34 6d 7e 46 430 20 30 3c 2f 74 69 74 6c
65 3e 3c 2f 68 65 61 64 3e 3c 62 6f 64 79 3e 3c 61
20 68 72 65 65 78 2e 68 74 6d 6c 22 20 73 74 79 6c
65 3d 22 63 6f 6c 6f 72 3a 23 30 30 30 30 30 30 22
3e 432 432 435 5b 33 34 6d 7e 40 5b 33 34 6d 7e 45 3c
2f 61 3e 3c 62 72 3e 3c 62 72 3e 3c 62 3e 421 5b 33
34 6d 7e 42 5b 33 34 6d 7e 40 430 43d 438 5b 33 34 6d
7e 46 430 20 30 3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c
74 62 6f 64 79 3e 3c 74 72 20 62 67 63 6f 6c 6f 72
3d 22 23 30 30 30 30 30 30 22 20 73 74 79 6c 65 3d
22 63 6f 6c 6f 72 3a 23 66 66 66 66 66 66 22 3e 3c
74 64 20 77 69 64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c
2f 74 64 3e a
Pass 2: seek 0/0 ok, data (320) = 3c 68 74 6d 6c 3e 3c 68 65
61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71 75
69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 22
20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68 74
6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d 38
22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b 33
34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40 430
43d 438 5b 33 34 6d 7e 46 430 20 30 3c 2f 74 69 74 6c
65 3e 3c 2f 68 65 61 64 3e 3c 62 6f 64 79 3e 3c 61
20 68 72 65 65 78 2e 68 74 6d 6c 22 20 73 74 79 6c
65 3d 22 63 6f 6c 6f 72 3a 23 30 30 30 30 30 30 22
3e 432 432 435 5b 33 34 6d 7e 40 5b 33 34 6d 7e 45 3c
2f 61 3e 3c 62 72 3e 3c 62 72 3e 3c 62 3e 421 5b 33
34 6d 7e 42 5b 33 34 6d 7e 40 430 43d 438 5b 33 34 6d
7e 46 430 20 30 3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c
74 62 6f 64 79 3e 3c 74 72 20 62 67 63 6f 6c 6f 72
3d 22 23 30 30 30 30 30 30 22 20 73 74 79 6c 65 3d
22 63 6f 6c 6f 72 3a 23 66 66 66 66 66 66 22 3e 3c
74 64 20 77 69 64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c
2f 74 64 3e a
Pass 3: seek 0/0 ok, data (320) = 3c 68 74 6d 6c 3e 3c 68 65
61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71 75
69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65 22
20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68 74
6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d 38
22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b 33
34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40 430
43d 438 5b 33 34 6d 7e 46 430 20 30 3c 2f 74 69 74 6c
65 3e 3c 2f 68 65 61 64 3e 3c 62 6f 64 79 3e 3c 61
20 68 72 65 65 78 2e 68 74 6d 6c 22 20 73 74 79 6c
65 3d 22 63 6f 6c 6f 72 3a 23 30 30 30 30 30 30 22
3e 432 432 435 5b 33 34 6d 7e 40 5b 33 34 6d 7e 45 3c
2f 61 3e 3c 62 72 3e 3c 62 72 3e 3c 62 3e 421 5b 33
34 6d 7e 42 5b 33 34 6d 7e 40 430 43d 438 5b 33 34 6d
7e 46 430 20 30 3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c
74 62 6f 64 79 3e 3c 74 72 20 62 67 63 6f 6c 6f 72
3d 22 23 30 30 30 30 30 30 22 20 73 74 79 6c 65 3d
22 63 6f 6c 6f 72 3a 23 66 66 66 66 66 66 22 3e 3c
74 64 20 77 69 64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c
2f 74 64 3e a

Reading test.htm as bytes
--------------------
Pass 1: seek 0/0 ok, data (332) = 3c 68 74 6d 6c 3e 3c 68
65 61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71
75 69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65
22 20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68
74 6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d
38 22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b
33 34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40
d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30 3c
2f 74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e 3c 62 6f
64 79 3e 3c 61 20 68 72 65 65 78 2e 68 74 6d 6c 22
20 73 74 79 6c 65 3d 22 63 6f 6c 6f 72 3a 23 30 30
30 30 30 30 22 3e d0 b2 d0 b2 d0 b5 5b 33 34 6d 7e
40 5b 33 34 6d 7e 45 3c 2f 61 3e 3c 62 72 3e 3c 62
72 3e 3c 62 3e d0 a1 5b 33 34 6d 7e 42 5b 33 34 6d
7e 40 d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20
30 3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c 74 62 6f 64
79 3e 3c 74 72 20 62 67 63 6f 6c 6f 72 3d 22 23 30
30 30 30 30 30 22 20 73 74 79 6c 65 3d 22 63 6f 6c
6f 72 3a 23 66 66 66 66 66 66 22 3e 3c 74 64 20 77
69 64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c 2f 74 64 3e
a
Pass 2: seek 0/0 ok, data (332) = 3c 68 74 6d 6c 3e 3c 68
65 61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71
75 69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65
22 20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68
74 6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d
38 22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b
33 34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40
d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30 3c
2f 74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e 3c 62 6f
64 79 3e 3c 61 20 68 72 65 65 78 2e 68 74 6d 6c 22
20 73 74 79 6c 65 3d 22 63 6f 6c 6f 72 3a 23 30 30
30 30 30 30 22 3e d0 b2 d0 b2 d0 b5 5b 33 34 6d 7e
40 5b 33 34 6d 7e 45 3c 2f 61 3e 3c 62 72 3e 3c 62
72 3e 3c 62 3e d0 a1 5b 33 34 6d 7e 42 5b 33 34 6d
7e 40 d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20
30 3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c 74 62 6f 64
79 3e 3c 74 72 20 62 67 63 6f 6c 6f 72 3d 22 23 30
30 30 30 30 30 22 20 73 74 79 6c 65 3d 22 63 6f 6c
6f 72 3a 23 66 66 66 66 66 66 22 3e 3c 74 64 20 77
69 64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c 2f 74 64 3e
a
Pass 3: seek 0/0 ok, data (332) = 3c 68 74 6d 6c 3e 3c 68
65 61 64 3e 3c 6d 65 74 61 20 68 74 74 70 2d 65 71
75 69 76 3d 22 43 6f 6e 74 65 6e 74 2d 54 79 70 65
22 20 63 6f 6e 74 65 6e 74 3d 22 74 65 78 74 2f 68
74 6d 6c 3b 20 63 68 61 72 73 65 74 3d 55 54 46 2d
38 22 20 2f 3e 3c 74 69 74 6c 65 3e 67 20 2d 20 5b
33 34 6d 7e 41 5b 33 34 6d 7e 42 5b 33 34 6d 7e 40
d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20 30 3c
2f 74 69 74 6c 65 3e 3c 2f 68 65 61 64 3e 3c 62 6f
64 79 3e 3c 61 20 68 72 65 65 78 2e 68 74 6d 6c 22
20 73 74 79 6c 65 3d 22 63 6f 6c 6f 72 3a 23 30 30
30 30 30 30 22 3e d0 b2 d0 b2 d0 b5 5b 33 34 6d 7e
40 5b 33 34 6d 7e 45 3c 2f 61 3e 3c 62 72 3e 3c 62
72 3e 3c 62 3e d0 a1 5b 33 34 6d 7e 42 5b 33 34 6d
7e 40 d0 b0 d0 bd d0 b8 5b 33 34 6d 7e 46 d0 b0 20
30 3c 2f 62 3e 3c 74 61 62 6c 65 3e 3c 74 62 6f 64
79 3e 3c 74 72 20 62 67 63 6f 6c 6f 72 3d 22 23 30
30 30 30 30 30 22 20 73 74 79 6c 65 3d 22 63 6f 6c
6f 72 3a 23 66 66 66 66 66 66 22 3e 3c 74 64 20 77
69 64 3e 3c 62 3e 49 44 3c 2f 62 3e 3c 2f 74 64 3e
a
 
A

alexxx.magni

thanks everybody for your help,
the following did the trick - ok it was written in the html header in
clear letters that it was utf-8 - but until yesterday I never even
knew what encoding could mean...

alessandro

perl -MEncode -wne '$_ = decode_utf8($_); foreach($_){if (/(\p
{InCyrillic})/){print"record $_ matches ===>>>$1<<<===\n\n\n"}else
{print"ok\n"}}' test.htm
 
S

sln

thanks everybody for your help,
the following did the trick - ok it was written in the html header in
clear letters that it was utf-8 - but until yesterday I never even
knew what encoding could mean...

dalessandro

perl -MEncode -wne '$_ = decode_utf8($_); foreach($_){if (/(\p
{InCyrillic})/){print"record $_ matches ===>>>$1<<<===\n\n\n"}else
{print"ok\n"}}' test.htm


Many times "Unicode" is written in the header. Whats written in the
header is not necessarily what the encoding actually is.

When you posted the sample analysis of bomtest it detected 'ascii'
because there was *no* utf-8 BOM marker and the utf-8 encoding didn't
come until later in the sample.

Traditionally, decoding is done in the perlio layers if you are
reading in data from file based handles. Unless you are doing something
special, decode strings can be tricky.

Since I use the CheckUTF() and Get_BOM() functions in other code,
to transparently add Unicode layers (utf8, utf-8/16/32LE/BE only) if
needed, I changed the functions slightly to actually add utf8 and
add a sample size option of -1 that will sample the entire file to
guess function.

So, if you take your test.htm and run it through a cyrillic test,
this would be a typical usage of CheckUTF() and Get_BOM().
You have to replace these stubbed out functions from the new
new bomtest1.pl code below.

-sln

----------------
# Your Cyrillic test on test.htm
----------------
use strict;
use warnings;

binmode (STDOUT, ':utf8');

# Open the file in byte mode, let the encoding be guessed
my $fname = 'test.htm';
open my $fh, '<', $fname or die "can't open $fname for read $!";

my $offset = CheckUTF($fh, 'size'=> -1, 'verbose'=> 1, 'getbom_v'=> 1); # verbose
# my $offset = CheckUTF($fh, 'size'=> -1, 'verbose'=> 0, 'getbom'=> 1); # transparent

seek($fh, $offset, 0); # not necessary if just reading the file once

while (<$fh>) {
print '-'x10,"\nRAW record:\n$_\n";
my $clean = '';
while ( / (\p{InCyrillic}+) | ((?:(?!\p{InCyrillic}).)+) /xg ) {
if (defined $1) {
print "Cyrillic: $1\n";
} else {
print "OK: $2\n";
$clean .= $2;
}
}
print "\nCleaned:\n$clean\n";
}
close $fh;
exit(0);

### Subs ..

# Add these subs from the bomtest1.pl code below this ..
sub ordsplit {}
sub Get_BOM {}
sub CheckUTF {}

__END__

# Your test.htm and Cyrillic test output:
UTF Check, guess(332): utf8. Adding layer .. unix crlf utf8
Get_BOM() - Layers = unix crlf utf8
* Reading up to 4 characters
- char 3c , bytes = 1
- char 68 , bytes = 1
- char 74 , bytes = 1
- char 6d , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = 3c 68 74 6d
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = 3c 68 74 6d
* BOM Not found
Get_BOM returned , , , 0
----------
RAW record:
<title>g - [34m~A[34m~B[34m~@-¦-+-+[34m~F-¦ 0</title></head><body><a hreex.html
" style="color:#000000">-¦-¦-¦[34m~@[34m~E</a><br><br><b>-í[34m~B[34m~@-¦-+-+[34
m~F-¦ 0</b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>
ID</b></td>

OK: <html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-
8" /><title>g - [34m~A[34m~B[34m~@
Cyrillic: -¦-+-+
OK: [34m~F
Cyrillic: -¦
OK: 0</title></head><body><a hreex.html" style="color:#000000">
Cyrillic: -¦-¦-¦
OK: [34m~@[34m~E</a><br><br><b>
Cyrillic: -í
OK: [34m~B[34m~@
Cyrillic: -¦-+-+
OK: [34m~F
Cyrillic: -¦
OK: 0</b><table><tbody><tr bgcolor="#000000" style="color:#ffffff"><td wid><b>I
D</b></td>

Cleaned:
<title>g - [34m~A[34m~B[34m~@[34m~F 0</title></head><body><a hreex.html" style=
color:#000000">[34m~@[34m~E said:
<tr bgcolor="#000000" style="color:#ffffff"><td wid><b>ID</b></td>

-------
bomtest1.pl
-------
# File: bomtest1.pl
#
# CheckUTF() -
# UTF-8/16/32 check, Will add a guessed utf8 or utf-8/16/32 layer
# if there is no current encoding() layer.
# Opened filehandle should be in byte context.
# This is more for automatically adding a utf layer on
# unknown files opened in byte mode.
# Guesses utf-8/16/32 encodings, BOMb'd or not.
# Uses Encode::Guess module. Will use standard guess settings (see docs).
# Used standalone or optionally with Get_BOM().
#
# Get_BOM() -
# SEEK_SET workaround for BOMb'd Encoding(utf-) layer Bug
# and a general way to get the BOM and offset for subsequent
# seeks.
# The file can be opened using any encoding. The handle passed
# is duped, that handle set to ':raw'. No character semantics
# are used when analysing the BOM.
# This will also detect a mismatch in width between the
# detected BOM and the current opened encoding layer.
# Used standalone or optionally with CheckUTF().
# CheckUTF() has an option to get the bom on the callers
# behalf, combining steps.
#
# test_surrogate_pair() -
# Just a utility function to print hi/lo surrogate pairs
# given a unicode test character (> 0x10000).
#
#
# Modify the top variables to test various configurations.
# Includes:
# $fname File name for output and input
# $write_file Flag 0/1 to create a test file
# (0 to not write, just read file $fname)
# $check_size Size of sample to read for encoding guess in Check_UTF()
# (less than zero will read entire file, default -1)
# $data Data string for writing test file, can be unicode
# $enc Encoding to read/write file $fname
# $passes Number of passes to test BOM seek bug (default 1)
# $test_surr Stand-alone surrogate pair test
#
# -sln, 1/7/10
# --------------------------------------------------------

use strict;
use warnings;

binmode (STDOUT, ':utf8');

{
## PARAMETERS ------
my $fname = 'sample.htm.txt';
# my $fname = 'dummy.txt';
# my $fname = 'c:\temp\XML\CollectedData_1804.xml';

my $Unichar = "\x{21000}";

my $data = "\x{1}\x{2}$Unichar";
my $enc = ''; # blank or UTF-xxLE/BE or :bytes, :utf8, :raw, etc ...
my $enc_string = ($enc =~ /:/) ? "$enc" : (length($enc) ? "encoding($enc)" : '');
my $check_size = -1;
my $write_file = 0;
my $test_surr = 0;
my $passes = 1;
## END PARAMETERS ------

my ($fh, $line);
my ($width,$encfound,$bom,$offset);

# Test surrogate pairs ..
##
test_surrogate_pair($Unichar) if $test_surr;

# Create a test file with encoding ..
##
if ($write_file) {
print "\nWriting sample data as '$enc' to $fname\n",'-'x20,"\n";
open $fh, ">$enc_string", $fname or die "can't open $fname for write $!";

print $fh $data;

close $fh;
} else {
print "\nNOT WRITING to $fname !!\n",'-'x20,"\n";
}

# Read test file with encoding ..
##
print "\nReading $fname as '$enc'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1);
print "\n";

for (1 .. $passes) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
print "$line\n";
}
close $fh;

# Read test file with encoding into scalar, open handle to scalar as :utf8 ..
##
print "\nBuffering $fname as '$enc', open/read buffer as ':utf8'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1);
seek($fh, $offset, 0); # SEEK_SET

my $membuf = <$fh>;
close $fh;

open $fh, "<:utf8", \$membuf or die "can't open \$membuf for read $!";

$offset = 0;
print "\n";

for (1 .. $passes) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
}
close $fh;

# Read test file in byte mode ..
##
print "\nReading $fname as bytes\n",'-'x20,"\n";
open $fh, '<', $fname or die "can't open $fname for read $!";

$offset = 0;

for (1 .. $passes) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ";
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}
print "\n";
}
close $fh;

}
exit 0;

##
## End of program ...
##

sub ordsplit
{
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%x ",$_);
}
return $buf;
}

sub Get_BOM
{
my ($fh_orig, $verbose) = @_;
$verbose = 0 unless (defined $verbose and lc($verbose) eq 'v');
use Encode;

my %bom2enc = (
map { encode($_, "\x{feff}") => $_ } qw(
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
)
);
my %enc2bom = (
reverse(%bom2enc),
map { $_ => encode($_, "\x{feff}") } qw(
UCS-2
iso-10646-1
utf8
)
);
my @bombs = sort { length $b <=> length $a } keys %bom2enc;
my $MAX_BOM_LENGTH = length $bombs[0];
my $bomstr = join '|', @bombs;
my $bom_re = qr/^($bomstr)/o;
my @arlays = PerlIO::get_layers($fh_orig);
my $layers = "@arlays";

print "Get_BOM() - Layers = $layers\n" if $verbose;

if (defined($fh_orig) && (ref($fh_orig) eq 'GLOB' || ref(\$fh_orig) eq 'GLOB'))
{
# Dup the passed in handle, binmode it to raw
# We will ONLY do byte semantics !!
# --------------------------------------------------
open my $fh_dup, "<&", $fh_orig or die "can't dup filehandle: $!";
binmode ($fh_dup, ":raw");

if (seek($fh_dup, 0, 0)) # SEEK_SET
{
my $sample = '';

## Read in $MAX_BOM_LENGTH characters.
## As a check, store character/bytes read based
## on the differential file position per read.
## Since our dup handle is in byte mode, the bytes
## per character will always be 1.
## This is a hold over when there could be encoded
## characters and is not necessary.
## -----------------------------------------------
my $fpos = tell($fh_dup);
my @samparray = ();
print " * Reading up to $MAX_BOM_LENGTH characters\n" if $verbose;

for (1 .. $MAX_BOM_LENGTH)
{
my $count = read ($fh_dup, my $buff, 1);
last if (!defined $count or $count != 1);
push @samparray, {'char'=> $buff, 'width'=> (tell($fh_dup)-$fpos)};
$fpos = tell($fh_dup);
}
$sample = '';
for my $href (@samparray) {
$sample .= $href->{'char'};
printf (" - char %x , bytes = %d\n", ord($href->{'char'}), $href->{'width'}) if $verbose;
}
if ($verbose) {
print " - Read ".length($sample)." characters. Position = $fpos\n";
print " - Sample (".length($sample).") = ";
for (map {ord $_} split //, $sample) {
printf ("%02x ", $_);
}
print "\n * Analysing bytes\n";
}
## Convert the 'characters' to bytes.
## Only process $MAX_BOM_LENGTH bytes.
## We only care about the BOM, which will match
## the encoding bytes we already have in %bom2enc.
## -----------------------------------------------
my ($octets, $numb_octets, $vecoffset) = ('',0,0);

for my $href (@samparray)
{
my @ar = ();
my ($charwidth, $ordchar) = (
$href->{'width'},
ord( $href->{'char'})
);
last if (($numb_octets + $charwidth) > $MAX_BOM_LENGTH);
$numb_octets += $charwidth;

while ($ordchar > 0) {
push @ar, $ordchar & 0xff;
$ordchar >>= 8;
--$charwidth;
}
push (@ar,0) while ($charwidth-- > 0);
for (reverse @ar) {
vec ($octets, $vecoffset++, 8) = $_;
}
}
if ($verbose) {
print " BOMS avail = ";
for my $bomavail (@bombs) {
print '( ';
for (map {ord $_} split //, $bomavail) {
printf ("%02x",$_);
}
print ' ) ';
}
print "\n - Bom bytes from sample (".length($octets).") = ";
for (map {ord $_} split //, $octets) {
printf ("%02x ", $_);
}
print "\n";
}
# end convert

my ($bomenc, $bom, $bomwidth, $bomoffset) = ('','','',0);
if (my ($found) = $octets =~ /$bom_re/) {

($bomenc, $bomoffset) = (
$bom2enc{$found},
length($found)
);
for (map {ord $_} split //, $found) {
$bom .= sprintf ("%02x", $_);
}
print " * Found $bomenc, BOM = $bom\n" if ($verbose);

my $noendian = $bomenc;
$noendian =~ s/(?:LE|BE)$//i;
$bomwidth = ($layers =~ /$noendian/i) ? 'ok' : 'mismatch';
print " - Does not match width of current encoding layer\n" if ($verbose and $bomwidth ne 'ok');
} else {
print " * BOM Not found\n" if $verbose;
}

close ($fh_dup);

# Original handle -> Must SEEK_SET/read twice to
# mitigate UTF-16/32 BOM seek bug
# ----------------------------------------------
seek($fh_orig, 0, 0); # SEEK_SET
read ($fh_orig, $sample, 1); # read
seek($fh_orig, 0, 0); # SEEK_SET last. Caller to seek past BOM.

return ($bomenc, $bom, $bomwidth, $bomoffset);
}
close ($fh_dup);

# seek failed, fall through
}
# There was an error ..
if ($verbose) {
print " * Invalid filehandle or file is unseekable\n - Utf BOMs available (max length = $MAX_BOM_LENGTH):\n";
while (my ($key,$val) = each %enc2bom)
{
my $valstring = '';
for (split //, $val) {
$valstring .= sprintf ("%x ",ord $_);
}
print " $key = $bom2enc{$val} = $valstring\n";
#print " $key = $bom2enc{$val} = '$val' = '$enc2bom{$key}' = $valstring\n";
}
}
return ('','','',-1);
}

sub CheckUTF
{
# UTF-8/16/32 check
# Will add a layer if there is no current
# encoding() layer. So open filehandle should
# be in byte context.
# ----------------------------------------
# Result value:
# -1 - not filehandle (or undef'ed)
# 0 - read error or empty file
# 1 - already have encoding layer (could force change later)
# 2 - guess_encoder() error message
# 3 - do nothing unless utf-8/16/32(be/le)
# 4 - added a layer

my ($fh_check,@args) = @_;
my %parm = ('verbose'=>0, 'size'=>60, 'getbom' =>0, 'bomopts'=>'');
while (my ($name, $val) = splice (@args, 0, 2))
{
$name =~ s/^\s+|\s+$//;
$name = lc $name;
next if not defined $val;

if ( $name =~ /^getbom(?:_(\w*))?/ and $val) {
$parm{'gbom'} = 1;
$parm{'bomopts'} = $1 if defined($1);
}
elsif ($name eq 'verbose' || $name eq 'size' && ($val<0 || $val>4)) {
$parm{$name} = $val;
}
}
# 0 1 2 3 4 5 6
my ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset) = (
0, 'UTF Check', '', '', '', '', 0);
$layers = ':'.join (':', PerlIO::get_layers($fh_check)).':';

if (!defined($fh_check) || (ref($fh_check) ne 'GLOB' && ref(\$fh_check) ne 'GLOB')) {
$Utfmsg .= ", not a filehandle";
print "$Utfmsg\n" if $parm{'verbose'};
$Res = -1;
}
elsif ($layers =~ /:encoding/) {
$Utfmsg .= ", already have encoding layer";
$Res = 1;
}
else {
my ($count, $sample, $buf);
my $utf8layer = $layers =~ /:utf8/;

binmode ($fh_check,":bytes");
seek ($fh_check, 0, 0); # SEEK_SET

# Set 'size' to file size if 'size' passed was less than zero

if ($parm{'size'} < 0) {
seek ($fh_check, 0, 2); # SEEK_END
$parm{'size'} = tell ($fh_check);
seek ($fh_check, 0, 0); # SEEK_SET
}

# Try to read a large sample

if (!defined($count = read ($fh_check,$sample,$parm{'size'},0))) {
$Utfmsg .= ". $!"; # read error
} elsif ($count <= 0) {
$Utfmsg .= ". File is empty";
} else {
seek ($fh_check, 0, 0); # SEEK_SET
use Encode::Guess;

my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF
$Res = 2;

if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count): $name";
if ($name =~ /UTF.*?(?:8|16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
if ($name =~ /utf8/i && !$utf8layer) {
$utf8layer = 1;
} else {
binmode ($fh_check, ":encoding($name)");
}
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
}
binmode ($fh_check,":utf8") if $utf8layer;
}
$Utfmsg .= ' ..';
$layers = "@{[PerlIO::get_layers($fh_check)]}";

print "@{[$Utfmsg,$layers]}\n" if $parm{'verbose'};

$bomoffset = $Res if $Res == -1;
if ($Res != -1 && $parm{'gbom'}) {
($bomenc, $bom, $bomwidth, $bomoffset) = Get_BOM ($fh_check, $parm{'bomopts'});
$Utfmsg .= " bom($bomoffset) = $bom ..";
print "Get_BOM returned $bomenc, $bom, $bomwidth, $bomoffset\n" if $parm{'verbose'};
}
return ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset);
}

sub test_surrogate_pair
{
my $test = shift;
print "\nTesting surrogate pairs\n",'-'x20,"\n";
if (ord($test) < 0x10000) {
print "nothing to test < 0x10000\n";
} else {
my $hi = (ord($test) - 0x10000) / 0x400 + 0xD800;
my $lo = (ord($test) - 0x10000) % 0x400 + 0xDC00;
my $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
printf "test uni = %x\n", ord($test);
printf "hi surrogate = %x\n", $hi;
printf "lo surrogate = %x\n", $lo;
printf "uni = %x\n", $uni;
}
}
__END__
 
S

sln

[snip]

sub CheckUTF
{
# UTF-8/16/32 check [snip]

if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count): $name";
if ($name =~ /UTF.*?(?:8|16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
#start change
if ($name =~ /utf8/i) {
if ($utf8layer) {
$decoder = 'Already have utf8 layer';
$Res = 1;
}
$utf8layer = 1;
} else {
binmode ($fh_check, ":encoding($name)");
}
#end change
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error [snip]
}
 
S

sln

But, decoded utf8 can increase regular expression processing
time by up to several magnitudes depending on the regex.
If the overwhelming majority of the sample is ascii, and you
don't care about the rest, or just want to filter ascii, it might
be better to process as bytes.

For this reason, you can have this function default to 'Do nothing' (default)
when it finds utf8, or force it to add the utf8 layer
if it finds it. Like this:

$offset = CheckUTF($fh, 'size'=> $check_size, 'verbose'=> 1, 'getbom_v'=> 1, 'utf8ok' => 1);
sub CheckUTF
{
# UTF-8/16/32 check
[snip]
my $utf_types = '(?:16|32)';

while (my ($name, $val) = splice (@args, 0, 2))
{
[snip]
elsif ($name eq 'utf8ok' and $val) {
$utf_types = '(?:8|16|32)';
}
}
[snip]
if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count): $name"; if ($name =~ /UTF.*?$utf_types/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
#start change
if ($name =~ /utf8/i) {
if ($utf8layer) {
$decoder = 'Already have utf8 layer';
$Res = 1;
}
$utf8layer = 1;
} else {
binmode ($fh_check, ":encoding($name)");
}
#end change
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error [snip]
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top